issue-491: Fix lost wakeups in spinlock

In uncontended cases, SpinLock::{Lock,Unlock} have an overhead of 13ns.
In cases where the lock is contended, SpinLock::Unlock has a bug where
it can fail to wakeup waiters, causing the overhead of
SpinLock::{Lock,Unlock} to blow up from 13ns to as high as 256ms. In
cases of heavy lock contention, this can cause applications to sleep
for minutes at a time without doing anything, appearing to hang.

The following fix slows down SpinLock::Unlock by 2ns in the uncontended
case, but speeds up SpinLock::Lock by up to almost 256ms in the
contended case where a wakeup ix mixed.
diff --git a/src/base/spinlock.h b/src/base/spinlock.h
index c2be4fd..b0ba09d 100644
--- a/src/base/spinlock.h
+++ b/src/base/spinlock.h
@@ -31,11 +31,6 @@
  * Author: Sanjay Ghemawat
  */
 
-//
-// Fast spinlocks (at least on x86, a lock/unlock pair is approximately
-// half the cost of a Mutex because the unlock just does a store instead
-// of a compare-and-swap which is expensive).
-
 // SpinLock is async signal safe.
 // If used within a signal handler, all lock holders
 // should block the signal even outside the signal handler.
@@ -95,10 +90,9 @@
   // TODO(csilvers): uncomment the annotation when we figure out how to
   //                 support this macro with 0 args (see thread_annotations.h)
   inline void Unlock() /*UNLOCK_FUNCTION()*/ {
-    uint64 wait_cycles =
-        static_cast<uint64>(base::subtle::NoBarrier_Load(&lockword_));
     ANNOTATE_RWLOCK_RELEASED(this, 1);
-    base::subtle::Release_Store(&lockword_, kSpinLockFree);
+    uint64 wait_cycles = static_cast<uint64>(
+        base::subtle::Release_AtomicExchange(&lockword_, kSpinLockFree));
     if (wait_cycles != kSpinLockHeld) {
       // Collect contentionz profile info, and speed the wakeup of any waiter.
       // The wait_cycles value indicates how long this thread spent waiting