From: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>

NPTL has 3 control counters (total/wake/woken).
so NPTL can know:
  "how many threads enter to wait"(total),
  "how many threads receive wake signal"(wake),
  and "how many threads exit waiting"(woken).

Abstraction of pthread_cond_wait and pthread_cond_signal are:

A01 pthread_cond_wait {
A02   timeout = 0;
A03   lock(counters);
A04     total++;
A05     val = get_from(futex);
A06   unlock(counters);
A07
A08   sys_futex(futex, FUTEX_WAIT, val, timeout);
A09
A10   lock(counters);
A11     woken++;
A12   unlock(counters);
A13 }

B01 pthread_cond_signal {
B02   lock(counters);
B03   if(total>wake) { /* if there is waiter */
B04     wake++;
B05     update_val(futex);
B06     sys_futex(futex, FUTEX_WAKE, 1);
B07   }
B08   unlock(counters);
B09 }

What we have to notice is:
    FUTEX_WAKE could be called before FUTEX_WAIT have called (at A07).
In such case, FUTEX_WAKE will fail if there is no thread in waitqueue.

However, since pthread_cond_signal do not only wake++ but also
update_val(futex), next FUTEX_WAIT will fail with -EWOULDBLOCK because the val
passed to WAIT is now not equal to updated val.  Therefore, as the result, it
seems that the WAKE wakes the WAIT.

-----

The bug will appear if 2 pair of wait & wake called at (nearly)once:

   * Assume 4 threads, wait_A, wait_B, wake_X, and wake_Y
   * counters start from [total/wake/woken]=[0/0/0]
   * the val of futex starts from (0), update means inclement of the val.
   * there is no thread in waitqueue on the futex.

[simulation]

wait_A: calls pthread_cond_wait:
    total++, prepare to call FUTEX_WAIT with val=0.
    # status: [1/0/0] (0) queue={}(empty) #

wake_X: calls pthread_cond_signal:
    no one in waitqueue, just wake++ and update futex val.
    # status: [1/1/0] (1) queue={}(empty) #

wait_B: calls pthread_cond_wait:
    total++, prepare to call FUTEX_WAIT with val=1.
    # status: [2/1/0] (1) queue={}(empty) #

wait_A: calls FUTEX_WAIT with val=0:
    after queueing, compare val. 0!=1 ... this should be blocked...
    # status: [2/1/0] (1) queue={A} #

wait_B: calls FUTEX_WAIT with val=1:
    after queueing, compare val. 1==1 ... OK, let's schedule()...
    # status: [2/1/0] (1) queue={A,B} (B=sleeping) #

wake_Y: calls pthread_cond_signal:
    A is in waitqueue ... dequeue A, wake++ and update futex val.
    # status: [2/2/0] (2) queue={B} (B=sleeping) #

wait_A: end of FUTEX_WAIT with val=0:
    try to dequeue but already dequeued, return anyway.
    # status: [2/2/0] (2) queue={B} (B=sleeping) #

wait_A: end of pthread_cond_wait:
    woken++.
    # status: [2/2/1] (2) queue={B} (B=sleeping) #

This is bug:
   wait_A: wakeup
   wait_B: sleeping
   wake_X: wake A
   wake_Y: wake A again

if subsequent wake_Z try to wake B:

wake_Z: calls pthread_cond_signal:
    since total==wake, do nothing.
    # status: [2/2/1] (2) queue={B} (B=sleeping) #

If wait_C comes, B become to can be woken, but C...

This bug makes the waitqueue to trap some threads in it all time.

-----

>  - According to man of futex:
>      "If the futex was not equal to the expected value, the operation
>       returns -EWOULDBLOCK."
>    but now, here is no description about the rare case:
>      "returns 0 if the futex was not equal to the expected value, but
>       the process was woken by a FUTEX_WAKE call."
>    this behavior on rare case causes the hang which I found.

So to avoid this problem, my patch shut up the window that you said:

 > The patch certainly looks sensible - I can see that without the patch,
 > there is a window in which this process is pointlessly queued up on the
 > futex and that in this window a wakeup attempt might do a bad thing.

-----

In short:
There is an un-documented behavior of futex_wait. This behavior misleads
NPTL to wake a thread doubly, as the result, causes an application hang.

Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
---

 25-akpm/kernel/futex.c |   11 +++++------
 1 files changed, 5 insertions(+), 6 deletions(-)

diff -puN kernel/futex.c~futex_wait-fix kernel/futex.c
--- 25/kernel/futex.c~futex_wait-fix	Tue Nov  9 13:43:05 2004
+++ 25-akpm/kernel/futex.c	Tue Nov  9 13:43:05 2004
@@ -486,8 +486,6 @@ static int futex_wait(unsigned long uadd
 	if (unlikely(ret != 0))
 		goto out_release_sem;
 
-	queue_me(&q, -1, NULL);
-
 	/*
 	 * Access the page after the futex is queued.
 	 * We hold the mmap semaphore, so the mapping cannot have changed
@@ -495,13 +493,15 @@ static int futex_wait(unsigned long uadd
 	 */
 	if (get_user(curval, (int __user *)uaddr) != 0) {
 		ret = -EFAULT;
-		goto out_unqueue;
+		goto out_release_sem;
 	}
 	if (curval != val) {
 		ret = -EWOULDBLOCK;
-		goto out_unqueue;
+		goto out_release_sem;
 	}
 
+	queue_me(&q, -1, NULL);
+
 	/*
 	 * Now the futex is queued and we have checked the data, we
 	 * don't want to hold mmap_sem while we sleep.
@@ -542,11 +542,10 @@ static int futex_wait(unsigned long uadd
 	WARN_ON(!signal_pending(current));
 	return -EINTR;
 
- out_unqueue:
 	/* If we were woken (and unqueued), we succeeded, whatever. */
 	if (!unqueue_me(&q))
 		ret = 0;
- out_release_sem:
+out_release_sem:
 	up_read(&current->mm->mmap_sem);
 	return ret;
 }
_