Hi Chris F,
Thanks a lot for discovering the race! Looks like that is indeed the reason
for the deadlock. A few quick comments:
* Technically input_end is already the special end-of-input token, it only
propagates along the queues (not through the queues) and the author tried to
take a shortcut and put it directly into the last queue (which of course
went wrong like most shortcuts do).
* Having a special end-of-input token that propagates via the data path is
not safe with multiple producers. Technically it can be that one producer
gets the end-of-input data token and enqueues it before the other producers
can finish their work. This means the end-of-input data token would overtake
some of the input data, which will of course cause trouble.
* The problem can be fixed with little work if we make a flag like input_end
a part of every queue. The producer can then use a special function to
"close" its output queue. The consumers on the other end will then wake up
and recognize that their input queue is not only empty but also closed. They
then in return finish whatever work is left, close all their output queues
and terminate. Dedup uses such a mechanism and it is of course semantically
the same as your end-of-input token, it just uses a separate data path from
the real input data and is hence a little easier to handle IMHO.
* The queues need to know how many producers they have. A close is only
propagated if all producers close the queue. This is the reason why a
separate flag is safer than a special data token. An easy way to implement
it is to have a corresponding "open" function that producers can use
*BEFORE* they fetch any real work. Dedup uses a mechanism where the number
of producers is passed to the queue when they are initialized, so the
counter can only decrease during runtime.
* If the end-of-input token is made a part of the queues via the extra flag,
it must be included in the standard multithreaded producer / consumer
critical sections so the signal can't get lost again like in this case. See
the dedup queues as an example for how the enqueue / dequeue functions can
be changed to do that.
Chris F, would you be willing to submit a patch that fixes the problem? I'd
then include it in the next PARSEC version and make you the author of the
patch submission. You already went through the trouble to find the bug so
you're the one who deserves to be named.
Best,
Chris B
-----Original Message-----
From: parsec-users-bounces at lists.cs.princeton.edu
[mailto:parsec-users-bounces at lists.cs.princeton.edu] On Behalf Of Chris
Fensch
Sent: Wednesday, March 03, 2010 1:49 PM
To: parsec-users at lists.cs.princeton.edu
Subject: Re: [parsec-users] ferret deadlock
Hello Marc,
hallo ChrisB.
I just encountered the same problem running the parallel version of
ferret on a single core sparc machine. The problem is rather simple:
ferret uses the global variable "int input_end" to communicate to the
last stage of the pipeline that all input files have been read:
void *t_load (void *dummy)
{
.....bla.....
input_end = 1;
return NULL;
}
The condition for the last stage to finish is:
void *t_out (void *dummy)
{
struct rank_data *rank;
for (;;)
{
.... bla ....
fprintf(stderr, "(%d,%d)\n", cnt_enqueue, cnt_dequeue);
if (input_end && (cnt_enqueue == cnt_dequeue))
{
// signal main thread that work is done
}
}
return NULL;
}
It is quite easy to imagine an interleaving of threads, where the
following happens:
- t_load reads the last file from the disk and enqueues it for
processing, but a context switch happens before it can set input_end =
1.
- the other stages process the data (t_load is still suspended).
- t_out reads the data, process it and checks the condition (input_end
&& (cnt_enqueue == cnt_dequeue)), which fails as input_end == 0. After
that it waits for another element to arrive in its input queue (which
never arrives).
- t_load completes and sets input_end = 1. But this does no longer
matter, as t_out is stuck on dequeue.
A properly better method to signal the end of input would be to enqueue
a special end-of-input data token into the pipeline (possibly bypassing
all stages and inserting it directly in q_rank_out). Once this token has
been received by the last stage AND (cnt_enqueue == cnt_dequeue), the
last stage signals the main thread that the work is done.
Depending on the details of the used cache coherence protocol, the
problem can even happen if t_load sets input_end = 1 before the
condition check, but the change is not propagated in time. In general,
it is a bad idea to use normal global variables for synchronisation. I
think the variable should have at least been declared volatile, to make
sure that it is never cached in a register.
Cheers
Chris F
> Hello Marc,
>> We know of no deadlocks in any of the PARSEC workloads, but due to the
> nondeterministic nature of multithreading we expected that problems like
that
> would show up. We are very interested in fixing this bug, can you provide
us
> with more information? You seem to be able to reproduce the deadlock
> consistently, could you tell us which synchronization primitives are
> involved?
>> - Chris
>>> On Monday 21 April 2008 11:54 am, Marc de Kruijf wrote:
> > I am seeing frequent deadlock running ferret on any input size
> > (happens approx. 50% of the time) . My platform is "i686-linux.gcc",
> > which uses gcc 3.4.4 running on an Intel Core2 Duo. I see the same
> > problem with versions of gcc 4.x as well. Below is a sample output.
> > The deadlock happens as the program is completing.
> >
> > Has nobody seen this before?
> >
> > ----
> >
> > $ bin/parsecmgmt -a run -p ferret -c gcc -i simsmall
> > [PARSEC] Benchmarks to run: ferret
> >
> > [PARSEC] [========== Running benchmark ferret ==========]
> > [PARSEC] Deleting old run directory.
> > [PARSEC] Setting up run directory.
> > [PARSEC] Unpacking benchmark input 'simsmall'.
> > corel/
> > corel/__cass.env
> > corel/corel.raw
> > corel/lsh.lsh
> > corel/map_corel.map
> > queries/
> > queries/acorn.jpg
> > queries/air-fighter.jpg
> > queries/airplane-2.jpg
> > queries/airplane-takeoff-3.jpg
> > queries/alcatraz-island-prison.jpg
> > queries/american-flag-3.jpg
> > queries/apartment.jpg
> > queries/apollo-2.jpg
> > queries/apollo-earth.jpg
> > queries/apple-11.jpg
> > queries/apple-14.jpg
> > queries/apple-16.jpg
> > queries/apple-7.jpg
> > queries/aquarium-fish-25.jpg
> > queries/arches-9.jpg
> > queries/arches.jpg
> > [PARSEC] Running 'time
> >
/media/usbdisk/top/users/dekruijf/parsec-1.0/pkgs/apps/ferret/../../../pkgs
> >/apps/ferret/inst/i686-Linux.gcc/bin/ferret corel lsh queries 10 20 1
> > output.txt':
> > [PARSEC] [---------- Beginning of output ----------]
> > (12,1)
> > (12,2)
> > (12,3)
> > (12,4)
> > (12,5)
> > (12,6)
> > (12,7)
> > (12,8)
> > (12,9)
> > (12,10)
> > (12,11)
> > (12,12)
> > (13,13)
> > (14,14)
> > (15,15)
> > (16,16)
> >
> > ----
> >
> > Marc
--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
_______________________________________________
parsec-users mailing list
parsec-users at lists.cs.princeton.eduhttps://lists.cs.princeton.edu/mailman/listinfo/parsec-users