Recommended Posts

Consider an array of data. There are multiple lockless writers, each writing to their own section of the array concurrently. I know that this is safe. However, would it be safe to add a concurrent reader which reads the whole array? I understand the data will be constantly changing because of the writers, but are there any issues that may arise? For example, if the reader is reading something, can a writer still write to it, and vice versa? Now that I say it this way, it seems like there would be issues, but I'd like to hear from you guys.
(Essentially, my goal is to create periodic snapshots of the data as it is being calculated, but I don't know if I need to lock it for such a task)

Share this post

Link to post

Share on other sites

It depends what the writers are doing and how much consistency you need, but chances are you need to lock.

Do the values being written depend on the value that was there previously? For example, an increment? A conditional test-and-set, etc?

How much data is each writer writing? Is it multiple words? For example, a memcpy?

When you say you want a "snapshot", do you require that all threads have finished their latest update and no new updates have begun?

If you answered yes to any of these questions, you need a lock. Even if you answered no to all 3 questions, it *might* be a candidate for a lock-free read, but even then when you have things running on multiple cores your caches can get out of sync, and sometimes you'll have to insert fences (which incur a slight performance overhead, but much less than a full lock). Chances are you answered yes to at least one question, so you probably don't have to worry about the additional complexities involved in making it lock-free.

Share this post

Link to post

Share on other sites

Original post by cache_hitDo the values being written depend on the value that was there previously? For example, an increment? A conditional test-and-set, etc?

No.

Quote:

How much data is each writer writing? Is it multiple words? For example, a memcpy?

A fraction of the total data. We're actually talking about pixels, so given 2 writers and a 800x600 surface, each one is writing to 240k non-consecutive dwords (on x86 that is) in a loop.

Quote:

When you say you want a "snapshot", do you require that all threads have finished their latest update and no new updates have begun?

No.

Quote:

If you answered yes to any of these questions, you need a lock. Even if you answered no to all 3 questions, it *might* be a candidate for a lock-free read, but even then when you have things running on multiple cores your caches can get out of sync, and sometimes you'll have to insert fences (which incur a slight performance overhead, but much less than a full lock). Chances are you answered yes to at least one question, so you probably don't have to worry about the additional complexities involved in making it lock-free.

Well one of them wasn't a yes-no [grin]... actually I guess it could be considered a no because the data isn't consecutive as a memcpy might imply. Anyways, like I said, I don't need synchronization of any sort. I'm simply asking if the actual operation of reading the data may cause issues in case a writer is concurrently writing to the same data, or vice versa.

Share this post

Link to post

Share on other sites

can a writer still write to it, and vice versa? Now that I say it this way, it seems like there would be issues, but I'd like to hear from you guys.

If individual atomic data unit is int or similar atomic unit which can be manipulated in a safe lock-less way, rather than position expressed as 3 ints, then yes.

Otherwise no. Individual operations will still be safe, but there is no transactional semantics to span multiple atomic operations.

I use a different approach for snapshots.

I have n buffers. On each update, data is written from old to new. When new is fully updated, it is pushed into snapshot queue. Snapshot writer takes the latest only, possibly ignoring multiple old updates. All arrays that snapshot writer processes or discards are then put back into pool for writers to consume.

This allows for non-blocking operation without any locks, except for switching the buffers. Number of buffers in pool can be pre-allocated, or allocated on demand with only m buffers always remaining in the pool to avoid excessive allocations. If pool size is checked, it can be used to detect overflow (snapshot writer is too slow) or underflows (snapshot writer needs to idle since most of the time no buffers are available). Both of these can be used to tune update rates.

Share this post

Link to post

Share on other sites

Original post by AntheusIf individual atomic data unit is int or similar atomic unit which can be manipulated in a safe lock-less way, rather than position expressed as 3 ints, then yes.

Otherwise no. Individual operations will still be safe, but there is no transactional semantics to span multiple atomic operations.

I see what you're saying. And yes, the data is composed of atomic units, unsigned 32-bit ints, to be exact.

Quote:

I use a different approach for snapshots.

I have n buffers. On each update, data is written from old to new. When new is fully updated, it is pushed into snapshot queue. Snapshot writer takes the latest only, possibly ignoring multiple old updates. All arrays that snapshot writer processes or discards are then put back into pool for writers to consume.

Indeed, but this would require that the operation is completed, right ("when new is fully updated")? I'm looking to create snapshots as the data is being calculated - noise and other abnormalities due to the data not being fully updated is not an issue.

Share this post

Link to post

Share on other sites

1) At time t the value of some pixel is 5. At time t+1 your writer writes the value 6. At time t+2 your reader attempts to read the pixel and sees 5.

Maybe this won't happen though, maybe it will actually see 6. When is the reader guaranteed to see 5? Who knows, maybe never if your compiler is smart enough. You can give hints such as volatile, but marking an entire array of 240,000 DWORDs volatile seems like a performance disaster waiting to happen.

2) One of your writers is responsible for writing to pixels 1, 3, 5, and 7. A reader comes through and reads. You have no guarantees at all about the state of pixels 1, 3, 5, and 7. Each one can have either the old or new value, for a total of 16 possible combinations of values. In other words, you have no idea what you're getting. Could be the entire old state, could be the entire new state, could be any possible wish-wash of in-between states.

Also, if you want best performance, you should consider which items are being written to by each thread. You mentioned that they're writing to non-consecutive DWORDs (which I'll assume means 4-byte value). You should try very hard to avoid false sharing, because it can be a big issue. In short, the situation is this:

Suppose the size of a cache line is 128-bytes. This is 32 DWORDs. Thread 1 is writing to the first 16, thread 2 is writing to the second 16. In this situation, performance is probably not that much different than if you didn't even have a cache in your machine in the first place. The reason is that no matter which thread performs a write, the entire cache line is invalidated and both cores have to reload their caches every single time. Maybe you're already aware of this problem, but when you say that each writer is writing to non-consecutive DWORDs, it's hard not to think of this as being a potential problem.

Ideally you want to partition your data set so that each thread only deals with cache-line size chunks of data.

Share this post

Link to post

Share on other sites

Original post by cache_hit1) At time t the value of some pixel is 5. At time t+1 your writer writes the value 6. At time t+2 your reader attempts to read the pixel and sees 5. ,000 DWORDs volatile seems like a performance disaster waiting to happen.

2) One of your writers is responsible for writing to pixels 1, 3, 5, and 7. A reader comes through and reads. You have no guarantees at all about the state of pixels 1, 3, 5, and 7. Each one can have either the old or new value, for a total of 16 possible combinations of values. In other words, you have no idea what you're getting. Could be the entire old state, could be the entire new state, could be any possible wish-wash of in-between states.

I'm aware of these issues:

Quote:

[...] noise and other abnormalities due to the data not being fully updated is not an issue.

Quote:

Also, if you want best performance, you should consider which items are being written to by each thread. You mentioned that they're writing to non-consecutive DWORDs (which I'll assume means 4-byte value). You should try very hard to avoid false sharing, because it can be a big issue. In short, the situation is this:

Indeed, I still need to find the perfect balance; when I previously arranged the writes to be cache-friendly, some of the threads ended up calculating extremely intense areas while other threads hit the boring areas, and randomizing the assigned pixels caused a noticeable speedup. I suppose a much better idea would be to randomize cache-friendly areas per-thread instead of individual pixels... in fact I'm going to go do that right now [smile]

Share this post

Link to post

Share on other sites

I've designed a lock-free multiple write/single read message queue; it's a tricky thing to get right and isn't always the most performant approach, but if you'd like to take a look at it, the source is available here (see lockless.h and mailbox.h).