I was inspired to install and benchmark some ramdisks after reading a forum post on winram+firadisk. After doing many tests, it appears that all read/writes of large files onto/from ramdisks use only one thread of my CPU.

I have dual xeon 5645 cpus capable of a combined 64GB/s transfer rate and I have 12x4GB 800MHz DDR3 memory sticks capable of a combined 76.8 GB/s transfer rate.

In my benchmarks I am getting nowhere near that speed. I am seeing maybe 4.5GB/s maximum data transfer in HD Tune benchmarker. In windows task manager I observe that only 1 thread of 1 CPU (1 out of 12 threads) will be working at any given moment while I'm transferring data. I imagine that if multiple threads were transferring data to/from the ramdrive we could see transfer rates of 50+ GB/s!

Could someone please explain to me why these ramdisks aren't utilising multiple CPU cores simultaneously? Could they? I'm certainly new to all this stuff and could be missing something obvious.

Richcopy was 2.3x faster than window's drag and drop method. Given that it has to read from one virtual hardrive and write to another, this is roughly equivalent to transferring 4.1 GB/s to+from the RAM. This value is similar to my HDTUNE benchmark of 4.5 GB/s. Hence richcopy/robocopy didn't get the 50+ GB/s range that I'm hoping for. They appear to be limited by the same thing limiting the HD TUNE benchmark software.

Other tests

I performed many tests between different software ramdrives, different bit versions of richcopy, experimented with cache and various thread settings. I didn't beat the read or write rates of 4.5GB/s from previous benchmark tests.

Conclusion/Guesses:

* 50+ GB/s was not obtained, read+write rates of 4.1GB/s using richcopy and robocopy came reasonably close to HDTUNE benchmark values of 4.5GB/s.

* My guess is that SoftPerfect Ramdisk / RAMDISK drivers are the bottleneck?

* My cpu has 12MB of cache, i suspect that SoftPerfect RAMDISK may not be transferring 12MB slabs of data between the ram and cpu? (I don't understand cache)

* Maybe the ramdisk drivers are not communicating in a parallel fashion with cpu threads?

* Maybe I've made a mistake in my 50 GB/s theory?

Perhaps someone has an idea on why the ramdisks are not transferring anywhere near their theoretical limits? I noticed that all threads of my cpu thread were being used in windows task manager while running richcopy, yet I got nearly the same speeds as utilising only 1 cpu thread in the benchmarking programs? I don't expect anyone to have the answer, but a guess, and maybe a way to check the guess would be good

As far as I know, even with 2 CPUs, there is always one access at a time to the RAM. So, the maximal transfer rate from your CPUs would rather be 32 GB/s.
This CPU supports 3-channels memory. With 800MHz memory sticks, we obtain a maximum of 19,2 GB/s.

That said, we are indeed far to your results. So, it seems to be an issue somewhere. I would be interested if you find something out.

I expected Qsoft to be the best based on other comparisons on the internet, but SoftPerfect RAMDISK won. IMDISK was the worst (although still amazing compared to my hard drive!). I believe this is because I'm using a different style of pc (workstation) compared to other benchmark comparisons of ramdisks that use a gaming pc. All my RAMDISKs are slow compared to others on the internet that I've looked at. Probably because my ram and cpu have low frequencies, and maybe dual/quad channel ram is better?

HD Tune Pro Benchmarker

The link above gives benchmarks in HD TUNE for a hard drive, usb drive, and some RAMDISKs. Inside this image is a brief discussion of my specs and some STREAM benchmarks from similar systems (same motherboards/similar cpu model) that have 39GB/s memory rates. It appears the hardware I have (or perhaps by replacing my current RAM modules?), I should be able to approach or beat these STREAM benchmarks. Is it possible for RAMDISKs to utilize these speeds? It appears they currently are not, but maybe with some reprogramming they could?

Hypothesis:

Ramdisks only communicate along one memory channel at a time, so communicating with 2/9 memory slots splits up the bandwidth to a mere 6.3GB/s theoretical max

Math for hypothesis:

CPU has 28.4GB/s memory bandwidth (32GB/s including error checking bits transferred, but we only care about actual data). This is divided down one memory channel to give 28.4/3=9.47GB/s. This is then split into three memory slots, two are occupied by memory sticks. So we have 2/3*9.47GB/s=6.3GB/s that can flow through a memory channel in current configuration. The maximum measured transfer rate of 4.5GB/s and highest burst rate of 5.6GB/s do not exceed this value.

I'll try swapping my ram modules around in the motherboard and see what happens, maybe someone can tell me I'm an idiot and that memory channels bandwidth doesn't get split up at the ram slots and that the 2/3 factor is wrong.

Thankyou for your replies, I've gotten a lot of ideas and am doing plenty of testing. I really want to see ultrafast ram-drives. I'm not that good with computers and your posts are helping me to learn more about the problem.

Summary/Conclusion:

I am getting 1-3 GB/s read/write times for the ramdisks. They are all pretty much the same considering the potential bandwidth of my system could be 30+ GB/s as indicated by the STREAM benchmarks of very similar systems.

SoftPerfect was the best ramdisk on my system, ramdisk speeds appear to be system specific. Try a few ones out to see which best utilizes the system you have.

I will be doing STREAM benchmarks and try to isolate exactly what part of the system is the limiting factor in ram disk speeds

I hypothesize current ramdrives utilize only one memory channel at a time resulting in a theoretical max of 6.3GB/s. This will be tested by physically changing ram modules in the motherboard and studying ramdisk performance.

I had the physical memory option selected with the above IMDisk benchmarks. I've re-tested for virtual memory and physical memory, virtual memory is much faster, my mistake. I wish I could edit my above post to include a corrected benchmark for IMDisk, it performs better than what my previous benchmarks show. I take back my statement that it is the worst, aside from 4K transfers it does alright. I've attached proper benchmarks for the version I've been using.

@pigeon: Thanks for testing. The "physical memory" option uses an additional driver ("awealloc"), which requires a few more things to do. So, your results are not surprising.

Finally, Wikipedia confirms my first answer about your maximal memory speed:"According to Intel, a Core i7 with DDR3 operating at 1066 MHz will offer peak data transfer rates of 25.6 GB/s when operating in triple-channel interleaved mode."
We get this result by multiplying 1066 by 8 (64-bit) and by 3 (the number of channels).
It's not the same CPU, but in this case, this changes nothing.

Anyhow, you are obviously not limited by the memory sticks.
It would have been interesting to have Softperfect ramdisk as open source... Perhaps you should try to contact its author to know why this result is so different than the others...

@pigeon: Thanks for testing. The "physical memory" option uses an additional driver ("awealloc"), which requires a few more things to do. So, your results are not surprising.

Finally, Wikipedia confirms my first answer about your maximal memory speed:"According to Intel, a Core i7 with DDR3 operating at 1066 MHz will offer peak data transfer rates of 25.6 GB/s when operating in triple-channel interleaved mode."
We get this result by multiplying 1066 by 8 (64-bit) and by 3 (the number of channels).
It's not the same CPU, but in this case, this changes nothing.

Anyhow, you are obviously not limited by the memory sticks.
It would have been interesting to have Softperfect ramdisk as open source... Perhaps you should try to contact its author to know why this result is so different than the others...

Lets remember that these transfer speed are theoritical.

In reality, when using ImDisk or other softwares, you have a driver in the middle, which next to transfer datas, also needs to manage a filesystem.

STREAM benchmark gave a maximum of 10.9GB/s.
The number of threads had a large impact on the speed of the data transfer. 1 Thread gave a mere 4.3GB/s. More threads = more bandwidth. RAMDISKS currently use 1 thread to transfer data and hence suffer from this bandwidth limitation.

Question: So why did multithreaded robocopy and richcopy fail to get high speeds when transferring data across RAMDISKS?Answer: I believe that they have to operate through a RAMDISK which for some reason only allows 1 thread to transfer data at a time, hence the RAMDISK is bottlenecking the transfer of data.

I will struggle to verify my answer since I've never programed drivers or windows applications. I suspect it is correct though. Any comments to help me verify/refute my proposed answer? Could this 1 thread bottleneck be removed by reprogramming RAMDISKS? Maybe modifying IMDISK to use multiple threads to copy data could create the worlds fastest RAMDISK for multicore systems?

Theory vs STREAM benchmark vs RAMDISK:
Two of three memory channels on system are being used (didn't spend enough $$ for extra ram sticks). So two channels of DDR3 modules @ 1066MHz gives 2x1066x8=17GB/s theoretical maximum speed. So STREAM benchmark is over half the theoretical limit. Best RAMDISK gave 3.7GB/s, this is far below the theoretical maximum and measured system bandwidth.

Conclusion:
* My system has a measured 10.9GB/s of a theoretical 17GB/s.
* Threads have a huge impact on how much bandwidth you can access.
* RAMDISKs had a transfer rate of 3.7GB/s, not too far from the 1 thread bandwidth limit of 4.3GB/s measured in STREAM.
* RAMDISKs can access higher bandwidths by utilizing more threads. Maybe a large increase in the speeds of IMDISK and others could be achieved by programming them to use more than 1 thread.

Thank you for your comments so far, I have no idea how practical/crazy it would be to reprogram IMDISK to use multiple threads to transfer data?

Above I compare Radeon RAMDISK Read/Write speeds with 1 vs 16 threads. The 16 threads gives massive speedups to 10.1GB/s which is close to the STREAM benchmark maximum memory bandwidth of my system (10.9GB/s).

I have also performed the same test above for IMDISK and SoftPerfect, they don't show this crazy speed up. Some Radeon RAMDISK programmer clearly set out with the specific goal of multi-threading a RAMDISK. I thought I had a unique and awesome idea

One note I'd like to make is my individual ram bandwidth is slow, my cpu is fast (weird server build). These speedups may not be the same for those with gaming style rigs? Maybe someone with that type of computer might want to test it out?

It's 5am in the morning, I'll post the other test results for IMDISK etc after some sleep.

Once again, a big thankyou to everyone, you've been very knowledgeable and helpful, I wouldn't have figured this out on my own.

Using a block size of 512K means that it can be entirely held in CPU cache, since you have 1MB per thread. And it's supposed to increase the speed...

If a thread can achieve 3.2 GB/s, with 16 threads, you should get something near than 51.2 GB/s. Given the current result, it means that your CPU spend about 80% of its time to wait for data, stuck at instructions that read or write something. So, this time can not even be used to do something else!

Using a block size of 512K means that it can be entirely held in CPU cache, since you have 1MB per thread. And it's supposed to increase the speed...

Oh, never thought of that. So I have a total of 24MB cache on my system, (16 threads) x (512K block size) = 8MB, well within cache limit. Here's a test which i think illustrates what you are saying? It is just a memory (not a ramdisk) benchmark:

You can see a sudden drop between 8MB and 16MB, this is because my L3 cache (12MB) is full and can't hold the entire block in one pass. v77, I'm sure you know this inside out already but I'm just posting this in case there are people struggling to understand this stuff like myself.

^^ Problem - The above benchmark uses only 2 threads and gets 9.8GB/s. Was I wrong about more threads giving more bandwidth?

In STREAM copy benchmark, it both Reads and Writes data, this may be slower and more expensive on system resources than just reading data from memory. This may explain how 2 threads in the above benchmark got 9.8GB/s read time, and 2 threads in STREAM got only 4.5GB/s.

And it's supposed to increase the speed...

I'm confused, didn't it? The benchmark says it has read 121GB of random data from the ramdisk in ~12.5 seconds (timed with stopwatch). That's a lot of data going from the ramdisk to the cpu in a small amount of time. Isn't this good? Please forgive my ignorance, I'm just a pigeon pecking on a keyboard. It seems you're correct though, robocopy and richcopy, which are multi-thread programs are not giving ~10GB/s transfer rates on the radeon multi-threaded ramdisk, only standard ramdisk rates of 2-4GB/s.

If a thread can achieve 3.2 GB/s, with 16 threads, you should get something near than 51.2 GB/s.

(I believe) The maximum experimental memory bandwidth of my system is 10.9GB/s benchmarked by STREAM. I don't know why it's this low, but I bet this guy does - http://www.cs.virginia.edu/stream/ .

Given the current result, it means that your CPU spend about 80% of its time to wait for data, stuck at instructions that read or write something. So, this time can not even be used to do something else!

Dual monitor screenshot: There was no lag in the video or open programs that I could observe while running benchmark. I'm sure there's a lot of truth to what you are saying, but I didn't notice any change in performance so I think it's still ok for normal pc activities?

In fact, what I know comes from what I have read about the ways to increase the speed of the "memcpy" function (a well-known function in C used to copy a block of memory). And my conclusion is: there is nothing obvious.
In your case, I also wonder if your machine does not lose some time because of the fact you are using 2 CPUs.

Because of the caches, using several threads makes things more complicated because in most cases, L3 cache is shared between all cores of a CPU, and the L1 cache is not.

All these benchmark tools are interesting, but the problem is that they never tell what they are using to read or write memory. For instance, using SSE2 instructions to read or write 128-bit at once can give better results, but the improvement is not necessarily very high because of the caches. With some implementations, it can even be worst than some other methods. Moreover, with a 3-channel memory (that is, 3x64 bits), results can also be a bit different.

In conclusion, using several threads can be a way to improve overall speed, but not necessarily the best (because you reduce the efficiency per core). And as you already seen, it also depends of the ramdisk itself.

Wonko, yes it looks like it was added in the 4.1 update - http://www.radeonmem...are_updates.php . Only certain benchmark programs can access the parallel speeds of the ramdisk. My attempts to use parallel copying programs on this ramdisk fail. This parallel feature seems unusable in the real world.

Only certain benchmark programs can access the parallel speeds of the ramdisk. My attempts to use parallel copying programs on this ramdisk fail. This parallel feature seems unusable in the real world.

I measured the bandwidth on my system using different c commands and multithreaded them with OpenMP. Here's the speeds (GiB/s):

The winners were sse for reading data and stosq for writing data. Maybe these commands can be incorporated into a ramdisk?

The maximum speeds were poor compared to Alex' measured 21GiB/s (of a theoretical 25.6GiB/s) on a 4 core computer. His ram runs at 1600MHz (much faster than mine) hence the higher bandwidth (the cpu is a bottleneck). For my computer, the ram speed and the number of channels I'm using are both bottlenecking the bandwidth. This can be fixed by spending $$ to fill 1333 MHz ram in all 3 channels, this would shift the theoretical bandwidth to 32GB/s (~10.7GB/s per channel).

* read/writes of 20-30GB/s are obtainable with an average modern computer running 1600MHz ram.

* Is ~30GB/s read/write obtainable on my system? Looks possible, don't have the $$ to test it

* Can these commands be incorporated into ramdisks through the use of OpenMP? Maybe?

Wonko, Anvil never replied to me so I found the above c programming commands that may work in ramdisks.

I find it odd how everyone seems to always have faster read and write speed then I do . Heck everyone with older computers seems to stomp mine in terms of RAM read/write speeds. I have a quad Xeon E7-8870 each having 10 hyperthreaded cores totalling up to 80 cores. Also all my RAM is 1333MHz but downclocked to 1066MHz as it seems that SuperMicro board does this. I used DataRam for Ramdisk and Anvil's Storage Utilities. Anywhoo here is my result:

Try the benchmark with Radeon ramdisk. It should give you better results as it is the only ramdisk I have found to offer proper multithreaded speedup, other's like ImDisk and SoftPerfect didn't have this feature (it appears to be a useless feature for real world applications though). From what I've read 1600MHz ram and a descent, high frequency cpu are what you need for a super fast ramdisk. Our xeon systems have neither of these features (but with proper multi-threaded coding I believe we could utilise the entire memory bandwidth and have superior large file transfer speeds).