Multithreading on multiple hard drive, make it faster ?

Hi everyone,I have program which collect files info,currently it's single threading.

Does multithreading make it faster to collect files info on multiple hard drive (each hard drive one thread) ? How about hard drives that sharing data line (eg. IDE primary and secondary) ? Anyone have tried this before ?

Thanks

Sz
Saturday, December 06, 2008

Deleting …Approving …

You might get a speed up, or you might not. It's very difficult to tell a priori and may be hardware-dependent.

The questions you should ask are (1) is this task worth a factor of 2 (or # of drives) speed up? and (2) how costly would it be from a software engineering standpoint to parallelize?

d
Saturday, December 06, 2008

Deleting …Approving …

If your target platform supports it, I would look at asynchronous I/O before I went after multi-threading.

The question is unanswerable without more data. You need to look at your current program and answer the question:"Where does it spend most of its time." Once you know that then you can reduce or eliminate that bottleneck. IF the disk is the major bottleneck then most people use some sort of RAID or striping to get higher disk throughput.

For example:If you find step X in your program is executed 1,000,000 times during a run. Those 1,000,000 take a total run time of 10 minutes.The total run time of the program is 11 minutes.

Then your objective would be to work on step X. Working on anything else would be a waste of time. Working on anything else would at best eliminate 1 minute of run time in 11 minutes. Therefore if you could:A. Reduce the number of times step X is called.and/orB. Make step X more efficient. (say 2X by improving the algorithium or logic)

Jim
Sunday, December 07, 2008

Deleting …Approving …

If you can queue up enough read requests, your OS has a chance to order the disk accesses in a more optimal way; pushing them down disk channels correctly, even ordering the disk operations by track number to minimise head seek time. Etc.

@d>> How costly would it be from a software engineering standpoint to parallelize? My program actually is double threading, one thread to collect files info,main thread to update GUI. So I think it won't be problem to parallelize

If it is spending most of its time on IO then how much of the whole is that? If you eliminated it entirely how much faster would the application be? For example if IO is taking most of the time in your benchmark (eg 10 seconds) and the total time spent on everything is 100 seconds then the best you can hope for is a total time of 90 seconds. (total time - IO time reduced to 0).

jim
Monday, December 08, 2008

Deleting …Approving …

Hard drives read a file in serial bits read off the hard disk through the read head, reassembled into bytes, usually a sector (or a few sectors) at a time.

First, it has to read the directory entry for the file, which has key information like the date and file size and the location of the first sector of the file. If that's all the information you need, you don't actually HAVE to read the file data itself.

It's not read in parallel. Unix (and Linux for all I know) tends to 'save' data to be written back to the hard drive in memory buffers, and only actually 'write' it to the physical drive "when it has time to do so" -- which is why you need to "sync" your buffers before shutting down Unix. Reading is slightly faster if all the file data is written in contiguous sectors, but even that is not mandatory.

If you really have multiple hard drives, in theory you could issue a 'read' on each one while multi-tasking, and each task could 'pend' waiting for its data to be ready. Typically this is a form of "premature optimization", because it may not buy you much compared to the difficulty of implementing such a scheme. I mean, how will your program KNOW which subdirectories are mounted on which physical hard drives? And spawning multiple tasks (or even multiple threads) one for each physical drive has some overhead associated with that -- not to mention the Inter-Process Communication you'll need to implement so the multiple tasks/threads can collate their data somewhere.

AllanL5
Monday, December 08, 2008

Deleting …Approving …

@AdamThanks for the links

@Jim>> If it is spending most of its time on IO then how much of the whole is that?About 75% of total time

@Allan5Thanks for the infos>> how will your program KNOW which subdirectories are mounted on which physical hard drives?The user select folder(s) before the program search for file. I know which physical hard drive using drive letter extracted from those folder, I have found a solution for this.

Thanks everyone for feedback and suggestion. I guess I have just to try it.

Sz
Monday, December 08, 2008

Deleting …Approving …

Okay, but it's quite possible for one physical drive to have several partitions. Each partition will have its own drive character, yet will still be on the same physical drive.

That's for Windows. Unix is even more complex, since each physical drive is "mounted" to a "mount point", which looks to your application like simply another subdirectory name.

Still, these aren't show-stoppers, you might as well give it a try.

AllanL5
Tuesday, December 09, 2008

Deleting …Approving …

"And spawning multiple tasks (or even multiple threads) one for each physical drive has some overhead associated with that -- not to mention the Inter-Process Communication you'll need to implement so the multiple tasks/threads can collate their data somewhere."

Yes, but if you want high speed data streaming, this is the way to go.

We have a column-orientated data store product that stores each column in a separate file, and the option to store each file on a separate drive, if needed. We used overlapped I/O with IOCP to read the raw data from the disk, and a series of queues and threads to stage the data and a final thread to assemble it into "records" before it's dumped to a socket. It's extremely fast. We can saturate a gigE channel with ease.

There are applications that needs to refresh very large in-memory data sets. Our data stores improved performance by orders of magnitude. That is, a task that took almost a day using an RDMS takes 30 minutes with our data store.

anony
Tuesday, December 09, 2008

Deleting …Approving …

Very cool, "anony". Sounds like you have a very special purpose, high-volume data slinging application. Very nice.

I might add, that Gig-E runs at 125 MBytes/second (probably a little less, given the overhead of ethernet, but still darn fast) while the parallel bus to your hard drives runs quite a bit faster. It's still an impressive achievement, given the overheads associated with Unix/Linux or whatever other operating system you're using to access the disks.

My point above was that this is doable, but only worth the effort for a few applications -- clearly yours needed it, but I'm not sure about the OP.

AllanL5
Tuesday, December 09, 2008

Deleting …Approving …

AllanL5:

We load several hundred million 512-byte records into a memory cache spread across multiple blades (10 is typical - each blade runs Windows Server 2003 64-bit. 8 gigs ram/blade; 8 cores/blade is typical) that perform trillions of comparisons over a span of several days.

If the primary server goes down (usually for maintenance), we have to reload those several hundred million 512-byte records into the cache.

With the new data server, it's almost painless. When customers used a RDMS, they would do almost anything to avoid the reload, as expected.

anony
Tuesday, December 09, 2008

Deleting …Approving …

@anony

it sounds like a system I learn about fingerprint matching ?

Data was stored in ram or special cards and a full reload would take a full day.... the card data could survive a reboot...

Francesco
Wednesday, December 10, 2008

Deleting …Approving …

Sweet. See, there ARE applications where an RDBMS is too slow.

AllanL5
Wednesday, December 10, 2008

Deleting …Approving …

Nothing like an exception to prove a rule.

So tired
Wednesday, December 10, 2008

Deleting …Approving …

Unless your files are really really big OS caching will defeat the purpose of this optimization. Here are a few pointers to speed up IO:

1) Try compression and decompression to reduce IO size and time. Typically it is faster to read compressed data and expand it in memory rather than reading a huge chunk of uncompressed data.2) When writing to files don't commit on every single write, do the commit after all the writes of done then you will make full use of OS caching.3) Check if your FS is NTFS if so you can turn off logging because NTFS is itself journalized so you will be writing to disk four times for every single write. (log + main file) done by you and again by the OS.4) Write and read sequentially use FILE_FLAG_SEQUENTIAL. Random reads and writes are slower than reading a big continuous chunk.5) Don't have more than a 1000 files in a folder. This is about the upper limit after which file open slows down drastically. Use sub folders to speed up CreateFile.

Other than these having many threads competing for HDD access will just make the seek head go here and there slowing down reads. Route all requests of a HDD through a single thread. Keep your data drives separate from the windows drives and page file drives otherwise your thread will compete with windows.

Above all experiment and test and time to come to your own conclusions, what I have found out may not apply to your situation.

dd
Sunday, December 14, 2008

Deleting …Approving …

Francesco, not fingerprint data, but you're warm. ;-) They tried Oracle and SQL Server and it was too slow, so they hired me to write the high-speed data store and now they are the ONLY company on the market that can reload data in terms of minutes as opposed to hours (which adds up to a day or more in some cases.)

anony
Monday, December 15, 2008

Deleting …Approving …

"...having many threads competing for HDD access will just make the seek head go here and there slowing down reads."

This is not really true. NCQ and other technologies perform optimization and arbitration to move the head in the most efficient manner.