7 Replies - 12959 Views - Last Post: 05 December 2012 - 04:20 PM

multithreaded hashmap

Posted 30 March 2012 - 01:59 PM

I had an interview today, and an interesting topic came up. Basically this company deals with a lot of pdf files, which are basically scanned into a String. One operation this company has to do is store every word in this file in a hashmap, and if there are duplicate words, the value of the word (key) will increase. So say we had the words

then
did
wonder
did
how

it would be
then | 1
did | 2
wonder | 1
how | 1

Now if they have 10 pdf files, this operation would be slow n^10. So a way they improve effeciency is to use threading. The count of words in not for a single file, but for all files. So if there is a did in file 2, the single hashmap would increase to did | 3. This means that the threads cannot run at the exact same time, because if they both read in the word "did" in their individual files at the same time, and picked up the hash map at the same time, they both would be getting the hashmap at did | 3 and returning did | 4, when after both processes, the result should be did | 5. So there needs to be some sort of delay between processes, or one process cannot access the hashmap if another process already has it.

To me personally, this seems like a strange approach to take, although when a business needs to be efficient, alternatives need to be created. I was always under the impression that Hashmaps were not thread safe, so performing an operation which requires a lot of bounderies to be set to the thread must be risky.

Just wondering what people think of this approach, and what other alternatives they might use. Remember though, the key is effeciency, so it is no good processing one file at a time.

Re: multithreaded hashmap

Posted 30 March 2012 - 03:24 PM

I looked into that, and that would definately solve any issues with thread safety, so thats definately a viable option. I am going to sleep on it and see if I can come up with another approach. For some reason, I dont really like threads, so I am trying to think of a way where this could be avoided. But for now, this is the only thing I can think of which can perform concurrent operations, so its the only option. And I dont know the exact details about the efficiency of different Collections, so I also want to check if a hashmap would be the best option for this. I doubt something like an ArrayList will be any good, because when it comes to counting the words, it would require quite a few passes.
I will think about this, and see what I can come up with.

Re: multithreaded hashmap

Posted 31 March 2012 - 03:42 PM

I would go with a divide-and-conquer approach in that, for each PDF i'd spawn a seperate hashmap, then merge them all when each has completed its task, so the file reads are all done in parallel and then merged together in series perhaps?

Re: multithreaded hashmap

Posted 31 March 2012 - 03:48 PM

The only problem with that approach is that although you are reading in the files concurrently in order to produce a single file, the single file you are left with is still the contents of 10 files. So going through this single file will take a similar amount of time as going through them seperately. So in terms of proficiency, they would need to remain 10 individual files, and analysed concurrently.

Re: multithreaded hashmap

Posted 09 November 2012 - 06:20 AM

No his solution is the more efficient one, i have run these type of code many times and this is the fastest solution. The time that you earn from the minimization of hashmap's locks during the process of the different documents it greater than the merging of the hashmaps in the end. And the final merging can be done in parrallel too

Re: multithreaded hashmap

Posted 05 December 2012 - 04:20 PM

9 months late but anyway...

Is counting the words really the slow part? I would have thought it would be reading the files, maybe parsing them but certainly not counting words. Did anyone profile the application before jumping on the threading bandwagon?

Threading doesn't change the complexity class of the algorithm, only the constant. If I can come up with a faster algorithm then that's at least as good as multithreading -- better because you might also be able to multithread that!

The hash map approach implies a three pass (maybe two pass) process. One pass to convert the PDF to a string, one to extract words from the String and another to calculate the hash function on each string. I guess the first two could be combined to make a two-pass function.

Instead of counting with hash maps, how about an alpha tree structure? One pass to get a string from the PDF, another pass to iterate over the string while traversing the tree. If you hit a space (or other non-word character) increment the node you are on and jump back to the root node. You can make this a one-pass algorithm if you can apply it to a stream of characters from the PDF file. Then you can multithread it too, if you have to.

Of course, if you haven't profiled your application then most of this will be a waste of time anyway!