Flat File Search

Greg Kearn

Greenhorn

Posts: 20

posted 15 years ago

Hi All What's the fastest way to search a flat CSV file (27,000 rows, 5 fields). I'm currently using BufferedReader and breaking the fields with StringTokenizer. How can I make searching this file faster? Thanks

Depends on what specifically you are searching for. Possibly regular expressions could help.

The soul is dyed the color of its thoughts. Think only on those things that are in line with your principles and can bear the light of day. The content of your character is your choice. Day by day, what you do is who you become. Your integrity is your destiny - it is the light that guides your way. - Heraclitus

Mark Herschberg

Sheriff

Posts: 6037

posted 15 years ago

Can you be more specific? What are you looking for? What do you know, if anything, about it's location in a file? WHat do you know about the other data? --Mark

Jim Yingst

Wanderer
Sheriff

Posts: 18671

posted 15 years ago

The StringTokenizer is creating 5 String objects for each line read. This is probably unnecessary overhead, especially if most lines don't have what you're looking for anyway. StringTokenizer is more for ease of use than efficiency. The same may well be true for regular expressions, but at least they're much more flexible. Some alternatives are: use String's indexOf() method to find a particular character, if it exists. Or, read a group of chars into a char[], and loop through the array to see what's in it. Avoid creating objects unless there's a good reason. As Ilja says though, the best solution will depend on what you're searching for. Give some more details, and we may have better recommendations.

"I'm not back." - Bill Harding, Twister

Greg Kearn

Greenhorn

Posts: 20

posted 15 years ago

I give the user the ability to search on one of the three: City, Area Code, Zip. The results that are displayed are "City, State, Zip, Area Code, Time Zone". Thanks Greg

Jim Yingst

Wanderer
Sheriff

Posts: 18671

posted 15 years ago

I'd try using String's indexOf(String) to see if a given line contains the substring you're looking for. Most of the time this will be no, so no need to spend any further time on that line. If it's a yes, then you can use a StringTokenizer to work through the commas and make sure the string is in the right place. There are faster ways, I think, but they'll be more complex to implement. A couple things to consider - should case be ignored? Does the user's search pattern need to allow wildcards? If either answer is yes, indexOf() will be hard to use effectively, and either a regex package, or a handbuilt parser based around a char[] array read, would be the way I'd go. If performance is still too poor with indexOf(), then it probably will be for regex too - work on reading a char[] array and finding the pattern you want in it.

"I'm not back." - Bill Harding, Twister

Greg Kearn

Greenhorn

Posts: 20

posted 15 years ago

What about a HashMap? Would that be any use in for this problem? Thank Greg

Jim Yingst

Wanderer
Sheriff

Posts: 18671

posted 15 years ago

It depends how many times you expect to need to search for data, and whether you can afford to keep all the data in memory. You could read the whole file line by line, and for each line put a new entry in each of fours hashtables. Each table would use a different field as its key, and would map to an object holding all the (other) data for that line. This would make the initial read-and-load of the file somewhat slower - but would allow you to look up data very quickly afterwards. So it's worthwhile if you've got the memory, and searching will be done often enough that the benefit of fast searches outweighs the initial detriment of the slow load time. Also, the hashtable will be no use if you want to search using wildcards, or substrings within a key. Only the exact value of hte key will do. If any of this is a problem, you're probably better off with one of the other approaches mentioned.

"I'm not back." - Bill Harding, Twister

Greg Kearn

Greenhorn

Posts: 20

posted 15 years ago

Jim, Thanks for your input. I'm going to try the indexOf(String) idea. I learned that you can search values when the values aren't placed in tokens. I'll see if it's any faster. Currently the search for Miami takes about 45 seconds. --Greg

Mark Herschberg

Sheriff

Posts: 6037

posted 15 years ago

Jim always beats me to the wise answers :-) He's right that you need to consider how often it'll be searched. If often, it should be loaded into memory. Another option is sorting (and possible indexing, as well). Is your data sorted? If so, it can be searched musch faster. Of course, you have to balance how often the data changes versus how often it's searched. If it is sorted, you may want to create an index to speed searching. --Mark

Greg Kearn

Greenhorn

Posts: 20

posted 15 years ago

Mark/Jim To speed the search the data is pre-sorted into 3 files, citydata, areadata and zipdata. The key field is the first field in the row. Ex citidata: Miami, 305, 33341, E. zipdata: 33341, Miami, 305, E. areadata: 305, Miami, 33341, E. Is it possible to index this file? Thanks

Thomas Smets

Ranch Hand

Posts: 111

posted 15 years ago

I haven't got much experience withthat but may be you could give a look at the jakarta ORO matcher : ORO matcher & Steve / pat RegularExpression matcher : Regular Expressions in Java AS I posted a question directly to Steven (the author of the javaregex library), he gave me the following reply !

I don't have the time to make a login and reply right now, however... The String.indexOf is probably the best thing suggested on the board. A better answer would probably be to read the whole file into memory, then search it via a Boyer Moore Horspool algorythm. Package pat uses this if the pattern is a string literal. To avoid some of the overhead of regex's you can use the SkipBMH class contained within it. The BMH algorythem can find things faster than indexOf(). One key to speed is making fewer Strings. That is why I recommend reading the whole file into memory (if possible). Once you've found the item you're looking for, you can backtrack to find the beginning and end of the line -- then maybe you can use things likesubstring or StringTokenizer. Of course, this assumes you have a fairly normal CSV file. In principle, CVS allows you to put newlines in quotes, making them part of a field, and so on. This kind of processing can get quite complex. Cheers, Steve

AS I mpentionned earlier this is the reply Steven R Brandt. You can contact him @ javaregex.com Hope this helps, [ May 04, 2002: Message edited by: Thomas SMETS ]

Thomas Smets
Just another developper

Mark Herschberg

Sheriff

Posts: 6037

posted 15 years ago

Originally posted by Greg Kearn: Mark/Jim To speed the search the data is pre-sorted into 3 files, citydata, areadata and zipdata. The key field is the first field in the row. Ex citidata: Miami, 305, 33341, E. zipdata: 33341, Miami, 305, E. areadata: 305, Miami, 33341, E. Is it possible to index this file? Thanks

Well, if the data is sorted, you can keep track of indexes, like, say where area codes starting with each digit are located; then you can skip the earlier bytes of data. There are probably more complex alternatives, but this is the basic idea. --Mark

Greg Kearn

Greenhorn

Posts: 20

posted 15 years ago

How can an index be kept? The file has to be read every time starting at the top and working its way down looking for the matching value. Is there another way to search a 27,000 row csv file? I did try the indexOf search and it did save about 5 seconds off a long search. Thanks Greg

Mark Herschberg

Sheriff

Posts: 6037

posted 15 years ago

It really depends on your constraints. You could create a second file for holding the index, or you can put the index at the top of this file. Even if it's a sorted file, without an index, you can speed your searching by only checking for a single digit match at first, making the comparisons faster (although some JVMs may do this automatically). in cases like this you must balance where you put the costs. If you do more searching this makes sense. if you do more data modifications, this may be more trouble then it's worth. --Mark

Frank Carver

Sheriff

Posts: 6920

posted 15 years ago

Also if the data doesn't change very often you might consider loading it into a database. take that 45 second hit once when the data changes, then get much faster searching through JDBC. Installing and using a database doesn't need to be a big, complex undertaking; something like hypersonic SQl is basically just a jar file and a config to tell it where to store its data. Using a database would mean that you can configure things like indexing in a high-level manner without worying about how to code it yourself.

I gave hypersonic sql a try without success. So I tried PointBase and that did work. Time started when the button was pressed and ended when the data stopped scrolling. The time results are as follows: City Top/Bot Srch PointBase Srch Acme - :03 - :09 Miami - 1:10 - :43 Zwolle- 1:25 - :09 Thanks for all your input! --Greg