Most efficient way to go through a file

Posted 06 January 2013 - 03:30 PM

I have an idea for a small program that will display random lines from a text file, a database and XML would be far too heavy for this project, however the file may be large. I want to read out a single line and display it but it has to be random. For now it doesn't matter how many times a quote was displayed but int he future I may add this so maybe SQLite would be good I am unsure and for simplicity a flat text file will work for now.

I am thinking loading the whole thing each time would be a bit bunch but if I want to move between quotes I need to read it all out into a list. Would that be best? Read the entire file to a list then randomly choose from the list or query the file each time?

Re: Most efficient way to go through a file

from random import shuffle
with open(filename) as f:
lines = f.readlines()
shuffle(lines)
print(lines.pop()) #prints a random line and removes it from the list of lines

EDIT:
From the requirements you listed, this should work. However, give us more detail on the application and I can probably give you a better solution :-P

I just want a small windows application to pull random quotes from a file and display them. I think your solution will work to prevent the same from being displayed mroe than once per time the application is started. However I will need to test the length of the list and if zero then reread the file. Or have two lists.

Re: Most efficient way to go through a file

Posted 06 January 2013 - 11:24 PM

Well no. To my knowledge, Python does not natively support going to the nth line, buuut if you're clever there are some workarounds. I can think of 1 off the top of my head... but I don't particularly like it. As you read this, keep in mind that by sharing this with you, I am in no way suggesting I like this approach or think it's a particularly good idea. I think a sqlite database is probably better for this sort of thing at this point. That being said, this would stay in line with your requirements

find a random number between 0 and the length of the file in bytes
have the file object set it's pointer to that spot
find the end of the current line
read the next line
example:

Re: Most efficient way to go through a file

Posted 07 January 2013 - 04:24 AM

Note that using that approach, lines that come after long lines will be more likely to be picked than lines coming after short lines. And the first line will never be picked (except if change the logic so that if the random spot is on the last line, it chooses the first line instead of picking a new spot).

If you want to pick a random line with equal probability and you don't want to/can't read the file into memory, you'll probably have to do something like this:

Iterate over the lines to count them

Pick a random number between 0 and the number of lines (exclusive)

Iterate over the lines again with an index and return the current line when the index equals the random number

To avoid picking duplicate lines, you'd need to keep a list of lines that you have already picked. So every time you choose a line number, you check whether its in that list. If it is, pick another one.

Alternatively to iterating over the file twice, you could store the starting indices of each line in a list when you iterate the first time. Then after picking a random number, you can directly seek to the starting index of that line. This will consume more memory than the first approach though (one integer per line in the file - so still significantly less memory than reading the whole file assuming the lines aren't ultra short), so you'd be trading runtime for memory consumption here.

Using this approach you can avoid picking duplicates by removing a line's entry from the index list after picking it.

PS: The most efficient way as far as runtime goes is to read the whole file into memory using readlines as atraub initially suggested. Only if the file is too large to fit into memory, do the other alternatives make sense.

Re: Most efficient way to go through a file

You can go to a random position in a file, but since your lines are variable length, it's hard to know where to go...

If you're going to read all the lines to count them, then read again for the seek, you're in the worse case scenario for speed, but memory won't be clobbered by super large files.

If you read all the lines into memory, then pick from that list, memory is hit hardest, but you'll win on speed as long as you have it.

I like to split the difference between the two: Read all the lines, keep that running count. However, for each line read, you do a random call based on the total read thus far. If you roll 0, keep the line. This should give you an random distribution across the file.

Best case for a flat file? Make an index file. The index file simply consists of the starting positions of each line in the data file. Read that into memory and randomly pick. Use the position to seek the line in the file. You can read the index once and remove values that you've used, for that effect you want.

Re: Most efficient way to go through a file

Posted 07 January 2013 - 06:52 AM

baavgai, on 07 January 2013 - 02:16 PM, said:

I like to split the difference between the two: Read all the lines, keep that running count. However, for each line read, you do a random call based on the total read thus far. If you roll 0, keep the line. This should give you an random distribution across the file.

When you say "keep the line" do you mean "select that line as your randomly chosen line"? If so, won't that always select the first line? Or at least select early lines with a much higher probability than later ones? Because the total read so far after the first line will be 1 and randrange(1) will always give you 0, while randint(0,1) will still give you a 50% chance of 0 - and if the first line has a 50% chance of being picked that's clearly not an equal distribution if there are more than 2 lines.

Re: Most efficient way to go through a file

If so, won't that always select the first line? Or at least select early lines with a much higher probability than later ones?

Granted, it's initially counter intuitive. Sort of. Rather than thinking of it as a connected set, think of each random choosing as unconnected. Yes, the first item get's picked first. Then the next item has a 50% chance of knocking it off. The next, 1/3 chance and so on. Each item should have the same chance of being picked.

That is, I think. I don't have a mathematical proof handy, but I do have a computer.

Re: Most efficient way to go through a file

Posted 07 January 2013 - 08:47 AM

Alright so I am using sqlite3 and I have a column named ID which stares a number, then the next column is quotes. I am trying to figure out how to read the max number of rows only at the start so I can then pull just the item out that I need but am having no luck. Otherwise my random number could be outside the max number of rows in the table.

Re: Most efficient way to go through a file

Posted 07 January 2013 - 12:02 PM

Here we go with my solution. Spent some time with Google trying to get this. I figure it will work fine enough since I won't have that much data in this database for the time being and should be okif I do add more. Critiques welcome.