I have a huge text file (~1GB) and sadly the text editor I use won't read such a large file. However, if I can just split it into two or three parts I'll be fine, so, as an exercise I wanted to write a program in python to do it.

What I think I want the program to do is to find the size of a file, divide that number into parts, and for each part, read up to that point in chunks, writing to a filename.nnn output file, then read up-to the next line-break and write that, then close the output file, etc. Obviously the last output file just copies to the end of the input file.

Can you help me with the key filesystem related parts: filesize, reading and writing in chunks and reading to a line-break?

I'll be writing this code test-first, so there's no need to give me a complete answer, unless its a one-liner ;-)

Unwelcome suggestion: get a better text editor. :-) If you're on Windows, EmEditor is one I know of that will seamlessly edit files without having to load them completely into memory.
–
bobinceNov 15 '08 at 13:00

Thanks for the answer - your suggestions are working well so far for reading the file. When I've finished, I'll also try a binary version that doesn't read one line at a time.
–
quamranaNov 15 '08 at 20:04

3

What is wrong with os.path.getsize(filename)?
–
J.F. SebastianNov 16 '08 at 18:02

I've written the program and it seems to work fine. So thanks to Kamil Kisiel for getting me started.
(Note that FileSizeParts() is a function not shown here)
Later I may get round to doing a version that does a binary read to see if its any quicker.

Yes, it is not Python, but why use a screwdriver to apply a nail?
–
SvanteNov 16 '08 at 1:05

Well it's not really a screwdriver vs. nail... python often is a great way to accomplish simple tasks such as this. And I don't want to bash bash (pun intended) but that is not really... readable :)
–
AgosFeb 4 '10 at 23:22

It is very readable, you just need to know the language.
–
SvanteFeb 5 '10 at 21:28

While Ryan Ginstrom's answer is correct, it does take longer than it should (as he has already noted). Here's a way to circumvent the multiple calls to itertools.islice by successively iterating over the open file descriptor:

def splitfile(infilepath, chunksize):
fname, ext = infilepath.rsplit('.',1)
i = 0
written = False
with open(infilepath) as infile:
while True:
outfilepath = "{}{}.{}".format(fname, i, ext)
with open(outfilepath, 'w') as outfile:
for line in (infile.readline() for _ in range(chunksize)):
outfile.write(line)
written = bool(line)
if not written:
break
i += 1

I had a requirement to split csv files for import into Dynamics CRM since the file size limit for import is 8MB and the files we receive are much larger. This program allows user to input FileNames and LinesPerFile, and then splits the specified files into the requested number of lines. I can't believe how fast it works!