Thursday, September 3, 2015

Working with text files in Python 2.7

Reading text files in Python 2.7 at first sight is simple, but without correct opening, a work with them can lead to unexpected results.

For example you have a file, looks like ascii-encoded and reading it line by line saving line number and line data. But if in this file some non-ascii symbols are hidden, you will have problems with encoding. In Python such problem could be reproduced as UnicodeDecodeError:UnicodeDecodeError: 'ascii' codec can't decode byte 0x?? in position ??: ordinal not in range(128)
Moreover if line endings are nonuniform for single platform (i.e. '\r\n' or '\r' will be met along with '\n') you could have problems with lines number. And if you open your log file in text editor - line numbers in it and in your Python script results will not correspond.
Let's prepare the test logfile:

and the output is:
4: line 56: line 5
Here codecs.open gave the same result as a simple open, but io.open gave the expected result.
Due to its argument newline, it is possible to enable universal newlines mode and read the lines regardless line ending format (Windows, Unix, Mac OS up to version 9), i.e. lines can end with '\r\n', '\n', or '\r'. This mode is enabled by default (when newline=None). Therefore lines are split correctly and we get the line numbers the same as in text editors.
In Python 3.x io.open is the default interface to access files and streams.

codecs.open and io.open functions have encoding argument that specifies the encoding which is to be used for the file. Also they have errors argument - an optional string that specifies how encoding and decoding errors are to be handled. The difference is that codecs.open handles line endings differently.
After the additions of arguments: