Contents

Introduction

Once upon a time, a text file was just a simple file. But it is not that easy anymore. New lines can be written in three different ways. Windows/DOS using character 13 and 10, Macintosh using just character 13, and Unix using character 10. Why it is like that had always puzzled me. Different character sets make reading and writing text files harder, so I'm glad that we have Unicode to use instead. But as you might know, writing a text in Unicode could be done in several ways...

Text encoding

I wrote CTextFileDocument because I thought it was too complicated to write and read files using Unicode characters. I also wanted the class to handle ordinary 8-bit files. Since version 1.20, different codepages are supported when reading/writing ASCII-files.

The encodings the classes can read and write are:

CTextFileBase::ASCII

Simple 8-bit files (different codepages are supported).

CTextFileBase::UTF_8

Unicode encoded 8-bit files. A character could be written in one, two or three bytes.

CTextFileBase::UNI16_BE

Unicode, big-endian. Every character is written in two bytes. Most significant byte is written first.

CTextFileBase::UNI16_LE

Unicode, little-endian. Every character is written in two bytes. Least significant byte is written first.

Most of the code I use on this page is for Windows/MFC, but the code should work on other platforms as well. The only major difference is that code-pages are only supported in Windows. On other platforms, you should use setlocale to specify which code-page to use. It's not necessary to use MFC on Windows.

Structure

CTextFileDocument consists of three classes:

CTextFileBase

This is the base class for the other two classes.

CTextFileWrite

Use this to write files.

CTextFileRead

Use this to read files.

There are some useful member functions in the base class:

class CTextFileBase
{
public:
CTextFileBase();
~CTextFileBase();
//Is the file open?int IsOpen();
//Close the filevoid Close();
//Return the encoding of the file (ASCII, UNI16_BE, UNI16_LE or UTF_8);
TEXTENCODING GetEncoding() const;
//Set which character that should be used when converting//Unicode->multi byte and an unknown character is found ('?' is default)void SetUnknownChar(constchar unknown);
//Returns true if data was lost//(happens when converting Unicode->multi byte string and an unmappable//characters is found).bool IsDataLost() const;
//Reset the data lost flagvoid ResetDataLostFlag();
//Set codepage to use when working with none-Unicode stringsvoid SetCodePage(const UINT codepage);
//Get codepage to use when working with none-Unicode strings
UINT GetCodePage() const;
//Convert char* to wstringstaticvoid ConvertCharToWstring(const char* from,
wstring &to, UINT codepage=CP_ACP);
//Convert wchar_t* to stringstaticvoid ConvertWcharToString(const wchar_t* from,
string &to, UINT codepage=CP_ACP,
bool* datalost=NULL, char unknownchar=0);
}

The first five functions are the most important ones, and I hope that what they do is obvious. The rest is needed when working with different code-pages.

Document/View

If you are using Document/View, you probably want to save and read your files in the Serialize function. A problem with this is that you can't close the CArchive object. If you do, you will get an ASSERT error. So instead of using the constructors where you specify the file name, you should use the constructors where you use a CFile pointer instead. When you do this, the file will not be closed when the object is deleted. The following sample is derived from CEditView, and instead of using the original code that only reads ASCII files, this will read Unicode as well:

Code-pages/Character sets

I hope that most of the code you have seen so far is quite straightforward to use. It's a little bit more difficult when you want to work with different code-pages (or "character sets", I don't understand the difference).

Before Unicode, there was a problem how to represent characters that were used in some parts of the world (a-z wasn't enough). For example, we who live in Sweden like the character 'å'. The character 'å' could be found in code-page 437. There, it has the ASCII-code 134. However, 'å' also exists in code-page 1252, but there it has the ASCII-code 229! Does it sound complicated? Wait, it's getting worse!

In some other countries, more complicated characters are used, like in Korea. Here, the ASCII-table is too small for all characters, so to make it possible to represent all characters, it is necessary to use two bytes for some characters. Code-page 949 has lots of multi-byte characters, like this one: 이 (code: C0CC=U+C774) (don't worry if you can't see the character). That character is represented by two bytes (192 and 204). If you open an ASCII-file that is using this character, in Notepad, and you are using code-page 949, you will see the character correctly. But if you are doing the same thing but you are using code-page 1252 instead, you will see two characters ("ÀÌ").

It is obviously quite hard to handle all different code-pages, that's why Unicode was invented. In Unicode, the idea is that only one character set should be used and that every character should be in the same size (no more multi-byte solutions are necessary).

So Unicode is great, but we still need to deal with files that use different code-pages. CTextFileDocument does this for you if you define which code-page to use (if you don't, it will use the code-page used by the system and that mostly works well).

If you read an ASCII-file to a Unicode-string (like wstring or CString if _UNICODE is defined), the string will be converted by using the code-page you have selected. The same thing happens (but in the other way) if you write a Unicode-string to an ASCII-file.

Remember that the string will not be converted if you read/write an ASCII-file to/from a non-Unicode string. I will show later how you could do if you want to convert from one code-page to another.

When you convert a Unicode-string to a multi-byte string, it could happen that some characters couldn't be converted. These characters are by default replaced with a query mark ('?'), but you could change this by calling SetUknownChar(). If you want to know if this has happened, call IsDataLost().

Some Windows-APIs

CTextFileDocument is using some APIs in Windows to convert strings: MultiByteToWideChar and WideCharToMultiByte. When these functions are used, the code-page to the multi-byte string must be defined. By default, CTextFileDocument is using CP_ACP, that means that the system default code-page should be used. If you want to use another code-page, call SetCodePage.

When you set which code-page to use, you must be sure that the code-page exists. Do this by calling IsValidCodePage.

Example 1

OK, enough talk about code-page, here is an example. The following code is reading an ASCII-file (with code-page 437) to a Unicode-string. Then it creates a new ASCII-file and writes the string with code-page 1252.

This is how you should do if you want to convert a string from one code-page to another code-page. Convert the multi-byte string to a Unicode-string, and then convert the Unicode-string to a multi-byte string. If you don't want to write the string to a file, you could use ConvertCharToWstring and ConvertWcharToString that are found in CTextFileBase.

//Make file reader. Read the file "ascii-437.txt"
CTextFileRead reader("ascii-437.txt");
//Define which code-page to use when we read the file//437 are very often used in DOS.
reader.SetCodePage(437);
//Read everything to a Unicode-string
wstring alltext;
reader.Read(alltext);
//Close file
reader.Close();
//Now we create a new ASCII-file
CTextFileWrite writer("ascii-1252.txt", CTextFileBase::ASCII);
//Set which code-page to use.//1252 is very often used in Windows
writer.SetCodePage(1252);
//Do the writing...
writer << alltext;
//Was data lost when the Unicode-string was converted to//code-page 1252?if(writer.IsDataLost())
{
//Do something...
}
//Close the file
writer.Close();

Example 1b

As I said before, it should be possible to use CTextFileDocument in platforms other than Windows. If you do this, you must know that code-pages are handled slightly different. Instead of calling SetCodePage, you should call setlocale to define which code-page to use. The following code is doing the same thing as the last example, but will work on every platform (I hope ;-)):

//Make file reader. Read the file "ascii-437.txt"
CTextFileRead reader("ascii-437.txt");
//Define which code-page to use when we read the file//437 are very often used in DOS.//NOTE: Make sure setlocale doesn't return an empty//string. If it do, you have probably tried to use//an code-page that your system doesn't support
cout << setlocale(LC_ALL, ".437") << endl;
//Read everything to a Unicode-string
wstring alltext;
reader.Read(alltext);
//Close file
reader.Close();
//Now we create a new ASCII-file
CTextFileWrite writer("ascii-1252.txt", CTextFileBase::ASCII);
//Set which code-page to use.//1252 is very often used in Windows
cout << setlocale(LC_ALL, ".1252") << endl;
//Do the writing...
writer << alltext;
//Was data lost when the Unicode-string was converted to//code-page 1252?if(writer.IsDataLost())
{
//Do something...
}
//Close the file
writer.Close();

About the code

CTextFileDocument was originally written to use MFC, but now it's more platform-independent. To make this possible, there are some #defines in the code. The most important one is PEK_TX_TECHLEVEL, which defines which features to use. But you need not think about this, the code should automatically define this correctly. The table below explains the differences:

PEK_TX_TECHLEVEL = 0

This is used if you are running on a non-Windows platform. This uses fstream internally to read and write files. If you want to change codepage, you should call setlocal.

PEK_TX_TECHLEVEL = 1

This is used on Windows if you don't use MFC. This calls Windows API directly to read and write files. If something couldn't be read/written, a CTextFileException is thrown. Codepages are supported. Unicode in filenames is supported.

PEK_TX_TECHLEVEL = 2

This is used if you are using MFC. This uses CFile internally to read and write files. If data can't be read/written, CFile will throw an exception. Codepages are supported. Unicode in filenames is supported. CString is supported.

Links

Points of interest

Even if the classes are quite simple, they have been very useful to me. They have all features I want, so I don't miss anything important. However, it would be nice if it supported more encodings, like UTF-32. Maybe I'll add this in the future. The performance is quite good, but if you know some way to get it faster, let me know :-).

One thing that probably should improve the performance is increasing the value of BUFFSIZE (defined in CTextFileBase). Another thing is making the code in CTextFileRead::GuessCharacterCount better. This should return the number of characters in the file. Currently, this only works if you are using MFC, otherwise it will return 1 MB. GuessCharacterCount is only used when Read is called, so it's not used when ReadLine is called.

How many bytes are a wchar_t? That is compiler dependent, and I think that could give me some problems in the future. In Windows, wchar_t is two bytes, but I think that in Unix, four bytes are used. Currently, this is not a problem, but if I add support for UTF-32 (four bytes for every character), some problems may occur.

Why isn't IsOpen() a const function? I think it should be, but that is impossible. The reason for this is that fstream::is_open() is not const (well, it is in my VC6 but not in standard C++). Why it is like this is a mystery for me.

The classes expect that the files have a "byte order mark" (BOM) in the first bytes in the files. These bytes are telling what encoding is used. The first two bytes in a "big endian" file are 0xFF and 0xFE; if you make a "little endian" file, the order is reversed. If the encoding is UTF-8, the first three bytes are 0xEF, 0xBB and 0xBF. If no BOM is found, the file is treated as an ASCII file.

You may wonder why I call these classes CTextFileDocument. The simple reason for this is that the name CTextFile was already taken... It was quite annoying to find that out just a couple of minutes before I wanted to upload the article :-).

And finally, thank you all of you who have commented and found bugs (and created fixes) to the code. These classes have been improved a lot, thanks to this.

History

21 May, 2005 - Version 1.22.

Reading a line before reading everything could add an extra line break, fixed.

A member variable wasn't always initialized, could cause problems when reading single lines, fixed.

A smarter/easier algorithm is used when reading single lines.

10 April, 2005 - Version 1.21. If it was not possible to open a file in techlevel 1, IsOpen returned a bad result. Fixed.

15 January, 2005 - Version 1.20

Fix: Fixed some problems when converting multi-byte string to Unicode, and vice versa.

Improved conversion routines. It's now possible to define which code-page to use.

It's now possible to set which character to use when it's not possible to convert a Unicode character to a multi-byte character.

It's now possible to see if data was lost during conversion.

Better support for other platforms, it's no longer necessary to use MFC in Windows.

13 August, 2004 - Version 1.1. I'm sorry about the quick update. I have rewritten some part of the code, so now it's a lot quicker than the previous version.

12 August, 2004 - Initial version.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

Share

About the Author

PEK is one of the millions of programmers that sometimes program so hard that he forgets how to sleep (this is especially true when he has more important things to do). He thinks that there are not enough donuts in the world. He likes when his programs works as they should do, but dislikes when his programs is more clever than he is.

You may also be interested in...

Comments and Discussions

Thanks, updated project seems works fine with the same sample file.;)
When I run into other problems, e.g)with complicated files having TM or register mark, or other format of Unicode, I will get back to you.

The performance is quite good, it's actually faster than using ifstream when reading ASCII-files, one line at the time .

I did a minor test (see code below) where I read a 2 MB textfile (ASCII). First I'm running in none-mfc mode (which mean CTextFileRead used ifstream internally to read files). I took 23 seconds to read with ifstream, and 9 seconds with CTextFileRead (in debugmode, I got a similar result running in release mode). If I used string or wstring to read files didn't have any major impact on the result.

I did the same test in mfc-mode (CTextFileRead using CFile internally), and got a similar second. Ifstream took 9-10 seconds, while CTextFileRead did the same work in 2 seconds if I read to a string, and 4 seconds if wstring was used (the slower performance with wstring is probably because I use MultiByteToWideChar to convert strings in mfc-mode, and mbstowcs in none-mfc-mode).

Thanks for your elaborate explain and example.
And I have another question:

By my experience, It becomes very slow to write a file when the size of which is larger than 2G bytes under Windows.
Is it a same problem to CTextFileDocument?
And are there other limitations about the file size to CTextFileDocument?

2 GB? Wow… that is a lot! . I haven't actually testing writing so large files until today, and I didn't get any major difference in speed when 2GB was passed. I wrote a 2 GB file in a speed about 20 MB/Sec, then I added 128 MB and got a similar speed (haven't enough disc space to wrote more ). So writing large files doesn't seem to be any problem and I think reading isn't a problem either.

thanks for your experiment and reply;P
my last task was to deal with the data of a Data Warehouse project. the size of a text file is often larger than 2GB, even 3G or so.
I really wish I've found your code three months ago! God, It would do me
a big favor!
Anyway, thanks for your works. I'll recommend it to my friends!
Happy New Year!;)

I would like to use this class but can't because I don't use MFC (ie. I use C++ with the Win32 API).

I know it's possible to use the non-MFC version (ie. ifstream) but, as you tests show, it's a lot slower compared to when the Win32 API file functions (ie. MFC version using CFile) are used.

Is there any chance of adding modifications so this class can be used in non-MFC Win32 apps, but still use the Win32 API file functions (or a non-MFC version of CFile (maybe one on this site)). This could be achieved by #defines, like you have already done to get an MFC or generic ANSI version.

You're code is probably better, but I'm not sure this is an actual problem in the current version. The number of characters in the converted string should always be the same number as in the original string. Or am I wrong about this?

If convert between true multibytes character (like chinese) with unicode, the length of wide char string and multibytes char string are always different.

The reason we use wide char,Unicode, instead of ascii, multibytes is because an 8-bit char can not describe characters of all languages in this planet and there are too many encoding standards exist for different language, even for one language. Unicode means unit all the different character encode into one.

As a chinese in mainland, we use GB2312, or GBK(gb2312 extension). other chinese in Hongkong, Taiwan, they use Big5. These encoding standards are all multibytes. Two 8-bits characters describe one chinese character(there are over 6 thousands live chinese character for the overall 60 thousands chinese characters).

bug one:
find bug in CTextFileRead::ReadCharLine, and CTextFileRead::ReadWcharLine
synopsis:
if there are one or more leading CR/LFs at the beginning of the file,
each load and save action on the file may drop one CR/LF.