Parsing Data Files

Welcome to the next installment of the .NET Nuts & Bolts column. In the prior column, "Accessing Files and Directories," we looked at the basics of reading and writing files. I received a number of requests about how to apply this to reading comma-delimited or positional data files. In this column, we'll get more in depth with reading from files and examine the task of parsing data from a file. It will involve using classes in the System.IO namespace among others.

Data Files

For as long as computers have been used to store information, there has been a need to share the data with other systems. Data files have traditionally been the mechanism used to share information between systems. The necessary data has been extracted from the originating system, written to a file, and then the file is sent to the other parties to be imported into the necessary system(s). In order for the import process to work, the receiving parties must know about the structure of the information contained within the file so that it can be read into the additional systems.

Data sample code

The following Person object will represent the data we will use to read data from files.

Working with Positional Files

Positional files, also known as fixed-length files, contain lines of data where each line has the same format and length. Each line in the file represents a complete record of data. Each line is made up of multiple fields of data where each field is located in the same place across lines in the file. Each field has a specific length associated with it and is always the same length, regardless of the data, by padding it with extra characters. For example, a field containing a last name may have an assigned length of 20 characters. Using my last name, the column would contain Strawmyer (9 characters) that are padded with 11 spaces either before (pre-fix) or after (post-fix) the actual data to reach a length of 20 characters. If the field was the first on the line, you would always read the first 20 characters of the line to read the last name.

Parsing a positional file sample code

In this example, we'll use the field order Last name (20 characters), first name (10 characters), middle name (10 characters), name suffix (3 characters), and gender (1 character).

The first step in the process is to read the entire file and then split it by line. The reason we read files in this way is in case we are dealing with files that originate from other types of systems—such as Linux, UNIX, or a Macintosh OS—where we have to strip out characters to read files in a consistent manner. Reading the entire file first will allow us the opportunity to strip out unwanted characters. Thus, we'll read the file a block of bytes at a time and assemble it using a StringBuffer. Then, we can replace whatever characters we want to and then split the file according to the characters that denote the end of a line.

Working with Delimited Files

Delimited files are similar to positional files in that the fields are located in the same order across lines in the file. Rather than having a specific length for each field, the individual fields just contain the actual data and are separated from other fields by a chosen delimiter. A comma is a common delimiter. Using my last name as an example, the field would contain Strawmyer followed by the delimiter. Character data is contained within double quotes at times as well.

Reading delimited files is quite simple in comparison to a positional file. It follows the same pattern of reading a line of the file, but the string data type has a split method that will automatically split a line into an array based on the given delimiter. This eliminates the need for any specialized parsing code.

Parsing a comma-delimited file sample code

In this example, we'll use the a comma as the field delimiter and use the field order of Last name, first name, middle name, name suffix, and gender.

Working with XML Files

Data files in XML have some comparative advantages over positional and delimited files as well as some disadvantages. XML has a more inherit structure to it and is less affected by structural changes to the order of the file. If you add or move a field in a positional or delimited file, your parsing program has to know about it; otherwise, the program is not likely to continue functioning properly. With XML, the physical ordering of the fields or addition of other fields is not relevant because the parser used to read the file abstracts the actual structure for you. This means your program is likely to still function properly, which is a distinct advantage over other formats. The main disadvantage is in the physical size of the file. Due to the descriptive tags, the XML file is guaranteed to be larger than positional or delimited. This can be a clear disadvantage when transmitting the data and will require more system resources in processing.

Possible Enhancements

You have now seen several ways in which the System.IO namespace can be used to deal with data files in positional and comma-delimited formats. Here are some ideas for how you could enhance this code to make it more meaningful for use within your applications:

Change the positional reader to be driven based on a stored configuration to allow for changing data formats. You could use a database table to store information about the positional fields that need to be read or assign attributes to the properties of the object to be populated that indicate how to read the file. This would allow you to have a generic positional reader that can be reused for other objects.

Change the comma-delimited reader to accept the delimiter as a parameter so that you can use it for multiple types of delimited files.

Change the delimited reader to be driven based on a stored configuration to allow for changing data formats. You could use a database table to store information about the fields that need to be read or assign attributes to the properties of the object to be populated that indicate how to read the file. This would allow you to have a generic reader that can be reused for other objects.

Change the reader to return an array of the desired object type. This will make the reader more useable because you'll have a reference to all of the data instead of having to use it inside the actual reader.

Future Columns

The topic of the next column is yet to be determined. If you have something in particular that you would like to see explained here, you could reach me at mstrawmyer@crowechizek.com.

About the Author

Mark Strawmyer, MCSD, MCSE, MCDBA is a Senior Architect of .NET applications for large and mid-size organizations. Mark is a technology leader with Crowe Chizek in Indianapolis, Indiana. He specializes in the architecture, design, and development of Microsoft-based solutions. Mark was honored to be named a Microsoft MVP for application development with C#. You can reach Mark at mstrawmyer@crowechizek.com.