Download and Follow

We have two files with exactly the SAME data. One is CSV another is TXT – just the extension is different.

Notice that CSV shows Excel icon and TXT shows Notepad icon. This is how the extensions are usually mapped. This mapping happens at the time of installing Office.

The data

Open the TXT file in Notepad. There are two columns. First column is a date and second one is some text.

The first column is in MM DD YYY format. In order to determine the exact format, we had to look at any value where the first or second portion is above 12. That will obviously represent the DAY portion.

Look at the first few rows and understand which are the dates. These dates are 1st Jan, 1st Mar, 1 April 2011 and so on. Just remember this.

Now close the TXT file.

Import CSV and understand the problem

Double click on the CSV file. It opens immediately. No import dialog is shown. Look at the first column. The width is inadequate.

What will you do? Double click in between column one and two headers to make the Date column width fit all the values.

Right? WRONG!

NEVER auto-fit a Date column.
That way you lose a very important visual indication.

To see the problem, I am going to increase the column width and then – suddenly – you see lot of problems.

What is wrong?

Dates are internally stored as numbers. They look like dates because of formatting. But even then, they are numbers. Numbers are always right aligned in Excel. Text is left aligned. So what has happened here?

Dates up to row 9 are numeric and right aligned – which means Excel has apparently understood them as dates. Those from row 10 onwards are being considered as TEXT by Excel. That is why those are left aligned.

Question: You should be more worried about the text ones or the date dates?

Almost everyone will say that the text dates are the problem and the dates which were imported as proper dates is NOT the problem.

WRONG AGAIN!

If you are a regular reader of this blog, you will know that we have a simple rule when trying to improve efficiency. The rule is “The first thought which comes to mind while working on something, is almost always inefficient!”

As a human being, you can clearly understand the dates which EXCEL has not recognized. The A10 cell contains 13 Jan 2011 and A12 contains 18 Jan 2011. So even if Excel does not recognize those dates as proper dates, at least as far as interpretation is concerned, humans will not have any problems.

Now let us look at the first 9 rows which Excel seems to have interpreted as dates. We would never suspect that something is wrong here. But there is something SERIOUSLY WRONG.

Always show month in text

I applied the dd-mmm-yyy formatting for the entire column and see what happened.Now at least do you recognize the danger?

The text dates did not change at all (from row 10). But look at the first 9 rows. All the dates are in JAN. But we just saw in the text file that all these dates were from different months. These should have been 1st Jan, 1st March, 1st April and so on.

THIS IS THE PROBLEM

The problem is that nobody notices this problem – because these LOOK like valid dates! I have not done a formal survey, but it is quite obvious that this type of mistake must be very common worldwide. It is a mistake which happens automatically and is very easy to miss – even while auditing or cross-checking. This may be leading to lots of wrong decisions.

It is a miracle that the world is still running!!

The problem

What went wrong? The PC was configured to look at dates in DD MM YYY format and the incoming data in the CSV file had dates in MM DD YYY format.

Therefore, while importing the dates, Excel could import all dates which were up to 12th of the month. But while it could import the date, it reversed the month and day portion in the process – leading to misinterpretation of the actual date.

For those dates which were above 12th, Excel could not convert the days to month. Therefore, it kept the whole value as text.

The ROOT cause

All this happened because when you double click CSV file – Excel imports it directly. It provides NO CONTROL over the import process. That is why all the columns of data were imported using the default settings – which led to this serious problem.

Solution One: Works in all versions of Excel

It is simpler than you think. Just rename the CSV as TXT and import it into Excel. Of course, while importing it, we have to use the import dialog correctly.

As a text file – the import Wizard cannot be avoided – which is a good thing.

Also remember that if you are going to import a similar file repeatedly, it is possible to record a macro of just the import process. All the customization and configuration performed during the import process will be recorded in the macro. Next month you can change the file name in the macro or make sure that you rename the file itself. In either case, it becomes a reusable macro.

While opening as text, the first step is to choose Delimited or Fixed Width. This is obviously delimited (each column is separated by some symbol). Click Next.

In the next step choose Comma as the delimiter. Unselect Tab. Click Next

Now is the crucial step. Notice that the first column is highlighted and the import option shows General. This is where the problem is going to happen. General in this context means that Excel will decide the data type on a row by row basis depending upon its contents. This is where it is mixing the days and months.

We want to tell Excel to import it in the format WE WANT rather than leaving the decision to Excel.

So click in the first column and choose DATE. Right now the date format is shown as DMY.

Why? Because that Regional settings in the control panel is set to DD MM YYY format.

This is where we take the crucial step of changing the format to MM DD YYY.

Repair the second column also

Click in the second column and choose Text. This will prevent it from removing leading zeros (if any) and also retain the value 155e3 (in row 7, column 2) as a text.

Import now

Click Finish to import the data. After importing – make the date column wider than the dates. Notice that all dates are now right aligned – which indicates that all are understood as dates by Excel.

Now select the whole column and go to Format Cells – Number Formats – Custom and type DD-MMM-YYY and click Ok. (SHIFT CTRL 3 is a shortcut to apply this formatting).

Now you will notice that all the dates are interpreted accurately.

Job done. Finally !!

What did we learn

Importing of data should always be done using the IMPORT dialog.

NEVER import CSV. Rename as TXT and then import.

Each date column needs to be configured like this. Multiple columns can be selected only if they are next to each other.

Text columns also require to be identified as text.

Share this with your team

If you found this article useful and important from an accuracy and data quality point of view, please share it with your colleagues, bosses and friends.

If your IT team gives you CSV files, show them this article and ask them to deliver TXT files.

Put up this article on your intranet / Yammer group so that maximum people understand the problem and the solution.

What Next

Remember that Power Query can also be used in this situation. If you have a choice, import using Power Query. We will see this in the next article.