Haphazardly Posted Excel Information and Other Stuff

Category: File Operations

A few weeks ago I was struggling with fixed width text files. Remember 1995? Me neither, but I’m living it every day. I happened upon a stackoverflow.com post that dealt with using ADO for this purpose. I don’t remember the post, so no link. It did, however, lead me to this MSDN article, which was very helpful.

If you haven’t been following along, I wrote a post about creating a sample fixed-width file and one about importing said file. This post is about importing that file with ADO. You may remember ADO from such database objects as Connection, Command, and Recordset. But, like me, you may never have considered using to import text files; or even knew that it could.

Before I get into the specifics, there are two things that endeared me to this method. First, it allows me to only import the data I want and easily ignores headers, totals, and other non-record data. Second, it’s super fast. I had a huge text file that took several minutes to read using the Input$ function. I got it down to 90 seconds using ADO. Ninety seconds still stinks, but it beats having to get a coffee every time the code runs. The bonus third reason I love ADO is that I can replicate it for different text files easily. Usually, fixed width text files are not arranged as normalized data, so there are some challenges. But I went from setting this up for one specific report to a dozen reports very quickly. Let’s see how it’s done.

Go read the MSDN article if you want the specifics, but basically you need a file called Schema.ini that tells ADO what’s what in your file. I don’t take a crap without a class module, so we’ll be using class modules here. Didn’t this use to be a family friendly blog?

If you couldn’t tell, the text file in question contains transactions from a general ledger. That means I need a Transaction class to hold each of them. Using the column headers from the file, I create a CTransaction class module.

Visual Basic

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

OptionExplicit

PublicTransactionID AsLong

PublicEntry AsString

PublicPeriod AsLong

PublicPostDate AsDate

PublicGLAccount AsString

PublicDescription AsString

PublicSrce AsString

PublicCflow AsBoolean

PublicRef AsString

PublicPost AsBoolean

PublicDebit AsDouble

PublicCredit AsDouble

PublicAlloc AsBoolean

My VBHelpers add-in quickly converts those to properties and creates a CTransactions parent class. Next, I create an MEntryPoints standard module and insert the following code.

Visual Basic

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

PublicSubImportGLTransactions()

DimclsTransactions AsCTransactions

Dimsh AsWorksheet

DimsFile AsString

sFile=Application.GetOpenFilename("*.txt, *.txt")

IfsFile<>"False"Then

SetclsTransactions=NewCTransactions

clsTransactions.FillFromFile sFile

Setsh=Workbooks.Add.Sheets(1)

clsTransactions.WriteToRange sh.Range("A1")

EndIf

EndSub

I don’t have a FillFromFile method or a WriteToRange method, but I like to write my main procedure as if I already had those. You’ll need a reference to Microsoft ActiveX Data Objects 2.8 Library (although any version close to that will do). The FillFromFile method is pretty simple. It creates and ADO Connection and an ADO Recordset, then loops through the recordset adding CTransaction instances as it reads them in. It’s treating our text file as if it’s a database with field names and everything.

We’ll hold off on how to create the Schema file for now. The Connection string is created with this little utility. You pass in the path and returns an array of strings ready to be joined.

Visual Basic

1

2

3

4

5

6

7

8

9

10

11

FunctionGetConnectionString(ByValsPATH AsString)AsVariant

DimaConn(1To3)AsString

aConn(1)="Provider=Microsoft.Jet.OLEDB.4.0"

aConn(2)="Data Source="&sPATH

aConn(3)="Extended Properties=""text;HDR=No;FMT=FixedLength"""

GetConnectionString=aConn

EndFunction

I’ve recently starting using arrays and Joins to concatenate strings of any length. I find it makes the code much more readable and manageable once you get used to it. Let’s talk about that SQL statement. In Schema.ini, I’ve defined column names and column widths. We’ll look at it in a moment. The SQL statement selects all the columns from the text file based on some criteria. The first section of the SQL statement is the SELECT keyword. For the second section, I have a property that returns an array of columns. You could just as easily use “SELECT * FROM”, but I was recently shown the benefit of following the never-select-astrisk rule, so I’m trying to be good.

Just an array of column names used in the SQL statment and in Schema.ini. The third section of the SQL statement is the FROM keyword followed by the file name in brackets. The Dir$ function strips the path out of the fullname and returns only the file name. You don’t need the path in the SQL statement because Schema.ini is in the same directory as the text file. It has to be, so it’s not looking anywhere else.

The final section of the SQL statement is the WHERE clause. This is where you have to get a little creative. As I scan down my text file, I need to find some characteristic of “good” rows that is not present in “bad” rows. For this example, it was pretty easy. Every row that I want has a real date in PostDate and every row that I don’t want doesn’t. They aren’t all that easy. Would you like to see some examples of WHERE clauses I’ve used? Well, would you?

Visual Basic

1

2

3

4

5

6

aSql(4)="WHERE (Tax Like ""__/__/____"" And Not(IsNull(Vendor))) Or Account Like ""[1-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]"""

aSql(4)="WHERE PaymentDate Like ""__/__/____"" And Not DiscountAmount Like ""%[ ][ ].00"""

aSql(4)="WHERE PostDate Like ""__/__/____"" Or GLAccount Like ""[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]"""

aSql(4)="WHERE TranDate Like ""__/__/____"" Or (LotNo Like ""[A-Z][A-Z]%"" And TranDate Like "" [A-Z]%"" And Not LotNo Like ""Item ID%"")"

Some of those are pretty gnarly. Our example file has lots of options, not just PostDate. We have three Yes/No fields and we could use any of those. It’s pretty unlikely that header information or totals rows are going to have a Yes or No in that same position. The idea is only get the rows you want. For some of the dates I used "__/__/____" and I think it’s pretty safe. But for this example, I used "[0-1][0-9]/[0-9][0-9]/[0-9][0-9][0-9][0-9]" which is a little more specific.

With a good connection string and SQL statement, I open the connection, open the recordset, and start looping. There’s not much to filling the CTransaction class via the FillFromRecordset method. It uses an Nz function that I wrote to avoid Null problems and return a default. For numbers and dates, I specify that I want to return a zero in place of a Null. For strings, it automatically returns vbNullString. The FillFromRecordset procedure is a method in CTransaction. The Nz function is in a standard module.

Visual Basic

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

PublicSubFillFromRecordset(ByRefadRs AsADODB.Recordset)

Me.Entry=Nz(adRs.Fields("Entry"))

Me.Period=Nz(adRs.Fields("Period"),0)

Me.PostDate=Nz(adRs.Fields("PostDate"),0)

Me.GLAccount=Nz(adRs.Fields("GLAccount"))

Me.Description=Nz(adRs.Fields("Description"))

Me.Srce=Nz(adRs.Fields("Srce"))

Me.Cflow=Nz(adRs.Fields("Cflow"))="Yes"

Me.Ref=Nz(adRs.Fields("Ref"))

Me.Post=Nz(adRs.Fields("Post"))="Yes"

Me.Debit=Nz(adRs.Fields("Debit"),0)

Me.Credit=Nz(adRs.Fields("Credit"),0)

Me.Alloc=Nz(adRs.Fields("Alloc"))="Yes"

EndSub

FunctionNz(fldTest AsADODB.Field,_

OptionalvDefault AsVariant)AsVariant

IfIsNull(fldTest.Value)Then

IfIsMissing(vDefault)Then

SelectCasefldTest.Type

CaseadBSTR,adGUID,adChar,adWChar,adVarChar,adVarWChar

Nz=vbNullString

CaseElse

Nz=0

EndSelect

Else

Nz=vDefault

EndIf

Else

Nz=fldTest.Value

EndIf

EndFunction

Now on to Schema.ini, at long last. I have a MakeSchema procedure in a standard module that simply creates the file where it’s supposed to. One of the arguments to MakeSchema is a string for the contents of the file. That comes from the Schema property of the CTranscations class (shown as Me.Schema in the FillFromFile method above). The Schema property takes the columns from the Columns property and puts them together with column widths to create the string.

The widths array is simply how many characters wide each column is. The lines in my file are 132 characters long. Counting them is a pain. Usually, I grab a couple of representative lines from the text file and put them in a spreadsheet. Here’s how I came up with the column widths for this file.

Lines 2-5 are Courier New 9pt and the first two lines are typed – no fancy formula to get those numbers. I like to get a couple of representative lines so I don’t miss anything. Then I go put pipes where I want the column breaks to be and put this formula in A7

=FIND("|",$A$4,A6+1)

and fill down until I get an error. Column B is just the difference. Finally, the MakeSchema utility takes that string and puts into a file.

I’ve been working with fixed width text files quite a bit lately. The built-in method for importing these files is terrible. First, it’s yet another wizard and I think it would be better as a single form (but then I think everything is better as a single form). The worst part, by far, is the tiny window that shows a preview of your data.

After a few days of using this wizard, it was pretty obvious that I was going to write my own, if for no other reason than to make that window bigger. My importer would be significantly different than Excel’s. Microsoft has to make their method flexible and universal, but I don’t. I know certain things about my text files and can build in some assumptions to make things better. For one, my files have repeating page headers. While Excel allows you to start your import at something other than first row to skip all that crap, it doesn’t do me any good because there’s just more crap to come.

Step 2 of the wizard allows you to add, delete, and move the dividing lines that delineate your columns. I find that Excel does a pretty poor job at placing these lines. But I grant that I haven’t used this on a wide variety of files so it’s entirely possible that their algorithm is the best – it’s just not the best for the types of files I’m using.

Did you think I was going to blow by this step without commenting on the lack of keyboard support? Not a chance. To move the lines, click them and drag them. To add a line, single click. To delete a line, double click. Actually, to delete a line, double click next to the line you want to delete, then double click that new line to delete it, then carefully double click on the original line to delete it. You know what I’m talking about. Click, click, click. Where’s the keyboard love?

The lack of large enough preview window really hurts on this step. I’ll discuss determining where the column lines should go later in this post. Normally after step 2, I just hit finish. But let’s take a look at the last step anyway.

I don’t use this step much because the defaults are really good. As much as I don’t like wizards, I have to give MS props for putting a Finish button on step 2 so I can skip step 3. Now that you’ve defined your columns, this step allows you to specify a data type. The General type works well for most situations, but if you have some text that happens to look like a number, it’s best to set the column to the Text format.

Under the Advanced button, you can switch what decimals and commas mean and, most importantly to me, tell it how to handle trailing minus signs.

So back to how nothing’s ever good enough and I can do everything better. I’m only dealing with fixed width files and I’m always starting on the first row, so step 1 of the wizard is gone. Step 2 and 3 could be combined, I think. It should draw a combobox over each column that let’s you choose the format. When you add or delete columns, it redraws the boxes. The best use of resources is getting the columns right in the first place. If you don’t have to move columns because it guesses so well, then the whole thing becomes a breeze.

Let’s look at this sample file in terms of columns. In order to get all the numbers to show, this chart is kind of big.

I wrote a little macro to analyze the file and report how many characters are in each column.

Visual Basic

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

SubReadTextFile()

DimsFile AsString

DimlFile AsLong

DimsInput AsString

DimvaLines AsVariant

DimiAsLong,jAsLong

DimaChars()AsLong

DimbLow AsBoolean

sFile="C:\Users\dkusleika\Dropbox\Excel\FixedWidthExample.txt"

lFile=FreeFile

Open sFile ForInput AslFile

sInput=Input$(LOF(lFile),lFile)

Close lFile

vaLines=Split(sInput,vbNewLine)

ReDimaChars(1ToLen(vaLines(0))+1,1To1)

Fori=LBound(vaLines)ToUBound(vaLines)

Forj=1ToLen(vaLines(i))

IfMid$(vaLines(i),j,1)<>Space$(1)Then

aChars(j,1)=aChars(j,1)+1

EndIf

Nextj

Nexti

Sheet1.Range("B2").Resize(UBound(aChars,1),1).Value=aChars

EndSub

Then I charted them. It seems pretty clear that there’s a break around 7, 12, 23, and 35. The points at 41 and 45 are a little less clear, but starting at 47, you see a clear downward trend. This is the tell-tale sign of left-justified text. Similarly, 87-104 is a right-justified number. The headers muddy up the waters a bit because they contain data that’s no good to me, but still adds to the character count. As I mentioned before, I’m not building a general-purpose solution and it just so happens I can remove the headers. So I did.

The ASCII code for the page break character is 12. All of my headers end in a string of 132 dashes. That’s damn convenient. Look what happens when I remove the headers.

Based on that, I think I could predict the column breaks perfectly. But there’s one aspect of this file that allows me to say that. I don’t have any truly variable length, left-justified text followed by any truly variable length, right-justified number. For instance, if the description column was immediately followed by the debit column, it might be difficult to determine exactly where to break it.

How did I solve that problem? I didn’t. This is as far as I got writing my own text import wizard. Out of nowhere, I read something about reading text files with ADO and all my problems were solved. I had used external data tables to read text files, but never ADO. I’ll post about how I’m importing text files with ADO in my next post.

I have a post or two in my head that deals with fixed width text files, something I’ve been dealing with quite a bit. Like me, you probably don’t run into fixed width text files in your personal life. It’s usually some crappy computer program at your job that forces you to deal with them. The problem is that I can’t simply throw around private company information, so I had to replicate a real fixed width file with obfuscated information. Before I post about all the lovely things I’ve been doing with these files, I thought I’d post about how I made the sample.

That procedure is a bit longer than I like, but it’s not something I wanted to spend a lot of time on. It reads in the legitimate file, then goes line-by-line through it. When it encounters a line that needs changing, it creates fake data to put in place of the real data. I want the totals to match so I have something to compare it to after I parse the text file (in a later post), so I have to keep track of the fake numbers I make up and total them appropriately.

The GL Account is a 10 digit number that start with 1-5 and can have any other digits after that.

The Yes/No generator gives me a Yes about 80% of the time. You have to stay positive, you know.

Visual Basic

1

2

3

4

5

6

7

8

9

FunctionGetYesNo()AsString

IfRnd<0.8Then

GetYesNo="Yes "

Else

GetYesNo="No "

EndIf

EndFunction

For the debits and credits, I wanted to stay somewhat realistic, so I kept the same scale as whatever number is there. To determine the scale, I remove all spaces, commas, and periods from the screen, then take the length x 2. A number like $9,453.65 will have a scale of 4 and will produce a number between 1,000 and 9,999. Rnd, as you know, generates a number between 0 and 1. I multiply Rnd by 10^Scale to get the right number of digits, then round it to two decimal places.

Finally, I have a Pad function to fill out spaces around numbers. First, Pad truncates the string to the right width in case it’s already too long. If it’s not too long, spaces are inserted in front of it to fill out the right width. Then there’s the sAfter argument that you’re probably wondering about. I don’t have trailing negatives in this report, but I have had them in some others. I need to be able to stick a negative or a space after the string and used the sAfter argument to do it. I believe they’re all null strings in this example.

Several years ago I needed to open the newest CSV file from a particular directory. Now I’m faced with a similar problem. My accounting system produces PDFs in some kind of proprietary PDF reader. It doesn’t have near the feature set of Foxit, my preferred reader. Instead of fighting it, now I immediately save the pdf and open it in Foxit.

Drawing from that previous post, I made a function to find the name of the most recently created PDF.

Visual Basic

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

FunctionGetNewestPDFFileName()

Dimfso AsScripting.FileSystemObject

DimfsoFile AsScripting.File

DimfsoFldr AsScripting.Folder

DimdtNew AsDate,sNew AsString

ConstsTYPE AsString=" PDF "

ConstsFLDR AsString="\\99991-dc01\99991\dkusleika\My Documents\"

Setfso=NewScripting.FileSystemObject

SetfsoFldr=fso.GetFolder(sFLDR)

ForEachfsoFile InfsoFldr.Files

IffsoFile.DateCreated>dtNew AndInStr(1,fsoFile.Type,sTYPE)>0Then

sNew=fsoFile.Path

dtNew=fsoFile.DateCreated

EndIf

NextfsoFile

GetNewestPDFFileName=sNew

EndFunction

Once I have the name, a simple FollowHyperlink method will get me where I want to go. Oh, except that hyperlinks are bad and Excel needs to show me a warning. That’s not going to work. Instead, I take the long way around. I create a batch file to open the PDF and run that.

Finally, I wanted a third method. fzz commented that I should use a console command because that’s what consoles are good at. I made a batch file following his example:

for /F %%a in ('dir /b/o-d "K:*.pdf"') do (start %%a & exit)

No warnings, obviously, and even though the VBA above is lightning quick, I think we can all appreciate that this is the quickest and most direct way. I’m having a problem running from VBA though.

Visual Basic

1

2

3

4

5

6

SubOpenNewestPDF3()

Shell"K:\OpenPDF2.bat"

'ShellExecute 0, "OPEN", "K:\OpenPDF2.bat", "", "", 0

EndSub

It got the file name right, but says it can’t find it. As you can see, I tried ShellExecute too. Same result. It doesn’t matter. I have a batch file, so I don’t need Excel. I put a shortcut to the batch file on my desktop and set the shortcut key. Now I can open it regardless of which applications are open or have the focus.

I love RSS. If you have a website and don’t have a feed, I don’t follow it. I monitor stuff via Google Reader and it’s a great way to keep up on a lot of sites. So it should only follow that monitoring spreadsheet changes via RSS would be great too. Right? Well, not really, but that never stopped me before.

First some caveats. This is just an experiment and not meant for general use. There may actually be some bugs in it, if you can believe that. If you want to modify the code for your own use, be warned that if you monitor too many cells it might be slow. Or you might get so much information that it’s worthless. OK, now that that’s over.

An RSS feed is an XML file that sits on a web server. This blog has such an XML file. When I post this blog entry, WordPress will update the XML file with an entry for this post. Occasionally Google Reader will check the XML file and see if there’s anything new. If there is, it will display the new stuff for anyone who has subscribed to the feed.

It’s fairly trivial to create an XML file, even from Excel. The potentially difficult part is putting that file on a web server. Oh, except for one little thing. I have a webserver right on my computer and you might too. Mine is called “C:\Users\dick\Dropbox\Public\”. Did I trick you? Dropbox has a Public folder and you can get a “public link” from any file in that folder, including the XML file we’re about to create. What a simple way to publish something to the intertubes. Here’s how you get that public link.

Alright, enough screwing around. Let’s get to the code.

I start with a class module called CChange (and its parent CChanges). CChange has the following read/write properties:

Address - the cell address we're watching for changes and a way to uniquely identify the instance.
OldValue - the value in the cell before it changed.
NewValue - the value in the cell after it changed.
Modified - a time stamp when the changed occurred.

I’ve named a range on the sheet called RSSWatch. When the workbook opens, a CChange object is created for every cell in that range.

These CChange objects are just sitting out there waiting to record any changes. They all have a Modified date of 12:00:00 AM (because I didn’t set anything) and in this state they won’t be written to the XML file. In the ThisWorkbook module, I use the Workbook_SheetChange event to monitor my range for changes. You might notice that I switch pretty liberally between sheet-specific references and general references. For instance, in the above Initialize method, I limit my range to Sheet1. When I went to code the event, I thought that someday I would want this to monitor different ranges on different sheets, so I used a Workbook level event. It’s totally inconsistent, but it will be helpful if you just ignore it.

If Not rRng Is Nothing Then
If Not Intersect(Target, rRng) Is Nothing Then
For Each rCell In Target.Cells
Set clsChange = gclsChanges.Change(Sh.Name & "!" & rCell.Address)
If Not clsChange Is Nothing Then
clsChange.NewValue = rCell.Value
clsChange.Modified = Now
End If
Next rCell
End If
End If

End Sub

For every cell that has changed, I find it’s CChange brother and change the NewValue and Modified properties. I haven’t written the XML file yet. I still just have a bunch of CChange objects, except that at least one of them has a Modified property that will make it eligible to be included in the file. Changes get made and recorded and only the last value and time are saved. Then, when the workbook is saved, any eligible CChange objects are written to the file.

Generally, I’m creating an XML file if it doesn’t exist, reading that file in, appending Items to it for any changes, and writing that file back out. The first thing I do is make sure there’s something to write. In line 10, I call the HasChanges property, which loops through all the CChange instances to see what’s eligible. If nothing has changed, there’s no need to create the XML file.

Line 20: If there isn’t already a file, I need to make one. I check to see if it’s out there.

You’ll need to set a reference to Microsoft XML, v6.0 or similar. CreateFile sets up everything in the file that’s not an Item, like the title, link, description, language, etc. The basics of XML generation are 1) create a new node and 2) append it to its parent node.

Line 40: I read in the existing file or the one I just created. The Load method populates the XMLDOMDocument with all the hierarchies and data. Line 50 and 60 go find specific nodes in the file that I’m interested in. I want the Channel node because I’ll be appending Items to it. I want the lastBuildDate node so I can compare that to CChange Modified properties and only write new changes.

Line 70: The RSS example I got off the web showed dates formatted like Tue, 06 Mar 2012 21:28:01 CST. Neither Excel’s CDate or Datevalue functions can convert that to a date, so I have to strip off the day and the timezone, which I do in a separate function.

Line 80: Ultimately I want to make my lastBuildDate in the XML file equal to the latest date of all the CChange objects. I’m initializing the maximum variable here. If I don’t, every cell’s Modified would be greater than this date.

Line 100: For every CChange, I check to see if it needs to be written to the file.

Public Property Get ShouldWrite() As Boolean

ShouldWrite = Me.Modified >= gclsChanges.LastBuildDate

End Property

If it’s been modified since the last time I wrote the file, it gets written this time. I really should also check to see if OldValue is different than NewValue. Right now if you change a cell and then change it back, it will still show up as a change. Sounds like a V2 enhancement.

The rest of the loop creates an Item, appends it to the Channel, and updates the maximum Modified date. The xmlItem property returns an object that can be appended.

Public Property Get xmlItem(xmlDoc As MSXML2.DOMDocument) As MSXML2.IXMLDOMElement

Not much to this – make a node and append it. The Description property is a read-only property that makes a nice English sentence describing what happened.

Line 160: I change the LastBuildDate to equal the max, then change that node in the XML file.

Line 180: This code I stole from VB Helper. It adds the line breaks and indentation that, while not necessary, is really helpful when debugging. No, I didn’t write this code perfectly the first time.

Finally I save the modified XML document. Dropbox publishes to the web and Google Reader reads it. Here’s what the last change looks like in the reader.

If I ever look at this again, here’s what I’d do in version 2

Put all the changes in one Item rather than one per cell

Check to see if a cell was changed and changed back and exclude it

Get the timezone from the Windows API rather than hardcoding it in a constant

Modify to use mulitple ranges on multiple sheets

Put the code in an add-in and look for workbooks to monitor

Change the Description to be easier to read

I like this Public Dropbox folder. I’m thinking of using it as a version control system to keep my add-ins up-to-date from multiple computers. It’s a shame that some companies block it. Oh well, your thoughts on the RSS code are welcome.