Payment matching done right

I use ledger to keep an accurate record of my finances. All those
financial actions of receiving, spending, saving, investing or moving
around money will be reflected by transactions in my ledger file. If
you have ever tried to do anything similar, you might think that
maintaining the ledger is cumbersome, time consuming and error-prone.
In reality, nothing could be further from the truth. You need proper
tooling, though.

Any half-decent bank will provide access to your account history as
some kind of downloadable CSV (or maybe an Excel file that can be
converted to CSV fairly easily). Each entry in the CSV has fields for
the transaction identifier, a date (or maybe two, if they separate the
value date from the transaction date), the amount and some text which
aims to identify what the transaction was or who the counterparty is.

As it turns out, the only real question when converting all this into
a ledger entry is finding the correct counter- (or payee-) account of
each transaction. I say counter-account, because the account itself is
known: it is the representation (typically an Asset in accounting
lingo) of the very account that the history pertains to.

Since I started keeping all my financial records in double entry
bookkeeping, I tried a few different solutions. First I used GnuCash,
but that proved to be a horrible experience when importing my account
statements. I basically had to choose the payee account from drop-down
menus in the import wizard on a per transaction basis. I also
disapproved of the XML file which looked unnecessarily complicated and
got quite large.

Then I migrated my books to ledger and never looked back. Finally I
felt in control of the whole flow: a simple text file is all I need to
maintain. So I adapted my importer scripts (by this time I had a few,
initially to convert CSV/XML to QIF that I could import into
GnuCash). However, the problem of finding the right account to write
the bookkeeping remained unsolved. I tried a few alternatives (ledger
--xact, reckon, etc.) but none seemed to work reliably enough.

This got me thinking about the underlying problem. What is the best
algorithm for matching a bank account transaction to a bookkeeping
account?

The matching algorithm

Assuming you have any sort of financial history in your ledger, you
have most likely already invented a number of bookkeeping accounts to
label the sources and sinks of your money. For example Income:Salary
is where your salary comes from, and Expenses:Rent is where each
month’s rent payment is booked. When adding new transactions, they
will most likely fall in one of the existing categories, since they
will be, for the most part, new instances of recurring transactions
you already have in your books (like paying the rent, or shopping once
again in your favourite grocery store). When considering a new
transaction to be added to your ledger, the fundamental question is
this: which of these bookkeeping accounts should be chosen as the
counterparty?

Preferably, this should be decided with no up-front configuration such
as regex rules or anything like that. I also don’t want the algorithm
to ask me any questions. I will adjust the generated entries anyway,
fixing anything I might not like, so guessing is fine as long as the
guesses are reasonably good. What’s more, I want the tool to
automatically adapt to any changes I might enact on the ledger. So if
I decide that from now on, the rent is bookkept under
Expenses:Housing (and run an auto-search-replace for that in my
whole ledger file), this change should be picked up by the algorithm.

So how to do this? I use the following naive Bayesian inference algorithm.

Step 1

Parse the ledger file. For each account, maintain a table of
{token→N} entries, increasing the tally for all the tokens
from the descriptor text in the table for that account.

Also, add each token to the set of all tokens T and each account to
the set of accounts A, so we have these sets ready for the subsequent
steps where we will iterate over them.

Let N(t,a) be the number of occurrences of token t in account a.

Step 2

When parsing a new transaction, scan the descriptor into tokens. Based
on the presence of each token, a vector of probabilities of the
transaction belonging to each possible account is calculated. Let
be the probability that a
transaction with the token t belongs to account a.

We define as:

Then, after we have
for all tokens in the descriptor for all accounts, we use the Bayes
theorem to calculate the combined probability of the transaction to
belong to the accounts.

Let P(a) be the probability of the descriptor to correspond to account
a. For each account, P(a) is defined as:

Then, the account with the greatest P(a) should be chosen.

Note that step 1 does not depend on the transaction to be decided on,
just the current ledger file. The results of reading the ledger file
may thus be precomputed and then only the remaining part of the
algorithm needs to be carried out on a per transaction basis.

The implementation

My implementation is done in Clojure, which I consider one of the
most promising new languages. Clojure is my personal favourite among
these. (My other favourite languages, Erlang, Common Lisp and C, are
not exactly new.)

You can find the repository on GitHub, it’s called banks2ledger.
Read the repository README for general instructions and usage.

One interesting point I want to make here is the way tokenizing is
done. It is super simple but improves the quality of the matches
quite a bit. When breaking the transaction’s description string into
tokens, by default every substring separated by spaces
(i.e. conventional words) will be a token. (The tokens are not case
sensitive, so everything is converted to uppercase.) The comma , and
the slash / are also taken as token separators.

What’s more, different forms of dates are replaced with degraded
forms. This is because the same recurring transaction will show up
with different dates in the ledger file; we want all these, and even
future dates, to weigh towards the same payee account (in terms of
Bayesian inference). So a substring like “20160317” will be replaced
by the generic (degraded) token “YYYYMMDD”. The same goes for dates
specified as “YY-MM-DD”. In practice, it is these two formats
pertaining to card transactions that I have come across.

One final quirk: when editing the newly added transactions in my
ledger file, I will sometimes add some notes after the end of the
original descriptor string. I separate these notes with a pipe |
character from the original content that came from the bank account
CSV. The notes contain additional information concerning the
transaction that I want to be able to look up later. But since this
part does not come from the original CSV, it is not useful to parse it
when tokenizing. So when descriptor lines are read from the ledger
file, any pipe character terminates the line.

Driving the flow

Each month I download the account history CSV (or XLS) from all the
banks I have accounts with. I save all these files according to simple
naming conventions (i.e. bankname_YYYY_MM.csv) and archive them
indefinitely for further reference.

After downloading the account files for the recently closed month, I
run make which will drive the conversion of all the account files
into ledger format. These files will be saved with the same names as
the originals, changing the extension to .dat.

Below is the Makefile I currently use. (Note that in two cases I
download the file as XLS or even XLSX and convert that to CSV via
ssconvert, a useful command line tool that comes with Gnumeric). The
JAR file run by make is the result of running lein uberjar in the
banks2ledger repo. It is a standalone executable version of my
Clojure application.

After the conversion is done, I go through the resulting .dat files
and manually edit them as I see fit. When that is done, I append them
to my main ledger file and eventually, if everything looks fine,
commit the changes to that file. The .dat files are also archived
along with the sources.

Executing this flow takes about half an hour each month – including
logging in to all webbanks and downloading the account history files,
running make which in turn runs banks2ledger on all of them, and
finally reviewing and editing the results, committing the updated
master ledger to git. In return, I get detailed records of all my
money movements, which I find to be an indispensable resource.