Helping Consumers with Data from Twenty Million Credit Cards

Bundle.com is a personal finance website with a mission to “help US consumers make smarter decisions with their money”. What really makes it stand out is the company’s unique access to detailed, anonymized transaction histories from 20 million Citibank credit cards.

This allows them to build consumer tools in the same vein as Mint, but with a deep foundation of information to compare to right from the first user. Last week I sat down with CEO Jaidev Shergill, CTO Phil Kim and data scientist Alex Hasha to learn more about what they’re doing with such a powerful data set.

The first question they wanted to address was the obvious one of how do they ensure privacy and security when dealing with such sensitive information? Everything is held in a secure data center, and no direct personally identifiable information is included in the histories – everything’s anonymized. The team also takes further steps, like identifying and removing healthcare-related payments. I asked them though, doesn’t it still make people a bit uncomfortable? Their response was that their whole business was based around helping consumers, and their investor Citi only shares the data on very strict conditions because they believe Bundle’s work will make customers lives better.

CTO Phil Kim laid out their philosophy:

Bundle takes great pains to protect the privacy of users. First, we hold ourselves to strict, bank-level information security standards, which means that sensitive data is held in a secure data center and access is heavily restricted, and that the Bundle application is heavily scrutinized for vulnerabilities on a regular basis, to prevent accidental or malicious leakage of user data. Second, much of the data analysis and synthesis work we do relies on data that has been sampled, modeled, flattened, or otherwise transformed — we rarely work with raw transaction data, and we never work with data that has a direct linkage back to a named customer. Last but not least, Bundle is very much focused on building tools to help consumers — remember that this data is not new… large companies use data just like this to market products and make business decisions — Bundle is simply trying to share this data with consumers.

Alex Hasha described how he’d worked in the finance industry as a quant, working in a team of over two hundred PhDs to analyze financial instruments. The attraction of Bundle.com for him was the chance to work on something that offered direct benefits to ordinary users, a refreshing change from the abstract world of high finance.

Users upload information from their own bank and credit card accounts onto the site, and in return they get back a score card showing how their spending compares to people like themselves. For example, you might discover that you’re spending a lot more on groceries than other people in your neighborhood, and you’d be better off switching to a cheaper supermarket.

The key to all of their work is the value that they’re able to extract from aggregate information, things like how much people in a particular zip code spend on particular categories such as eating out, groceries and transportation. Because this is the result of blending and averaging large numbers of different accounts, it helps reduce the risk that any sensitive information will leak out.

What’s really impressive about the data they possess is its broad coverage. Almost every merchant in the U.S. will be represented, and it has the potential to offer the deep customer analytics that website publishers are used to. I could imagine it being used by restaurant owners to spot when they’re losing previously loyal customers for example. Bundle.com won’t speculate on where they will take their product in the future, but did want to emphasize how everything was driven by their mission to help consumers.

I spent a bit of time talking with them about the technical challenges of their work, too. Credit card systems are often 30 or 40 years old, and so the data they get back is often very messy. You know how you look at your statements and try to decipher what “MCDON 94117″ could be? That’s one of their biggest obstacles, the names of the merchants are often incomplete and unclear, so they have a whole system devoted to making sense of this unstructured data. “MCDON”, “MCD” and “MCDONALDS” all likely to refer to the restaurant, which allows them to categorize any transactions as food purchases.

A large amount of their code is written in Perl, since they’re big fans of CPAN’s rich repository of libraries, and runs in-memory, so it’s not a classic big data problem. They also rely on R for some of their analysis, thanks to its rich toolkit of statistical functions.

The data mining of billions of credit card transactions is bound to raise a lot of questions, but it was clear to me that Bundle.com is serious in its mission to help consumers. Its product certainly seems to offer a lot more value to the wider world than anything that Wall Street’s quants have produced.