Scaling a DataFrame in Javascript

In a previous article I proposed a thought experiment: one might model SQL Execution Plans in Javascript, and use this to design a database that supports example plans. Such a system might allow for new query languages to be built, or drop-in algorithms in a data processing application (for instance, one might want to be able to do “create index” for a LINQ query that is actually against in-memory data).

In this article I will explore the problem from a different angle: a data structure that can to handle massive amounts of data in memory.

In both R and Python/Pandas there is rich support for an object called a DataFrame, which bears strong resemblance to a table. If you load files into these, eventually you will find one large enough to run out of memory, even if you have a lot of RAM. Once this happens, you have to resort to other techniques – it would be preferable to keep the programming model, and have the tools handle data volume gracefully.

First, it is valuable to have a fairly large database table to test. Google n-grams provides good examples: this is word and phrase counts from Google books (scanned books), grouped by year. There are many thousands of these files – I chose the file of two word word phrases starting with “th”, which is 30 GB of text.

The following unix command splits the file into small blocks- I determined the size of the blocks experimentally, aiming to average around 64k, as this is a plausible value for disk block sizes in a relational database.

split -l 2500 -a 4 googlebooks-engall-2gram-20120701-th

This generates loads of files, which are named sequentially, with names like xaaaa, xaaab, xaaac, aaaad, etc.

These files can be loaded trivially in R into a dataframe, like so, wherease the full file wouldn’t support this:

A typical filesystem would store blocks in a B+ tree for quick insertion and lookup. These trees are usually short and wide, which reduces the hops to find a record, and track from one node to the next at the leaves, which supports scanning ranges of the tree.

The API for this implementation is pretty easy to follow. It has an interesting property that when you do “seek,” it stays in that location until the next seek, so you can traverse along the bottom of the tree if you wish.

t =new tree(3)
t.insert(1, “data1”)
t.insert(2, “data2”)
t.seek(2)

To populate this tree, we need to add our block numbers – this tree requires integers for the keys. Since the files created above for blocks are named “aaa”, “aab”, etc, this is base-26, so we can convert it into numbers:

On first glance this seems a bit stupid – we’ve invented the world’s most complicated array. We’ve stumbled across a neat database feature by accident: index ordered tables. This table happens to be pre-sorted by word and then by year, which gives performance gains to certain queries – for instance a “count(distinct word)” could be done without maintaining a large “set” of words in the DB for some memory savings.

In a relational database, storage doesn’t necessarily keep rows in any particular order. Each insert may fill a slot in an existing page, so several sequential inserts may put data in random locations. Interestingly, it is possible to influence paging in Oracle during bulk operations: “TRUNCATE table” tells it to immediately delete all pages, skipping rollback, without inspecting their contents, and “INSERT /*+ append */” tells it to create a new page- skip the step of looking for empty slots in existing pages (good for a single massive insert, and very slow for many small inserts).

Note that I’ve left the above objects at each node empty – when a node is accessed, it will be loaded into memory. Since I’m doing this in a browser, I’m going to load a block using Ajax. Once the block is loaded once, the getData method will replace itself with a memoized copy of the data, which helps subsequent queries (this is why a query will run faster the second time you run it, or how one query can effect the performance of subsequent queries):

This solves the problems of loading and caching data blocks, but we’re still going to run out of RAM if we run a long query – we need a way to kick blocks out of the system, and revert to the ‘unloaded’ state.

This shows the first lines of the 1000th file. If you then load 11 blocks, the first block is kicked out, and eventually freed by the garbage collector (it’s set low for testing purposes – obviously you want much more RAM).

This, I suggest, is a good implementation of a DataFrame for large tables – it supports “show me the first 100 rows” queries common in investigation, and scales well to paging through troublesomely large data sets.

One could loop over this table to generate an index – a mapping of entries {“the dog”: “xamym”}, which tells you where to start looking for specific record – once this index was built the lookup would be quite fast. Properly implemented, these indexes could be used by the R or Pandas DataFrame APIs (or a system like LINQ) to support dramatic performance improvements, without re-architecting a system that crumbles under load.

Enter your email for more great content

2 comments ↓

One correction that I think you will find useful is that the routines do not require the keys to be integers. The demo screen does force it but only so that the display of the results is more manageable. If character strings were entered as keys the tree would easily scroll way off the screen. It also saves me fending off questions from people confused about collating sequences!

However you can do t.insert(‘aaab’,’xyz’) and t.seek(‘aaab’) so you don’t need to perform your string to number routine.