Overview

The tarray extension implements a new Tcl collection data type - typed array - and associated commands column and table. A typed array stores elements of a specific data type in native format. The primary motivation for this extension is efficient memory utilization and speed of certain operations in applications dealing with very large numbers of elements. This is achieved through native storage formats and parallelization of operations on multi-core CPU's. See the benchmarks below.

The tarray extension was inspired in part by Speed Tables and to a lesser extent by TclRal.

The philosophy behind tarray is to provide efficient facilities on top of which more sophisticated data structures, possibly customized for specific applications, can be easily scripted and experimented with. Therefore, unlike Speed Tables, tarray does not require creation and recompilation of a new extension for each table definition. Moreover, tarray provides value-based semantics so that columns and tables can be used as basic building blocks. Additional facilities that Speed Tables provides, like remote access, are expected to be implemented at the script level.

An accompanying package tarray_ui is also under development containing related Tk widgets.

Also see Xtal, a language built on top of Tarray for convenient programming with typed arrays.

Benchmarks

Out of date

As expected, benefits of native format and parallelization can be significant (more than two orders of magnitude for searches) as shown in the benchmarks below. Tests run on 64-bit Windows 8 notebook with i5 CPU @ 1.7Ghz.

Sorting

The lsort column is the baseline showing the performance of lsort for each data type (using the -real, -integer etc. options). The numbers in parenthesis show performance relative to the lsort baseline. The next two columns show tarray performance without parallelizing and parallelized to 2 threads.

Real world example

The following example compares storing of geographical data for cities from http://geonames.org as a list, as a sqlite in-memory database, and as a tarray table. The database has just over 142000 records. The corresponding cities table definition is shown below.

The table is queried for all cities with more than a million people. Simplistic query and shows tarray in the best light speedwise as comparisons are all numeric. Better benchmarks tbd.

Memory usage and timing is shown below. It shows even from the memory usage point of view, lists are not suitable for more than a few hundred thousand records. Note tarray uses twice as much memory as sqlite but is two orders of magnitude faster on the search. This is not to say sqlite and tarray are comparable! sqlite is a database, tarray is not! Nevertheless, for use as an in-memory data structure, tarray can replace some uses of sqlite. APN is somewhat surprised with the difference in speed. Any hints on optimizing the sqlite query would be appreciated. It would have been nice to also compare Speed Tables but that does not build on Windows and I do not have a Linux benchmarking host.

Discussion

SEH -- It would be useful to be able to do dumps of raw binary data once a TArray had been constucted. Then one could, for example, use it with a reflected channel to duplicate the function of memchan. Or to do device I/O. (Is this already possible?)

APN Direct I/O files/databases to and from typed arrays is on the to-do list but low down for a couple of reasons. First, there are still a bunch of basic operations and optimizations that have to be implemented to make the package more useful as a building block. Second, it is not clear what the output / input format should be. Even just binary dumps raise questions of endianness etc.

SEH -- Since that package is pure Tcl, I was hoping a TArray-based solution would be faster.

AK -- What makes this then different from the original memchan ?

SEH -- Nothing at all, except that the author states above that TArray is intended to be a modular base for a range of specific solutions. If the function of memchan could be duplicated, that would be one less purpose-specific package to maintain, replaced by a flexible multi-purpose tool. I think it would be healthier for the Tcl ecosystem to have fewer of the former and more of the latter, for an equivalent range of applications.

If I have time, I'll a try to switch to tarray to give it a try. But to have a real benefit, I'd like to have binary search, to search by time. It's also missing in metakit, so I have done it in Tcl, which is faster than searching with a metakit call.

APN There is no "byte array" tarray type so if you are storing frames, you have to use type any for the column which corresponds to a Tcl_Obj*. So you will not really see any savings in memory. Regarding binary search, you might want to see if you really need it as integer searches are pretty fast. I have thought of optimizing the column search for the sorted column case but have not gotten to it yet.