Searching ChEMBL in the browser

A previous post (see the slidedeck from slide 40) described some of the work we have done on the development of fast substructure search, a project code-named Arthor. At the time, it ran about two orders of magnitude faster than any of the other programs benchmarked. Such speed makes possible interactive searches of large databases. That’s pretty obvious, and so rather than discuss that here, here’s something else that’s a bit more novel: interactive substructure search of moderately sized datasets, entirely client-side in the browser.

It is important to note this is not the first time that substructure search has been implemented entirely in the browser: Peter Ertl and co. developed the Wikipedia Structure Explorer which searches almost 15K structures from Wikipedia using the Actelion Java library compiled to JavaScript. However, with Arthor (also compiled to JavaScript), it is possible to search the whole of ChEMBL22_1, 1.68 million molecules, in the browser. It even works on my mid-range phone (Moto G 3rd gen, 2GB RAM), although there it is limited by memory constraints to 1.0 million molecules.

Time for the timings. Note that times quoted for the native code do not include the use of a fingerprint screen to be like-for-like with the JavaScript, where is not possible to use fingerprints for the whole of ChEMBL due to RAM constraints. The native and JavaScript times were measured on the same machine (Core i7 6900K CPU, 3.20GHz), and all are times to find the total number of hits (rather than the first 10 or 100 or whatever) using a single-thread. Phone times are for 1.0 million molecules. All times are in ms unless otherwise stated.

1.68M mols

1.00M mols

Query

Hits

Native

JavaScript

Phone

c1ccccc1

1420663

419

663

3.24s

Br

75132

113

197

819

CCO

754842

230

368

1.32s

OOO

1

99

300

1.12s

[X5]

160

102

186

817

Imagine a future where the computationally expensive step of substructure searching no longer requires a server, but is done client-side. Impossible, or only a matter of time?

Perhaps I neglected to mention the 90MB JavaScript download. 🙂 But future browsers will hopefully cache even such large files so this would be a once-off cost. This is served gzipped, and is 325MB unzipped. Then when processed by the browser, it takes up an additional similar amount of memory for the data structures. It might be possible to avoid one of these memory duplications, by pulling the Arthor database directly across the web rather than embedding it into the initial JavaScript file.

Another idea I’ve had is to provide successive enhancements by downloading bigger and bigger portions of the database over time, so that you could do the initial search on a subset of the database immediately. So you could be refining your query while the complete download is still proceeding.