Category: data mining

For some amount of time (ie., until the bandwidth costs add up 🙂 ) you can download the figures extracted from arxiv documents, during the development and testing of the search API described in prior posts. If you have the AWS CLI installed, getting the figure meta data (from which you can create download URLs)Continue Reading got data?

I’m entering the final stages of a figure search engine, a nice wrapper for the new API method discussed below. It’s also a chance to properly release data mined directly from arxiv figures, and take advantage of the lambda + S3 processing pipeline I developed when pushing the p2t algorithms to cloud initially. Attached isContinue Reading figure meta data

We’re very grateful to Dr Piatetsky-Shapiro for the chance to publish an item in kdnuggets, check it out here. In the process of putting some examples together for the article, I think I’ve finally landed on a useful workflow and schema for the figure data search engine, hoping to get that out asap; stay tunedContinue Reading kdnuggets

Some time ago I launched a little project, mining data from arxiv; you can read about it in other blog posts. Specifically, I modeled figures from about 500k figures as Gaussian mixture models, in order to create some features, so figures might be ultimately represented as graphs for comparison. More ordinary methods might suffice tooContinue Reading arxiv mining

We’re edging closer to officially releasing available API methods, including a core OCR method (text-lines) that allows for text extraction in the presence of extraneous objects like embedded images and so forth. Image up/download times combined with computation cost at the backend amounts to several seconds, which isn’t too bad. Using curl to POST data,Continue Reading Mining text from document pages

We’re privileged to be presenting at the Collision beta track in New Orleans in a couple of weeks! At the conference will be a stimulating and diverse range of companies, investors and speakers. Looking forward to hearing from D-wave among many others. If you’re going, stop on by and say hi; free swag and demos!Continue Reading Collision Conference 2017

I’ve finished mining ~ 1M figures from a venerable preprint server. It was a great learning experience, and I’ll write more shortly on the findings. I’m about to release the data, and to that end put together a boilerplate MEAN app for serving up the details using AWS on github. Each figure from > 100k papers isContinue Reading Figure Search Engine I

I’ve spent some time, over the last month or so, mining figures from a large document preprint server. The value of the information stored within is hard to overestimate, covering a huge cross section of Physics and Mathematics, and several other sciences. The goal of this work is to find figures, and create mixture models ofContinue Reading Figure Mining at Scale

When extracting data from technical images, one is solving an inverse problem. The pixels in the image and their x,y locations are essentially a set of observations, from which we would like to discern the underlying inputs that produced them. An important initial step in processing technical documents is image feature extraction. These features needContinue Reading Image Feature Extraction