There are many times when I am faced with the task of extracting data from a published graph (usually a bitmap image in an paper). For example, a scatter plot from which I would like to get a list of individual (x, y) coordinates for the points.

One option is to ask the contact author for raw data. Most will do it, sometimes in nice ASCII format, sometimes in Excel files, sometimes in formats that I cannot open (chemists are fond of software like Origin or Igor Pro). Some authors never reply, or ask questions like “what do you want to do with it?”. In all cases, it takes time. Sometime, it's not even possible (I can hardly email the author of a 1936 paper!).

The other option is to extract the data. I currently use g3data to do that, but for large scatter plots having to click on every single point is tedious. Thus, I am looking for a data extraction software that could recognize individual points automagically, and possibly filter them by point color or symbol used. Is that even something that exists? What other tools can you recommend to work around this issue?

I don't think it'd be appropriate to have extra requirements on the software, so I'm happy with free or commercial solutions, running on any OS. Of course, if given the choice, I'd prefer open source software running on Linux and Mac OS.

The problem with extracting the data from a printed graph is that the process will introduce errors. Then what can you really say about the data you have?
–
Dave ClarkeFeb 1 '13 at 8:46

3

@DaveClarke Yes, the process introduces some uncertainty, but if the graph resolution is good, the uncertainty can be low. Also, sometimes there is no choice: I recently digitized data from a 1936 paper, I can hardly imagine emailing the author :)
–
F'xFeb 1 '13 at 8:52

1

An option you didn't mention in the question is to reproduce the experiment yourself. While in some cases it is a time-waste you'd like to avoid, depending on the nature of the experiment, it may be an interesting solution.
–
T. VerronFeb 1 '13 at 12:24

Edge detection in image processing is not easy; it gets harder if you have anything besides black and white. So the main difficulty is not in the "conversion to tabular" but the "finding the data points" part of the problem; you may have better luck asking on dsp.stackexchange.com
–
Willie WongFeb 1 '13 at 16:00

7 Answers
7

We had a very similar problem at my old job: we had to scour a huge literature database containing literally thousands of papers for any data showing the solubility behavior of different species. A lot of this data was from the 1950s through 1970's, and was data we could not reproduce for a very large number of reasons (time and now safety regulations being chief among these).

The colleague who was responsible for collecting all of this data used a package called Data Thief to remove the data from graphs. It seemed to work well, but is also (from what I recall) commercial software (or rather shareware, but still technically not free). It is cross-platform and written in Java, so perhaps satisfies a decent amount of your criteria.

The later is something I had not thought about, but might actually be useful for some teaching needs (analysis of motion from a video). My first experiences are good: the software is easy to use, includes a nice magnification UI, and automatic curve detection works fine if the graph is “clean”.

Slight correction: the program that you call "Digitizer" actually calls itself Engauge, or sometimes "Engauge Digitizer", and it's free software (GNU General Public License v2) rather than shareware. I have used it several times on Linux and found it very capable.
–
PontApr 14 at 6:10

I used DataThief years ago. From what I remember, it is not fully automated. You start by loading a digital image and identifying the axes, some tick marks, the axis limits and the scale (i.e., linear/log/polar). This lets it handle bad scans (e.g., rotation and warping). Once it knows the bounding box of the plot, you then tell it what to extract (curves, points, errorbars, etc.).

It is written in JAVA so should run on most OS's. I believe it is free as in beer (and it might be open source).

There's a now famous/infamous blog by a professor at Berkeley where he and his lab carefully read and dissect papers in bioinformatics. I saw him mention this tool. If this guy uses it, it's probably quality. liorpachter.wordpress.com/2014/02
–
vector07Sep 24 '14 at 1:40

I think GRABIT m-file by Jiro Doke can be a possible solution for you. You can find this in MatlabCentral for free. I just copied the the description of the code here for your convenience.

GRABIT Extracts data points from an image file.
GRABIT starts a GUI program for extracting data from an image file. It is capable of reading in BMP, JPG, TIF, GIF, and PNG files (anything that is readable by IMREAD). Multiple data sets can be extracted from a single image file, and the data is saved as an n-by-2 matrix variable in the workspace. It can also be renamed and saved as a MAT file.
Following steps should be taken:
1. Load the image file.
2. Calibrate axes dimensions. You will be prompted to select 4 points on the image.
3. Grab points by clicking on points. Right-click to delete a point. Image can be zoomed during this stage.
4. Multiple data sets will remain in memory so long as the GUI is open. Variables can be renamed, saved to file, or edited in Array Editor.
Panning is achieved by clicking and dragging on the image. In addition to using the zoom button, there are keyboard shortcuts for zooming:
<shift> - center view
<a> - zoom in
<b> - zoom out
<space> - reset view
This code will also work for extracting data points from a tilted or a skewed image (even upside-down or mirrored). The calibration stage ensures that the imperfect orientation or quality of the image is accounted for.
The types of files that will most likely work are BMP, JPG, TIF, GIF (up to 8-bit), and PNG files. Basically, any format supported by the IMREAD is accepted.
GRABIT(FILENAME) will start the GUI program and open the image file FILENAME.
Type GRABIT('-sample') to load a sample image.
Created in MATLAB® R13. Tested up to R2006a

I think the OP explicitly wants something where he doesn't have to "Grab points by clicking on points".
–
Willie WongFeb 1 '13 at 15:55

Yeah, the idea is exactly that I don't want to click for each and every point… as noted in my question, I currently use g3data which does the same thing, and I'm otherwise quite happy with it
–
F'xFeb 1 '13 at 16:15

Here I describe how it is possible to recover data from vector graph in a PDF file with maximum exactness and even estimate introduced recovery error. I show how it can be done in Mathematica but the method shown is very basic and simple enough to be easily implemented in other systems.

Welcome to the site. Link-only answers are discouraged here. Can you give us some idea of what ScanIt is capable of doing, some benefits and drawbacks, etc?
–
FomiteMar 14 at 7:43

@Fomite: This is not a link-only answer (see here). Nontheless, it’s still a bad answer that can and should be improved by addressing your questions.
–
WrzlprmftMar 14 at 9:55

While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes.
–
Bob BrownMar 14 at 11:03

@Wrzlprmft It's close enough to a link-only answer that I'm perfectly comfortable asking the poster for more information.
–
FomiteMar 15 at 8:56