Tilburg University (sponsor Chris Hartgerink, @chjh) have asked CM to collaborate on a project to extract data from funnel plots on a high-throughput basis and measure the results. The project will last ca 1 months from now.

@chjh will provide a corpus of diagrams and @petermr is writing the extraction software by enhancing AMI (SVG) to extract data as CSV.

The project is OpenNoteBook (https://en.wikipedia.org/wiki/Open_notebook_science ) and everything is public. There are no closed emails other than admin. We believe that the whole project should be replicable including Open software, Open corpus, immediately published results and Open discussion. Anyone can read and comment.

Format

2-axes:

x observable (may be z-scores or domain-specific units)

y SEM or similar

The axes are normally - (-a, 0, a) for x - (0, b) downwards for Y

Here X is roughly symmetrical about the expected mean or observed mean and y is downwards. This gives a roughly (very) triangular or inverted funnel. Usually the plot is decorated with a triangle or a funnel (trumpet like), but these are ignored in this study.

Funnel plots

two-axis ("L") plot

calvin

This has the x-and y- axes meeting in the lower left (0.2, 0.18) with positive y downwards. The y-axis (Standard error) has "origin" at the top, and the x-origin is dependent on the user domain. The vertical line in the middle is the author's estimate of the mean (I think) and the two triangular lines are approximate boundaries of the funnel. The points are plotted in an unusual manner (small line from SW-NE).

The SVG file (the leading whitspace reflects its position on the orginal page, where all the text has been clipped:

sbarra -------------------------------

Development corpus

The primary corpus for analysis will be delivered by @chjh later this week. To help develop and test the software we have created a small (10 paper) corpus. These papers are all publicly visible on the Web (a Google search for "funnel plot" and not behind paywalls). I selected the first 12-15 and extracted papers which appeared to have vector graphics plots.

The PDF papers will be posted to our repo under fair use and using the Hargreaves Exception for TDM. They will only be used for development of funnel plot extraction. The derived data in SVG will also be committed to the repo.

Software development

The (Open) software will be developed in response to the corpus. The algorithms have been developed in the svg package. In addition we have to test the traversal process (iterating over the CTrees in the CProject and within each CTree we iterate over (possibly multiple) figures. The current design - which is informed by the Table extraction is shown in next post

The current plan is:

create a set of archetypal figures to exemplify:

traversal/process of a single figure

traversal/process of several figures in a ctree

traversal/process of single figure within many ctrees

traversal/process of multiple figures within multiple ctrees

Algorithms

The algorithms are developed in the svg project of the ami-stack. Currently the Code and Tests are in:package org.xmlcml.graphics.svg.plot;

The tests in org.xmlcml.graphics.svg.plot.PlotBoxTest.java will process individual files by direct calling and write:

Repository structure for figures

current repo structure inspired by the Table Extraction

This is the current structure in the cm-ucl project as found in https://github.com/ContentMine/cm-ucl/tree/master/corpus-oa-pmr-v02 . This shows a single CTree (10.1016_j.amepre.2016.07.024) with a cproject (corpus-oa-pmr-v02).The CTree would often have the primary files (e.g. fulltext.pdf) as direct children and a set of sub-directories for specific components of the document. Since this is still at an early stage some details may change later. The only raw file is fulltext.pdf - from which all othres are derived.

Since each table undergoes transformation we create a directory for each, and the same will be done for figures. NOTE: image will collect the raw images (e.g. *.png) while figure will refer to a structured object in the paper , usually in a box (explicit (lines) or implicit (whitspace)). Some figures contain only vectors (translated to SVG), some have only image/s , and some are mixed.

processed files

The PDFs were converted to SVG in a 2-step process:

create a CProject from the individual PDF files. (Note it is useful to have reasonable names for the files (no spaces, escape characters, etc.). The files should be in a directory (say publicPapers) and the commands should be run from its parent directory.

This identifies all files of the form foo.pdf - extracts the file root (foo) - creates a directory of that name (foo/) - renames foo.pdf as fulltext.pdf as child of foo/

Note that the original file foo.pdf is no longer available under that name. If there were , say, 11 files *.pdf there will now be 11 subdirectories (known as CTrees) each with a single child fulltext.pdf.

transform the PDF to SVG, also running this from the parent directory of publicPapers.

This iterates over all the CTrees and looks for a child fulltext.pdf. If none is found we skip to the next. The transformer pdf2svg is applied to the PDF and creates 1 or more subdirectories (explained below).

extract the figures. We shall soon make this automatic, but at the moment we use manual editing in Inkscape (or other SVG editor). See below

Raw (+manual) extracted SVG

Here is the pdf2svg output for publicPapers1 with some material removed that doesn't include figures. pdf2svg creates a subdirectory svg with the extracted pages fulltext-page\d+.svg. Vector figures have been extracted manually using inkscape and are labelled figure1.svg, etc. (This should be changed to the figure1/figure.svg syntax).

high-level

A funnel plot is expected to have many attributes, some MUST (mandatory), some SHOULD (desirable) and some MAY (optional).

The Header and Footer usually contain information about the plot, while the *_axial and Plot_body components comprise the plot itself. Often the semantics of the *_axial blocks and the Plot_box are described in natural language in the Header/Footer. There may be 2,3 or 4 axial boxes with meaningful content - if there is only a horizontal axis it is impossible to extract User coordinates.

Normally the only useful axes are BOTTOM and LEFT. Sometimes the TOP mirrors the BOTTOM - any other semantics are ignored. Sometimes the RIGHT axis mirrors the LEFT - any other semantics such as different units or more than one symbol type are ignored.

The system will analyse all axes , but the coordinate extraction is based on BOTTOM and LEFT and these are discussed below.

Structure of a Header and or Footer

These may contain:

a Figure (maybe Fig.) label usually with a number or letter or combination of both. The numbers and/or letters are normally consecutive in the containing paper.

a FigureTitle, usually a single phrase or sentence. e.g. "Funnel plot of IQ variability against Standard Error of mean".

annotation. Further detailed description of the figure (contents, structure), e.g. "adults are denoted by filled circles and children by open circles". Occasionally the actual plot symbols are included.

Normally this text is L2R, unrotated consisting of words separated by white space and often wrapped. Sometimes titles are rendered in italics or bold.

Axes

The plot will normally have lines to delimit the plot: LEFT and BOTTOM axial lines are mandatory (banded rectangles are out of scope).

Details of axial boxes

This plot has successful extractions of the x- and y- axes. The extracted components are in green annotation boxes.

For each axis there are normally the components:

tick marks on the axis.

These may be (a) outside (b) inside (inside and outside). Normally the ticks touch the axis. The commonest style is - single tick marks regularly spaced as herebut also - logarithmic scales - major ticks every 5 or 10 minor ticks.

scale values

e.g. the x-axis has scale values (-5, 0, 5, 10, 15). Ideally every tick will have a scale value. However often when the axial corner (e.g. x meets y) lies on a tick point, the tick is omitted by not the scale value

axial legend/title

These describe the values along that axis (e.g. "s.e. of MD_HA" and "MD_HA"). For funnel plots the Y-axis is usualy some form of Standard Error or similar, while the X-axis is user-determined. Sometimes (e.g. "Z-scores") there is a natural origin, but often there is not.

Note that the TOP axis has no tick marks so any text is interpreted as a possible figure title.

Axial boxes

Tilburg do not requre us to extract of analyze the labels.

The axial box normally contains the components:

tick marks (perpendicular to the axis). They can be inside. outside or both.

tick annotations (usually scale values, e.g. (5, 10, 15 ...)

axial labels (the quantity measured and the units and any other annotations

Initial triage

PMR will inspect the 30 PDFs and indicate any which are out of scope (e.g. not vectors, no funnels, etc. I expect this to be a very small number.

analysis and selection

SPK will create a normaCProject(tilburg20170615) from the 30 PDFs, recording what commands were issued and what resulted.

SPK will run the 30 PDFs through norma --transform pdf2svg and report on which converted successfully to SVG. PMR will indicate which of the diagrams (and sub-diagrams) are in scope (some papers may contain 10 or more plots and only 1 or 2 may be needed for development.

manual extraction of diagrams

SPK will manually edit the appropriate pages of the SVG output and trim/snip out anything other than the funnel plots. The result should conform as far as possible to the format above - e.g. boxes surrounding the plot should be trimmed out. Each funnel plot (even in a collection within one figure) should be in a separate *.SVG file (in "Plain SVG", not "Inkscape SVG").

Each diagram should have a unique number within the document running from 1-n for n extracted funnelSVGs. The Author's numbering scheme should NOT be used, but included as an annotation.

index

An index of the SVG plots should be included as a CSV file (rawPdf.csv). This should contain:

DOI / (URL)

serial number in repository folder (1-n)

authors' number

document analysis of the Funnel plots.

PMR will create an initial schema for plot structure and component. There will be ca 20-30 fields for each plot. Each funnel plot will be manually analysed (SPK) to describe the components present and the data entered in a CSV file with columns describing the fields in the plot. Details will be posted shortly.

extraction of reference points

For each plot 4 reference points are selected to calibrate the user scales (i.e. the values are what the user reads off the axial scales (not the pixel/screen coordinates). This will enable us to detect any systematic or random errors in the extracted coordinates. (Generally we would expect extracted coordinates to be more accurate than measuring with a ruler).

The number of points in the plot should also be manually extracted.

initial processing to extract scatterplots

The current software will be run on the CProject to create CSV files of the extracted points and annotated SVG plots. The presence/absence of a component can be checked and the embedded values checked for syntax (e.g. numeric), consistent axial tick differences etc.

software development

The current norma-svg plot analysis software was run on the development corpus. Initially most documents failed to produce a CSV file, and the software was iteratively updated to reduce the failing tests. There were and are many reasons for failure and the following analysis will show the different heuristics that had to be added.

The commit strategy is now fairly fine grained and the commit message log is a medium level account of the gradual introduction of the functionality: