Conversions from non-structured formats (PDF and PNG) to processable output.

These are all lossy methods. Sometimes they are quite good - we can extract species names from images. It is useful to have redundant information so errors can be detected (or at least the error rate measured).

AMI software stack (static)

The dependency tree of Java code (click to display) in the AMI-stack (i.e. excluding getpapers, quickscrape, canary, elastic). Third-party libraries (Apache, Guava, etc are not shown. This is current at 2017-02.

At present the dependency is linear and so any rebuild above ami will trigger Maven dependencies and Jenkins or Travis will re-build all lower ("downstream") libraries.

repo

status

svg

Library partially supporting SVG.

svg supports an XML DOM (xom) for SVG. It differs from most other libraries in that it supports the construction of complex objects, e.g. creating rects from paths. It is not comprehensive and only covers static aspects of SVG, such as text and shapes . It also supports higher-level objects such as flowChart and arrow.

Features of svg include:

Reading SVG files directly in an SVG Dom

transformations of several kinds

semantification of graphical objects.

analysis of SVG diagrams into higher-level components

repo

status

html

Library partially supporting HTML.

html supports an XML DOM (xom) for SVG. It differs from most other libraries in that it supports the construction of complex objects, e.g. creating words from text characters (this may also be done downstream). It is not comprehensive and only covers static aspects of HTML, such as div, ul etc. . It also supports the parsing of unclean HTML using various tidy programs.

Features of html include:

Reading HTML files directly into an HTML Dom

Tidying bad HTML

Many of the advanced functions are currently in the SVG2XML project but could be moved here later.

imageanalysis

imageanalysis transforms raw bitmaps (png, bmp) into semantic pixel maps, either monochrome ("binarized") or "posterized" with a small number of colours. This is very messy and heuristic. We have used it successfully for phylogenetic trees, and also for chemical structure diagrams. Currently (2017-02) we are working on X-Y plots.

After binarization (or posterization) the diagrams consist of pixelIslands which can be analyzed heuristically. Methods include:

thinning

erosion

OCR for characters (mainly through Tesseract)

The results are held in an SVG DOM based on svg. This means that , in principle, the results can be transformed into higher-level objects such as tables, chemistry, flowCharts, etc. The main problems are:

fuzzy diagrams (JPG, antialiasing)

unclear semantics ("l", vs "1", etc.)

overlap

However with diagrams of simple to medium complexity it is often possible to extract "almost all" information.

repo

status

It's possible that some components will be refactored between nodes in the hierarchy. For example the commandline process in cproject may be moved higher - the same holds for the cproject file structure. Alternatively cproject might be moved to the end of the chain as a runner of modules.

repo

status

ami

A framework for running discipline-specific calculations/transformations/analyses on part of cproject.

Historically ami managed all fact extraction, including bag-of-words, regex, and dictionary search. Some of these are also managed by elastic search. The exact management of this is fluid and may occur in either framework.

There are certain operations for which ami and the cproject structure are essential, including: