Sign up or log in to save this to your schedule and see who's attending!

The DARPA Memex project and NSF Polar Cyber Infrastructure project have been funding a ton of improvements in the Apache Tika framework. Apache Tika is a content detection and analysis toolkit that has support for file type identification (MIME identification) for over 1200 types of files; extraction of text and metadata and language information from those files; even translation!

Though Tika supports all those file types, its support for extraction from images, and videos has been lacking. Via the Memex and NSF projects, we have expanded Tika to extract text from images (using Tesseract OCR); and are actively integrating other analyses (Visual Sentiment analysis; geo-location using toolkits like GDAL; and analyes of scenes and objects).

I'll tell you all about how to install and use these improvements and even illustrate them in a cool example from Memex and NSF Polar.

Chris Mattmann has a wealth of experience in software design, and in the construction of large-scale data-intensive systems. His work has infected a broad set of communities, ranging from helping NASA unlock data from its next generation of earth science system satellites, to assisting... Read More →