Extract tables from PDFs and scrape the web

Business Publishing

Maximising subscriber retention
with unique, accurate data feeds

With the rise of the web, the publishing industry is in a time of great change. Customers now expect data relevant to their businesses to be completely up to date and beautifully presented – whether that’s in print, the web, or their phones.

Now, more than ever, it’s the fine balance between insightful editorial content, and rock-solid data, that makes or breaks a business publication. ScraperWiki has been working with the publishing and media sector for years, helping them achieve this exact goal.

Their Agra product area needed to collect pricing and volume data for different commodities from various source websites. The data was stored in Excel files, web pages or PDF files and their analysts and journalists were spending a lot of time copying and pasting data into spreadsheets and then cleaning it and not enough time on providing insights for their customers.

You can use the PDF technology we developed at PDFTables.com to get data from your own PDFs.

As a business we deal with a vast array of datasets in various formats. Large PDFs, for example, are particularly problematic for our reporters and so it was a joy using ScraperWiki’s Table Xtract tool for the first time: within seconds the data was in a format we could download, analyse and visualise.

Ian Hart
Content Director, Informa Agra

Case Study: Collecting health spending data for EMAP

The Health Service Journal (HSJ) from EMAP provides the most comprehensive and insightful coverage of what is really going on in the NHS, from both policy and practical perspectives, and follows the key figures within the healthcare sector.

HSJ wanted to analyse spending activity above £25K across the many Department of Health organisations. Whilst the raw data had been published on the internet, each organisation had a different way of presenting the data. ScraperWiki automated the collation of the spending data from the internet and then created sophisticated scrapers to extract, clean and categorise the content. This allowed HSJ to discover some unique insights that they could share with their subscribers.

We knew there was a goldmine of data in the spending reports but didn’t have the time or skills to extract analyse the data. The team at ScraperWiki used their platform and data science skills to automate the data collection, cleaning process and analysis, freeing us up to share the insights with our subscribers.