ScraperWikihttps://blog.scraperwiki.com
Extract tables from PDFs and scrape the webTue, 09 Aug 2016 06:10:13 +0000en-UShourly1https://wordpress.org/?v=4.658264007The Sensible Code Company is our new namehttps://blog.scraperwiki.com/2016/08/the-sensible-code-company-is-our-new-name/
https://blog.scraperwiki.com/2016/08/the-sensible-code-company-is-our-new-name/#respondTue, 09 Aug 2016 06:10:13 +0000https://blog.scraperwiki.com/?p=758224569For a few years now, people have said “but you don’t just do scraping, and you’re not a wiki, why are you called that?”

Our other main product converts PDFs into spreadsheets, it’s called PDFTables. You can try it out for free.

We’re also working on a third product – more about that when the time is right.

You’ll see our company name change on social media, on websites and in email addresses over the next day or two.

It’s been great being ScraperWiki for the last 6 years. We’ve had an amazing time, and we hope you have too. We’re sticking to the same vision, to make it so that everyone can make full use of data easily.

We’re looking forward to working with you as The Sensible Code Company!

]]>https://blog.scraperwiki.com/2016/08/the-sensible-code-company-is-our-new-name/feed/0758224569Remote working at ScraperWikihttps://blog.scraperwiki.com/2016/08/remote-working-at-scraperwiki/
https://blog.scraperwiki.com/2016/08/remote-working-at-scraperwiki/#respondTue, 02 Aug 2016 13:17:10 +0000https://blog.scraperwiki.com/?p=758224518We’ve just posted our first job advert for a remote worker. Take a look, especially if you’re a programmer.

Throughout ScraperWiki’s history, we’ve had some staff working remotely – one for a while as far away as New York!

Sometimes staff have had to move away for other reasons, and have continued to work for us remotely. Othertimes they are from the north of England, but just far enough away that it is only reasonable to come into the office for a couple of days a week.

Collaborative tools are better now – when we first started coding ScraperWiki in late 2009, GitHub wasn’t ubiquitous, and although we had IRC it was much harder to get bizdevs to use than Slack is. It’s hard to believe, but you had to pay for Skype group video calls, and Hangouts hadn’t launched yet.

We love Liverpool, and we can tell you the advantages and help you move here if that’s what you want.

If it isn’t though, and if you’ve always wanted to work for ScraperWiki as a software engineer remotely, now’s your chance.

]]>https://blog.scraperwiki.com/2016/08/remote-working-at-scraperwiki/feed/0758224518QuickCode is the new name for ScraperWiki (the product)https://blog.scraperwiki.com/2016/07/quickcode-is-the-new-name-for-scraperwiki-the-product/
https://blog.scraperwiki.com/2016/07/quickcode-is-the-new-name-for-scraperwiki-the-product/#commentsThu, 14 Jul 2016 14:02:14 +0000https://blog.scraperwiki.com/?p=758224468Our original browser coding product, ScraperWiki, is being reborn.

We’ve found that the most popular use for QuickCode is to increase coding skills in numerate staff, while solving operational data problems.

What does that mean? I’ll give two examples.

Department for Communities and Local Government run clubs for statisticians and economists to learn to code Python on QuickCode’s cloud version. They’re doing real projects straight away, such as creating an indicator for availability of self-build land. Read more

Office for National Statistics save time and money using a special QuickCode on-premises environment, with custom libraries to get data from spreadsheets and convert it into the ONS’s internal database format. Their data managers are learning to code simple Python scripts for the first time. Read more

Why the name change? QuickCode isn’t about just scraping any more, and it hasn’t been a wiki for a long time. The new name is to reflect its broader use for easy data science using programming.

We’re proud to see ScraperWiki grow up into an enterprise product, helping organisations get data deep into their soul.

Does your organisation want to build up coding skills, and solve thorny data problems at the same time?

]]>https://blog.scraperwiki.com/2016/07/quickcode-is-the-new-name-for-scraperwiki-the-product/feed/1758224468Learning to code bots at ONShttps://blog.scraperwiki.com/2016/07/learning-to-code-bots-at-ons/
https://blog.scraperwiki.com/2016/07/learning-to-code-bots-at-ons/#respondTue, 12 Jul 2016 14:44:09 +0000https://blog.scraperwiki.com/?p=758224407The Office for National Statistics releases over 600 national statistics every year. They came to ScraperWiki to help improve their backend processing, so they could build a more usable web interface for people to download data.

We created an on-premises environment where their numerate staff learnt a minimal amount of coding, and now create short scripts to transform data they didn’t have the resource to previously.

Matthew Jukes, Head of Product, Office for National Statistics said:

Who knew a little Python app spitting out CSVs could make people so happy but thank you team @ScraperWiki – great stuff 🙂

Spreadsheets

The data the team were processing was in spreadsheets which look like this:

They needed to turn them into a standard CSV format used internally at the ONS. Each spreadsheet could have 10,000s of observations in it, turning into an output file with that many database rows.

We created an on-premises ScraperWiki environment for the ONS, using standard text editors and Python. Each type of spreadsheet needs one short recipe writing, which is just a few lines of Python expressing the relative relationship of headings, sub-headings and observations.

The environment included a coloured debugger for identifying that headings and cells were correctly matched:

Most of the integration work involved making it easy to code scripts which could transform the data ONS had – coping with specific ways numbers are written, and outputting the correct CSV file format.

Training

As part of the deployment, we gave a week of hands on script development training for 3 members of staff. Numerate people learning some coding is, we think, vital to improving how organisations use data.

Before the training, Darren Barnes (Open Datasets Manager) said learning to code felt like crossing a “massive chasm”.

Within a couple of hours he was write scripts that were then used operationally.

He said it was much easier to write code than to use the data applications with complex graphical interface he often has to work with.

Conclusion

Using graphical ETL software, it took two weeks for an expert consultant to make the converter for one type of spreadsheet. With staff in the business coding Python in ScraperWiki’s easy environment themselves, it takes a couple of hours.

This has saved the ONS time for each type of spreadsheet for the initial conversion. When new statistics come out in later months, those spreadsheets can easily be converted again, with any problems fixed quickly and locally, saving even more.

The ONS have made over 40 converters so far. ScraperWiki has been transformational.

]]>https://blog.scraperwiki.com/2016/07/learning-to-code-bots-at-ons/feed/0758224407Running a code club at DCLGhttps://blog.scraperwiki.com/2016/06/running-a-code-club-at-dclg/
Wed, 08 Jun 2016 15:49:39 +0000https://blog.scraperwiki.com/?p=758224360The Department for Communities and Local Government (DCLG) has to track activity across more than 500 local authorities and countless other agencies.

They needed a better way to handle this diversity and complexity of data, so decided to use ScraperWiki to run a club to train staff to code.

Martin Waudby, data specialist, said:

I didn’t want us to just do theory in the classroom. I came up with the idea of having teams of 4 or 5 participants, each tasked to solve a challenge based on a real business problem that we’re looking to solve.

The business problems being tackled were approved by Deputy Directors.

Phase one

The first club they ran had 3 teams, and lasted for two months so participants could continue to do their day jobs whilst finding the time to learn new skills. They were numerate people – statisticians and economists (just as in our similar project at the ONS). During that period, DCLG held support workshops, and “show and tell” sessions between teams to share how they solved problems.

As ever with data projects, lots of the work involved researching sources of data and their quality. The teams made data gathering and cleaning bots in Python using ScraperWiki’s “Code in Browser” product – an easy way to get going, without anything to install and without worrying about where to store data, or how to download it in different formats.

Here’s what two of the teams got up to…

Team Anaconda

The goal of Team Anaconda (they were all named after snakes, to keep the Python theme!) was to gather data from Local Authority (and other) sites to determine intentions relating to Council Tax levels. The business aim is to spot trends and patterns, and to pick up early on rises which don’t comply with the law.

Local news stories often talk about proposed council tax changes.

The team in the end set up a Google alert for search terms around council tax changes, and imported that into a spreadsheet. They then downloaded the content of those pages, creating an SQL table with a unqiue key for each article talking about changes to council tax:

They used regular expressions to find the phrases describing a percentage increase / decrease in Council Tax.

The team liked using ScraperWiki – it was easy to collaborate on scrapers there, and easier to get into SQL.

The next steps will be to restructure the data to be more useful to the end user, and improve content analysis, for example by extracting local authority names from articles.

Team Boa Constrictor

It’s Government policy to double the number of self-built homes by 2020, so this team was working on parsing sites to collect baseline evidence of the number being built.

The team wrote code to get data from PlotBrowser, a site which lists self-build land for sale.

And analysed that data using R.

They made scripts to get planning application data, for example in Hounslow. Although they found the data they could easily get from within the applications wasn’t enough for what they needed.

They liked ScraperWiki, especially once they understood the basics of Python.

The next step will be to automate regular data gathering from PlotBrowser, and count when plots are removed from sale.

Phase two

At the end of the competition, teams presented what they’d learnt and done to Deputy Directors. Team Boa Constrictor won!

The teams developed a better understanding of the data available, and the level of effort needed to use it. There are clear next steps to take the projects onwards.

DCLG found the code club so useful, they are running another more ambitious one. They’re going to have 7 teams, extending their ScraperWiki license so everyone can use it. A key goal of this second phase is to really explore the data that has been gathered.

We’ve found at ScraperWiki that a small amount of coding skills, learnt by numerate staff, goes a long way.

As Stephen Aldridge, Director of the Analysis and Data Directorate, says:

ScraperWiki added immense value, and was a fantastic way for team members to learn. The code club built skills at automation and a deeper understanding of data quality and value. The projects all helped us make progress at real data challenges that are important to the department.

]]>758224360Highlights of 3 years of making an AI newsreaderhttps://blog.scraperwiki.com/2016/04/highlights-of-3-years-of-making-an-ai-newsreader/
Wed, 06 Apr 2016 09:15:01 +0000https://blog.scraperwiki.com/?p=758224299We’ve spent three years working on a research and commercialisation project making natural language processing software to reconstruct chains of events from news stories, representing them as linked data.

If you haven’t heard of Newsreader before, our one year in blog post is a good place to start.

We recently had our final meeting in Luxembourg. Some highlights from the three years:

Papers: The academic partners have produced a barrage of papers. This constant, iterative improvement to knowledge of Natural Language Processing techniques is a key thing that comes out of research projects like this.

Open data: As an open data fan, I like some of the new components which will be of permanent use to anyone in NLP which came out of the project. For example, the MEANTIME corpus of news articles in multiple languages annotated with their events, for use in training.

Open source: Likewise, as an open source fan, Newsreader’s policy was to produce open source software, and it made lots. As an example, the PIKES Knowledge Extraction Suite applies NLP tools to a text.

Exploitation: Is via integration into existing commercial products. All three commercial consortium members are working on this in some way (often confidentially for now). Originally at ScraperWiki, we thought it might plug into our Code in Browser product. Now our attention is more around using PDFTables with additional natural language processing.

Simple API: A key part of our work was developing the Simple API, making the underlying SPARQL database of news events acccessible to hackers via a simpler REST API. This was vital for the Hackdays, and making the technology more accessible.

Hackdays: We ran several across the course of the project (example). They were great fun, working on World Cup and automotive related news article datasets, drawing a range of people from students to businesses.

Thanks Newsreader for a great project!

Together, we improved the quality of news data extraction, began the process of assembling that into events, and made steps towards commercialisation.

]]>758224299Saving time with GOV.UK design standardshttps://blog.scraperwiki.com/2016/02/draft-design-standards-save-time/
Thu, 04 Feb 2016 08:46:36 +0000https://blog.scraperwiki.com/?p=758222678While building the Civil Service People Survey (CSPS) site, ScraperWiki had to deal with the complexities of suppressing data to avoid privacy leaks and making technology to process tens of millions of rows in a fraction of a second.

In this blog post I talk through specific things where the standards saved us time and increased quality.

This is useful for managers – it’s important to know some details to avoid getting too distant from how projects really work. If you’re a developer or designer who’s about to make a site for the UK Government, there are lots of practical links!

Header and footer

To style the header and the footer we used the govuk_template, via a mustache version which automatically updates itself from the original. This immediately looks good.

It also reduces maintenance. The templates are constantly updated, and every now and again we quickly update the copy of them that we include. This keeps our design up to date with the standard, fixes bugs, and ensures compatibility with new devices.

If you’d like to find more, doing a web search for site:service.gov.uk is a great way to start exploring.

]]>758222678Open Data Camp 2https://blog.scraperwiki.com/2015/10/open-data-camp-2/
Tue, 13 Oct 2015 15:38:20 +0000https://blog.scraperwiki.com/?p=758224153I’m back from Open Data Camp 2; and I’m finding it difficult to make a coherent whole of it all.

Perhaps it’s the nature of the lack of structure of an un-conference. Maybe the different stakeholders in the open data community throughout the various hierarchies have a common aim but different levers to pull: the minister with the will to make changes; the digital civil servants with great expectations but not great budgets; the hacker who tries to consume the open data in their spare time, and creates new standards and systems for data in their day job.

There seemed to be a few themes which echoed through the conference:

Skills

There’s a recognition that improving people’s skills and recognising the skills people already have is critical, whether it’s people crafting Linked Data in Microsoft Word and wondering why it doesn’t work or getting local authorities to search internally for their invisible data ninjas.

Sometimes those difficulties occur due to differences in the assumed culture for different types of data — it seems everyone working in GIS would know what was meant by an .asc file and how to process it, but this information isn’t obvious to someone fresh to the data. Is there a need for improved documentation; linked to from Data sets? Or the ability to ask questions of other people interested in the same datasets about interpretation and processing in comments?

Feedback

How do you know if your data is useful to people? Blogs have a useful feature called pingback – the referencing blog sends a message to the linked blog to let them know they’ve been linked to. There was quite a bit of discussion as to whether this functionality would be useful: particularly for informing people if breaking changes to the data might occur.

Also, when data sits around not being used, people don’t notice problems with it. When things break noisily and publicly — like taking down a cafeteria’s menu system — it’s a bit embarrassing, but it does get the problem fixed quickly!

Core Reference Data

One of the highlights of the weekend was a talk on the Address Wars — the financial value of addresses and the fight to monetise them and their locations, the problems caused for the 2001 census as a result of not being able to afford a product from the Royal Mail and Ordnance Survey, both of which were wholly government owned at the time.

It highlighted how much core reference data — lists of names and IDs of things — is critical as the glue which allows different data to be joined and understood. Apparently there’s 20 different definitions of ‘Scotland’ and 13 different ways of encoding gender (almost all of which are male or female). There’s no definitive list of hospitals, and seven people claim to be in charge of the canonical list of business names and addresses. Hence there’s a big push from GDS at the moment to create single canonical registers.

But there’s other items that need standardised encodings. The DCLG have been working on standardised reasons for why bins don’t get emptied – one of the most common interactions people have with their council. There’s a lot more work to be done across the myriad things government does, and it’s not quite clear where it should be happening: councils are looking to leadership from central government, central government wants councils to work together on it, possibly with the Local Government Association. This only gets more complicated when dealing with devolved matters or finding appropriate international standards to use.

Meeting people

I’m also really happy to have met Chris Gutteridge who was showing off some of the things he’s been working on.

Equipment.data.ac.uk brings together equipment held by various UK universities in a federated discoverable fashion by making use of well-known URLs to point to well-formatted data on each individual website. So each organisation is in control of their data and is the authoritative source for it, and builds upon having a singular place to start discovering linked data about an organisation. It’s the first time I’ve actually seen linked data in the wild joining across the web like Tim Berners-Lee intended!

]]>7582241536 lessons from sharing humanitarian datahttps://blog.scraperwiki.com/2015/10/6-lessons-from-sharing-humanitarian-data/
Tue, 13 Oct 2015 12:01:19 +0000https://blog.scraperwiki.com/?p=758222955This post is a write-up of the talk I gave at Strata London in May 2015 called “Sharing humanitarian data at the United Nations“. You can find the slides on that page.

The Humanitarian Data Exchange (HDX) is an unusual data hub. It’s made by the UN, and is successfully used by agencies, NGOs, companies, Governments and academics to share data.

There are lots of data hubs which are used by one organisation to publish data, far fewer which are used by lots of organisations to share data. The HDX project did a bunch of things right. What were they?

Here are six lessons…

1) Do good design

HDX started with user needs research. This was expensive, and was immediately worth it because it stopped a large part of the project which wasn’t needed.

The user needs led to design work which has made the website seem simple and beautiful – particularly unusual for something from a large bureaucracy like the UN.

2) Build on existing software

When making a hub for sharing data, there’s no need to make something from scratch. Open Knowledge’s CKAN software is open source, this stuff is a commodity. HDX has developers who modify and improve it for the specific needs of humanitarian data.

3) Use experts

HDX is a great international team – the leader is in New York, most of the developers are in Romania, there’s a data lab in Nairobi. Crucially, they bring in specific outside expertise: frog design do the user research and design work; ScraperWiki, experts in data collaboration, provide operational management.

4) Measure the right things

HDX’s metrics are about both sides of its two sided network. Are users who visit the site actually finding and downloading data they want? Are new organisations joining to share data? They’re avoiding “vanity metrics”, taking inspiration from tech startup concepts like “pirate metrics“.

5) Add features specific to your community

There are endless features you can add to data hubs – most add no value, and end up a cost to maintain. HDX add specific things valuable to its community.

For example, much humanitarian data is in “shape files”, a standard for geographical information. HDX automatically renders a beautiful map of these – essential for users who don’t have ArcGIS, and a good check for those that do.

6) Trust in the data

The early user research showed that trust in the data was vital. For this reason, anyone can’t just come along and add data to it. New organisations have to apply – proving either that they’re known in humanitarian circles, or have quality data to share. Applications are checked by hand. It’s important to get this kind of balance right – being too ideologically open or closed doesn’t work.

Conclusion

The detail of how a data sharing project is run really matters. Most data in organisations gets lost, left in spreadsheets on dying file shares. We hope more businesses and Governments will build a good culture of sharing data in their industries, just as HDX is building one for humanitarian data.

This post is about the government Contracts Finder website. This site has been created with a view to helping SMEs win government business by providing a “one-stop-shop” for public sector contracts.

Government has been doing some great work transitioning their departments to GOV.UK and giving a range of online services a makeover. We’ve been involved in this work, in the first instance scraping the departmental content for GOV.UK, then making some performance dashboards for content managers on the Performance Platform.

More recently we’ve scraped the content for databases such as the Air Accident Investigation Board, and made the new Civil Service People Survey website.

As well as this we have an interest in other re-worked government services such as the Charity Commission website, data.gov.uk and the new Companies House website.

Getting back to Contracts Finder – there’s an archive site which lists opportunities posted before 26th February 2015 and a live site, the new Contracts Finder website, which has live opportunities after 26th February 2015. Central government departments and their agencies were required to advertise contracts over £10k on the old Contracts Finder website. In addition the wider public sector were able to advertise contracts on there too, but weren’t required (although on the new Contracts Finder they are required to on contracts over £25k).

The confusingly named Official Journal of the European Union (OJEU) also publishes calls to tender. These are required by EU law over a certain threshold value depending on the area of business in which they are placed. Details of these thresholds can be found here. The Contracts Finder also lists opportunities over these thresholds but it is not clear that this must be the case.

The interface of the new Contracts Finder website is OK, but there is far more flexibility to probe the data if you scrape it from the website. For the archive data this is more a case of downloading the CSV files provided although it is worth scraping the detail pages indicated from the downloads in order to get additional information such as the supplier to which work was awarded.

The headline data published in an opportunity is the title and description, the name of the customer with contact details, the industry – a categorisation of the requirements, a contract value, and a closing date for applications.

We run the scrapers on our Platform which makes it easy to download the data as an Excel spreadsheet or CSV, which we can then load into Tableau for analysis. Tableau allows us to make nice visualisations of the data, and to carry out our own ad hoc queries of the data free from the constraints of the source website. There are about 15,000 entries on the new site, and about 40,000 in the archive.

The initial interest for us was just getting an overview of the data, how many contracts were available in what price range? As an example we looked at proposals in the range £10k-£250k in the Computer and Related Services sector. The chart below shows the number of opportunities in this range grouped by customer.

These opportunities are actually all closed. How long were opportunities open for? We can see in the histogram below. Most adverts are open for 2-4 weeks, however a significant number have closing dates before their publication dates – it’s not clear why.

There is always fun to be found in a dataset of this size. For example, we learn that Shrewsbury Council would have appeared to have tendered for up to £1bn worth of fruit and vegetables (see here). With a kilogram of carrots costing less than a £1 this is a lot of veg, or a mis-entry in the data maybe!

Closer to home we discover that Liverpool Council spent £12,000 for a fax service for 2 years! There are also a collection of huge contracts for the MOD which appears to do its contracting from Bristol.

Getting down to more practical business we can use the data to see what opportunities we might be able to apply for. We found the best way to address this was to build a search tool in Tableau which allowed us to search and filter on multiple criteria (words in title, description, customer name, contract size) and view the results grouped together. So it is easy, for example, to see that Leeds City Council has tendered for £13million in Computer and Related Services, the majority of which went on a framework contract with Fujitsu Services Ltd. Or that Oracle won a contract for £6.5 million from the MOD for their services. You can see the austere interface we have made to this data below

Do you have some data which you want exploring? Why not get in touch with us!

Got a PDF you want to get data from?
Try our easy web interface over at PDFTables.com!