After spending the morning running before we could walk the team I’m in, Mike Beardmore, Dominic Clay, Matt Holmes and I have discussed putting together something simple.

Matt had used C (sharp) to pull out all the #uksnow tweets and we plan to create a mash-up map using the Highways Agency RSS feed to build a regularly updated map.

The first process was removing all the non-post code tweets.

Mike has also suggested mashing the #uksnow with a re-written scrape on Scraperwiki with details of Harvester restaurants in the UK.

However, we had an issue with the Twitter as too much information was coming in at once. It’s snowing and #uksnow is a popular hashtag and the API couldn’t deal with it.

Mike took Twitter feed for #UK with the post code, extracted the postcode, took the Scraperwiki feed of Harvesters and extracted the post code from those, and created a datafile formatted as XML so it would show up on Google maps.

The plan is for a pointer showing the location of a restaurant in the snow, a snow hole, providing warmth.

Mike has managed to get it to work on his own because he’s very capable with the code and produced a map showing areas where heavy snow is reported and the location of the Harvester restaurants nearby.

The potential future for this map would be to show a wide variety of restaurants, service areas and places offering shelter to people who find themselves trapped by snow while travelling.

We were quite a large group to start with, so we’ve ended up splitting in two. One group is working on scraping details of registered care homes, and I’m in a group working on information gathered but creating an interesting and informative visual.

Our first battle was making sure Scraperwiki could read our data so we could work with it.

First of all I uploaded to Google docs, but the comma separated values (CSV) scraper didn’t like it. Then when the spreadsheet was published as a web page, as suggested by it still wasn’t happy because it wanted to be signed into Google.

Matt suggested putting the CSV onto his server, so I exported it and sent it over to him.

Francis Irving also suggested scraping What Do They Know, because it was Freedom of Information dat.

After much fiddling Matt managed to pull out the raw data by popping (pulling from the top of the list) and using a Python scraper.

It turned out the data we had was so unstructured it wasn’t possible to work with it.

Take the Gulf oil spill. You can find a list of oil fields around the UK, but it’s all in a strange lump.

He shows a piece of Python code reading the oil field pages and turns it into a piece of data.

It’s quite simple to make a map view, but also code to make more complicated views.

Scraperwiki is automatic data conversion.

Scrape internet pages, Parser it, organise it, collect it and model it into a view. It will keep running and give the dataset constantly.

There are two kinds of journalism to use with the data. You can make tools, specific tools and find a story.

In Belfast took a list of historic houses in the UK. The data scraper looked through a host of websites, using Python, can use Ruby.
There are a multitude of visuals available. The Belfast project showed a spike in 1979, this was explained due to a political sectarian issue.

Answering a question, Francis confirms you can scrape more than one website at a time.

Francis would like to see more linked data and merging datasets together.

Asked about licensing for commercial use. Francis says it’s mainly used for public data. Scraperwiki blocks scraping Facebook because it’s private data, but the code can be adjusted.

The tag maybe related to a journalism or social media conference I’m interested in, or a trending topic, it changes.

Yesterday (Friday, November 27) there was a great deal of activity on the #demo2010 tag as students started occupying more universities, and tweets were full of pictures and videos from demonstrations on Wednesday, November 24.

After updating colleagues on which of their old unis were taken over by students one asked me: “How do you know this stuff and find it on Twitter?”

Then I explained how I followed the hashtag. It’s a simple way to find everything posted on a particular theme, topic or event.

26/11/2010

Jonathon Shuler has published a post exploring the News Diamond from my Model for a 21st Century Newsroom. As part of that he’s added an extra layer to the diamond showing which areas professional journalists should focus on, and which ones they should let go:

It used to be the case you had to check the division list to find out how MPs voted.

Created a web scraper pulling out the information and created The Public Whip, showing how MPs voted.

Have to be a parliament nerd to understand, even when it’s broken down.

They Work for You simplifies the information even more, it tells you something about your MP.

Bring the division information together. Take a list from public whip and create a summary of how they voted.

Checking how one MP voted on the Iraq War. Voted with the majority in favour of the war on three votes and abstained from the first and then the final three. It’s almost a deal with electorate.

MP asked to have “voted moderately” removed because found it misleading. A number of MPs have complained, but checked the votes.

Richard Pope founder of Scraperwiki made a website after the demolition of his local pub (a fine-looking establishment called The Queen) and created Planning Alerts.com website.

It helps people access information from outside the immediate catchment area. He wrote lots of web scrapers. Example of different councils’ planning application systems.

Scraperwiki is like Wikipedia but for data. It’s a technical product for use when you’re not technical. Can look at different data scrapers and copy what others are doing without learning Pearl or Python.

Planning Alerts is being moved over to Scraperwiki. Can tag it on Scraperwiki and find information. Can find stories and in-depth information.

Can request a dataset and have something built for you.

Francis was asked, is it legal? In the UK if it’s public data, not for sale, you can reuse it. Would take things down if asked, but it’s open stuff.

Could it be stopped? Would be ill-advised to stop people, and journalists, reading public information.

Public whip and They work for you, look at numerous votes.

Looking at ways to fund it such as private scrapers, or scrapers in a cocoon. Looking at white label for intranet use. There’s a market for data and developers who want to give data. Want to match developers with data. Currently funded by Channel4. Want to remain free for the public.

Does it make people lazy? No, it’s already published but it makes it easier. Movement of people trying to get publishers of data to change. Always a need to pull out in a variety of formats.

Running Hacks and Hackers days working together finding stories and hunting around.

Former digital content and social media editor for six newspaper websites in West Sussex.
Experienced journalist and sub-editor.
Seeker of knowledge and general internet enthusiast.
My opinions are my own.