Senate and committee data

I’m listing this here mostly because I’ve received a lot of requests for it. And indeed these are both crucial parts of the legislative process. But I’m reluctant to add new scraped data sources — more things that can go wrong, more potential for ongoing maintenance (and then a dead site when I don’t have the time or will to keep up with the maintenance). Especially as Parliament plans to release XML versions of data, which will hopefully include senate and committee transcripts. That said, if you’re interested in willing to put in the scraping and parsing work, we should discuss.

Video

This is a fun and difficult one. You can pull House video off of the Parliament and CPAN streams, and it’s archived at Mycelium. Automatically matching that video with the statement transcript is tricky but totally possible. We have an approximate (+- 5 minutes) timestamp for each statement. The QP video almost always shows an onscreen banner with the name of the speaker. I suspect the video resolution’s too low to do OCR on the name, but the name’s accompanied with a color-coded party banner, and detecting that color bar and mapping it to a party name is possible. So you could get, given a video, a timestamped list of which parties spoke when. (I have rough proof-of-concept code for this somewhere.) That could then be paired with the Hansard via some kind of approximate sequence-matching algorithm.

Statistics

There’s all manner of fun/informative statistics that could be derived from our dataset. If you know statistics (bonus points for knowing how to compute stats in Python) and can suggest potential things to do with the data I have, get in touch.