4 September 2016

Thing 21 is about dirty data and some strategies and tools for fixing data issues.

Having been involved in implementing data systems at work which involved data migration and establishing feeds from other systems with transformations eg. building an organisation code structure in a new system based on partial strings from a payroll system; sourcing person records from two separate systems and deduplicating (people who were both staff and students), the pitfalls of dirty data is quite familiar. The problems soon started appearing during testing phase, particularly as we looked at report generation and business processes that relied on choosing a specific record.

One of the difficulties was individuals that had name variations between the two systems but were in fact the same person. Sometimes the only way these were found was through someone knowing that staff member had changed their name, or used a diminutive in their student record. This led to changing some business processes to help identify persons between the two systems.

This thing talks about using Google Spreadsheets and a scraping extension to gather data tabular data from websites. In the past, when websites used
tags in the html it was relatively easy to import tables directly into Excel using the method in this video. I was hoping to try it again, but could not find a suitable table to play with. (They mostly seem to use these for ads!, and alternative methods for tabular data)

The feature to do this is available in Excel 2016 in the data ribbon.

This is my first time at trying Google spreadsheets for scraping data. So here is a table from the Wikipedia page on Australia at the Olympics.

Medals by Summer Games

In the wikipedia page the column "Totals" has bold text. In the data scraped the wiki encoding for bold has been captured as asterisks surrounding each value - a prime candidate for some cleansing.

I was going to have a go with openRefine, but it was downloaded on a different computer and I can't be bothered shifting gears to finish this on the other one.

20 August 2016

I viewed the list of apps for gathering research data suggested in thing 19.

It struck me that mostly these are just online forms which have been around for years, often marketed for integration into websites for all sorts of purposes. I use wufoo as a contact form on this blog. It is another that could be used to build an online form. Google forms is another option. It really depends if the research requires something specific to choose amongst the many options.

Specifics needs for research as opposed to many general website forms might be:

ability to attach a file to the form data eg. photo, or audio recording

ability to complete the form multiple times while offline and then submit the data when internet/network access is available

Another valuable app for integrating web services is IFTTT. This enables users to set up a series of steps to push data, images, social media to other web services. This is definitely worth looking at for those doing research into social media (but should also be considered for other types of research too). A 'recipe' could be created to poll Twitter and Instagram for a hashtag and then make a record in a Google spreadsheet everytime a post matches the criteria.