My Blog about Data

Month: May 2015

I needed to extract mean pitch values from audio recordings of human speech, but I wanted to automate it and easily recreate my analyses so I wrote a couple of scripts that can do it much faster.

Here is a recipe for extracting pitch from voice recordings.

Cleaning audio files

My audio files were stereo recordings of a participant saying /a/ while hearing (near) real-time pitch shifts in their own productions. The left channel contains the shifted pitch (heard by participants) and the right channel contains the original speech productions.

The first step is to examine the audio recordings for any non-speech sounds. I used Audacity for that. Any grunts or sights can mess up the outcome of scripts used in the analysis. Irrelevant parts of the audio track can be silenced (CTRL+L in Audacity). Once the audio track is cleaned, I split the channels and save them in separate wav files.

Acoustic signal used in the analysis. Highlighted part is showing noise that should be removed.

Splitting continuous recordings using SFS

My pitch-extracting scripts expects each utterance to be saved in a separate wav file so I need to split the continuous recordings. It could be done manually but for longer recordings it’s cumbersome. Speech Filing System (SFS) has an option that allows splitting the continuous files on silence.

Manual:

1. Load a sound file

2. Create multiple annotations

Tools > Speech > Annotate > Find multiple endpoints

Specify the values of npoint. More information can be found here. You don’t need to know the exact number of utterances, but a close approximation should work.

Visualise the results of automatic annotation:

Check if the annotations are correct. If not, then tweak the npoint settings to get the effect you need.

3. Chop the files on annotations

Tools > Speech > Export > Chop signal into annotated regions

This will save the files in the sfs format, but PraatR can’t work with these files. They need to be transformed into wav.

4. Convert sfs into wav files

Load the files you want to convert, highlight them, and go to:

File > Export > Speech

Automatic:

If you don’t want to spend hours doing what I’ve just described then a simpler solution is using a program that runs all the commands described above.

Use the batch script that follows the steps described above (plus some extras).

I am using EEGLab to process my electroencephalografic data (i.e. brain’s electric activity), but I wanted to have an interactive visualisation showing how different filter settings change my data. I prefer using R to Matlab, so I decided to create a Shiny app that would do just that.

Recently I read an article (PL) about massaging statistics by Polish police. It made me wonder what kind of data is available on their website and whether any interesting patterns could be observed.

The website offers some data but it is badly formatted, not very recent, and can be only downloaded as a PDF :O

I didn’t feel like scraping the page so I manually copied and pasted the data from the website and initially preprocessed it in Excel by extracting the numbers following the backslash.

I decided to focus on the dataset ‘Foreign – Crime‘. Surprisingly enough, both crime perpetrators and victims, are lumped together in one table, separated by a backslash. As if that wasn’t enough of bad formatting, someone decided to split the table in two. Each table with a different number of rows and some missing values (marked as ‘bd’). Victims/suspects from countries not specified in the table were aggregated in the total values (Pl: ‘RAZEM’). I intentionally omitted these values from my analyses.

The original(-ish) data was in wide format, but I needed to turn it into long format. I used tidyr for that:

Now it’s pretty obvious, which country’s citizens were the most common crime victims in Poland if you focus on raw numbers registered by police. This dataset doesn’t include any information about the number of visitors from other countries so it’s hard to answer the question about the likelihood of being a crime victim as a foreigner in Poland.

I wanted to have some interactivity and I didn’t have much time so I made a dashboard in Tableau:

It’s a much faster way to create static or interactive plots but they are more difficult to reproduce than in R.

General Election 2015 is coming so I decided to compare the popularity of the main UK political parties on Wikipedia.

I gathered the data about the Wikipedia page views using wikipediatrend R package. Pretty plots were made with dygraphs and ggplot2. Wikipedia article traffic was collected from 1st March 2015 till 2nd May 2015 only for English language version of the site.

The Wikipedia interest in Labour, Conservatives, and UKIP is highly aligned. Another tier consists of Liberal Democrats and SNP, both seem to be getting similar amount of page views. Greens have the lowest total volume of the main parties, with a marked peak on 1st April.

All parties experienced a boost in the traffic volume around the time of major TV debates (2nd, 16th, and 30th April 2015).

Page views of articles about political parties decline on weekends. The pattern is fairly consistent and can be observed in all parties.
Here is how it looks for Labour:

After reading the paper by Steen et al. (2013) I decided to play with data that was published with the article. I thought that such an interesting topic deserved interactive plots so I created them in Tableau.