A Data Science, NLP and personal blog by Matthew Ruttley

Menu

Overview

Zenko (“Good fox” in Japanese) is a reporting system (see code on Github here) I’ve created over the last couple of weeks at Mozilla. Basically my non-technical coworkers were getting so frustrated by Tableau (“what the heck is the difference between INNER JOIN and OUTER JOIN?”) that I decided to create a simple dashboard interface for them.

Its a simple bootstrap front-end to a database containing campaign stats for sponsored tiles. You can drill down to each tile or client/partner and pivot by things like locale, country and date.

Send the data back to the server, format it, and then back to the client via an iframe

Which is the best solution?

This is problematic because our sticking point is the query speed. The redshift database is an OLAP Column Oriented database, and append-only. This means that it is insanely fast to add data to, but quite slow (often 6+ seconds) to query. Yes, it is dealing with billions of rows so excusable, but its not so great in terms of user experience to wait so long.The user doesn’t want to wait another 6 seconds for the analysis to rerun when they have the data already.

This sounds like it could just end up storing a lot of data on the client, but it could work quite well. In terms of security though, I’m not sure that the data should be lingering on the user’s PC unrequested though.

This didn’t work out so well – in Firefox, the file is incorrectly named. In the future, I’d like to name the files according to the parameters of the analysis e.g. <client>-<date>-<country>.xls

This is the weirdest solution, but it works! Flask is running locally so it is actually very fast. There are no huge JQuery/JavaScript complications with file permissions and the fact that you can manipulate the data easily on the server is nice too.

Solution 4

The process is as follows when the “Download for Excel” button is clicked:

Reference the HTML table using JavaScript and convert it to an array of arrays

Append an iframe to the DOM

Append a form with a POST action and hidden field to the iframe

Insert the table contents into the hidden field’s value

Submit the form

Let Flask receive the POST request and format the information as a CSV

Return an HTTP response with a file attachment containing the CSV

Let’s implement it

Add this function to grab the table content and put it in an array

JavaScript

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

functionconvert_table_to_array(){

//convert the current table to a list of lists

itable=document.getElementById("impressions_table")//the table will always be called this in zenko

//convert the table to a list of lists (i.e. array of arrays)

vardata=[];

//meta data

col_count=itable.children[0].children[0].children.length//number of cols

Web Workers allow you to run code in the background in browsers such as Firefox. This is how to build one into a Firefox Extension, which is slightly different than from just creating one as normal on a page. The documentation for doing this is basically non-existent, so hopefully you’ll find this useful.

Aha! The key line here is: ReferenceError:Worker isnotdefined . This is because Firefox Extensions use something called a ChromeWorker instead. We need to import this in main.js by pasting this at the top:

JavaScript

1

var{ChromeWorker}=require("chrome")

and changing the line that references the hello_world.js file to call a ChromeWorker instead:

JavaScript

1

2

//var worker = new Worker("hello_world.js"); //remove this

varworker=newChromeWorker("hello_world.js");//add this instead

Ok let’s try running it again! Try
cfx run . Wtf another error?!

Shell

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

>cfx run

Using binary at'/Applications/Firefox.app/Contents/MacOS/firefox-bin'.

Using profile at'/var/folders/p1/zzdzcrrx6pq96hgsmy5xjqmh0000gp/T/tmpJJXeC4.mozrunner'.

The key line here is: Malformed script URI:hello_world.js. This cryptic error is because firefox can’t yet access anything in the
/data/ folder. We have to use another part of the SDK to enable access to it.

Open
main.js and put this at the top:

JavaScript

1

varself=require("sdk/self");

Now we can use the function
self.data.url() . When you put a filename as the first argument, it will return a string like
resource://jid1-zmowxggdley0aa-at-jetpack/test/data/whatever_file.js which properly refers to it in the case of extensions. Modify the worker import line as follows:

JavaScript

1

2

//let worker = new Worker("hello_world.js"); //remove this

let worker=newChromeWorker(self.data.url("hello_world.js"));//add this

Now let’s run the extension again using
cfx run :

1

2

3

4

>cfx run

Using binary at'/Applications/Firefox.app/Contents/MacOS/firefox-bin'.

Using profile at'/var/folders/p1/zzdzcrrx6pq96hgsmy5xjqmh0000gp/T/tmppvMjZp.mozrunner'.

console.log:test:Hello Matthew

Yay it works! The Worker returned the message “Hello Matthew“.

FAQ

What does this
{notation} mean?

It is shorthand for:

1

2

varchrome=require("chrome")

varWorker=chrome['ChromeWorker']

Basically this means that
require("chrome") returns an Object, and we just need the value that is referenced by the key “ChromeWorker”. This is a very succinct way of extracting things from JavaScript Objects that will come in handy in the future.

Why is Worker now called ChromeWorker? Are we doing something with Google Chrome?

This is a naming coincidence and nothing to do with Chrome as in the browser. Chrome in this case refers to Firefox Addon internals.

I’ve been using the multiprocessing library in Python quite a bit recently and started using the shared variable functionality. It can change something like this from my previous post:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

from multiprocessing import pool

from pymongo import Connection

def how_many(server_number):

"""Returns how many documents in the collection"""

c=Connection("192.168.0."+server_number)#connects to remote DB

returnc.MyDB.MyCollection.count()

pool=Pool(processes=4)

servers=[1,2,3,4]

result=pool.map(how_many,servers)#map stage

pool.close()

#reduce stage

result=sum(result)

print"You have {0} docs across all MongoDB servers!".format(result)

Into a much nicer:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

from multiprocessing import Value,pool

from pymongo import Connection

def how_many(server_number):

"""Returns how many documents in the collection"""

c=Connection("192.168.0."+server_number)#connects to remote DB

shared_count.value+=c.MyDB.MyCollection.count()

#setup

pool=Pool(processes=4)

servers=[1,2,3,4]

shared_count=Value("i",0)

result=pool.map(how_many,servers)#map stage

pool.close()

print"You have {0} docs across all MongoDB servers!".format(shared_count)

Thus eliminating the reduce stage. This is especially useful if you have a shared dictionary which you’re updating from multiple servers. There’s another possible shared datatype called Array, which, as it suggests, is a shared array. Note: One pitfall (that I fell for) is thinking that the
"i" in
Value("i",0) is the name of the variable. Actually, its a typecode which stands for “integer”.

There are other ways to do this, however, each of which has its own trade offs:

#

Solution

Advantages

Disadvantages

1

Shared file

Easy to implement and access after

Very slow

2

Shared mongoDB document

Easy to implement

Slow to constantly query for it

3

Multiprocessing Value/Array (this example)

Very fast, easy to implement

On 1 PC only, can’t be accessed after process is killed

4

Memcached Shared Value

Networked aspect is useful for big distributed databases, shared.set() function is already available

Background to the Problem

I work regularly with gigantic machine learning datasets. One very versatile format, for use in WEKA is the “ARFF” (Attribute Relation File Format). This essentially creates a nicely structured, rich CSV file which can easily be used in Logistic Regression, Decision Trees, SVMs etc. In order to solve the problem of very sparse CSV data, there is a sparse ARFF format that lets users convert sparse lines in each file such as:

f0

f1

f2

f3

…

fn

1

0

1

0

…

0

Into a more succint version where you have a list of features and simply specify the feature’s index and value (if any):

i.e. {feature-index-zero is 1, feature-index-two is 1}, simply omitting all the zero-values.

The Implementation Problem

This is easy enough if you have, say 4 features, but what if you have over 1 million features and need to find the index of each one? Searching for a feature in a list is O(n), and if your training data is huge too, then creating the sparse ARFF is going to be hugely inefficient:

1

2

3

4

5

6

7

8

9

features=['name','some_metric','an_attribute','some_boolean']

#Searching for the existence of a feature is O(n)

>>>'some_metric'infeatures

True

#Retrieving the index of a feature in the list is also O(n)

>>>features.index('some_metric')

1

I thought I could improve this by using an OrderedDict. This is, very simply, a dictionary that maintains the order of its items – so you can pop() items from the end in a stack-like manner. However, after some research on StackOverflow, this disappointingly this doesn’t contain any efficient way to calculate the index of key:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

from collections import OrderedDict

features=['name','some_metric','an_attribute','some_boolean']

od=OrderedDict()

forfinfeatures:

od[f]=0

#Searching for the existence of a feature is O(1)

>>>'some_metric'inod

True

#Retrieving the index of a feature in the list is still O(n) though!

>>>features.keys().index('some_metric')

1

#keys() has to create an entire list of all the keys in memory to

#retrieve the index. You could use iterkeys() to improve memory

#performance, but its still pretty ridiculous.

The solution

What can we do about this? Enter my favorite thing ever, defaultdicts with lambdas:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

from collections import defaultdict

features=['name','some_metric','an_attribute','some_boolean']

dd=defaultdict(lambda:len(dd))

forfinfeatures:

dd[f]#no need to give it a value, it just needs to be called

#So now we can do lookups in O(1)

>>>'some_metric'indd

True

#And also get its index in O(1)

>>>dd['some_metric']

1

Assigning items values in addition to the index is fairly straightforward with a slightly modified lambda:

1

2

3

4

5

6

7

8

9

10

11

dd=defaultdict(lambda:{'index':len(dd)})

#Then more information can seamlessly added:

dd['some_metric']['info']=1

dd['some_attribute']['info']=2

#Whilst maintaining the O(1) lookup of the auto-generated index:

>>>dd['some_attribute']['index']

1

Limitations

This is a fun fix, but doesn’t support full dictionary functionality – deleting items won’t reorder the index and you can’t iterate in order through this easily. However, since in creating this ARFF file, there’s no need for deletions or iteration that’s not a problem.

I was recently contacted by Jeff Coltin, a journalist at WNYC Radio, who asked me to participate in a show about hackathons in NYC.

He featured a snippet from our conversation, specifically about problems that the hacker community could solve. I said (vaguely accurate transcription):

“…There are so many problems that hackathons could fix. I think some big issues at the moment in the media, things like the NSA spying scandals and stuff like that. I think one thing the tech community has slightly failed to do is to make encryption really easy. There’s a sort-of inverse relationship between simplicity and security, so the more secure an app, often the more inconvenient it is to use. So we have things like TOR, extra-long passwords (TOR slows down your connection a lot), VPNs and a lot of very secure services are incompatible with mainstream services. So this level of security and privacy that users want or need is just so inconvenient to achieve its really up to the hacker community to make them much easier to use…”

There have been efforts such as Cryptocat but its adoption rate still needs to grow. HTTPS would probably be the best example of seamless encryption but this often fails when people either ignore or are at loss as to what to do when HTTPS certificates are flagged as invalid by the browser.

Cryptography is an incredibly tough field of Computer Science, so creating reliably secure apps is hard. Educating oneself about this can require a fairly super-human effort and I have a lot of respect for people who contribute modules in this field to PyPI. I’m hoping to start the Crypto course on Coursera once I have some more free time, but beating the security-simplicity inverse relationship I mentioned is certainly easier said than done.

I’m part of an organization called Software Carpentry in NYC. This uses volunteers to teach programming at varying levels to universities, large governmental organizations and other interested groups of people. I previously taught at Columbia and this past weekend it was held at Harvard, organized by Chris Erdmann, the head librarian at the Harvard-Smithsonian Center for Astrophysics.

Before Software Carpentry, my teaching experience was limited to explaining aspects of programming to friends and family, as well as part of a year spent teaching English and French to children and adults in Japan. Teaching is hard. It’s very easy to be critical of a teacher – I’ve often found myself being so without thinking about the effort and stress behind conveying a complex concept to a group of students all with varying backgrounds and motivations. I’ve come up with a few conclusions about how to optimize teaching style from my last 2 SWC events:

R. David Murray to help alongside all the presentations, mainly with the trickier definition-oriented questions

Things that worked well

Humor. Mike sprinkled his tutorial with funny anecdotes which kept the class very lively.

Relevant and interesting subject matter. Hamlet was a good choice, as was the theme of cheating at scrabble due to the librarian-oriented audience. The dictionary brought up several amusing entries for searches like:
grep".*s.*s.*s.*s.*s.*s"words|less

Adding anecdotes to save people googling things. I reckon that a large amount of any programmer’s activities are in simply finding someone who’s done what you want to do before, and slightly modifying things – or connecting up the building blocks. So at the end of talking about the benefits of things like
append() vs concatenating with plus signs like
first+second , I mentioned things like
deque() and
format() .

Things to remember for next time

Typing SLOWLY. I work a lot with MongoDB, so end up typing from
pymongo import Connection;c=Connection() 20+ times a day into the terminal. This can become so fast, things like that can seem bewildering to newcomers.

Using a high contrast terminal with large font and dimmed lights, to make it super easy to see from the back of the room.

What can advanced programmers get out of teaching such basic things?

You’ll learn a lot from the instructors and student’s questions

Community involvement is a great asset on your resume and shows potential employers that you have the ability/drive to train future co-workers

It helps to have on-hand analogies and anecdotes developed during teaching when explaining technical matters to non-technical people, socially or business-wise.

You’ll meet many like minded people and it feels great to get involved in the community.

More about Git. I use SVN at work and thus don’t really submit anything to github. Git is HARD. Erik was an excellent instructor and calmly went from the basics right through to the minutiae of things like .gitignore and diff.

What “immutable” really means. I hear this thrown around quite a lot and it basically just means things can’t be assigned to an object. E.g. the .
split() of
myString.split() can’t become a variable. Very simple.

Somewhere between an introduction and an encyclopedia, it gives fairly comprehensive overviews of each sub-field, including distinctions that I hadn’t previously thought of so clearly. The authors are mostly unafraid to explain the maths behind the subjects. It dips into some probability and linear algebra – admittedly with simplified notation. There’s no real mention of implementation (i.e. programming the examples) as one would usually expect with O’Reilly; but most competent readers will now at least know what they’re “looking for” perhaps in terms of packages to install or if they want to try and implement a system from scratch. It is certainly designed for the intelligent, professional and far from popular science.

Whilst it is very thorough and interesting it could touch a nerve among Data Scientists, since should a manager of a Data Scientist really have to read a book such as this – surely in such a position of authority they should know of these techniques already? (an extreme example would be one footnote which even contains a description of what Facebook is, and what it is used for). Often, such unbalanced hierarchies are the cause of much unnecessary stress and complication in the workplace. However, this is often the case so perhaps this will be useful in that context.

I think, overall, I was hoping for a slightly different book – with more in-depth case studies of how to implement existing Data Science knowledge into Business scenarios. Nevertheless, it’s an interesting, intelligent guide in an encyclopedic sense and fairly unique in its clarity of explanation and accessibility – I highly doubt I could write a better guide in that respect. Existing Data Scientists will find many clear analogies to explain their craft to those less technical than themselves and I reckon that by itself justifies taking a look

One great function in python is the
ast (Abstract Syntax Tree) library’s
literal_eval . This lets you read in a string version of a python datatype:

1

2

3

4

5

6

7

8

9

>>>from ast import literal_eval

>>># A sample dictionary with some trivial information

>>>myDict="{'someKey': 1, 'otherKey': 2}"

>>>#Let's parse it using the function

>>>testEval=literal_eval(myDict)

>>>printtestEval

{'someKey':1,'otherKey':2}

>>>type(testEval)

<type'dict'>

Importing a dictionary such as this is similar to parsing JSON using Python’s
json.loads decoder. But it also comes with the shortcoming’s of JSON’s restrictive datatypes, as we can see here when the dictionary contains, for example, a datetime object:

So you might try and write some code to parse the dictionary data-type yourself. This gets very tricky, but eventually you could probably accommodate for all common data-types:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

def read_dict(d):

"""Accepts a string containing a dictionary.

Tries to parse this and returns the dictionary."""

parsed_dict={}#The result

#Remove the {} and split into chunks

d=d[1:-1].split(', ')

#iterate through the chunks and try to interpret them

forkv_pair ind:

#split up by the central colon

kv_pair=kv_pair.split(": ")

#interpret the key and value

k=whatAmI(kv_pair[0])

v=whatAmI(kv_pair[1])

#add to the final parsed dictionary

parsed_dict[k]=v

returnparsed_dict

def whatAmI(thing):

"""Simple attempt at interpreting a string

of a datatype. Can deal with Strings and Ints"""

#remove any inverted commas

ifthing.startswith("'")orthing.startswith('"'):

thing=thing[1:-1]

#Now check for data-types (there are way more than this though)

ifthing.isdigit():

returnint(thing)#return the digit

else:

returnthing#return the string

#if not recognized by either, then return an error

return"Corrupted data"

But this still doesn’t truly fix our datetime object problem:

1

2

3

4

5

>>>read_dict(myDict)

Traceback(most recent call last):

File"<stdin>",line1,in<module>

File"<stdin>",line13,inread_dict

IndexError:list index out of range

Which is where we get to the crux of this post. I thought at first that I could deal with datetime’s formatting by extracting the class
datetime.datetime(2013,8,10,21,46,52,638649) as a tuple by spotting the brackets, then feeding the tuple back into datetime like:

1

2

3

4

5

>>>x=(2013,8,10,21,46,52,638649)

>>>parsedDate=datetime(x)

Traceback(most recent call last):

File"<stdin>",line1,in<module>

TypeError:an integerisrequired

But apparently not. The tuple must be extracted – not by a lambda or perhaps list comprehension, but in fact by using asterisk notation:

1

2

3

>>>parsedDate=datetime(*x)

>>>print parsedDate

2013-08-1021:46:52.638649

Asterisk ( * ) unpacks an iterable such as x into positional arguments for the function. Simple!