In the capstone, students will build a series of applications to retrieve, process and visualize data using Python. The projects will involve all the elements of the specialization. In the first part of the capstone, students will do some visualizations to become familiar with the technologies in use and then will pursue their own project to visualize some other data that they have or can find. Chapters 15 and 16 from the book “Python for Everybody” will serve as the backbone for the capstone. This course covers Python 3.

RG

Excellent certificate; the project was a bit too easy, but I guess its purpose was to show that something seemingly complex was the result of breaking things into simpler and more manageable pieces.

CE

Apr 20, 2017

Filled StarFilled StarFilled StarFilled StarFilled Star

Excellent, simply the best experience of on line education I have had, the course is extremely complete and useful, the support of the staff of mentors is remarkable also, I am extremely grateful.

从本节课中

Spidering and Modeling Email Data

In our second required assignment, we will retrieve and process email data from the Sakai open source project. Video lectures will walk you through the process of retrieving, cleaning up, and modeling the data.

教学方

Charles Russell Severance

Professor

脚本

So, now we're going to do our last visualization and it's interesting that it's kind of we're coming full circle, we're back to email and so instead of a few thousand lines of email, we're going to do a gigabyte of email and you're going to spider a gigabyte. Now, actually if you look at the Read Me on gmane.zip, it tells you how you can get a head start by doing this first 675 megabytes by one statement and then you can fill in the details. The idea is that we have an API out there that will give us a mailing list. Given a URL that we just hit over and over again changing this little number and then we're going to be pulling this raw data and then we'll have analysis cleanup phase and then we're going to visualize this data in a couple of different ways. Now, this is a lot of data, it's about a gigabyte of data and it originally came from a place called gmane.org and we have had problems because when too many students start using gmane.org to pull the data, we've actually kind of hurt their servers. They don't have rate limits, they're nice folks. If we hurt them, they're hurt, they're not defending themselves and so where Twitter and Google, gmane.org is just some nice people that are doing this and so don't be bug, don't be uncool. I've got this http://mbox.dr-chuck.net that has this data and it's on superfast servers that are cached all around the world using this thing called cloudflare. So, they're super awesome and you can beat the hack out of dr-chuck.net and I guarantee you you're not going to hurt it. You can't take it down. Good luck trying to take it down, okay, because it is a beast. So, make sure that when you're testing, you better use dr-chuck.net, dont use gmane.org. Even though it would work, please don't do that. I've got my own copy and okay enough said. Okay. So, this is basically the data flow that's going to happen and that is we go to this dr-chuck.net which has got all the data, it's got an API and we basically had their sequence in number. So, there just message 1, message 2, message 3 and so we can have message 1, message 2, message 3 and we know how much we've retrieved. So, this program when it starts up it says how much is in the database go down down down down down down oh, okay, number 4. So, then it calls the API to get message number 4, brings it back and puts it in. Calls the API message number 5, 6, 7, 8, 9, 100, 150, 200, 300, oh crash. Again, this is a slow but restartable process, okay. So, then you start it back up and it's like oh we're 51. So, we go 51, 52, 53, 54 and if you're really going to spider this all, I think when I spidered at the first time, it took me like three days to get all of it and so it's kind of fun, right, unless of course you're using a network connection you're paying for. Do not do that because you're going to spend a lot of money on your network connection. If you're on a unlimited network or you're in a university, it's got a good connection, then have fun. Run it for hours, watch what it does. It just grinds and grinds and grinds and grinds. Now, what happens is it turns out that this data is a little funky and it's all talked about in the read me but this is like almost 10 years of data from the psychi developer list and people even change their email address and so there's this little bit of extra like patchy code called G model that has some more data that it configures it and it reads all this stuff and it cleans up the data. So, this ends up being really large and if you recall from the database chapter, it's not well normalized. It's just raw, it's set up to be it's an index it's very raw, it's only there for spidering and making sure we can restart our spider. If you want to actually make sense of this data, we cleaned it up by running a process that reads this completely, wipes this out and then writes a whole brand new content. If you look at the size of this, this is really large and this is really small. If you have the whole thing, it can take, depending on how fast your computer is, it can take minutes to read this data because it's so big and this is a good example of normalized data versus non normalized data. So, it takes like- let's just say it takes two minutes to write this because it is reading it slowly because it's not normalized. This is nicely normalized. It's using index keys and foreign keys and primary keys and all that stuff, all the stuff we taught you in the database. That's here. So, this is a small and you look at the size of the file, it's roughly got the same information but it's represented in a much more efficient way. So, then this produces content.sqlite and then the rest of the things read content.sqlite because this is the cleanup phase, that's the cleanup phase. Now, what you can do is you can run this for awhile then blow that up then run this and that's fine because every time this runs, it throws this away and rebuilds it and maybe look at some stuff and say "Oh, I want to run some more" and then that's okay because now you can start this back up and as soon as you're done with however far you went there, you stop that and then you do this again. So, you do this and it reads the whole thing and updates this whole thing and so then once this data has been processed the right way, then you run gbasic.py and it dumps out some stuff but it's really just doing database analysis and then if we want to visualize it with line, you run this gline.py and again that loops through all the data and produces the data on gline.js and then you can visualize this with a HTML file and the d3.js. If you want to make a word cloud, you run this gword which loops through all that data, produces some JavaScript that then is combined with some more HTML to produce a nice word cloud. So, the ReadMe tells you all of this stuff and gets you through all this stuff and tells you what to run and how to run it and roughly how long it's going to take. So, you can work your way through all of these things. So, in summary with these three examples, were really writing a little more sophisticated applications. I've given you most of the source code for these applications, but you ca see what a more sophisticated application looks like and based on these, you can probably build your own data pulling and maybe even a data visualization technique and adapt some of these stuff. So, thanks for listening and we'll see you on the net.