Trying to work my own way through the informational jungle

Menu

Tag Archives: R

It’s been quite an intensive period recently. First, I was having two parallel courses at Coursera – on data analysis and on statistics. Second, Irina Radchenko and I were preparing to launch a new Russian-language data expedition under our Datadrivenjournalism.ru project and then we were actually coordinating it for two weeks (9 – 23 December). Third, I suddenly had a huge task at work with a really tough deadline, which actually ruined my plans a bit, but thankfully not all of them. So here’s a brief account of the resulting layout:

I had to drop the data analysis course after its sixth week. Due to that sudden workload I couldn’t afford doing the second assignment, which was somewhat upsetting. But on the other hand, I think I’ll be able to do it later either on my own or within the course iteration (I’m almost sure it’s going to be launched soon again). Anyway, I’m glad I’ve done at least something, because it turned out to be rather helpful, especially in terms of structuring things and my mind. And yes, the previous course Computing for Data Analysis (on R) was extremely helpful. (For those who might be interested: the next iteration of this course starts on 6 January 2014.)

On the other hand, I triumphantly completed Statistics One course and that’s really cool. There are contradictory reviews of this course online. Some of them claim that the course is inconsistent in terms of difficulty: sometimes too easy and even boring, sometimes too complicated. Well, after completeing it, I can’t say that I’ve digested all the material provided. But now I have a better vision of what statistics is like and how it approaches data. Also I can apply some techniques for data analysis with the help R, but I wouldn’t claim I completely understand the mechanisms underlying some of these operations. Next I’m actually going to focus on Open Intro Statistics, which is a great textbook, and revise the material in order to pack it into my head. To wrap up this segment, I’ll add that the material that had been provided within that course by the middle of the semester was enough to complete assignment one in Data Analysis course.

As to the data expedition, it was luckily completed yesterday. Its organisation was considerably different from the previous experience and demanded quite a bit of in-advance preparation, apart from participation as it is. Although I couldn’t participate in it myself as thoroughly as I would want to, I still have to admit that the result somewhat exceded my expectations. I’ll be writing about it in a greater detail after I analyse the the whole picture. For now I can say that the timing was horrible. So the lesson is: never launch learning projects right befor Christmas or the New Year. But nonetheless there are some very inspiring results and the participants were virtually great.

Lastly, they say a MOOC on data driven journalism provided by Datadrivenjournalism.net is going to be launched in ‘early 2014’. I’m not sure I’ll be able to afford to participate, but might be interesting.

A week ago, I completed Computing for Data Analysis by Prof. Roger Peng at Coursera. This course was described as an introduction to the R language. Well, this might have been somewhat confusing, because it was an introductory course indeed for those who were totally new to R. But not for those who were total newbies in programming in general, which wasn’t actually directly mentioned in the course description. Judging by numerous complains at the discussion forum within the course, some people really were having hard time trying to figure out where to start having no programming experience whatsoever.

On the other hand, even a very distant familiarity with programming basics in Python made things a bit more tolerable to me than they would have been had I never ever seen things like an IDE or a for-loop before. So for me the course was rather challenging and even frustrating at times, but to my huge surprise I was able to complete the assignments. This doesn’t mean of course that I have perfectly understood, digested and mastered all the material provided. But after the course I really feel much more confident in the R environment. What is even more important, the course helped me to map my skills, so now I know what I need to learn better, where and how I can look for help and which spots in my knowledge I can rely on. All in all, I’m glad I took this course. Thanks to Dr. Peng and his wonderful teaching assistants who made a huge lot of job trying to retell the course material so that even total newbies could keep up.

By the way, I think the course is still available as archive at Coursera. Its video lectures are also available at YouTube.

Also, I must admit, I have developed Stockholm Syndrome began to like R.

And I’ve spent almost two notebooks on it, because I really feel more confident when I make notes on the way.

Now, as a follow-up, I played a bit with the dataset, which was used for our last assignment focused on regular expressions. We worked with the homicide data from Baltimore Sun site, which provides an interactive application to navigate these data, but doesn’t provide them in a downloadable format. So Dr. Peng simply copied them from the page source and pasted into a text file. Here it is.

For our assignment we had to write two functions. One had to count the number of victims given the cause of death. The other had to count the number of victims of a given age.

And well, I actually found out that the most common cause of violent death in Baltimore in the period from 2007 to 2012 was shooting; that out of 1245 observations in 1126 cases victims are male, so it looks like this:

Also, the only category in which female victims prevail is asphyxiation. So speaking about preferences in killing tools given gender, this chart might be more instructive.

Well, for more sophisticated data analysis I’ve yet to learn loads of Statistics. By the way, as to Statistics, I’m still taking Statistics One by Prof. Andrew Conway at Coursera. Although it seemed a bit boring at the beginning, now it’s getting more and more interesting.

Also I have completed the Python course at Codecademy. And immediately started a course in JavaScript. Because I like Codecademy. And because I don’t have enough time right now to focus on learning API with Python there. Never mind that I’m currently doing Introduction to Interactive Programming in Python at Coursera. I promise, I’ll quit it, as soon as it becomes too challenging to be combined with Statistics and Data Analysis, which starts on October 28th.

All this stuff is supposed to be completed by January. I must say, now I feel a strongest urge to get down to something a bit more fundamental, like maths and computer science basics.

A new bunch of links to the resources regarding statistics etc. that seem to me helpful:

Introduction to Statistics

This is an archive of an introductory statistics course at Coursera Statistics: Making Sense of Data by Alison Gibbs, Jeffrey Rosenthal (University of Toronto).

The authors of the course kindly provided a list of recommended literature. I don’t think it would be a crime to reproduce it here. So, they recommended three ‘traditional books’:

Introduction to the Practice of Statistics, by David S. Moore and George P. McCabe. (The book is currently in its fifth edition, but any edition will do.)

Stats: Data and Models, Canadian edition, by Richard D. De Veaux, Paul F. Velleman, David E. Bock, Augustin M. Vukov, and Augustine C.M. Wong. (The original version of the book, by the first three authors only, is also recommended.)

Statistics, by David Freedman, Robert Pisani, and Roger Purves.

And three online resources:

OpenIntro Statistics, by David M. Diez, Christopher D. Barr, and Mine Cetinkaya-Rundel. The cool thing about this one is that it’s not just a book, it’s a whole learning tool including labs and some instructions on using R.

OK, here am I back from my over a month’s time gap, of which there were two weeks of holydays and the rest was a huge lump of work, including the tasks for my job as well as some work at Moscow Open Data School. But now I hope I’ll be able to afford to spend some time on just learning.

Unfortunately, I couldn’t finish my Python MOOC, because of that sudden workload again. But I’m totally going to get back to it as soon as I can. Following Zach Sims’ (Codecademy) recommendation, I’m simply trying to gradually do the tasks Codecademy to refresh stuff in my mind and to keep digesting Python.

Right now though I’m focused on the Statistics course that has just begun at Coursera (by the way, those who are interested are welcome to join). I wonder how helpful it’s going to be, but there’s one thing I know for sure: I’ve got to learn how to process data in R. And well, the R course is actually integrated into this one, which is great.

While working on the first assignment, which was actually a very simple drill exercise to memorize some R commands, I faced one problem. The problem was that I couldn’t install and load a package in R (MS Windows 7) because of some troubles with administrator access to some saving functions (although I’m obviously the administrator). Or better to say, it did download the package, but it would refuse to save it in the R directory. As far as I know, some students in the same course also had troubles at this point, but they were different. In my case the solution was very simple. I just manually relocated the necessary package from where it was saved by default to where I needed it (namely, in the R library). And there’s also a way to install packages from a manually downloaded (from CRAN) .zip files through the menu (Packages > Install package(s) from local zip files). Well, at this stage this works perfectly well for me.