Post navigation

Presses have rolled (to some extent). Word of mouth is spreading and will only grow. The launch is upon us and we are ready to go.

Friday Chris and Jojo presented at the GC Digital Initiative’s Media Res #1. It was exciting to see TANDEM rolling with the NYCDH crowd. Not only were useful contacts made in the other presenters, but it gave us a chance to see how our final presentation should go. In under five minutes, Chris was able to describe the basic objectives and accomplishments and the proto-website easily stood up to the standard of the other presentations (even as a definite pre-rollout version). Chris also got first hand experience with areas that need more or less explaining. Optimizing 10 speaking minutes will be a challenge, but it was a productive and constructive exercise to give it a dry run. It also allowed the team to make more focused and concise presentation slides.

Thanks to tips from Tim Owens and Zach Davis, we have fixed the bug that was causing unpredictable results. The bug meant that some users would experience a session crash (unhandled error), others would go through all the steps and get back an empty result set, while still other users would get results with other project results inexplicably mixed in. The solution was based on the ‘aha!’ realization that the web is a stateless environment and that the programmer must take steps to ensure that a session across multiple requests and responses remains a single logical whole. The way to do that is to use anonymous cookies to identify the session and store the project-id in that cookie. Using the project-id, the program looks up in the database all the information required including the location of the files and folders in use for that project. There is no doubt that we have achieved our MVP and it is a good reliable user experience. There is a tiny bit of styling left to be done and the program is finished as far as version 0.5.

With our corpus defined and development goals set, the team is taking a two-pronged approach to the reaching the final project. While Chris and Steve focus on continuing to develop and code the working project, Kelly and Jojo have turned their attention to the work to be done with the corpus. Equally as important as building TANDEM is the ability to show a proof-of-concept and illustrate the value of the output TANDEM generates. While the duties among the team will still bleed as there is still design work that may arise for Kelly, outreach to be done by Jojo, and theoretical questions for Chris and Steve to weigh in on, our focus is much more pointed on particular pieces of achieving a functioning and valuable tool and methodology.

DEVELOPMENT:

A key milestone was reached this week when the text processing backend coding was completed. It will need to be thoroughly tested which the team expects to complete by 3/31. Additional work requires that the program be merged with the image processing code. This integration step is targeted for completion on 3/24. The current version of the program can be found on Github. The repo contains a number of test data files as well as documentation. The core program is TandemText.py.

The team decided to abandon the Flask web framework in favor of Django, primarily because there is much more local support (from the Digital Fellows) for Django. We were able to switch because we did not have a significant code base built in Flask and much of the work done on Flask may transfer well to Django. Optimistically, the team should be able to get a pilot “Hello World” application running under Django on the Reclaimhosting.com server (with help from Zach Davis and Tim Owens).

Finally, on the development front, the team needs to envision and plan for how we will persist data on the website. Will persistence even be see as a valuable feature by the users? If so, how will store and secure the data? How will we handle requests to amend or edit an existing result set? These decisions are pending, likely to be addressed at the 3/24 class.

OUTREACH:

This week TANDEM has maintained its twitter activity. Jojo is also working on reaching out to new communities while developing useful skills — she has taken on work at the Tow Center for Digital Journalism and is exploring possible applications of TANDEM there, and she got accepted to Django Girls next weekend and was assigned her team. She looks forward to meeting a number of people across disciplines and fields.

Not too long ago (within the last couple of years, I think) Oracle acquired MySQL and there was, I know, a fair amount of concern within the open source community that Oracle wouldn’t support it very well–that they might even deliberately try to kill it or convert it into a profit making product. Perhaps these concerns have come true? Does anybody have a sense that the open source community is moving away from MySQL or whether Oracle has done a good job supporting this DBMS?

I tried to think about what I would bring to each of the roles. Here is my #skillset:

Project Management: I have 30 years of experience managing software development projects. This is such an obviously good fit, that I think I would prefer a role that is new to me and forces me to learn. I would gladly help out in the project management capacity.

Developer: I am a “baby” Python programmer, which is to say that I know the basics, but have little experience. I know a little R and have old experience (can you say COBOL?) developing code. I am quite tenacious at problem solving and learning new technology and have a pretty broad background at the conceptual level. I would enjoy this role.

Design/UX: In my career, I have quite a bit of experience in this area as it pertains to software usability. I have seen a project fail when it met all the requirements, but was hard to use. I am not a graphically talented person, so making a project “look beautiful” is not something I would be good at. I would be glad to play this role focusing on “ease of use”.

Outreach: I have limited ability to use social media. I am a Twitter and Facebook dabbler. I am dubious that this would be the best use of my labor.

Statistics and R

I am pursuing two unrelated paths. The first of which is a collaborative path with Joy. She has identified some interesting birth statistics. The file we started with was a PDF downloaded from the CDC (I believe). I used a website called zamzar.com to convert the PDF to a text file. The text file was a pretty big mess, because it included a lot of text in addition to the tabular data that we are interested in.

Following techniques that Micki demonstrated in her Data Visualization Workshop, I used Text Wrangler to cut out a single table and gradually clean it up. I eliminated commas in numeric fields, and extra spaces. I inserted line feeds etc. until I had a pretty good tab-delimited text file, which imported very cleanly into Excel, where I did some additional cleaning and saved the table as a CSV file that would work well in R. The table reads into R very cleanly so that we can perform simple statistics on it such median, min and max.

Text Analysis

My other data path is working with text, specifically, Dickens’ “Great Expectations”. I have used no fewer than three different tools to open some windows onto the book First a loaded a text file version of the book into Antconc, “…a freeware tool for carrying out corpus linguistics research and data-driven learning.” I was able to generate word counts and examine word clusters by frequency. The tool is very basic so until I had a more specific target to search, I set Antconc aside.

At Chris’s suggestion I turned to a website called Voyant-tools.org, which quickly creates a word cloud of your text/corpus. What it does nicely is provide the ability to apply a list of stop words, which eliminates many common frequently used words, such as ‘the’ and ‘to’. Using Voyant, I was able to very quickly create a word cloud and zero in on some interesting items.

The most frequently mentioned character is Joe (in fact, Joe is the most frequent word) and not Pip or Miss Havisham. That discovery sent me back to Antconc to understand the contexts in which Joe appears . Other words that loom large in the word cloud and will require further investigation are ‘come’ & ‘went’ as pair, ‘hand’ and ‘hands’, ‘little’ and ‘looked’/looking’.

Lastly, I have run the text through the Mallet topic modeler and while don’t know what to make of it yet, the top ten topics proposed by Mallet make fascinating reading, don’t they?

miss havisham found left day set bed making low love

made wemmick head great night life light part day dark

mr pip jaggers pocket mrs young heard wopsle coming question

boy knew herbert dear moment side air began hair father

time long face home felt give manner half replied person

back thought house make ll pumblechook herbert thing told days

joe don mind place table door returned chair hope black

hand put estella eyes asked stood gentleman sir heart london

good round hands room fire gave times turned money case

man looked biddy sister brought held provis sat aged child

At this point the exploration needs to be fueled by some more pointed questions that need answering. That is what will drive the research. Up until now it has been the tools that have been leading the way as I discover what they can do and what buttons to push to make them do it.

Challenge number one is finding a dataset I can work with. I want to do something with text. There are plenty of books available digitally, but they always seem to be in formats, such as PDF, that require a fair amount of cleaning (which I would rather not do) before they can be consumed by any of the tools.

I did find a number interesting datasets on the NYC Open Data website. Many of them are easy to download. I downloaded one small one that contains the average SAT scores for each High School. I was able to bring the dataset into R, too. It’s such a simple dataset–basically just school id plus the scores of each section of the SAT–that there isn’t much analysis one can do (mean, median, standard deviation…?). I would like to combine it with some other datasets that could enrich the data. For example, if I could get the locations of the schools, I could map the SAT data or if I could get demographic data for each school, I could correlate SAT performance to various demographic variables.

On a parallel track, I am working with Joy, who is very interested in motherhood and wants to explore datasets related to that subject. If she can find a compelling dataset, we will work together to analyze it.

“Day 2”

So it turns out that Project Gutenberg has books in TXT format that you can download. It also appears that there is a website www.zamzar.com that will convert PDF files to text files for free. Question for further thought: if the PDF is a scanned image, will converting it to text get me anywhere? I doubt it. Best way to find out is to give it a go.

I am going to download Great Expectations from Project Gutenberg and see if I can pull it into R and, perhaps, Gephi or Antconc to see what can be done that might be interesting.

The Python Workshop last night was SRO, with members of our class occupying about a third of the seats (including some who had been on the wait list). Given the complexity of the subject and the short time, the class barely scratched the surface of Python Programming, but it was a start.

For those interested in other Python learning opportunities, I would mention several. First, the workshop instructor named two books: Python for Kids by Jason Briggs (good for those who have zero prior exposure to programming) and Learn Python the Hard Way by Zed Shaw for those who have some familiarity with coding.

There are four Python classes on Lynda.com. I have not taken any of them, but I have taken other Lynda.com classes and they have all been good.

I can also recommend Coursera, which offers a Python class.

Finally, classmate Chris Vitale shared with me a link to Code Academy, which I will pass along. http://www.codecademy.com