Tag Archives: TANDEM

We could not have gotten here without each and every one of us. This class has been a great challenge, and I like to think that rather than a Big Brother/Survivor style show, we have kicked no one off the island, but have come out of this challenge as a team.

Back when I was unsure of the name “TANDEM”, I made this video to get myself on board. Sometime musical theater is all you need to bring it together.

I am sharing it here in the event that it might bring any one of you some positive spirit moving into our presentations.

The Digital GC: 2014-2015 Year-End Showcase

Please join us on May 19th 2015 for a special event at the Graduate Center showcasing the innovative and diverse digital projects initiated during the 2014-2015 academic year! Presentations will be given by: the Digital Praxis Seminar, the GC Digital Fellows, Provost’s Digital Innovation Grantees, the New Media Lab, the Interactive Technology and Pedagogy Certificate Program, the Futures Initiative, and the GC Library.

The Digital Praxis Seminar: Final Project Launches

Digital Humanities Praxis is a two-course sequence that introduces students to the landscape of digital humanities tools and methods through readings, discussion, lectures, hands-on workshops, and culminates with students collaborating in groups over a single semester to build and launch working prototypes of Digital Humanities projects. The instructors for DH Praxis are Stephen Brier and Matthew Gold (Fall, 2014) and Amanda Hickman and Luke Waltzer (Spring, 2015).

Event hashtag: #digitalgc

Students in the Digital Humanities Praxis course at the CUNY Graduate Center will launch four new projects:

We are happy to announce that the initial version of our near-polished UI is up and functioning on http://dhtandem.com/. This development means that you can now go to the site and walk through uploading files as well as review some early versions of our documentation.

Immediate next steps for our team include updating the text on the documentation pages to the more robust things we have patiently waiting in the wings while we finalize the connection of the front and back components of the app. We have been powering away at creating thorough documentation and user information to be present on the final site. This also includes our exploration of the Mother Goose corpus which is beginning to take shape (in part thanks to some TANDEM supporters and volunteers from the praxis class). Basically, we’re pushing our data set through various tools for discovery and analysis. These results will become incorporated in the Sample Data section on the TANDEM website, which is intended as an example of the apps potential, and as a learning tool for new users.

As we continue to work on bugs and high priority action items, such as fixing an error with zipping files that originated from a change in processing in this iteration, we are realizing areas that could use strengthening post-dhpraxis. Our function May 19th MVP is so close we can taste it.

The zipping problem mentioned above may be related to another problem, which only happens on the server and cannot be replicated in a development environment. What appears to happen follows: when a user starts a new project, TANDEM builds three folders on the server, one for the uploaded files, one for the final output which is subsequently zipped for download. The third folder is a staging or intermediate directory that can contains files after any pre-processing that is required. For example, PDF files must be converted to JPG for our image analysis software to work. Another example is that the text must be extracted into TXT files via an OCR step for NLTK to be able to consume the content.

These new folders appear to be created successfully, and their locations are saved to global variables in the program. However, when it comes time to write files to the newly created folders, it seems that the file are being written to a previously used set of folders. The problem is intermittent. To make diagnosis more difficult, the zip step sometimes zips the older folder which delivers content from multiple projects to the user. However, other times the zip step zips the new folder which is empty delivering an empty file to the user. At still other times, the files are all read and written properly.

Zipping issues aside, we are moving along. Given all the amazing progress we have made, it is not surprising that buzz for the launch is growing. (Also Jojo invites anyone and everyone she speaks to). With new details regarding presentations, we are ready to get this party started. The DH community at CUNY and in New York has been a part of these projects whether actively or abstractly, and it seems a grand opportunity to celebrate.

WEEK 12 TANDEM PROJECT UPDATE:

This has been a week of accelerated achievement on all fronts for TANDEM. Thanks to Steve, we have a working MVP hosted on www.dhtandem.com/tandem. Further, we have also made huge strides on the front end with Kelly’s robust initial set of HTML/CSS pages for the site. While the two ends are not tied together just yet, they are within sight as of this weekend. Jojo continues to surprise the group with her intuitive mix of outreach and awesome having sent out personalized invitations to key members in our contact list and people who have shown interest in the past few months. Keep reading for more detailed information about these and other developments.

OUTREACH:

Continuing to garner community support, Jojo attended a GC Digital Initiatives event Tuesday as well as the English department’s Friday Forum. Additionally, initial invites for the launch went out to the digital fellows and DH Praxis friends and family via paperless post. Digital Fellow Ex Officio Micki Kaufman has already replied that she wouldn’t miss it. I’m now working to organize outreach with the other teams.

The press release is coming along on the class wiki, too!!

Corpus:

With functionality ironed out, we continue to work with the dataset we have generated via TANDEM for the Mother Goose corpus. As part of our release, we will include work that we have done in both analysis and data visualization for the initial test corpus. If you have questions or points of interest in Mother Goose feel free to comment them below! We are interested in hearing the kinds of questions one might ask of a text/image corpus.

The code merge was completed and tested on two local machines and uploaded to the server at Reclaimhosting.com. According to Tim Owens at Reclaim, the necessary Python packages were loaded on the server, but the code cannot find three of them, so, as of this date, the code has not been run (Note: running this code on the server is an interim step to verify that the core logic of the text analysis and image analysis works properly). However, the server was built out so that the demonstration Django application launches successfully. Unfortunately, once it launches, some of the pages cause errors as does any attempt to write to the database. Our subject matter expert has been contacted to help debug these errors.

On a separate development path, multiple members of the team are working on building the Django components we need to turn the analytics engine into an interactive web application. Steve is working on linking the the core program to a template or view. Chris, Kelly and Jojo are working on designing and building the templates in a Django framework. Current UI/UX concerns involve potential upload sizes combined with processing time, button prompts that launch the analysis, and ways to convey best practice documentation so that it’s clear, concise, and that it facilitates proactive troubleshooting. The next part of this process will be to address the presentation of the final page, where the user is promoted to download their file. This page has great potential to be underwhelming, but there are some simple features we can apply to jazz it up, such as data visualization examples and by providing external links to next-step options.

On the outreach front, Jojo went to a Django hacknight Wednesday to get a handle on people building Django apps. She made contact with several new advocates in addition to garnering further support from Django Girls participants web developers Nicole Dominguez and Jeri Rosenblum, as well as hacknight organizer Geoff Sechter. The new contacts include Michel Biezunski. He seems like he could help. And has used Django to upload and redistribute files for his app InstantPhotoAlbum. So he could help when we work on figuring out potential options for placing and giving back data.

Last but not least, Chris attended a meetup at DaniPad NYC Tech Coworking space in Queens, NY this past week. There, he met a handful of Python developers who had insight into working with Django based web-apps. Commercial uses for TANDEM-like were brainstormed and people responded with interest in testing a prototype. Along with academic beta-testers, some of these people will be included in the contact list when TANDEM is deployed.

TANDEM 0.5 will be moving from it’s heavy development phase into a testing and forward-facing design phase this week. At the time of this posting, Steve and Chris are still working out the specifics of functioning unified code, but testing of the independent scripts has begun to a certain degree of success. Text and image values are easily generated via independent processes.

This week we also discussed the idea of data persistence with some depth. Simply put, would someone be able to access the data they generated at a later date via the TANDEM ui. At this iteration of the software, we agree that this is a valuable component, but not an essential feature for an MVP. That said, we are thinking about both the code needed to run it and the user-specific UI that would accompany such an application.

DEVELOPMENT:

We are working away at unifying TANDEM’s independently functioning image and text codebases. We are aligning the code in a single python script file. We vetted an idea to have two scripts, one for image and one for text, run simultaneously. The decision was made that for this first iteration of TANDEM, a single .py will suffice, and in fact may be more maintainable and more efficient.

The code merge has been slow due to python versioning issues which lead to the code producing different results on different machines.

A call is scheduled for Tuesday with Tim at Reclaimhosting to work on configuring the server to run Django. Meanwhile the developer is working through the very thorough Django tutorial and also trying to begin the defined appropriate class objects for a potential future version.

DESIGN:

Immediately following the code merge, we plan to begin implementing our user interface. A full size mockup of this is still under version control as we explore new grounds with user-specific views and the possibility of in-browser table views of the .csv data that is generated.

Additionally, Jojo spoke with Grant Wythoff, who reasserted TANDEM’s relevance to Bill Gleason’s project at the Cotsen Library. Jojo will reach out to Professor Gleason again this week. Grant also recommended we contact Natalie Houston at University of Houston regarding her Digital Victorian project on the visuality of poetry.

Jojo also attended the DjangoGirlsNYC event at the Stack Exchange. In addition to familiarizing herself with the framework in which TANDEM will eventually operate, she made useful contact with other django developers working in NYC.

Development

On the image processing side of things, Chris has identified the syntax for generating our key values. Now we are working toward stitching the pieces together in a way that makes sense for our output. The extreme minimum of computer vision is accessible via OpenCV and while the possibilities are tantalizing, we have continued to keep a direct focus on the key pieces we need to access for the mvp. TANDEM is still on track.

We have also begun to reevaluate our progress. To do so, we created a new list of dev tasks that range from bite-sized to larger steps so we can visualize how much further we have to go. Steve has been doing a great job of keeping track of progress and using git for version control of his scripts.

In addition we successfully implemented a routine to convert PDF to TXT. Input files are screened by type. If they are JPG, PNG or TIFF, they are passed to Tesseract for OCR processing. If they are PDF they are passed to a PDFMiner routine that extracts the text. In each case the program writes TXT files to “nltk_data/corpora/ocrout_corpus” with a name that matches the first order name of the input file. The latest version of the backend code is here: https://github.com/sreal19/Tandem

Web functionality remains problematic. Most effort this week has been merely trying to get through the Flask tutorial.

To end on a positive note, developmentally, good progress has been made with Text Analysis processing. We are computing the word count and average word length for a single page. The program also creates a complete list of words for each input file. In the very near future work will be completed to create a list of unique words and the count of each. The team must make a decision about whether to strip punctuation from the analysis, since many of the OCR errors are rendered as punctuation.

Design/UI/UX

We’ve been working to identify the ideal UX functionalities for javascript. Most of this was fairly straight-forward, such as giving the user the ability to browse local folders & view a progress bar of the upload/analysis. It has been difficult to locate a script to produce error messages. Searching for anything involving “error” in the name retrieves a different type of request, and “progress” only gets to half of the need.

For instance, we had discussed having the ability to let users identity upload/analysis errors by file, either with a prompt on the final screen or with indicator text in the CSV output. Such a feature will provide the user with the ability to go back and fix the error for 1 file, versus having to comb through the entire corpus and re-uploaded. An example of how this would look would be something this, with text & visual cues that indicate that which file needs review:

There is some documentation on Javascript progress events and errors, but we need to need to discuss how it could be employed for TANDEM, and whether its necessary for the 0.5 version.

Outreach

Twitter continues to be the primary platform for outreach. While #picturebookshare continues to chime away, we are also now using it to generate research ideas for potential TANDEM users. Fun distant futures for TANDEM might involve the visual trajectories of various aspects of books: visuality of covers or book spines, as well as the visual history of education materials.

Jojo spoke with Carrie Hintz, who has is starting a Childhood Studies track via the English Department, to see if she knew anyone studying illustrated books at the GC. She has no leads yet, but said come the fall she would have a better idea of people interested in TANDEM. Meanwhile, Long LeKhac, an English PhD at Stanford, was giving her a sense of the DH scene there and said he would ask around the DH community beyond Moretti’s lab. Jojo is in the process of devising outreach to text studies experts — Kathleen Fitzpatrick at MLA, Steve Jones — and folks in journalism — Nick Diakopoulos, NICAR and Jonathan Stray, per Amanda Hickman’s suggestion. Keep on keeping on — keep the tweets t(w)eeming.

Team TANDEM is working fast and furiously on all fronts. We’ve hit a few snags but all told, we feel like we’ve got a handhold on the mountains we’re climbing. Here’s a brief overview of the ups and downs of the week:

Our hope we might springboard off Lev’s tool proved somewhat castles in the air. Lev’s feature extractor was coded in a day. When they went to try to run it again later they couldn’t. Lev suggested we use OpenCV instead.

OpenCV seems to be a massive and constantly shifting morass of dependencies.

Jojo attended a couple of talks from Franco Moretti and spoke with him afterwards to see if anyone at Stanford was doing anything similar. While he acknowledged the validity of studying text as image, he seems to show no further interest. Bummer, but his loss.

The Details

In development we’ve got a working program for OCR and NLTK (Go Steve!), and we’re making strides in OpenCV (Go Chris!). Lev suggested that we have two different types of picture books for our test corpus — one that’s rich in color, another that’s more gray-scale with more text. These corpus variations will show the range of data values available to future users of TANDEM. Kelly’s working on scanning an initial test corpus now.

We also have our hosting set up with Reclaim thanks to Tim, as well as a forwarding email domain. Go ahead and send us an email to dhtandem@gmail.com.

In design/UI/UX Kelly has been working on variations of a brand identity, for color schemes, logo, web design elements… all of it (Go Kelly!). The UI is primed to go now that we have hosting for TANDEM. Kelly is currently working on identifying the code for the specific UI elements desired, for an “ideal world” situation. The next steps for design/UI/UX are to pick a final brand image, and apply it to all our outreach initiatives.

In outreach, TANDEM had a good meeting with Lev on Wednesday. He seems to think we’re doing something other DHers aren’t quite doing. We’re not yet convinced that it’s not just because it’s crazy hard. Either way, we’re up for it. Otherwise, Jojo has been working it hard (Go Jojo!) on all outreach fronts. This week we received interest from Dr. Bill Gleason at Cotsen Children’s Library at Princeton, where they’re working on ABC book digitization and seem especially interested in our image analysis. This response is proof of relevance in the field.

In regards to social media, we now have a proper twitter handle, which we will admit happened in the middle of last week’s class thanks to some pressure from Digital HUAC already having one. You can follow us @dhTANDEM. More on Twitter: TANDEM had a couple really useful retweets (hurray Alex Gil, massively connected Columbia DHer!) that generated some traffic on our website (jetpack has us at 138 views so far, which is not a ton, but it’s a start!) and has won us some good DH followers — @NYCDH, @trameproject. We’ve transferred #picturebookshare to the @dhTANDEM account, and inviting our followers to participate, as well as use it as a means to suggest additional items for our test corpus.

THE MINIMUM VIABLE PRODUCT (MVP)

MVP version #1

Because TANDEM is leveraging tools that already exist, one very basic minimum deliverable is that TANDEM makes OCR, NLTK, and OpenCV easy to use. Moreover, if TANDEM itself is not easy to use, there is no inherent advantage in using TANDEM over simply installing the existing tools and running them.

TANDEM as this minimum deliverable would solve the issue of having these tools in a web based environment, relieving users of the laborious headache of installing the component elements. Even after installing the component elements a user would likely have to write code to obtain the required output. TANDEM will shield the user from that need to be a programmer. At this minimum deliverable, TANDEM has not wrapped the three together into a single output.

MVP version #2

A second, more advanced minimum viable product would be to have a website on which a person could upload high resolution TIFF files, press a “run TANDEM” button, and receive a .CSV document containing the core (minimum) output.

The minimum output will consist of six NLTK values (average word length, word count, unique word count, word frequency (excluding stop words), bi-grams and tri-grams) and three image statistics for each input page provided by the user. We hope to expand the range of file types that we can support and to improve the quality of our OCR output, as well as build more elaborate modules for feature detection in in both text and illustrations. However, we contend that demonstrating the comparative values of a couple corpora of picture books will prove that there is relevant information to be found across corpora with heavy image content.

MVP #3

We are shooting for a single featured MVP. A user comes to www.dhTANDEM.com and uploads a folder of image files following the printed instructions on the screen. They are prompted to hit an “analyze” button. After a few moments, a downloadable file is generated containing OCR-ed text, key data points from the OCR-ed text, and key feature descriptors from the overall image. This is purely for an early adopter looking to generate some useful data so that they can continue working on their story and/or data visualization.​

Concerns

Overall things are going swimmingly. But of course there are concerns. This weeks concerns range from:

Can we get the OpenCV to do what we need in the time we have available? This seems to be the element that people really want — the visuality of illustrated print.

Will be able to scale the project to process the number of pages we would need for users to get the results that would prove TANDEM’s value?

These are, of course, huge questions. But to put it all in perspective: Stephen Zwiebel told Kelly this week that DH Box was held together “by tape” at the time of the final project presentations, and that it has had a lot more time in the past 9 months to become stable. Not to say that we aren’t looking to have a (minimally viable) product come May, but it’s a good feeling to know where other groups were last year. Should we be sharing that widely with the class? Well, we just did. 🙂

Chris is a Data Analyst for the Advertising department of XYZ Publishing. He has the banner ads from this year’s holiday campaign. He is interested in analyzing what generated the highest click-through rates for the company. Chris has previously downloaded and installed TANDEM to his desktop tool. Chris drag-and-drops his folder of ads onto the TANDEM interface. A progress bar appears. A .csv file is generated in the backend to store the output. The completion page gives Chris a downloadable CSV. Chris is directed to brief guides on how the data could possibly be used/visualized. Chris goes the basic route and enters excel to explore his data. He compares the data to the clickthrough rates in the ad server and notices a trend in the relationship between brightness and saturation, along with the number of words on the advertisement, and how many users clicked the ad. The brightest ads with 10 words or less had the highest click through rates. Chris is able to make an data-driven argument with the design team for brighter ads with minimal text in future campaigns.

Scholar Case:

Professor Plum is studying how advertising strategies have been affected by a significant historical event such as World War I. He has collected a corpus of print advertising materials spanning multiple product categories both before and after the event which is being studied. Plum wants to know what has changed and has developed theories regarding a number of features among which are the following questions:

Has the proportion of text to image changed? How?

Has the word usage changed? How?

Has the iconography changed? How?

How has the visual style changed? Are the different colors being used? Are the images more contrasty?

Using a tool outside of TANDEM, Professor Plum scans the materials into a digital format such JPG, TIFF, PDF or GIF. After the image files have been built, he downloads a copy of TANDEM from the Internet and installs it on his desktop computer. Plum launches TANDEM and starts the analysis process by inputting the name of the folder that contains the electronic documents being studies. TANDEM outputs OCR, NLTK and FeatureExtractor data into a database, which can be saved.

Professor Plum can now use TANDEM (or some other visualization tool) to produce visualizations or tables on the parameters that are of particular interest to the scholar. Based on the results of these visualizations, Plum may make some adjustments to the settings in TANDEM to produce a more useful result. He may choose to export the results database to another application for further work or study.

Educator Case:

An early childhood educator, Yasya Berezovskiy, wants to study the effects of children’s literature on neurological development, exploring factors such as narrative, image representations, and lexiles (or word complexity/reading level) together. To date, Berezovskiy has worked with empirical evidence and collected fieldwork data.

Berezovskiy will be analyzing a number of children’s books with varying factors, ranging from author collections, time published, and theme.

Using TANDEM Berezovskiy can upload page images or entire works to process the work’s text in comparison to the visual information. Once complete, Berezovskiy can visualize the processed files in split screen, with the original image beside the visualized data. From there, Berezovskiy can choose to isolate individual elements to analyze, such as opacity, density, text to image ratio, text to color ratio, shape to text ratio, and more. Alternately, Berezovskiy can download the raw processed data to analyze using a separate visualization program.

The processed data will be complementary to other observational research being done by Berezovskiy’s colleagues. Without TANDEM, the evidence from the children’s books would have been only descriptive. Further, without TANDEM it would have taken Berezovskiy multiple programs and more effort.

Fairy Tale Nerd Case:

The user, a woman interested in creating a datavisualization for a pop lit site like Toast.net — let’s say Ella, wants to look at Victorian illustrated fairy tale collections. Ella wants to analyze captions for art plates in all available published works. She wants a computer to process all available picture books to give her more information on the content of a work based on its visual properties as well as its textual content. She wants to get a computer to pull all the words included in the illustrations, as well as the ratio of those words in relation to what is written in the story (Are they direct quotes? Are they distinct?). She goes to the TANDEM interface. There, she sees a simple description of what files the application will yield. It’s so understandable! All the fields are so well explained! She clicks the upload button, finds the files on her computer, uploads the picture book scans, and runs the application. Once the TANDEM program has run, another window appears offering a number of file types. Each file type has a scroll over description of its applications and recommended datavis links. Once she has selected, she can download the data file (CSV or …. …..).

Ella takes it to her favorite datavis site and goes wild with joy at the new capabilities and bases for comparison. All her dreams have been answered. Thanks, TANDEM!