Thursday, June 26, 2014

(Note: This is continuing a series of posts about visualizations
created either by students in our research group or in our classes.)

I've been teaching the graduate Information Visualization course (then CS 795/895, now CS 725/825) since Fall 2011.Each semester, I assign an open-ended final project that asks students to create an interactive visualization of something they find interesting. Here's an example of the project assignment. In this series of blog posts, I want to highlight a few of the projects from each course offering. Some of these projects are still active and available for use, while others became inactive after their creators graduated.

The following projects are from the Fall 2011 semester. Both Sawood and Corren are PhD students in our group. Another nice project from this semester was done by our PhD student Yasmin AlNoamany and MS alum Kalpesh Padia. The project led directly to Kalpesh's MS Thesis, which has its own blog post. (All class projects are listed in my InfoVis Gallery.)

K-12 Archive Explorer
Created by Sawood Alam and Chinmay Lokesh

The K-12 Web Archiving Program was developed for high schools in partnership with the Archive-It team at the Internet Archive and the Library of Congress. The program has been active since 2008 and allows students to capture web content to create collections that are archived for future generations. The visualization helps to aggregate this vast collection of information. The explorer (currently available at http://k12arch.herokuapp.com/) provides users with a single interface for fast exploration and visualization of the K-12 archive collections.

The video below provides a demo of the tool.

We Feel Fine: Visualizing the Psychological Valence of Emotions
Created by Corren McCoy and Elliot Peay

This work was inspired by the "We Feel Fine" project by Jonathan Harris and Sep Kamvar. The creators harvested blog entries for occurrences of the phrases "I feel" and "I am feeling" to determine the emotion behind the statement. They collected and maintained a database of several million human feelings from prominent websites such as Myspace and Flickr. This work uses the "We Feel Fine" data to measure the nature and intensity of a person’s emotional state as noted in the emotion-laden sentiment of individual blog entries. The specific words in the blogs related to feelings are rated on a continuous 1 to 9 scale using a psychological valence score to determine the degree of happiness. This work also incorporates elements of a multi-dimensional color wheel of emotions popularized by Plutchik to visually show the similarities between words. For example, happy positive feelings are bright yellow, while sad negative feelings are dark blue. The final visualization method combines a standard histogram which describes the emotional states with an embedded word frequency bar chart. We refer to this visualization technique as a "valence bar" which allows us to compare not only how the words used to express emotion have changed over time, but how this usage differs between men and women.

The video below shows a screencast highlighting how the valence bars change for different age groups and different years.

I participated as an employee of the MITRE Corporation -- we help ATARC organize a series of collaboration sessions that are designed to help identify and make recommendations for solutions to big challenges in the federal government. I lead a collaboration session between government, industry, and academic representatives on Big Data Analytics and Applications. The goal of the session was to facilitate discussions between the participants regarding the application of big data in the government and preparing for the continued growth in importance of big data. The targeted topics included access to data in disconnected environments, interoperability between data providers, parallel processing (e.g., MapReduce), and moving from data to decision in an optimal fashion.

Due to the nature of the discussions (protected by Chatham House Rule), I cannot elaborate on the specific attendees or specific discussions. In a few weeks, MITRE will produce a publicly released summary and set of recommendations for the federal government based on the discussions. When it is released, I will update this blog with a link to the report. It be in a similar format and contain information at a similar level as the 2013 Federal Cloud Computing Summit deliverable.

On July 8th and 9th, I will be attending the Federal Cloud Computing Summit where I will run the MITRE-ATARC Collaboration Sessions on July 8th and moderate a panel of collaboration session participants on July 9th. Stay tuned for another blog posting on the Cloud Summit!

Wednesday, June 18, 2014

In this blog post, we detail three short tests in which we challenge the Google crawler's ability to index JavaScript-dependent representations. After an introduction to the problem space, we describe our three tests as introduced below.

String and DOM modification: we modify a string and insert it into the DOM. Without the ability to execute JavaScript on the client, the string will not be indexed by the Google crawler.

Anchor Tag Translation: we decode an encoded URI and add it to the DOM using JavaScript. The Google crawler should index the decoded URI after discovering it from the JavaScript-dependent representation.

Redirection via JavaScript: we use JavaScript to build a URI and redirect the browser to the newly built URI. The Google crawler should be able to index the resource to which JavaScript redirects.

Introduction

JavaScript continues to create challenges for web crawlers run by web archives and search engines. To summarize the problem, our web browsers are equipped with the ability to execute JavaScript on the client, while crawlers commonly do not have the same ability. As such, content created -- or requested, as in the case of Ajax -- by JavaScript are often missed by web crawlers. We discuss this problem and its impacts in more depth on our TPDL '13 paper.

We wanted to investigate how well the Google solution could index JavaScript-dependent representations. We created a set of three extremely simple tests to gain some insight into how Google's crawler operated.

Test 1: String and DOM Modification

To challenge the Google crawler in our first test, we constructed a test page with a MD5 hash string "1dca5a41ced5d3176fd495fc42179722" embedded in the Document Object Model (DOM). The page includes a JavaScript function that changes the hash string by performing a ROT13 translation on page load. The function overwrites the initial string with the ROT13 translated string "1qpn5n41prq5q3176sq495sp42179722".

Before the page was published, both hash strings returned 0 results when searched in Google. Now, Google shows the result of the JavaScript ROT13 translation that was embedded in the DOM (1qpn5n41prq5q3176sq495sp42179722) but not the original string (1dca5a41ced5d3176fd495fc42179722). The Google Crawler successfully passed this test and accurately crawled and indexed this JavaScript-dependent representation.

Test 2: Anchor Tag Translation

Continuing our investigation with a second test, we wanted to discover if Google could discover a URI to add to its frontier if the anchor tag is generated by JavaScript and only inserted into the DOM after page load. We constructed a page that uses JavaScript to ROT13 decode the string "uggc://jjj.whfgvasoeharyyr.pbz/erqverpgGnetrg.ugzy" to get a decoded URI. The JavaScript inserts an anchor tag linking to the decoded URI. This test evaluates whether the Google crawler will extract the URI from the anchor tag after JavaScript performs the insertion or if the crawler only indexes the original DOM before it is modified by JavaScript.

The representation of the resource identified by the decoded URI contains the MD5 hash string "75ab17894f6805a8ad15920e0c7e628b". At the time of this blog posting's publication, this string returned 0 results in Google. To protect our experiment from contamination (i.e., linking to the resource from a source other than the JavaScript-reliant page), we will not post the URI of the hidden resource in this blog.

The text surrounding the anchor tag is "The deep web link is: " followed by the anchor tag with the target being the decoded URI and the text of "HIDDEN!". If we search for the text surrounding the anchor tag, we receive a single result which includes the link to the decoded URI. However, at the time of this blog posting's publication, the Google crawler has not discovered the hidden resource identified by the decoded URI. It appears Google's crawler is not extracting URIs for its frontier from the JavaScript reliant resources.

Test 3: Redirection via JavaScript

In a third test, we created two pages. One of which was linked by my homepage and is called "Google Test Page 1". This page has a MD5 hash string embedded in the DOM "d41d8cd98f00b204e9800998ecf8427e".

A JavaScript function changes the hash code to "4e4eb73eaad5476aea48b1a849e49fb3" when the page's onload event fires. In short, when the page finishes loading in the browser, a JavaScript function will change the original hash string to a new hash string. After the DOM is changed, JavaScript constructs a URI string to redirect to another page.