Tuesday, October 22, 2013

Everyday we create more and more digital files that record our lives. We take selfies (with and without our loved ones). We record our baby's first step. We take pictures of things that we have or would like to have. The number of digital file and artifacts we create grows and grows and the places where we can store them seem to have almost infinite capacity. Smart phones with 64Gigabytes of storage, could hold almost 20,000 MP3 files (roughly 1,000 hours of listening time, or about 6 months of listening 8 hours a day). Amateur cameras can have the same amount of storage, and depending on image size and frames per second can store days of continuous recordings or about 500,000 still images. We can and are creating more digital artifacts than we can manage. Being able to create so much, means we don't care about what we create. We create because it is easy. We create because it is fun. We create because we have a new toy. We create because we can. There is a significant downside to this creation craze. How can we preserve our selfies for our children?? How can we share our baby's first step with their babys?? How can we show what we had when we were young, now that our hair is silver?? How can we show unknown others in the future those things that were important to us in our youth?? How do we preserve our selves??

We could foist the preservation responsibility of all that we create onto our children (seems sort of unkind). We could preserve our selves using a commercial or governmental institution , but that may not be too much better. Another way to attack the problem is to rephrase the question. Instead of: how we preserve digital artifacts and objects?? Change the question to: how can digital objects preserve themselves?? If we can imbue digital objects with directions to preserve themselves and provide a benign environment where they can survive then they should be able to continue to be available long after we are gone. Long after our children are gone. Long after those that loved and cared for us are forgotten. Imbuing digital objects with preservation directives and providing a benign environment are at the heart of Unsupervised Small-World (USW) graphs.

At Old Dominion University, we created a demonstration USW environment composed of representative sample webpages, faux domains with supporting RESTful methods, and a robot to represent users as they wandered through the Internet and viewing the representative webpages. We scraped parts of four domains (flickr.com, arXiv.org, RadioLab.org and Gutenberg.com) to collect representative pages with different types of digital files.

As the human facing portion of the USW graph, we mocked up a Preserve Me! button so that the webpage viewer could add the webpage to the USW graph.

There are two major parts of the benign
infrastructure. Firstly is a set of servers that support two USW
RESTful methods called "copy" and "edit." The "copy" method creates a
copy of a foreign REM in the local domain and "edit" updates selected
REMs on the local domain. Secondly is an HTTP message server (Sawood Alman's Master's thesis) which provides a communication mechanism for
exchanging actionable HTTP directives between the USW imbued digital
objects.

As an example, we going to talk about preserving a
scanned image from the 1900s.

The image was uploaded to
flickr.com and
was scraped to become part of the ODU benign USW demonstration
environment.

A robot was written to act as a human visiting the
different pages in the ODU USW demonstration environment. The robot was
written rather than have a human repeatedly press the Preserve Me!
button on different pages. It is possible to watch the USW graph grow
using the Preserve Me! Visualizer

1.
The "copy," "edit," and HTTP mailbox infrastructure components are
represented by the three cyan colored icons in the center of the
display.

2. Original USW REMs are in a concentric circle close
to the infrastructure icons and are color coded. REMs from flickr have a
magenta frame, those from RadioLab have a blue frame, and REMs from
Gutenberg have a yellow frame.

3. Copy USW REMs are much further
out from the center and have the same color as the domain they are
hosted on, but the contents of the REMs are from their original domain.

4. Permanent connections between REMs (edges in the USW graph) are directional and colored white.

5. Activity between a REM and any of the infrastructure components are directional, red and transient.

6. If a REM is removed from the system, a red slash is drawn through it's icon.

7.
Below the plotting area are VCR like controls, including speed
controls, toggling the background between black and white, capturing an
image and maximizing the display.

8. Placing the pointer over
any of the icons will cause almost all other icons and edges to become
hidden. The only things that will be visible is the icon under the
pointer, permanent edges originating at that icon, and icons that are
pointed to by the permanent edge.

9. Clicking on an icon will show explanatory information about the icon.

10. A REM will try to make preservation copies on domains, other than its own that it knows about.

Preserve
Me! Viz replays a prerecorded JSON log of events. These events came
from a scenario that the robot executed. Between the time the robot
created the JSON log file and when you replay the visualization of the
robot's actions, the USW graph created by the robot may no longer be in
existence (caveat emptor).

The general events are:

1. REM #1 retrieves messages from its mailbox. (As indicated by the flashing red line from the REM to the HTTP mailbox icon.)

2.
Based on the messages, REM #1 might execute HTTP patch directives (as
indicated by the flashing red line from the REM to the edit icon), might
create preservation copies of another REM (as indicated by the flashing
red icon from the REM to the copy icon and the creation of preservation
REM), or other actions.

A REM will never directly affect another REM. A REM will send requests and directives via the HTTP mailbox.

The
replay file shows 17 webpages, across 4 domains creating preservation
copies of themselves on domains different than the one where they were
created. Josie originated on the flickr domain (at the 6 o'clock
position and framed in magenta), preserves a copy on the Gutenberg
domain (at the 1:30 position and framed in yellow), and made USW
connections to a REM originating on the Gutenberg domain (at the 12:00
o'clock framed in yellow) and preservation copies on the flickr and
RadioLab domains (framed in magenta and green respectively).

6 - 9 (8.575 - 14.290) The first USW REM connection is made from flickr's Kittens to Gutenberg's Pride and Prejudice.

10 - 277 (663.476) Additional REMs are added to the system and make connections to Gutenberg's Pride and Prejudice.278 - 326 (664.884 - 770.372) Gutenberg's Pride and Prejudice begins to read messages from its mailbox.327 - 525 (771.436 - 1134.216) Gutenberg's Pride and Prejudice creates
reciprocal REM connections to other REMs, creates preservation copies on
the Gutenberg domain and sends messages back to requesting REMs.

526 (1135.699) A preservation copy of Josie is created on the Gutenberg domain.

527 - 1056 (2921.045) REMs continue to make preservation copies and
permanent edges as directed by messages from the HTTP mailbox.

1057 (2922.324) The first REM on the RadioLab domain is lost. The next few
events will show all REMs on the RadioLab domain as lost. These few
events simulate the total loss of the domain either through closing the
domain, terminating the domain's participation in the USW process, or
disconnection of the domain from the Internet.

1093 - 1735 (2999.098 - 5398.655) The remaining REMs continue to process messages
from their respective mailboxes until all messages have been processed
and no more communications are needed or necessary. In effect, the USW
system has reached a point of stability and does not have any growth or
change opportunities.

Now Josie (my grandmother's sister) exists on two domains and given a
larger benign environment, could spread to more places thereby
increasing the likelihood of being around long after those that knew her
have been forgotten.

On the picture's back: Josie McClure's Picture taken Feb. 30, 1907 at Poteau I. T. Fifteen years of age. When this was taken weighed 140 lbs.

Like Kalpesh and Yasmin, I have turned a semester project into a conference submission with a poster/demo accepted to IEEE VIS 2013: Graph-Based Navigation of a Box Office Prediction System. The impetus for this strangely out-of-topic (for this blog's theme) submission has roots in the IEEE Visual Analytics Science and Technology (VAST) Challenge, a competition where a large data set is supplied to contestants and a meaningful visual representation is created with each submission. Both Kalpesh and I had previously participated in the VAST Challenge in 2011 (see a summary of my Visual Investigator submission) yet neither of us attended the conference, so without further ado, the Trip Report.

I arrived on Wednesday morning, setup my poster, and headed off to the first session, which consisted of "Fast Forwards" of the papers. This summary session is akin to the "Minute Madness" at JCDL and allows conference attendees to get a glimpse at the many papers to be presented and to choose which concurrent session to attend. The one that piqued my interest the most was the InfoVis Papers session: Storytelling & Presentation.

With the completion of the Fast Forward Summaries, I headed over to the Atrium Ballroom of the Atlanta Marriott Marquis (the conference venue, pictured above) to first see Jessica Hullman of University of Michigan present "A Deeper Understanding of Sequence in Narrative Visualization" (full paper).

In the presentation she stated, "Users are able to understand data if they're seeing the same type of transition repeatedly." In her study, her group created over fifty presentational transitions using public data with varying type and cost (she describes the latter as a function in the paper). From the study, she found that low cost transitions are preferred, temporal, temporal transitions are easy to grasp and hierarchical transitions were the most difficult for the user.

She then created 6 visualizations with and without parallel structures and utilized them in a timed presentation given to 82 users. She then asked for the transitions to be compared and explained as well as requested the user to recall the order of the content. With further studies on the topic she was able to confidently conclude that "Presentation order matters!" and that "Sequence affects the consumption of data by the user."

Following Jessica, Bongshin Lee (@bongshin) of Microsoft Research presented "Sketchstory: Telling More Engaging Stories with Data through Freeform Sketching". Sketchstory is a means of utilizing underlying data in interactive presentations, as is done on an interactive whiteboard. Bongshin demonstrated the product by showing that just through gesturing, data can be immediately plotted or graph in a variety of Powerpoint-esque figured to help a presenter explain data interactively to an audience. The system is capable of drawing icons and axes while utilizing the data on-the-fly, which makes it suitable for storytelling.

In a study of the product, Bongshin's group found that users enjoyed the presentations given with Sketchstory more than Powerpoint presentations, felt they were more engaged with the presentations and that the presenters felt the system was easy enough to learn. However, possibly due to previous familiarity with Powerpoint, most presenters felt that creating a presentation in Sketchstory required more effort than doing so in Powerpoint.

In followup questions to the presentation, one audience participant (at the conference, not in the study) asked how engagement was measured in the study, to which Bongshin replied that questions were asked using a Likert scale. When another audience member asked where they could try out the software to evaluate it for themselves, Bongshin relied that it was not available for download and that is only suitable for internal (Microsoft) use."

The next presentation was on StoryFlow, a tool (inspired by the work of Randall Munroe, illustration pictured) for creating storyline visualizations interactively. The authors determined that in order to be more effective of a visualization, the timeline plot needed to reduce the number of edge crossings and minimize whitespace and "wiggles", with the latter referring to unnecessary movements for association in the graph.

The authors mathematically optimized the plots using quadratic programming to facilitate ordering alignment and compaction of the plots. Evaluation was done by comparing the plots generated against a genetic algorithm method and Randall's method. From their work, the authors concluded that a storyline visualization system was an effective hybrid approach at producing the graphs through being aware of the hierarchy needed based on the plots. Further, their system provided a method for interactively and progressively rendering the graphs if the user though a more visually pleasing layout is preferred.

The fifth and last presentation of the Storytelling and Presentation session was "Visual Sedimentation", a interesting approach at showing data flow. "Data streams are everywhere but difficult to visualize.", stated Samuel Huron (@cybunk).

Along with Romain Vuillemot (@romsson) and Jean-Daniel Fekete (@jdfaviz) of Inria, their early work started in visualizing political Twitter streams during the French 2012 presidential elections, and social interactions during a TV show. Through an effect of compaction, old data is "merged" into the total value to escape visual clutter and provide an interesting accumulation abstraction. The video (below) gives a good overview of the work but for those technically handy, the project is on Github.

After a short coffee break, I attended the next session wherein Carlos Scheidegger (@scheidegger) presented "Nanocubes for Real-Time Exploration of Spatiotemporal Datasets". Nanocubes are "a fast datastructure for in-memory data cubes developed at the Information Visualization department at AT&T Labs – Research". Based on Data Cubes by J. Gray et. al along with many other works well known in the Vis community (e.g., by Stolte, Mackinlay, Kandel and other works) Carlos showed how they went about extracting the information necessary based on two location fields and a device field aggregated to create a summary record.

Carlos' summation of the project were the nanocubes enabled interactive visual interfaces for datasets that previously were much too large to visualize. Further, he emphasized that these data sets did not have massive hardware requirements but instead, the system was designed to allow exploration of the data sets from a laptop of cell phone. The project is open source with the server back end written in C++11 and the front end written in C++11, OpenGL, JavaScript, D3 and a few other technologies.

After Carlos, Raja Sambasivan (@RS1999ent) of Carnegie Mellon University presented "Visualizing Request-Flow Comparison to Aid Performance Diagnosis in Distributed Systems". "Distributed Systems are prone to difficult-to-solve problems due to scale and complexity", he said. "Request flow show client-server interaction".

After Raja, Michelle Borkin (@michelle_borkin) of Harvard presented "Evaluation of Filesystem Provenance Visualization Tools" in which she initiated the talk by introducing file system provenance through the recording of relationships of reads and writes on the file system of a computer. The application of recording this information might lie in "IT Help, chemistry, physics and astronomy", she said. Through a time-based node grouping method algorithm, data is broken up into groups by activity versus being a whole stream grouping or a simple marker for the start of activity.

She illustrated various methods for visualizing file provenance showing a side-by-side of how a node-and-link diagram gets unwieldy with large data sets and expounding on the radial graphs as an alternative is preferable.

A running theme in the conference was the addition of a seemingly random dinosaur on the slides of presenters. The meme originated with Michelle's presentation on Tuesday titled "What Makes a Visualization Memorable?" (paper) in which, she was quoted as saying, "What makes a visualization memorable? Try adding a dinosaur. If that’s not enough, add some colors." With this in-mind, dinosaurs began popping up on the slides each author felt was the take-home of his/her presentation.

Following Michelle, Corinna Vehlow presented "Visualizing Fuzzy Overlapping Communities in Networks". "There are two types of overlapping communities: crisp and fuzzy.", she continued, "Analyzing is essential in finding out what attributes contribute to each of these types." Her group has developed an approach for utilizing undirected weighted graphs for clarifying the grouping and representing the overlapping community structure. Through their approach, they were able to define the predominant community of each object and allow the user of their visualization to observe outliers and identify objects with fuzzy associations to the various defined groups.

After Corinna, Bilal Alsallakh (@bilalalsallakh) presented "Radial Sets: Interactive Visual Analysis of Large Overlapping Sets". In his talk, he spoke about Euler diagrams' limited scalability and the concept he created called "Radial Sets" that allows association to be encoded using relative proximity. The interactive visualization he created allowed for interactivity wherein extra information could be accessed (e.g., set union, intersection) by holding down various keyboard modifiers (e.g., alt, control). By using a brushing gesture, sets could be combined and aggregate data returned to the user.

The conference then broke for a long lunch. Upon returning, a panel commenced titled "The Role of Visualization in the Big Data Era: An End to a Means or a Means to an End?" with Danyel Fisher (@FisherDanyel), Carlos Scheidegger (@scheidegger), Daniel Keim, Robert Kosara (@eagereyes), and Heidi Lam. Danyel stated "The means to an end is about exploration. The ends to a means is about presentation". He noted that a lot of big data is under-explored. In 1975, he illustrated, big data was defined in VLDB's first year as 200,000 magnetic tape reels. He cited his own 2008 paper about Hotmaps as an exhibition that big data is frequently not suitably convertable for interactivity. "There were things I couldn't do quickly", he said, alluding to tasks like finding hte most popular tile in the world in the visualization. He finished his portion of the panel by stating that Visualizations is both an ends to a mean and a means to an end. "They're complementary, not opposing", he concluded.

Carlos was next and stated that there are two kinds of big data, the type that is large in quantity and the type that is "a mess". "Big data is a means to a means.", he said, "Solving one problem only brings about more questions. Technology exists to solve problems created by technology." He continued by noting that people did not originally expect data to be interactive. "Your tools need to be able to grow with the user so you're not making toy tools." He continued by saying that we need more domain-specific languages, "Let's do more of that!".

Heidi followed Carlos noting that "When company profits are down, consider whether they've always been down.", alluding to the causal aspect of the panel. She noted two challenge: First, figure out what not to show in a visualization; Secondly, Aggregated data is likely missing meaning only apparent when the full data set is explored, an issue with big data. She finished by describing Simpson's paradox by saying "Only looking at aggregate data and not slices might result in the wrong choice.", referring back to her original "profits down" example.

Robert spoke after Heidi by asking the audience, "What does big data mean? How do you scale from three hundred to three hundred thousand? Where should it live?" In reference to a tree map he asked, "Why would I want to look at a million items and how is this going to scale?" Juxtaposed to Heidi he stated that he cares about totals and aggregate data and likely not the individual data.

In the Q&A part of the panel, one panelist noted, "Visualization at big data does not work. Shneiderman's mantra does not work for big data.". The good news, stated another panelist, is that automated analysis of big data does work.

Following the panel, the poster session commenced early as the last event of the day.. There I presented the poster/demo I showed earlier in this post "Graph-Based Navigation of a Box Office Prediction System".

Thursday

The second day of my attendance at IEEE VIS started with a presentation by Ross Maciejewski titled, "Abstracting Attribute Space for Transfer Function Design". In the paper Ross inquired as to how to take 3D data and map the bits to color.
In his group's work, they proposed a modification to such a visualization in which the user is presented with an information metric detailing the relationship between attributes of the multivariate volumetric data instead of simply the magnitude of the attribute. "Why are high values interesting?", he asked and replied with "We can see the skewness change rapidly in some places." His visualization gives a hint of where to start in processing the data for visualization and gives additional information metrics like mean, standard deviation, skewness, and entropy. Any of these values can then be plugging into the information metric of the visualization.

Carsten Görg followed Ross with "Combining Computational Analyses and Interactive Visualization for Document Exploration and Sensemaking in Jigsaw". "Given a collection of textual documents", he said, "we want to assist analysts in information foraging and sensemaking." Targeted analysis is a bottom up approach, he described, whole an open-ended scenario is top to bottom. He then proceeded to show an hourglass as an analogy of information flow in either scenario. His group did an evaluation study with four settings, using paper, a desktop, an entity, and Jigsaw each with using four strategies: overflow, filtering, and detail; build from detail; hit the keyword; and find a clue and follow the trail. From a corpus of academic papers he showed a demo wherein corresponding authors were displayed on-the-fly when one was selected.

Ross Maciejewski completed the sandwich around Carsten by presenting another paper after him titled, "Bristle Maps: A Multivariate Abstraction Technique for Geovisualization. In the talk he first described four map types and some issues with each:

Point maps are cluttered

Choropleth maps are not modifiable and exhibit a real unit problem

Heat Maps are limited to one variable per map

Line maps allow two variables per map, but that's it.

"Bristle maps allow the user to visualize seven variables utilizing traits like color, size, shape, and orientation in the visualization." His group tried different combinations of encoding to see what information could be conveyed. As an example, he visualized crime data at Purdue and found that people were better at identifying values in the visualization with bristle maps than with a bi-variate color map.

After Ross' sandwich, Jing Yang presented "PIWI: Visually Exploring Graphs Based on Their Community Structure (HTML). In the presentation she described the process of using Vizster and NodeXL to be able to utilize tag clouds, vertex plots, boolean operations and U-Groups (User-defined vertex groups).

Following Jing, Zyiyuan Zhang presented "The Five W's for Information Visualization with Application to Healthcare Informatics". "Information organization uses the 5 Ws scheme", he said, "Who (the patient), What (their problems), Where (location of What), When (time and duration of what)" conveniently leaving out the "Why". He encoded these questions into a means more navigable to doctors than the usual form-based layout healthcare professionals experience.

Following a break, Charles Perin (@charles_perin) presented "SoccerStories: A Kick-off for Visual Soccer Analysis. "Usually there's not enough data and only simple statistics are shown.", Charles said. "If there's too much data, it's difficult to explore." His group developed a series of visualizations that allows each movement on the field to be visualized usually context-sensitive visualization types that are appropriate for the type of action on the field they're trying to describe. Upon presentation to a journalist, his reply was "My readers are not ready for this complex visualization", noting that a higher degree of visualization literacy would be required to fully appreciate the visualization's dynamics.

Visually explore the complexity of inter-firm relations in the mobile ecosystem

Discover the relation between current and emerging segments

Determine the impact of convergence on ecosystem structure

Understand a firm's competitive position

Identify inter-firm relation patterns that may influence their choice of innovation strategy or business models.

Following Rahul, Sarah Goodwin (@sgeoviz presented Creative User-Centered Visualization Design For Energy Analysts and Modelers". In the presentation she visualized energy usage of individuals to provide insight into time-shifting their usage (a la Smart House) to less peak times.

Christian Partl spoke after Sarah on his paper "Entourage: Visualizing Relationships between Biological Pathways using Contextual Subsets." His work expounded on Kono 2009 by showing that biological processes can be broken down into pathways and asked three questions:

How do we visualize multiple pathways at the same time?

How do we visualize relationships between pathways?

How do we visualize experimental data on the pathways?

To visualize multiple pathways, he connected the pathways by shared nodes with "focus" pathways and "context" pathways. When focusing on a node, his visualization only displays the immediately surrounding node. Relationships can be visualized by the connection of stubs and guessing which pathway it is. A system called enRoute allows selection of a path within a path and can display it in a separate view to show experimental data.

Joel Ferstay came up after Christian with "Variant View: Visualizing Sequence Variants in their Gene Context". In their study they created visualization for DNS analysts using an interactive and iterative fashion to ensure the visualization was maximally useful in regards to allowing exploration and providing insights onto the data. From the data source of DNA sequence variants (e.g., human versus monkey), their work helped to determine which variants are helpful and which are harmless. Their goal was to show all attributes necessary for variant analysis and nothing else. To evaluate their visualization, they compare it to MuSiC, a different variant visualization plot and found Variant View showed encoding on separate lanes, so did not have the disadvantage of variant overlap, which would hinder usefulness.

Sébastien Rufiange next presented "DiffAni: Visualizing Dynamic Graphs with a Hybrid of Difference Maps and Animation". In his presentation, he tried to resolve the node-link bad but matrices hard-to-read problem by using dynamic networks in small multiples and embedded glyphs with data at each point.

John Alexis Guerra Gómez (@duto_guerra) followed Sébastien with "Visualizing Change Over Time Using Dynamic Hierarchies: TreeVersity2 and the StemView" where he showed how to display categorical data as trees. The trees consisted of data with either fixed hierarchy, dynamic data (e.g., gender, ethnicity, age), or mixed (e.g., gender, state, city).

Following John, Eamonn Maguire presented "Visual Compression of Workflow Visualizations with Automated Detection of Macro Motifs". In the paper, they created macros in workﬂow visualization as a support tool to increase the efficiency
of data curation tasks. Further, they discovered that the state transition information used to identify macro candidates characterizes the structural pattern of the macro and can be harnessed
as part of the visual design of the corresponding macro glyph.

After Earmonn, Eirik Bakke (@eirikbakke) presented "Automatic Layout of Structured Hierarchical Reports". In their visualization, Eirik's group with to overcome the form-based layout style of visualization that is normally supplied to those having to interface with a database". Using a nested table approach allowed them to display data based on the screen real estate available and be adaptive when the space available was conducive.

Tim Dwyer presented next with "Edge Compression Techniques for Visualization of Dense Directed Graphs" where he attempted to simplify dense graphs by creating boxes. His visualization were created by using Power-graphic compression through MiniZinc.

After a much-needed break (as evidenced by the length of my trip report notes), R. Borgo presented "Visualizing Natural Image Statistics" in which he, utilizing the Forier representations for image, noted that it's difficult to uniquely identify different images by sight. Further, he found that it was difficult to even define the statical criteria for classifying these images. The example he used were images of manmade versus natural images wherein some degree of similarity existed between those of the same class but the distinction was insufficient. Using Gabor filters, four scales and eight orientations were used for the classification task.

Yu-Hsuan Chan presented next with "The Generalized Sensitivity Scatterplot". She had asked people to identify 2 functional trends from a scatterplot, determined the flow of the data independently and determined how well the trends matched.

Michael Gleicher presented his paper next with "Scatterplots: Overcoming Overdraw in Scratterplots". In his paper, he asked "What happens when you have scatterplots with too many points?". He continued, "Data is unbounded, visual is bounded". His group utilized Kernel Density Estimation to determine when to cluster data and utilized the GPU to ensure that the visualization was interactive.

Ceci n'est pas une kelp.

Wouter Meulemans presented next with his paper "KelpFusion: a Hybrid Set Visualization Technique". He said, "Given a set of points, each point is part of a set. To find the strucuture, connect the nodes to form a minimum spanning tree." He went on to correlate Kelp Diagrams with Bubble Sets and Line Sets. He toted KelpFusion as a means to interactively explore hybrid selection. He then went on to explore the various considerations and heuristics he used in strategically generating the areas between nodes to express relation beyond a simple node-and-link diagram while simultaneously retain the context potentially provided on and underlying layer (see below).

Ceci est une kelp.

The final presentation I attended was Sungkil Lee's "Perceptually-Driven Visibility Optimization for Categorical Data Visualization". The goals of Sungkil's study were to define a measure of perceptual intensity for categorical distances. They define class visibility as a metric to measure perceptual intensity of categorical groups. Without doing so, they found, the dominant structure suppresses small inhomogeneous groups.

Following the presentations, I headed back to Norfolk armed with new knowledge of the extent of visualization research that is currently being done. Had I simply perused proceedings or read papers, I am not sure I would have gotten the benefit of hearing the authors give insights into their work.

Tuesday, October 15, 2013

On October 2-5, I was thrilled to attend Grace Hopper Celebration of Women in Computing (GHC), the world's largest gathering for women in computing, and meet so many amazing and inspiring women in computing. This year, GHC was held in Minneapolis, MN. It is presented by the Anita Borg Institute for Women and Technology, which was founded by Dr. Anita Borg and Dr. Telle Whitney in 1994 to bring together research and career interests of women in computing and encourage the participation of women in computing. GHC was held for the first time in 1994 in Washington DC. The theme of the conference this year was "Think Big - Drive Forward".

There were many sessions and workshops that targeted academics and business. The Computing Research Association Committee on Women in Computing (CRA-W), offered sessions targeted towards academics. I had a chance to attend Graduate Cohort Workshop last April, which was held in Boston, MA, and created a blog post about it.

The first day started with welcoming new comers by the program Co-Chairs, Wei Lin from Symantec Corporation and Tiffani Williams from Texas A&M University. They expressed their happiness to be among 4,600 brilliant women in computing. They also highlighted that there were many experts and collaborators who were eager to help and answer our questions.

Barb Gee, the vice president of programs for Anita Borg institute, spoke about ABI global expansion and it was a successful experiment in India. Gee said, "we believe that if women are equally represented at the innovation table, the products will meet better satisfaction and solutions for many problems will be optimized".

Then, the plenary session was composed of three amazing world thought leaders who had an enlightening conversation about "How we can think big, and drive forward": Sheryl Sandberg, the COO of Facebook and the founder of LeanIn.org, with Maria Klawe, the president of Harvey Mudd College, and Telle Whitney, the President and CEO of the Anita Borg Institute. The conversation started with a question from Klawe to Sandberg about the reason of writing her book "Lean In". Sandberg started her answer with "because it turns our the world is still run by men, and I'm not sure it's going very well!".

Sandberg left all of us with a great inspiration because of her question: "What would you do if you are not afraid?"
Here are some quotes from their conversation:

"People who imagine and build technology are problem solvers. They look at what the world needs and they create it.

"We are here because we believe that each one of you has a potential to create a different future."

"Women who make up 51% of the population and are part of 80% of the purchasing decisions, only make up 23% of the computer science work force."

"Next time you see a little girl and someone is calling her bossy, take a deep breath and big smile on your face, and say, ‘that little girl is not bossy she has executive leadership skills."

"What would you do, if you were not afraid? When you leave GHC, whatever you want to do, go and do it!"

"Women inspire other women"

At the end, Withney announced a partnership between LeanIn.org Foundation and Anita Borg Institute to create circles for women in computing.

For reading more about the conversation, here are a blog post and an article:

After the opening keynote, we attended the scholarship lunch which was sponsored by Walmart Labs, in which we had small talks from Walmart people during the lunch. After lunch, I attended the Arab Women in Computing meeting. This is the first time to have a real existence for Arab Women in Computing organization in GHC, based on Sana Odeh, the founder and the chair of the organization, from New York University. Then I attended couple of Leadership workshops in which we had circles and exchanged the questions with expert senior women in computing who answered questions about how to move our career forward.

In the evening, I presented my poster entitled "Access Patterns for Robots and Humans in Web Archives" during the poster session. The poster contains an analysis of the user access patterns of web archives using the Internet Archive's Wayback Machine access logs. The detailed paper of this research appeared at JCDL 2013 proceedings.

In the meantime, many famous companies, such as Google, Facebook, Microsoft, IBM, etc., were there in the career fair. Each company has many representatives to discuss the different opportunities they have for women. A few men also attended the conference. For a perspective of the conference from a man's point of view, Owyn Richen created blog post titled "Grace Hopper 2013 – A Guy’s Perspective". This is also another post on Questionable Intelligence blog.

Thomson Reuters attracted many women's attention with a great promotion through bringing up a caricature artists. I was seeing large queues of many women the whole time of the days of the career fair waiting for a delight draw. They also had many representatives for promoting the company and also for interviewing. I enjoyed being among all of these women in the career fair which inspired me enough to think about how to direct my future in a way to contribute to computing and also encourage many other women to computing. My advice to anyone who will go to GHC next years, print many copies of your resumes to be prepared for the career fair.

On day 2, Telle Whitney gave an inspiring short talk before the second keynote begins. She presented some statistics about the conference to realize how fortunate we were to be among 4,817 attendees of the conference. Based on Whitney's, 54 countries, 305 companies, and 402 universities are represented. She also presented the top 10 universities that brought the most students, and also the top 10 companies who brought participants to GHC 2013. University of Minnesota is in the lead of the universities and Microsoft is in the lead of the companies. Here are some quotes from here talk:

"Think Big, because you can!"

"You cannot fight every battle or certainly cannot win every war, but you can stay true to who you are, by never giving up on yourself. Drive Forward."

Whitney talked about ACM support and partnership between ACM and ABI, then introduced John White, from ACM, for the opening Remarks. Vincent Cerf, the president of ACM, was supposed to attend, but he couldn't. Cerf created a video for the attendees to speak about how it is important to be in GHC. He expressed his sadness from some colleagues for badly treating women in computing. He wished to attend GHC 2014 personally to help encouraging more women to be in the computing field.

Megan Smith, the Vice President of Google[x], gave a keynote titled, "Passion, Adventure and Heroic Engineering". Before Smith showing up, a short inspiring video about moonshot thinking was presented. The most inspiring quote of the video was "When you find your passion, you are unstoppable.". Smith had image oriented presentation that flow her talks. She shared details about the 4 Google[x] projects:

At the end, we were surprised by Nora Danzel, who gave an amazing talk last year in GHC's keynote opening. Dr. Michele Weigle created a blog post about it. Danzel talked shortly about Anita Borg story and how that amazing women started the organization to encourage women in computing to get together and increase the women in computing. She asked for donation for keeping Anita Borg Institute up to help many women every year.

I attended a couple of workshops after the break, but the most highlighted one is an invitation only event from Microsoft Workshop. I had a great chance to meet many senior women from Microsoft from many different projects and exchange the knowledge on how can be a successful leaderships in our careers.

At the end of the day, the ABI award ceremony was held. Shikoh Gitau, the ABIE Change Agent Awards Winner, gave a very emotional talk. After this it was the dancing party and the entertainment. In the same time, there was a documentary video about Anita Borg's life and her influence on the creation of the Anita Borg Institute, and the Systers group. It showed also how she started these initiatives for bringing women in computing together. Here is the documentary video about Anita Borg:

I spent most of the third day in the career fair. Grace Hopper not only gave me inspiration, happily it allowed me to meet many old friends and new amazing friends. It also allowed me to discuss my research ideas with many senior women and got positive feedback about it. I'm pleased to have this great opportunity which allowed me to network and communicate with many great women in computing.

For more information about GHC, here are some articles and blog posts:

Monday, October 14, 2013

Last week LANL released Memento for Chrome, an extension that adds Memento capability for Chrome browsers. It represents such a leap in capability and speed that the prior MementoFox (Memento for FireFox) add-on should be considered deprecated.

It's not just a FireFox vs. Chrome thing either; Memento for Chrome features a subtle change in how it interacts with the past and present. MementoFox had a toggle switch for present vs. Time Travel mode that would trap and modify all outbound requests, from the current page and all subsequent pages until turned off, to go from the form of:

This involved some complicated logic to determine when you were getting a memento (i.e., archived web entity) vs. something from the live web. When you factored in native Memento archives vs. proxied Memento archives, things could get hairy (see the 2011 Code4Lib paper for a (dated) discussion of some of the issues). Due to differences in how they archive web pages, it was not possible to take an HTML page from archives like WebCite and Archive.is and modify all the links to go through the Memento aggregator.

Instead of a toggle switch, Memento for Chrome features a "right-click" model in which time travel is only for the next click and (from the client's point of view) is not sticky. Basically, you load the present version of "index.html", and the prior versions are accessed by right-clicking in the page or on the next link itself to pull up the option of traveling to some prior date (set separately via a calendar interface). This means the client only modifies a single request, and the subsequent requests are processed unfiltered.

For most web archives, all the subsequent requests for embedded images, stylesheets, etc. will be rewritten to be relative to the archive of the parent HTML page. In other words, if you land inside the Internet Archive, then all the embedded resources will come from the Internet Archive, and all the subsequent links in the HTML to other pages will be rewritten to take you to pages inside the Internet Archive. This means sometimes you'll miss a resource that is present in another archive but not in the current archive or the target date can drift over many clicks (see Scott Ainsworth's JCDL 2013 paper on this topic), but this allows the client to run much faster. You can always choose to right-click (instead of just a regular click) to reapply the Memento time travel.

On other archives, like WebCite and Archive.is, as well as other systems like wikis, the links to other pages aren't rewritten to point back into the archive, and a regular click will pop you out of time travel mode and back to the present web. In this case, successive right-clicks are required to stay in time travel mode.

Herbert has prepared a very nice demo video that packs many features into 78 seconds. If you want to know why Memento for Chrome is really special, watch this video:

he starts at the current version of techcrunch.com, and then sets a datetime preference for June 20, 2011.

right-clicking in the current page, he chooses the option to time travel to a version near June 20, 2011 (in this case, he gets June 20, 2011 exactly, but that's not always possible)

he right-clicks on the link for gnonstop.com, and chooses to get an archived version (in this case, the archive delivers a close but not exact version of June 21, 2011).

note the archived pages for techcrunch.com and gnonstop.com both come from the Internet Archive.

to see the current version, he right-clicks and chooses "get at current time" and sees that the current version is unavailable.

from that page he right-clicks and chooses "get near current time", which is basically "get me the most recent archived version", which at the time of this video was July 8, 2011 and the archived pages comes from a different Memento-enabled archive, Archive.is.

If the above is interesting to you, I recommend the longer (10 minute) video Herbert prepared with an earlier version of Memento for Chrome:

Some highlights include:

00:30 -- 1:30: a Google search is done in the present, but the first blue link is right-clicked to visit the prior version

02:15 -- 03:00: from the Google results page, a link is followed in the present, then the page is viewed in the past via a right-click

04:10 -- 05:03: shows how the client works with the SiteStory Transactional Web Archive

05:10 -- 07:00: an extended session about how it works with Wikipedia (i.e., Mediawiki)

07:20 -- 08:20: interacting with an archived 404 and resetting the date

Keep in mind this is an older version of the software, but there are enough interesting bits in the video that I think it still warrants viewing for those who care about various special cases.

P.S. Note that other Memento clients are still available, including iOS, Android, and the mcurl command line client. Though slowed by his PhD breadth exam and other obligations, Mat Kelly is still developing Tachyon, a Chrome extension with a toggle model similar to MementoFox, first developed by Andy Jackson.

Friday, October 11, 2013

Earlier this year, we were awarded an NEH Digital Humanities Start-Up Grant for our project "Archive What I See Now": Bringing Institutional Web Archiving Tools to the Individual Researcher.

We were invited to attend the NEH Office of Digital Humanities Project Directors' Meeting in early October, but due to the government shutdown, the meeting was cancelled. Here I'll give the quick overview of the project that I'd planned for that meeting. (Mat Kelly has already posted a nice description of the tools we've been developing, WARCreate and WAIL, at http://bit.ly/wc-wail.)

Our project is focused on helping people archive web pages. Since much of our cultural heritage is now published on the web, we want to make sure that important pages are archived for the future.

Since 1996, the Internet Archive and other archiving services have done great work preserving web pages. But, the Internet Archive can only do so much. What if you had a website that the Internet Archive doesn’t or can’t crawl or one that changes more frequently than they would crawl it? Until now, your solution was to archive the page yourself, either using ad-hoc methods like “Save Page As” or by attempting to install your own crawler and Wayback Machine instance.

Our partners in this project include church historians who want to allow individual churches to archive their own websites, artists who want to preserve their own sites, political scientists who want to archive conversations about elections in social media, and social scientists who want to archive conversations about disasters in social media.

There are a couple problems here that we’re addressing. First, if you want an archive of a webpage in a standard format, called a WARC, you have to install and configure some rather complex software. Second, if the webpages you want to archive are behind authentication, the crawler will not be able to access them. Another problem is that you typically set the crawl frequency ahead of time, so if you find a page that you want to archive and it might change soon, it may be difficult to schedule.

So, we’ve built some tools that allow you to get around these problems. They let you “Archive”, “What I See”, “Now”. Essentially, what you see in the browser is what gets archived.

The two tools that we've developed are WARCreate and WAIL. WARCreate is a browser extension (right now for Chrome, but Firefox is coming soon) that lets you create a WARC of whatever page you’re viewing. It can be on social media, it can be a dynamic page, or it can be behind authentication. The WARC is created locally and saved on your local machine.

So, now that you have a WARC, what do you do with it? Our second tool, WAIL, addresses this issue. It is package that contains Heritrix, the Internet Archive’s crawler, and wayback, the software behind the Wayback Machine. This package installs and configures the software in one click. Once WAIL is running, you can point it to a directory of WARCs that you created with WARCreate, and then you can access your archives locally using the Wayback Machine interface.

Right now, WARCreate can only archive a single page and just saves it locally. We are working on building in the ability to archive a set of pages, or a whole site, and the ability to upload the created WARC to a remote server, including a service like the Internet Archive’s Archive-It.

We hope that these two tools will be useful and can help non-IT experts archive important pages for the future.

The conference began with Herbert Van de Sompel and I giving a tutorial about ResourceSync. Attendees registered for all tutorials and were free to attend whichever one they preferred. We had as many as ten people in ours at one point, but more importantly we had some key people present who will be implementing ResourceSync in their organizations. We also received some feedback and will probably reorder the slide deck to focus more on particular cases instead of a reference list of all possible capabilities and their implementation.

The main conference began with an opening keynote from Chris Borgman, reviewing the state of scholarly communication with her talk "Digital Scholarship and Digital Libraries: Past, Present, and Future". The slides are already available, and I believe videos will be eventually be posted on the TPDL Vimeo channel, but directly form her slides the best summary is 1) Open scholarship is the norm, 2) Formal and informal scholarly communication are converging, 3) Data practices are local, and 4) Open access to data is a paradigm shift.

I had two papers in the "Aggregating and Archiving" session following the keynote, although Herbert helped me out and presented one of them. I first presented "On the Change in Archivability of
Websites Over Time" (with Mat Kelly, Justin Brunelle, and Michele Weigle), and then Herbert presented "Profiling Web Archive Coverage
for
Top-Level Domain and Content Language" (with Ahmed AlSum, Michele Weigle, and Herbert Van de Sompel).

There was a single parallel session after lunch, followed by a panel on the EU Cooperation on Science and Technology (COST), and then the Minute Madness and poster session that evening. At the reception, they honored Ingeborg Sølvberg for her upcoming retirement. Ingeborg has been active in the community for quite some time, and Herbert and I were PC co-chairs with her for JCDL 2012.

On this topic is a great blog post from Max Kemman entitled "The Future of Libraries is in Linking". Max covers the entire TPDL from the point of view of linked data and I think he's spot on.

In the closing session, I presented two papers: "Resurrecting My Revolution: Using Social Link Neighborhood in Bringing Context to the Disappearing Web" (with Hany SalahEldeen), and "Who and What Links to the Internet Archive" (with Yamin AlNoamany, Ahmed AlSum, and Michele Weigle).

We were fortunate enough to have Yasmin's paper win "Best Student Paper"! Scott Ainsworth's paper was a nominee for this award at JCDL 2013, but this represents the first win for our research group! Congratulations to Yasmin and Ahmed!