Monday, July 27, 2015

On Wednesday, August 5, 2015 Herbert Van de Sompel (Los Alamos National Laboratory) will give a colloquium in the ODU Computer Science Department entitled "A Perspective on Archiving the Scholarly Web". It will be held in the third floor E&CS conference room (r. 3316) at 11am. Space is somewhat limited (the first floor auditorium is being renovated), but all are welcome to attend. The abstract for his talk is:

A Perspective on Archiving the Scholarly Web

As the scholarly communication system evolves to become natively web-based and starts supporting the communication of a wide variety of objects, the manner in which its essential functions -- registration, certification, awareness, archiving -- are fulfilled co-evolves. Illustrations of the changing implementation of these functions will be used to arrive at a high-level characterization of a future scholarly communication system and of the objects that will be communicated. The focus will then shift to the fulfillment of the archival function for web-native scholarship. Observations regarding the status quo, which largely consists of back-office processes that have their origin in paper-based communication, suggest the need for a change. The outlines of a different archival approach inspired by existing web archiving practices will be explored.

The main point of this presentation was to document how each project successively further embraced the web, not just as a transport protocol but fully adopting the semantics as part of the protocol. Herbert and I then had a fun email discussion about how the web, scholarly communication, and digital libraries were different in 1999 (the time of OAI-PMH & our initial collaboration) and now. Some highlights include:

Although Google existed, it was not the hegemonic force that it is today, and contemporary search engines that did exist (e.g., AltaVista, Lycos) weren't that great (both in terms of precision and recall).

Related to the above, the focus in digital libraries was on repositories, not the web itself. Everyone was sitting on an SQL database of "stuff" and HTTP was seen just as a transport in which to export the database contents. This meant that the gateway script (ca. 1999, it was probably in Perl DBI) between the web and the database was the primary thing, not the database records or the resultant web pages (i.e., the web "resource").

Focus on database scripts resulted in lots of people (not just us in OAI-PMH) tunneling ad-hoc/homemade protocols over HTTP. In fairness, Roy Fielding's thesis defining REST only came out in 2000, and the W3C Web Architecture document was drafted in 2002 and finalized in 2004. Yes, I suppose we should have sensed the essence of these documents in the early HTTP RFCs (2616, 2068, 1945) but... we didn't.

The very existence of technologies such as SOAP (ca. 1998) nicely illustrates the prevailing mindset of HTTP as a replaceable transport.

Technologies similar to OAI-PMH, such as RSS, were in flux and limited to 10 items (belying their news syndication origin which made them unsuitable for digital library applications).

Full-text was relatively rare, so the focus was on metadata (see table 3 in the original UPS paper; every digital library description at the time distinguished between "records" and "records with full-text links"). Even if full-text was available, downloading and indexing it was an expensive operation for everyone involved -- bandwidth was limited and storage was expensive in 1999! Sites like xxx.lanl.gov even threatened retaliation if you downloaded their full-text (today's text on that page is less antagonistic, but I recall the phrase "we fight back!"). Credit to CiteSeer for being an early digital library that was the first to use full-text (DL 1998).

Eventually Google Scholar announced they were deprecating OAI-PMH support, but the truth is they never really supported it in the first place. It was just simpler to crawl the web, and the early focus on keeping robots out of the digital library had given way to making sure that they got into the digital library (e.g., Sitemaps).

The OAI-ORE and then Memento projects were more web-centric, as Herbert nicely explains in the slides, with OAI-ORE having a Semantic Web spin and Memento being more grounded in the IETF community. As Herbert says at the beginning of the video, our perspective in 1999 was understandable given the practices at the time, but he goes on to say that he frequently reviews proposals about data management, scholarly communication, data preservation, etc. that continue to treat the web as a transport protocol over which the "real" protocol is deployed. I would add that despite the proliferation of web APIs that claim to be RESTful, we're seeing a general retreat from REST/HATEOAS principles by the larger web community and not just the academic and scientific community.

Thursday, July 23, 2015

Inspired by the "#icanhazpdf" movementand built upon the Memento service, I Can Haz Memento attempts to expand the awareness of Web Archiving through Twitter. Given a URL (for a page) in a tweet with the hash tag "#icanhazmemento," the I Can Haz Memento service replies the tweet with a link pointing to an archived version of the page closest to the time of the tweet. The consequence of this is: the archived version closest to the time of the tweet likely expresses the intent of the user at the time the link was shared.

Consider a scenario where Jane shares a link in a tweet to the front page of cnn about a story on healthcare. Given the fluid nature of the news cycle, at some point, the story about healthcare would be replaced by another fresh story; thus the link in Jane's tweet and its corresponding intent (healthcare story) become misrepresented by Jane's original link (for the new story). This is where I Can Haz Memento comes into the picture. If Jane included "#icanhazmemento" in her tweet, the service would have replied Jane's tweet with a link representing:

An archived version (closest to her tweet time) of the front page healthcare story on cnn, if the page had already been archived within a given temporal threshold (e.g 24 hours). Or

A newly archived version of the same page. In other words, the service does the archiving and returns the link to the newly archived page, if the page was not already archived.

Method 1: In order to use the service, include the hashtag "#icanhazmemento" in the tweet with the link to the page you intend to archive or retrieve an archived version. For example, consider Shawn Jones' tweet below for http://www.cs.odu.edu:

Retrieve links from tweets with "#icanhazmemento": This was achieved due to Tweepy's api.search API method. The sinceIDValue is used to keep track of already visited tweets. Also, the application sleeps in between each request in order to comply with Twitter's API rate limits, but not before retrieving the URLs from each tweet.

After the URLs in 1. have been retrieved, the following subroutine

Makes an HTTP Request to the Timegate API in order to get the the Memento (instance of the resource) closest to the time of tweet (since the time of tweet is passed as a parameter for datetime content negotiation):

Gerhard Gossen started the lightning talk session with his presentation on "The iCrawl System for Focused and Integrated Web Archive Crawling". It was a short description of how iCrawl can be used to create archives for current events, targeted primarily to researchers and journalists. The demonstration illustrated how to search on the Web and Twitter for trending topics to find good seed URLs, manually add seed URLs and keywords, extract entities, configure crawling basic policies and finally start/schedule the crawling.

Ian Milligan presented his short talk on "Finding Community in the Ruins of GeoCities: Distantly Reading a Web Archive". He introduced GeoCities and explained why it matters. He illustrated the preliminary exploration of the data such as images, text, and topic extraction from it. He announced plans for a Web Analytics Hackathon in Canada in 2016 based on Warcbase and is looking for collaborators. He expressed the need of documentation for researchers. To acknowledge the need of context, he said, "In an archive you can find anything you want to prove, need to contextualize to validate the results."

Zhiwu Xie presented a short talk on "Archiving the Relaxed Consistency Web". This was focused on inconsistency problem mainly seen in crawler based archives. He described the illusion of consistency on distributed social media systems and the role of timezome differences. They found that the newer content is more inconsistent. In a simulation more than 60% of timelines were found inconsistent. They propose proactive redundant crawls and compensatory estimation of archival credibility as potential solutions to the issue.

Martin Lhotak and Tomas Foltyn presented their talk on "The Czech Digital Library - Fedora Commons based solution for aggregation, reuse, dissemination and archiving of digital documents". They introduced three main digitization areas in the Czech Republic - Manuscriptorium (early printed books and manuscripts), Kramerius (modern collections from 1801), and WebArchiv (digital archive of the Czech web resources). Their goal is to aggregate all digital library content from Czech Republic under Czech Republic Library (CDL).

Todd Suomela presented "Analytics for monitoring usage and users of Archive-IT collections". The University of Alberta is using Archive-It since 2009 where they have 19 different collections of which 15 are public. Their collections are proposed by public, faculty, or librarians then the proposal goes to the Collection Development Committee for the review. Todd evaluated the user activity (using Google Analytics) and the collection management aspects of the UA Digital Libraries.

After the lightning talks were over, workshop participants took a break and looked at the posters and demonstrations associated with the lightning talks above.

During the lunch break, awards were announced where our WSDL Research Group secured the Best Student Paper and the Best Poster awards. While some people were still enjoying their lunch, Dr. J. Stephen Downie presented the closing keynote on HathiTrust Digital Library. I learned a lot more about the HathiTrust, its collections, how they deal with the copyright and (not so) open data, and their mantra, "bring computing to the data" for the sake of the fair use of the copyright data. Finally, the there were announcements about the next year's JCDL conference which will be held in Newark, NJ from 19 to 23 June, 2016. After that we assembled again in the workshop room for the remaining sessions of the WADL.

Robert Comer and Andrea Copeland together presented "Methods for Capture of Social Media Content for Preservation in Memory Organizations". They talked about preserving personal and community heritage. They outlined the issues and challenges in preserving the history of the social communities and the problem of preserving the social media in general. They are working on a prototype tool called CHIME (Community History in Motion Everyday).

Mohamed Farag presented his talk on "Building and Archiving Event Web Collections: A focused crawler approach". Mohamed described the current approaches of building event collections, 1) manually - which leads to the hight quality but requires lots of effort and 2) social media - which is quick, but may result in potentially low quality collections. They are looking for balance between the two approaches to develop an Event Focused Crawler (EFC) that retrieves web pages that are similar to those with the curator selected seed URLs with the help of a topic detection model. They have made an event detection service demo available.

Zhiwu Xie presented "Server-Driven Memento Datetime Negotiation - A UWS Case". He described Uninterruptable Web Service (UWS) architecture which uses Memento to provide continuous service even if a server goes down. Then he proposed an ammendment in the workflow of the Memento protocol for a server-driven content negotiation instead of an agent-driven approcah to improve the efficiency of UWS.

Luis Meneses presented his talk on "Grading Degradation in an Institutionally Managed Repository". He motivated his talk by saying that degradation in data collection is like a library with books with missing pages. He illustrated examples from his testbed collection to introduce nine classes of degradation from the least damaged to the worst as 1) kind of correct, 2) university/institution pages, 3) directory listings, 4) blank pages, 5) failed redirects, 6) error pages, 7) pages in a different language, 8) domain for sale, and 9) deceiving pages.

The last speaker of the session, Sawood Alam (your author) presented "Profiling Web Archives". I briefly described Memento Aggregator and the need of profiling the long tail of archives to improve the efficiency of the aggregator. I described various profile types and policies, analyzed their cost in terms of space and time, and measured the routing efficiency of each profile. Also, I discussed the serialization format and scale related issues such as incremental updates. I took the advantage of being the last presenter of the workshop and kept the participants away from their dinner longer than I was supposed to.

Thanks Mat for your efforts in recording various sessions. Thanks Martin for the poster pictures. Thanks to everyone who contributed to the WADL 2015 Group Notes, it was really helpful. Thanks to all the organizers, volunteers and participants for making it a successful event.

Thursday, July 2, 2015

Large, Dynamic and Ubiquitous –The Era of the Digital Library

JCDL 2015 (#JCDL2015) took place at the University of Tennessee Conference Center in Knoxville, Tennessee. The conference was four days long; June 21-25, 2015. This year three students from our WS-DL CS group at ODU had their papers accepted as well as one poster (see trip reports for 2014,2013, 2012, 2011). Dr. Weigle (@weiglemc), Dr. Nelson (@phonedude_mln), Sawood Alam (@ibnesayeed), Mat Kelly (@machawk1) and I (@LulwahMA) went to the conference. We drove from Norfolk, VA. Four of our previous members of our group, Martin Klein (UCLA, CA) (@mart1nkle1n), Joan Smith (Linear B Systems, inc., VA) (@joansm1th), Ahmed Alsum (Stanford University, CA) (@aalsum) and Hany SalahEldeen (Microsoft, Seattle)(@hanysalaheldeen), also met us there. The trip was around 8 hours. We enjoyed the mountain views and the beautiful farms. We also caught parts of a storm on our way, but it only lasted for two hours or so.The first day of the conference (Sunday June 21, 2015) consisted of four tutorials and the Doctoral Consortium. The four tutorials were: Topic Exploration with the HTRC Data Capsule for Non-Consumptive Research, Introduction to Digital Libraries, Digital Data Curation Essentials for Data Scientists, Data Curators and Librarians and Automatic Methods for Disambiguating Author Names in Bibliographic Data Repositories.
Mat Kelly (ODU, VA)(@machawk1) covered the Doctoral Consortium.

The main conference started on Monday June 22, 2015 and opened with Paul Logasa Bogen II (Google, USA) (@plbogen). He started by welcoming the attendees and then he mentioned that this year had 130 registered attendees from 87 different organizations, and 22 states and 19 different countries.

Then the program chairs: Geneva Henry (University Libraries, GWU, DC), Dion Goh (Wee Kim Wee School of Communication and Information, Nanyang Technical University, Singapore) and
Sally Jo Cunningham (Waikato University, New Zealand) added on the announcements and number of accepted papers in JCDL2015. Of the conference submissions, 18 (30%) of full research papers are accepted, and 30 (50%) of short research papers are accepted, and 18 (85.7%) of posters and demos are accepted. Finally, the speaker announced the nominees for best student paper and best overall paper.

Then Unmil Karadkar (The University of Texas at Austin, TX) presented the keynote speaker Piotr Adamczyk (Google Inc, London, UK). Piotr's talk was titled “The Google Cultural Institute: Tools for Libraries, Archive, and Museums”. He presented some of Google attempts to add to the online cultural heritage. He introduced Google Culture Institute website that consisted of three main projects: the Art Project, Archive (Historic Moments) and World Wonders. He showed us the Google Art Project (Link from YouTube: Google Art Project) and then introduced an application to search museums and navigate and look at art. Next, he introduced the Google Cardboard (Link from YouTube:”Google Cardboard Tour Guide”) (Link from YouTube: “Expeditions: Take your students to places a school bus can’t”) where you can explore different museums by looking into a cardboard container that can house a user's electronic device. He mentioned that more museums are allowing Google to capture images of their museums and allowing others to explore it using Google Cardboard and that Google would like to further engage with cultural partners. His talk was similar to a talk he gave in 2014 titled "Google Digitalizing Culture?".
Then we started off with the two simultaneous sessions "People and Their Books" and "Information Extraction". I attended the second session. The first paper was “Online Person Name Disambiguation with Constraints” presented by Madian Khabsa (PSU, PA). The goal of his work is to map the name mentioned to the real world people. They found that 11%-17% of the queries in search engines are personal names. He mentioned that two issues are not addressed: adding constraints to the clustering process and adding the data incrementally without clustering the entire database. The challenge they faced was redundant names. When constraints are added they can be useful in digital library where user can make corrections. Madian described constraint-based clustering algorithm for person name disambiguation.Sawood Alam (ODU, VA) (@ibnesayeed) followed Madian with his paper “Improving Accessibility of Archived Raster Dictionaries of Complex Script Languages”. He mentioned that general online book readers are not suitable for scanned dictionary. They proposed an approach of indexing scanned pages of a dictionary that enables direct access to appropriate pages on lookup. He implemented an application called Dictionary Explorer where he indexed monolingual and multilingual dictionaries, with speed of over 20 pages per minute per person.

After the Research at Google-sponsored Banquet Lunch, Sally Jo Cunningham (University of Waikato, NZ) introduced the first panel "Lifelong Digital Libraries" and then the first speaker Cathal Gurrin (Dublin City University, Ireland)(@cathal). His presentation was titled "Rich Lifelong Libraries". He started off with using wearable devices and information loggers to automatically record your life in details. He gave examples of devices such as Google Glass or Apple’s iWatch that are currently in the market that record every moment. He has gathered a digital memory of himself since 2006 by using a wearable camera. The talk he gave was similar to a talk he gave at 2012 titled "The Era of Personal Life Archives".The second speaker was Taro Tezuka (University of Tsukuba, Japan). His presentation was titled "Indexing and Searching of Conversation Lifelogs". He focused on search capability and that it is as important as storage capability in lifelong applications. He mentioned that cleaver indexing of recorded content is necessary for implementing a useful lifelong search systems. He also showed the LifeRecycle which is a system for recording and retrieving conversation lifelogs, by first recording the conversation, then providing speech recognition, after that store the result and finally search and show the result. He mentioned that the challenges that faces a person to allow being recorded is security issues and privacy.The last speaker of the panel was Håvard Johansen (University of Tromso, Norway). First they started with definitions of lifelogs. He also discussed the use of personal data for sport analytic, by understanding how to construct privacy preserving lifelogging. After the third speaker the audience asked/discussed some privacy issues that may concern lifelogging.

The third and fourth session were simultaneous as well. The third session was "Big Data, Big Resources". The first presenter was Zhiwu Xie (Virginia Tech, VA) (@zxie) with his paper “Towards Use And Reuse Driven Big Data Management”. This work focused on integrating digital libraries and big data analytics in the cloud. Then they described its system model and evaluation.
Next, “iCrawl: Improving the Freshness of Web Collections by Integrating Social Web and Focused Web Crawling” was presented by Gerhard Gossen (L3S Research Center, Germany) (@GerhardGossen). iCrawl combines crawling of the Web and Social Media for a specified topic. The crawler works by collecting web and social content in a single system and exploits the stream of new social media content for guideness. The target users for this web crawling toolbox is web-science, qualitative humanities researchers. The approach was to start with a topic and follow its outgoing links that are relevant.G. Craig Murray (Verisign Labs) presented instead of Jimmy Lin (University of Maryland, College Park) (@lintool). The paper was titled “The Sum of All Human Knowledge in Your Pocket: Full-Text Searchable Wikipedia on a Raspberry Pi”. Craig discussed how it is useful to have Wikipedia that you can access without Internet by connecting to Raspberry Pi device via bluetooth or wifi. He passed along the Raspberry Pi device to the audience, and allowed them to connect to it wirelessly. The device is considered better than Google since it offers offline search and full text access. It also offers full control over search algorithms and is considered a private search. The advantage of the data being on a separate device instead of on the phone is that it is cheaper per unit storage and offers full Linux stack and hardware customizability.
The last presenter in this session was Tarek Kanan (Virginia Tech, VA), presenting “Big Data Text Summarization for Events: a Problem Based Learning Course”. Problem/project Based Learning (PBL) is a student-centered teaching method, where student teams learn by solving problems. In this work 7 teams of student each with 30 students apply big data methods to produce corpus summaries. They found that PBL helped students in a computational linguistics class automatically build good text summaries for big data collections. The student also learned many of the key concepts of NLP.
The fourth session I missed was "Working the Crowd", Mat Kelly (ODU, VA) (@machawk1) recorded the session.

After that, Conference Banquet was served at the Foundry on the Fair Site.On Tuesday June 23, 2015 after breakfast the Keynote speaker Katherine Skinner (Educopia Institute, GA). Her talk was titled “Moving the needle: from innovation to impact”. She discussed how to engage others to make use of digital libraries and archiving, getting out there and being an important factor to the community as we should be. She asked what digital libraries could accomplish as a field if we shifted our focus from "innovation" to "impact".
After that, there were two other simultaneous sessions "Non-Text Collection" and "Ontologies and Semantics". I attended the first session where there was one long paper presented and four short papers. The first speaker in this session was Yuehan Wang (Peking University, China). His paper was “WikiMirs 3.0: A Hybrid MIR System Based on the Context, Structure and Importance of Formulae in a Document”. The speaker discussed the challenges of extracting mathematical formula from the different representations. They propose an upgraded Mathematical Information Retrieval system named WikiMirs3.0. The system can extract mathematical formulas from PDF and can type in queries. The system is publicly available at: www.icst.pku.edu.cn/cpdp/wikimirs3/.
Next, Kahyun Choi (University of Illinois at Urbana-Champaign, IL) presented “Topic Modeling Users’ Interpretations of Songs to Inform Subject Access in Music Digital Libraries”. Her paper focused on addressing if topic modeling can discover subject from interpretations, and the way to improve the quality of topics automatically. Their data set was extracted from songmeanings.com which contained almost went four thousand songs with at least five interpretation per song. Topic models are generated using Latent Dirichlet Allocation (LDA) and the normalization of the top ten words in each topic was calculated. For evaluating a sample as manually assigned to six subjects and found 71% accuracy. Frank Shipman (Texas A&M University, TX) presented “Towards a Distributed Digital Library for Sign Language Content”. In this work they try to locate content relating to sign language over the Internet. They propose a description of a software components of a distributed digital library of sign language content, called SLaDL. This software detects sign language content in a video.

The final speaker of this session was Martin Klein (UCLA, CA) (@mart1nkle1n), presenting “Analyzing News Events in Non-Traditional Digital Library Collections”. In his work they found indicators relevant for building non-traditional collection. From the two collection, an online archive of TV news broadcasts and an archive of social media captures, they found that there is an 8 hour delay between social media and TV coverages that continues at a high frequency level for a few days after a major event. In addition, they found that news items have potential to influence other collections.

The session I missed was "Ontologies and Semantics", the papers presented were:

After lunch, there were two other simultaneous sessions "User Issues" and "Temporality". I attended "Temporality" session where there were two long papers. The first paper was presented by Thomas Bogel (Heidelberg University, Germany) titled “Time Well Tell: Temporal Linking of News Stories”. Thomas presented a framework to link news articles based on temporal expressions that occur in the articles. In this work they recover the arrangement of events covered in an article, in the big picture a network of article will be timely ordered.
The second paper was “Predicting Temporal Intention in resource Sharing” presented by Hany SalahEldeen (ODU, VA) (@hanysalaheldeen). Links on web pages on Twitter could change over time and might not match users intention. In this work they enhance prior temporal intention model by adding linguistic feature analysis, semantic similarity and balancing the training dataset. In this current module they had a 77% accuracy on predicting the intention of the user.

Next, there was a panel on “Organizational Strategies for Cultural Heritage Preservation”. Paul Logasa Bogen, II (Google, WA) (@plbogen) introduced four speakers in this panel. There were Katherine Skinner (Educopia Institute, Atlanta), Stacy Kowalczyk (Dominican University, IL)(@skowalcz), Piotr Adamczyk (Google Cultural Institute, Mountain View) and Unmil Karadkar (The University of Texas at Austin, Austin) (@unmil). In this panel they discussed the preservation goal, the challenges that faces organizations practice preservation centralized or decentralized preservation and how to balance these approaches. In the final minutes there were questions from the audience regarding privacy and ownership in Cultural Heritage collections.Following that was Minute Madness, which was a session where each poster presenter has two chances (60 seconds then 30 seconds) to talk about their poster in attempt to lure attendees to come by during the poster session.

Lulwah Alkwai (ODU, VA) (@LulwahMA) (your author) presented “How Well Are Arabic Websites Archived?”. In this work we focused on determining if Arabic websites are archived and indexed. We collected a simple of Arabic websites and discovered that 46% of the websites are not archived and that 31% are not indexed. We also analyzed the dataset to find that almost 15% had an Arabic country code top level domain and almost 11% had an Arabic geographical location. We recommend that if you want an Arabic webpage to be archived then you should list in DMOZ and host it outside an Arabic country.

Then Jacob Jett (University of Illinois at Urbana-Champaign, IL) presented his paper “The Problem of “Additional Content” in Video Games". In this work they first discuss the challenges that video games nowadays faces due to its additional content such as modification and downloadable contents. They try to address the challenges by proposing a solution by capturing additional contents.