About Matt Senate

Matt is passionate about Open Access and Access to Knowledge in general. As a supporter and participant of the Open Science and Free Culture movements, Matt recognizes the importance of an alliance between all disciplines (humanities and social sciences included) to fulfill a moral imperative to share knowledge and information as far and wide as possible. At PLOS, Matt is a web developer on the web production team.

Protip: default on PubMed is to retrieve articles indexed by a descriptor or any descendant. Prevent this with the mesh:noexp flag (“please don’t explode”).

Synonyming and clustering are built in to MeSH. Bodenreider explains they are attempting to use some automatic indexing to improve MeSH, which has historically been maintained manually. Approaches include hybrid extracting concepts from title and abstract then mapping from UMLS to MeSH, as well as extracting MeSH descriptors from related citations. Read about the Medical Text Indexer, which includes a diagram of NLM’s process. 3600 new citations processed every week night. To visualize this vocabulary, go here.

Javier Lacasta discussed his work converting thesauri into ontologies such as work that supports prototypes for search with thesauri from the European Urban Knowledge Network and Urbamet. Mapping a thesaurus in to a direct vocabulary in Wordnet was not possible alone, it required additional heuristics. For example, “Educational Building” related to “Primary School” or “High School” in the thesaurus and remapped to “School” in Wordnet. However, Wordnet maintains ~7 meanings for “School” including “A large group of fish.” One solution was to use context, such as when “Water sport” matches Wordnet’s “Water sport”, a child of “Sport”, then siblings to “Water sport” in the thesaurus like “Winter sport” without a match in Wordnet can be map to a member of the tree of “Water Sport” namely “Sport”. This avoids the ambiguity of Wordnet’s alternative definitions of “Sport”, such as a mutant biological organism. Lacasta also used term properties to define relationships in the hierarchy. The work tested reasonably well with low error rates, and resulted in a map based on Urbamet data.

Ágnes Sándor investigated “knowledge-level claims” in scholarly research (with special credit to Frederique Lisacek, Simon Buckingham Shum, and Anna de Liddo). Specifically, Sándor is searching for phrases, or “rhetorical formulas” that denote contrast, open questions, emerging tendencies, etc–such as “In contrast with previous hypotheses”, “but its function is poorly understood”, and “emerging as a promising approach”, respectively. Further, certain facts stated in the article may be more or less important, which is potentially discernible based on phrases. Sándor et al use an incremental parser to find these elements, for instance in detecting “paradigm shifts”, or “claimed knowledge updates”, based on pre-defined grammars. For European Educationl Research Quality Indicators (EERQI), they applied the same tool on social science work in English, French, German, and Swedish vocabularies. The team even compared results between human and machine annotation for open educational research, one example of very high accuracy was given. As these analyses continue, we will hopefully have a chance to isolate both rhetorical tools and scientific facts. Sándor intends to launch an open web service on open.xerox.com for folks to be able to use their tool.

Similar work has been used for summarization and citation analysis, among other projects. Luckily, the scientific jargon is static enough that it was not difficult to create a fairly complete list of terms and phrases, it doesn’t seem to grow linearly as the corpus grows. During questions, Cameron Neylon asked about working in the reverse direction–rather than parsing the texts, but to help works be generated following a rhetorical methodology or grammar. For now, Sándor explained they do not prescribe, though it is an interesting application.

Anita de Waard noted she is one of only 4 women presenters, and also she is incorrectly listed as Dr., but she does not yet have her PhD. De Waard notes there are “claimed knowledge updates” as referenced in Sándor’s presentation (above), but these often reference data that may not be accessible or sufficient to justify their claim. There are many data preservation initiatives, but scientists’ workflow is simply insular, passing hands all over the lab, changing formats (especially non digital), and potentially getting lost along the way, especially in biology. Out of perhaps 1 million papers / year, majority of data (90%?) is likely on local hard drives, perhaps 8% in large, generic data repositories, and finally some 1-2% in small, focused data repositories. Path is clear for their group at Elsevier: increase data digitization, improve data usability, improve repository interoperability, and develop sustainable models for the entire ecosystem. Elsevier launched a pilot program with CMU Urban Legend App, turning a paper-based lab into an electronic lab for better data capture. Researchers are pulled in 4 different directions to submit in a: domain specific data repository by their community, local data repository by collaborators and research group members, large data repository like dryad by funding agencies, institutional data repository by universities–and they all want different metadata! There needs to be a better way.

Poster Sessions (See bottom).

covered Research DataDr. Wolfram Horstmann explored the complexities of Data Policies at every level. International Organizations, Governments, Associations, Universities (and Departments), Funders, Publishers, and Journals can set policies that set requirements for expected results of academic research. Horstmann covered the groups from whom we see currently policies–Funders, Publishers, and Journals. Horstmann concludes that some policies are aspirational and are necessary “Zeitgeist” works, while others are practical policies, a “Seachange” style and include more methodology, infrastrucure, advice, and funding components.

Donatella Castelli shared her investigations in data interoperability, especially across heterogeneous data sources required for different tasks. One of Castelli’s insights included that data infrastructure can benefit from shared solutions. Overall, she isolates two types of issues: (a) how to make data available and (b) providing tools to interact with the data. One of the major problems for repositories and repository managers is providing the proper formats for various scientists’ needs, such as for different applications and services that they require. Castelli offered i-marine as an example of trying to meet these needs.

Kevin Ashley spoke on data quality and curation. The point was there are very different types of quality for different people, each with different needs–provenance, timing, accuracy (which is already complex). As an example, Ashley gave an example of a government dataset on companies that contains some invalid data–you probably don’t want data with dates like Feb 31 in it. If social scientists clean it up so its usable for their purposes, it doesn’t necessarily meet every researcher’s needs. If researchers want to analyze what data the government has, you need to have a copy of the original set, and potentially a mapping of any changes and who made the changes. These needs can be very different. Data is usually focused on one kind of use case, and one data provider, so how do you rethink questions of quality so you can provide what people want when they are different? Problems are around resources and costs in addition to the trade-off between provenance and accuracy of data. We need machine-readable change information, and to cater more for provenance.

Tim Smith shared CERN’s experience and swath of tools for managing a global information infrastructure for storing, transferring, analyzing, and archiving massive amounts of data. Storing all detected data at CERN would be equivalent to storing about 1 petabyte per second. There are few places to store the data they do save, required distributed data management using a worldwide LHC computer grid. Smith uses an acronym for their servic: “AAA – Any data, Any where, Any time… Almost.” CERN’s data needs are so big that the network used for data transfer has become a resource that needs to be scheduled. The CERN Document Server runs on open source softare Invenio, and handles about 30Tb already. Smith and folks at CERN write lots of open source software, including collaborating on Zenodo.

Paul Groth opened up with a simple ask for publishers, and various content producers relevant to scholarly research, to add popular metadata formats to their pages. For instance: Twitter, Open Graph, RDF, COinS, microformats, and various others. Curious about your page’s metadata? Run it through Any23 to get a preview of some data possibly stored already. Also, follow David Shotton’s work on his blog. But more than a little metadata, Groth shared visions of an article-focused, web-developer-friendly, even journal-free future. Follow developments on data2semantics.org including one project, LinkItUp.

Robert Sanderson followed with updates on and descriptions of the W3C’s Open Annotation project. In particular, he highlighted the existing Open Annotation Community Group and encouraged others to join. In addition, Sanderson shared several diagrams of the open annotation data model in various examples of its potential to represent annotations–comments, etc. He also clarified that the intention behind Open Annotation is to use a simple JSON implementation to make the model practical and accessible. Learn more about his memento project.

Henry Thompson rounded out this plenary with an analysis of Naming on the web. He noted how auspicious it is to discuss the future of the web for scholarly research miles from its birthplace at CERN. Thompson points out the web is tragically limited at the link layer, we still use links that often break, with few alternatives to find the source content (e.g. archive.org’s Wayback Machine). Interestingly, rather than pointing to purely technical options for improving naming on the web (like refactoring DNS), Thompson suggests codifying a sort of Social Contract for web linking, agree to socially-moderated non-reputable binding mechanism for scholarly references in URIs, and provide a companion fallback resolution process (i.e. w3c.org should never disappear due to non-renewal of its domain name).

Plenary 2 on Metrics covered analysis of scholarly work before and after publication

Johan Bollen stated that science is about “ideas not bricks”, with the assumption that not all ideas are equally as valuable (claims there are bad ideas that should not be communicated). Science is much like a gift economy, where sharing freely is positive, and you are rewarded through social recognition of that sharing, such is clear in the case of citations. The impact factor is too simple to possibly suffice as a reasonable metric for the value of scholarly communication. There is citation, usage, and social media that can constitute counts, normalized counts, social network metrics, and even trends–plus the granularity expands beyond the Journal to category, region, University, author(s), etc. Additonally, there are various algorithms to generate metric values from citations (e.g. walks, pagerank, eigenfactor, etc). Overall, Bollen is nervous about applying these new metrics to science, wants to separate funding from the metrics of scientific impact.

Euan Adie is founder of altmetric.com a small London start-up funded by The Macmillan Group, owners of Nature Publishing Group. Altmetric is different from the general field of Altmetrics, which is a broad category. Altmetric builds a big dataset of various social media posts with mentions of academic works. There are some gaps in the data, including fringe cases when they cannot pull URLs from a PDF, etc. The overall goal for Altmetric is to collect the data, but they want other people to pull the data and make applications with it. Contact Euan, especially if you’re a non-profit, or browse their site to collaborate.

Jelte Wicherts reviews for a wide array of journals, mostly in his field of psychology. Points out that, especially for young Open Access journals, becoming stable or profitable is a primary motivating factor, potentially driving accepting more papers, which may affect peer review. Wicherts shares his experiences of reviewing for journal articles without a transparent process, not knowing if there were other reviewers, and who eventually approved the articles or not. BioMed Central‘s policy of publishing peer review reports along with the paper works really well. Jeffrey Beall has been investigating predatory publishers and is worth following to learn more. OASPA and COPE share guidelines for OA journals.

Despite some technical difficulties (more on this below), the house was packed, our lightning talk speakers gave a great showing, and group discussions carried on, engulfing participants into the evening.

So what happened behind the scenes? What did it take to hold a good event?

Preparation

A handful of us on PLOS staff previously coordinated a few hackathons and a salon event on Open Access movement-building. This put us in a great position to organize another event. First-time coordination may take a bit longer, but it also offers the most eye-opening experiences and an opportunity to strengthen your team. Do what works for your community and your needs as organizers of any event–you’re experts at what you do, who you work with, and how to collaborate. Focus on the outcomes you’d like to see from the event, translate these into goals, isolate tasks to accomplish those goals, and divvy accordingly to complete them in time.

Bureaucracy Hacking

All institutions have bureaucracy–and it’s a good thing! Formal processes can help ensure there are pathways to get things done, while various needs are addressed along the way for collective benefit, ultimately storing insight and knowledge for later use as culture and routine. “Bureaucracy hacking” means starting the right conversations, bringing folks on-board, and getting through check-lists of needs and steps to get you to your goal as best (and quickly) as possible: hosting the event.

Many Hands Make for Light Work

Sometimes it is difficult to incorporate lots of folks into a process. However, with a little bit of elbow grease, it can be quite easy and rewarding. Reaching out to the whole office brought in co-organizers with different skills (great to have on-hand), new staffers with fresh perspectives, and a few lurkers with a bit too much on their plate to get involved (but good to have their eyes and ears around for feedback). We held a brief kick-off meeting, reviewed the opportunity of hosting the event, shared our thoughts about the potential costs and benefits, and came to consensus on a plan to move forward–including precise action items and corresponding owners for each item.

Justifying the wide-invitation to participate, a new staffer, Angela Melkisethian, brought a great boon to the effort. Aware of the various moving parts (i.e. a few of us were charged with seeking speakers), Angela reached out to science journalist Annalee Newitz of io9 after attending a book signing event–and Annalee accepted. This expanded our view of available speakers and helped our group increase its capacity to communicate and coordinate. A few emails and we had another lightning talk set.

Accountability and Progress

With our explicit list of To-Dos, it wasn’t hard to send quick reminders, ask for follow-up or help on individual items, and keep everyone abreast of overall progress. We met a second time as the event approached, and tackled new needs like sensitively finalizing the agenda (addressing the trade-off between listening to speakers and holding discussion), ensuring good estimates for food and beverages, marketing through social media, and inviting guests personally who would likely be interested in the topic area. A major factor that took the edge off marketing was hosting a meetup of an already-bustling group in the bay area–the Open Knowledge Foundation community. Demand was (and remains) high for these sorts of gatherings!

The Main Event

On event day, we were all set. There were a few final items to wrap-up: checking in about previous tasks, especially logistically weighty components like food and beverages (quantities and scheduling). Given our early planning and good collaboration, things were in working order. Everyone who was available to set up gathered to move furniture before the event began, reconfiguring the space to suit our needs.

Nothing is Perfect

At PLOS’s SF office, we have a large conference room with a glass wall facing a reception area. For staff meetings, some sit inside the room and we line up chairs outside to watch presentations through the glass. For a simple audio solution, we phone in to the room’s telephone and use a speakerphone to hear.

It’s not perfect, but sufficient for a presentation’s purposes. Unfortunately, we set the room up with a regular telephone instead of swapping out for a proper speakerphone.

Lesson 1: Always test your A/V equipment, a dry run will go a long way.

Lesson 2: A/V and IT needs are sometimes taken for granted, it’s always better to have at least one volunteer on-hand, ready to troubleshoot problems as the show goes on.

Rolling with the Punches

When it came time to bring everyone together, start the talks, and get on with the evening, the audio constraint presented itself visibly but quietly (through glass). After a bit of trial and error, we made do with a few scratchy announcements with the phone, then opened doors to let some sound out. Ultimately, an audience member suggested we have the speakers stand in one of the door ways, straddling both crowds, permitting audible presentations with sufficient visibility–and it worked quite well (as you can see in the photo below)!

Overall, the event was a tremendous success, the likes of which we hope to see again at PLOS, at new places with new hosts in the bay area, and in Open Knowledge communities elsewhere. Mishaps, mistakes, and snafus are inevitable, but more importantly they are an opportunity to connect with participants, to win them over in a moment when the wall—between presenter/viewer, organizer/attendee—becomes both clear and reflective, like glass.

PLOS is now in its 10th year of striving to make science research available to the public. As a staff-written and -moderated blog, PLOS Tech seeks to share not only research, but the whats, hows, and whys of facilitating scholarly communication. This is possible through the wisdom and insights of our software development, web production, and product development teams—among many others.

Goals for PLOS Tech

Build a community around sharing insights, techniques, and tools for scholarly communication, especially for science.

Create an opportunity for deeper technical feedback from the wider scholarly community.

What is technology?

Our coverage of technology is inclusive, meaning everything from evaluation of software tools to practical use of theoretical frameworks, information process construction to best practices and clever efficiencies, vital discussion of trends and patterns to critical analysis of buzzwords and memes.

We aim to provide insight accessible to both the technical and the not-so-technical among us, and we encourage any and all feedback via the comment sections and the rest of the web.

Coinciding with the launch of PLOS Tech, the bay area Open Knowledge community is invited for a casual meetup with a few lightning talks and targeted discussion in relevant topic areas. Join us at the SF Headquarters of PLOS, some snacks and beverages will be supplied. Will include updates on California state legislation on Open Access – AB 609 by PLOS, updates from OA at Berkeley by Angelica Tavella and Mitar Milutinovic http://oa.berkeley.edu, as well as a report back from Dario Amodei on the Vannevar series at Stanford.