As part of our project with the NCBO we have been curating expression experiments housed in NCBI’s GEO data base and annotating a variety of rat-related records using the NCBO Annotator and more recently, mining data from the NCBO Resource Index. The annotation pipelines and curation tools that we have built have demonstrated some strengths and shortfalls of automated ontology annotation. Similarly our manual curation of these records highlights areas where human involvement could be improved to better address the fact that we are living in the Google era where findability is King.

Speaker Bio:

Simon Twigger currently splits his time between being an Assistant Professor in the Human and Molecular Genetics Center at the Medical College of Wisconsin in Milwaukee and exploring the iPhone and iPad as mobile platforms for education and interaction. At MCW he has been an investigator on the Rat Genome Database project for the past 10 years, he worked with the Gene Ontology project and has been active in the BioCuration community as co-organizer of the past three International BioCuration meetings. He is the former Director of Bioinformatics for the MCW Proteomics Center and was previously the Biomedical Informatics Key Function Director for the MCW Clinical & Translational Science Institute. He is a Semantic web enthusiast and is eagerly awaiting the rapture of Web 3.0 when all the data will be taken up into the Linked Data cloud and its true potential realized.

Annotation, useful annotation anyway, is based on recognition of the subject of annotation. Should prove to be an interesting presentation.

Notes from the webinar:

(My personal notes while viewing the webinar in real time. The webinar controls in all cases of conflict. Posted to interest others in viewing the stored version of the webinar.)

Rat Genome Database: http://rgd.mcw.edu / interesting questions that researchers ask / Where to find answers, PubMed 20 million+ citations, almost 1 per minute / search is the critical thing – in all interfaces / “Being able to find information is of great importance to researchers.” / NCBO Annotator www.bioontology.org/wiki/index.php/Annotator_Web_service / records annotated – curated the raw annotations – manual effort needed to track it down – / rat strain synonyms has issues / work flow description / mouse gut maps to course (ex. of mapping issue) / Linking annotations to data / RatMine faceted-search + lucene text indexing , interesting widgets / – Driving “Biological” Problem Part 2 – 55.6 % of researchers rarely use archival databases, 56.0% rarely use published literature / 3rd International biocurator meeting Amos Bairoch – “trying to second guess what the authors really did and found.” / post-publication effort to make content be found. different from academic model where publication simply floats along. / illustration of where the annotation path fails and the consequences of that failure. / very cool visualization of how annotations can be visualized and the value thereof / put in keywords and don’t care about it being found (paper) , NCBO Resource Index could be a “semantic warehouse” of connections, websites: gminer.mcw.edu, github.com/mcwbbc/, bioportal.bioontology.org, simont -at- mcw.edu @simon_t

Today [Friday, May 27, 2011] marks our third milestone in the Neo4j 1.4 releases. We’ve spent the time since our last release listening to the community and adding to our APIs to help make working with the database even easier and more productive. Under the covers we’ve also built in some performance enhancements that we think you’ll appreciate. And our eye-candy, which you know as Webadmin, has also been extended and tweaked.

Welcome to the Semantic Web Conference Corpus – a.k.a. the Semantic Web Dog Food Corpus! Here you can browse and search information on papers that were presented, people who attended, and other things that have to do with the main conferences and workshops in the area of Semantic Web research.

We currently have information about

2133 papers,

5020 people and

1273 organisations at

20 conferences and

132 workshops,

and a total of 126886 unique triples in our database!

The numbers looked low to me until I read in the FAQ:

This is not just a site for ISWC [International Semantic Web Conference] and ESWC [European Semantic Web Conference] though. We hope that, in time, other metadata sets relating to Semantic Web activity will be hosted here — additional bibliographic data, test sets, community ontologies and so on.

This illustrates a persistent problem of the Semantic Web. This site has one way to encode the semantics of these papers, people, conferences and workshops. Other sources of semantic data on these papers, people, conferences and workshops may well use other ways to encode those semantics. And every group has what it feels are compelling reasons for following its choices and not the choices of others. Assuming they are even aware of the choices of others. (Discovery being another problem but I won’t talk about that now.)

The previous semantic diversity of natural language is now represented by a semantic diversity of ontologies and URIs. Now our computers can more rapidly and reliably detect that we are using different vocabularies. The SW seems like a lot of work for such a result. Particularly since we continue to use diverse vocabularies and more diverse vocabularies continue to arise.

The SW solution, using OWL Full:

5.2.1 owl:sameAs

The built-in OWL property owl:sameAs links an individual to an individual. Such an owl:sameAs statement indicates that two URI references actually refer to the same thing: the individuals have the same “identity”.

For individuals such as “people” this notion is relatively easy to understand. For example, we could state that the following two URI references actually refer to the same person:

The owl:sameAs statements are often used in defining mappings between ontologies. It is unrealistic to assume everybody will use the same name to refer to individuals. That would require some grand design, which is contrary to the spirit of the web.

In OWL Full, where a class can be treated as instances of (meta)classes, we can use the owl:sameAs construct to define class equality, thus indicating that two concepts have the same intensional meaning. An example:

One could imagine this axiom to be part of a European sports ontology. The two classes are treated here as individuals, in this case as instances of the class owl:Class. This allows us to state that the class FootballTeam in some European sports ontology denotes the same concept as the class SoccerTeam in some American sports ontology. Note the difference with the statement:

<footballTeam owl:equivalentClass us:soccerTeam />

which states that the two classes have the same class extension, but are not (necessarily) the same concepts.

Anyone see a problem? Other than requiring the use of OWL Full?

The absence of any basis for “…denotes the same concept as….?” I can’t safely reuse this axiom because I don’t know on what basis its author made such a claim. The URIs may provide further information that may satisfy me the axiom is correct but that still leaves me in the dark as to why the author of the axiom thought it to be correct. Overly precise for football/soccer ontologies you say but what of drug interaction ontologies? Or ontologies that govern highly sensitive intelligence data?

So we repeat semantic diversity, create maps to overcome the repeated semantic diversity and the maps we create have no explicit basis for the mappings they represent. Tell me again why this was a good idea?

I would take seriously its suggestion to seek legal counsel if you have any doubts about data you want to use. IP (intellectual property) in any country is a field unto itself and international IP is even more complicated. Self-help, despite all the raging debates about licensing terms and licenses by non-lawyers, is not recommended.

Should not be a problem so long as you are using IP of a client for that client. Is a problem when you start using data from a variety of sources, some of which may not appreciate your organization of the underlying data. Or the juxtaposition of their data with other data, which places them in an unflattering light.

The 4th international workshop Social Data on the Web (SDoW2011) co-located with the 10th International Semantic Web Conference (ISWC2011) aims to bring together researchers, developers and practitioners involved in semantically-enhancing social media websites, as well as academics researching more formal aspect of these interactions between the Semantic Web and Social Web.

It is now widely agreed in the community that the Semantic Web and the Social Web can benefit from each other. One the one hand, the speed at which data is being created on the Social Web is growing at exponential rate. Recent statistics showed that about 100 million Tweets are created per day and that Facebook has now 500 million users. Yet, some issues still have to be tackled, such as how to efficiently make sense of all this data, how to ensure trust and privacy on the Social Web, how to interlink data from different systems, whether it is on the Web or in the enterprise, or more recently, how to link Social Network and sensor networks to enable Semantic Citizen Sensing.

Completely inadequate description but the interface constructs a mythic “single” reporter on any topic you choose from stories in the New York Times. The interface also gives you reporters who wrote stories on that topic. You can then find what “other” stories the mythic one reporter wrote, as well as compare the stories written by actual NYT reporters.

Raw survey data file in both SPSS and comma-delimited (.csv) formats. To protect the privacy of respondents, telephone numbers, county of residence and zip code have been removed from all public data files.

Survey instrument/questionnaire in Word format. The survey questionnaire provides question and response labels for the raw data file. It also includes all interviewer prompts and programming filters for outside researchers who would like to see how our questions are constructed or use our questions in their own surveys.

Topline data file in Word format that includes trend data to previous surveys in which we have asked each question, where applicable.

As far as I know, the use of topic maps with survey and other data to create “profiles” of particular communities remains unexplored. May not be able to predict the actions of any individual but probabilistic predictions about members of a group may be close enough. Interesting. Predicting the actions of any individual may be NP-Hard but also irrelevant for most purposes.

Posted in Data, Data Source | Comments Off on Pew Research raw survey data now available

“Geospatial Information” identifies, depicts or describes geographic locations, boundaries or characteristics of Earth’s inhabitants or natural or human-constructed features. Geospatial data include geographic coordinates that identify a specific location on the Earth; and data that are linked to geographic locations or have a geospatial component.

Why preserve geospatial data?

Preserving and reusing geospatial data can save time and money

Geospatial data provide important baseline information for later comparison and analysis

Geospatial data can increase the value of important historical and administrative records

Geospatial data are often useful for purposes not originally envisioned

Serious data acquisition, processing, storage and preservation folks. A good starting place to learn the community’s history, vocabulary, concerns and present preservation efforts.

In the realm of public domain software for record linkage and unduplication (aka. dedupe software), The Link King reigns supreme. The Link King has fashioned a powerful alliance between sophisticated probabilistic record linkage and deterministic record linkage protocols incorporating features unavailable in many proprietary record linkage programs. (detailed overview (pdf))

The Link King’s probabilistic record linkage protocol was adapted from the algorithm developed by MEDSTAT for the Substance Abuse and Mental Health Services Administration’s (SAMHSA) Integrated Database Project. The deterministic record linkage protocols were developed at Washington State’s Division of Alcohol and Substance Abuse for use in a variety of evaluation and research projects.

The Link King’s graphical user interface (GUI) makes record linkage and unduplication easy for beginning and advanced users. The data linking neophyte will appreciate the easy-to-follow instructions. The Link King’s artificial intelligence will assist in the selection of the most appropriate linkage/unduplication protocol. The technical wizard will appreciate the discussion of data linkage/unduplication issues in The Link King’s user manual, the variety of user-specified options for blocking and linkage decisions, and the powerful interface for manual review of “uncertain” linkages.

Recommender systems are playing a key role in the next web revolution as a practical alternative to traditional search for information access and filtering. Most of these systems use Collaborative Filtering techniques in which predictions are solely based on the feedback of the user and similar peers. Although this approach is considered relatively effective, it has reached some practical limitations such as the so-called Magic Barrier. Many of these limitations strive from the fact that explicit user feedback in the form of ratings is considered the ground truth. However, this feedback has a non-negligible amount of noise and inconsistencies. Furthermore, in most practical applications, we lack enough explicit feedback and would be better off using implicit feedback or usage data.

In the first part of my talk, I will present our studies in analyzing natural noise in explicit feedback and finding ways to overcome it to improve recommendation accuracy. I will also present our study of user implicit feedback and an approach to relate both kinds of information. In the second part, I will introduce a radically different approach to recommendation that is based on the use of the opinions of experts instead of regular peers. I will show how this approach addresses many of the shortcomings of traditional Collaborative Filtering, generates recommendations that are better perceived by the users, and allows for new applications such as fully-privacy preserving recommendations.

Chris Anderson: “We are leaving the age of information and entering the age of recommendation.”

I suspect Chris Anderson must not be an active library user. Long before recommender systems, librarians have been making recommendations to researchers, patrons and children doing homework. I would say we are returning to the age of librarians, assisted by recommender systems.

Librarians use the reference interview so that based on feedback from patrons they can make the appropriate recommendations.

If you substitute librarian for “expert” in this presentation, it becomes apparent the world of information is coming back around to libraries and librarians.

Librarians should be making the case, both in the literature but to researchers like Dr. Amatriain, that librarians can play a vital role in recommender systems.

One of the more intriguing slide represented http/apps/dbs as a stack to show that while scaling of the http layer is well-known, scaling of apps is more difficult but still doable, the scaling of storage is the most expensive and difficult.

I mention that because scaling of databases I suspect has a lot in common with scaling of topic maps.

On the issue of consistency, the point was made that “expires” can be included in HTTP headers, which indicate a fact is good until some time. I wonder, could a topic have a “last merged” property? So that a user can choose the timeliness they need? So that “last merged” 7 days ago is public information, “last merged” 3 days ago is subscriber information and the most recent “last merged” is premium information.

For example, instead of trying to regulate insider trading, the SEC could create a topic map of stocks and sell insider trading information, suitably priced to keep its “insider” character, except that for enough money, anyone could play. The SEC portion of the subscription + selling price could be used to finance other enforcement activities.

This presentation plus the Amazon paper make nice weekend reading/viewing.

Zanran doesn’t work by spotting wording in the text and looking for images – it’s the other way round. The system examines millions of images and decides for each one whether it’s a graph, chart or table – whether it has numerical content.

Admittedly you may have difficulty re-using such data but finding it is a big first step. You can then contact the source for the data in a more re-usable form.

From Hints & Helps:

Language. English only please… for now.
Phrase search. You can use double quotes to make phrases (e.g. “mobile phones”).
Vocabulary. We have only limited synonyms – please try different words in your query. And we don’t spell-check … yet.

From the website:

Zanran helps you to find ‘semi-structured’ data on the web. This is the numerical data that people have presented as graphs and tables and charts. For example, the data could be a graph in a PDF report, or a table in an Excel spreadsheet, or a barchart shown as an image in an HTML page. This huge amount of information can be difficult to find using conventional search engines, which are focused primarily on finding text rather than graphs, tables and bar charts.

Are you a gamification expert[1] or interested in becoming one? Want to help solve a problem of epic proportions that could have a major impact on the world?

The SETI Institute and Gamify[2] together have created an EPIC Contest to explore possible ways to gamify SETI. We’re asking the most brilliant Earthlings to come up with ideas on how to apply gamification[3] to increase participation in the SETI program.

The primary goal of this social/scientific challenge is to help SETI empower global citizens to participate in the search for cosmic company and to help SETI become financially sustainable so it can live long and prosper. This article explains our problem and what we are looking to accomplish. We invite everyone to answer the question, “How would you gamify SETI?”.

To be more specific:

Can we create a fun and compelling app or set of apps that allow people to aid us in identifying signals?

Do you have any ideas to make this process a fun game, while also solving our problem, by applying game mechanics and game-thinking?

Can we incorporate sharing and social interaction between players?

Is monetization possible through virtual goods, “status short-cuts” or other methods popularized by social games?

Are there any angles of looking at the problem and gamifying that we have not thought of?

The scientific principles involved in this field of science can be very complicated. A conscious attempt has been made to explain the challenge we face with a minimum of scientific explanation or jargon. We wish to be able to clearly explain our unique problems and desired outcomes to the scientific and non-scientific audience.

It all started with the flu. In 2008, we found that the activity of certain search terms are good indicators of actual flu activity. Based on this finding, we launched Google Flu Trends to provide timely estimates of flu activity in 28 countries. Since then, we’ve seen a number of other researchers—including our very own—use search activity data to estimate other real world activities.

However, tools that provide access to search data, such as Google Trends or Google Insights for Search, weren’t designed with this type of research in mind. Those systems allow you to enter a search term and see the trend; but researchers told us they want to enter the trend of some real world activity and see which search terms best match that trend. In other words, they wanted a system that was like Google Trends but in reverse.

This is now possible with Google Correlate, which we’re launching today on Google Labs. Using Correlate, you can upload your own data series and see a list of search terms whose popularity best corresponds with that real world trend. In the example below, we uploaded official flu activity data from the U.S. CDC over the last several years and found that people search for terms like [cold or flu] in a similar pattern to actual flu rates…

Breaking with long-standing, respected and near holy traditions of conference workshops with jet-lagged, caffeine-jagged, email-reading, passive-aggressives half-listening to speakers, who are not reading their email, Balisage pre-conference workshop will focus on creation of a useful deliverable.

This year after a short introduction to the topic, the goal, and the approach, the attendees will break out into work groups with writing assignments and will actively participate in the development of a white paper. As the day progresses, groups will work on assignments, report back to the whole, and receive new assignments.

The notes, text, lists, and stories created during the workshop will be turned over to an editor who will produce a White Paper from the work produced during the workshop.

We expect this to be an intense, interactive, and productive day.

Participate if you want to:

help draft a document that will meet the needs of many

influence the direction and content of this document

learn what some others think

work elbow to elbow with XMLers of different backgrounds for a day

throw yourself into an interactive group activity.

If this does not sound like your sort of day, if you are more comfortable in a more traditional conference environment, please join us for Balisage: The Markup Conference 2011, starting the following day.

Question: If this map is shown on an iPhone and can display more information about either the starting or ending location, is that a fragment?

I ask because it would be less than all the information the source contains. Which is one sense of “fragment.”

I don’t know that this representation (yet) can do that, but the delivery of the route made me think about the information being delivered. It is in some very real sense “complete” for purposes of navigating about Dublin. If I ask again, I will get another “complete” information set. And I have no trouble seeing relationships between those two sets of information.

Contributions can be either full research papers, Standard Enhancement Proposals, or a description of new Content Dictionaries, particularly ones that are suggested for formal adoption by the OpenMath Society.

IMPORTANT DATES (all times are GMT)

OpenMath 2011 does not have a submission deadline. Submissions will be accepted until July 10 and reviewed and notified continuously.

SUBMISSIONS

Submission is by e-mail to omws2011@googlegroups.com. Papers must conform to the Springer LNCS style, preferably using LaTeX2e and the Springer llncs class files.

Submission categories:

Full paper: 4-12 LNCS pages

Short paper: 1-8 LNCS pages

CD description: 1-8 LNCS pages; a .zip or .tgz file of the CDs should be attached.

Our view is that the new data intensive workloads that are increasingly common are a poor match for the legacy storage systems they tend to run on. These systems are built on a set of assumptions about the capacity and performance of hardware that are simply no longer true. The Acunu Storage Platform is the result of a radical re-think of those assumptions; the result is high performance from low cost commodity hardware.

It includes the Acunu Storage Core which runs in the Linux kernel. On top of this core, we provide a modified version of Apache Cassandra. This is essentially the same as “vanilla” Cassandra but uses the Acunu Storage Core to store data instead of the Linux file system and is therefore able to take advantage of the performance benefits of our platform. In addition to Cassandra, there is also an object store similar to Amazon’s S3; we have a number of other more experimental projects in the pipeline which we’ll talk about in future posts.

Perhaps the start of something very interesting.

It took NoSQL a couple of years to flower into the range of current offerings.

We’ve all heard this story. All was fine until one day your boss heard somewhere that Hadoop and No-SQL are the new black and mandated that the whole company switch over whatever it was doing to the Hadoop et al. technology stack, because that’s the only way to get your solution to scale to web proportions while maintaining reliability and efficiency.

So you threw away your old relational database back end and maybe all or part of your middle tier code, bought a couple of books, and after a few days of swearing got your first MapReduce jobs running. But as you finished re-implementing your entire solution, you found that not only is the system way less efficient than the old one, but it’s not even scalable or reliable and your meetings are starting more and more to resemble the Hadoop Downfall parody.

An excellent post on problems to avoid with Hadoop!

Posted in Hadoop, Humor, NoSQL | Comments Off on Hadoop Dont’s: What not to do to harvest Hadoop’s full potential

With GraphStream you deal with graphs. Static and Dynamic.
You create them from scratch, from a file or any source.
You display and render them.

From Getting Started:

GraphStream is a graph handling Java library that focuses on the dynamics aspects of graphs. Its main focus is on the modeling of dynamic interaction networks of various sizes.

The goal of the library is to provide a way to represent graphs and work on it. To this end, GraphStream proposes several graph classes that allow to model directed and undirected graphs, 1-graphs or p-graphs (a.k.a. multigraphs, that are graphs that can have several edges between two nodes).

GraphStream allows to store any kind of data attribute on the graph elements: numbers, strings, or any object.

Moreover, in addition, GraphStream provides a way to handle the graph evolution in time. This means handling the way nodes and edges are added and removed, and the way data attributes may appear, disappear and evolve.

This isn’t so much surprising as it is disappointing. We now know the priority that “open” government in U.S. government budgetary discussions.

I could go on at length about this decision, the people who made it, complete with speculation on their motives, morals and parentage. Unfortunately, that would not restore the funding nor would it be a useful exercise.

As an alternative, let me suggest that everyone select one or two of the data sets that are already available and do something interesting. Something that will catch the imagination of the average citizen. Then credit these government sites as the sources and gently point out that with more funding, there would be more data. And hence more interesting things to see.

Asking someone at the agencies that produce data could result in interesting suggestions. They may lack the time, resources, personnel to do something really creative but with their ideas and your talents…, well, the result could interest the agency and the public. These agencies are the ones fighting on the inside of the public budget process for funding.

What data sets and ideas for those data sets do you think would have the most appeal or impact?

Posted in Data Source, Dataset | Comments Off on Open government sites scrapped due to budget cuts

Architects look at thousands of buildings during their training, and study critiques of those buildings written by masters. In contrast, most software developers only ever get to know a handful of large programs well—usually programs they wrote themselves—and never study the great programs of history. As a result, they repeat one another’s mistakes rather than building on one another’s successes.

This book’s goal is to change that. In it, the authors of twenty-five open source applications explain how their software is structured, and why. What are each program’s major components? How do they interact? And what did their builders learn during their development? In answering these questions, the contributors to this book provide unique insights into how they think.

If you are a junior developer, and want to learn how your more experienced colleagues think, this book is the place to start. If you are an intermediate or senior developer, and want to see how your peers have solved hard design problems, this book can help you too.

I thought this might be of interest to the developer side of the topic map house.

Identifier persistence requires an organizational commitment. Persistence cannot be ensured by a few renegades in the skunk-works, nor can it be mandated from on high without the support of those who manage the identifiers or produce web resources. All individuals involved in the life-cycle of web resources must be committed to persistence in perpetuity if true persistence of identifiers is to be achieved.

No technology, no standard, no identifier scheme, no information architecture will get you persistence. Whether you choose native URIs, Handles, DOIs, PURLs, ARKs, UUIDs, or XRIs, you will never achieve identifier persistence without active management of your identifiers and web resources. This requires the aforementioned organizational commitment since such management cannot occur without sufficient resources. Management of web resources and identifiers requires time and due diligence and those don’t come for free.

It has been, what?, over 2,000 years without active management of identifiers and web resources and it still persists as an identifier.

And that is a fairly recent identifier in the great scheme of identifiers. There are those that are far older.

I don’t deny the convenience or utility of web identifiers. But in terms of persistence, where should we look for a digital Rosetta stone when the maintenance of opaque identifiers and 303 redirects have fallen into disuse? I have heard it mentioned that fifteen or twenty years is persistence for a web identifier. Perhaps so but realize that the persistence of the identifier for Cleopatra that appears above is more than two orders of magnitude greater.

How your business would be different today if there were a cone of information darkness only fifteen or twenty years (optimistic estimate) behind you? And with each passing year, another year drops into a digital abyss. Some things persist, others don’t. Usually the ones you want/need don’t. Or so it always seems.

Let’s write web identifiers using (in part) identifiers that are already meaningful in our professions, occupations and hobbies. Identifiers that are not dependent particular resolution mechanisms or technologies. Identifiers that will persist long after their maintenance has failed. That is a step towards persistence.

Last week discovered the SPARQL 1.1 Graph Store HTTP Protocol [1] and I wondered if this wouldn’t be a good alternative to SDShare [2].

The graph store protocol uses no artificial technologies like Atom but uses REST and RDF consequently. The service uses an ontology [3] to inform the client about available graphs etc.

The protocol allows creation of graphs, deletion of graphs and updating graphs and discovery of graphs (through the service description).

The protocol is rather generic, so it’s usable for Topic Maps as well (graph == topic map).

The protocol provides no fragments/snapshots like SDShare, though. Adding these functionality to the protocol would be interesting, I’d think. I.e. each graph update would trigger a new fragment. Maybe this functionality would also solve the “push problem” [4] without inventing yet another syntax. The description of the available fragments should also be done with an ontology and not solely with Atom, though.

Anyway, I wanted to mention it as a good, *dogfooding* protocol which could be used for Topic Maps.

I created an implementation (Cassa) of the protocol at [5] (no release yet). The implementation supports Topic Maps and RDF but it doesn’t provide the service description yet. And I didn’t translate the service description ontology to Topic Maps yet.