Author: julian

Linked Data Engineering

In general it is difficult to get data, because it is distributed into different databases and you need different APIs to get the data -> data islands. Applying semantic web technologies allows you a standardized interface to access this data. This allows easier reuse and sharing of data.Tim Burners-Lee: “Value of data increases if data is connected to other sources.”There are four principles for linked data:

Use URIs as names (not only web pages, but also real objects, abstract concepts and so on)

Linked Data Programming

How to publish data for Semantic Web? The best way is via a SPARQL endpoint via OpenLink Virtuoso, Sesame, Fuseki. These endpoints are RESTful Web Services, that you can query via JSON, XML and so on. Another way is via Linked Data Endpoints (Pubby, Jetty). There are overlays over the SPARQL endpoint. Another way is via D2R servers, that translate data from non-RDF databases into RDF data. A source for availability is datahub.io.

Metadata and Semantic Annotation

Semantic Annotation: you attach semantic data to your source. Formal:

subject of the annotation, (a book, represented by isbn-number)

object of the annotation, the author

predicate that defines the type of relationship,relationship, that the author is author of the book

context, in which the annotation is made (who did the annotation and when?)

Named Entity Resolution

When we do semantic annotation we want to get the meaning of this string, like additional information (you annotate Neil Armstrong and get more information about him). The main problem is ambiguity, if you enter “Armstrong” in a search engine, you also get pictures of Lance Armstrong and Louis Armstrong. Context helps us to specify the search and overcome this problem.

Resolution: mapping the word to a knowledge base in order to solve ambiguity
Recognition: locating and classifying entities into predefined categories like names, persons, organizations

Example: Armstrong landed the eagle on the moon
From that you do every kind of combination for these entities and if there are co-occurences in the texts, you can find the best matches. Another way is to look at dbpedia where you can see which of the possible options do have connections to each other.

Semantic Search

When you use a search engine, you will also find ambiguious results. With semantic annotated texts, you can overcome ambiguity. Based on this you can do entity-based IR, so it is language independent. You could also include information from the underlying knowledge base or use content-based navigation and filtering (filter pictures vs. drawings).

You can use it for:
Query String refinement (like auto-completion, query enrichment)
cross referencing (additional information for the user taken from knowledge base)
Fuzzy search (give nearby results, helpful if you have very few results)
exploratory search (visualize and navigate in search space)
resoning (complement search results with implicitly given information)

Another example is entity based search. You match a query against semantically annotated documents (simple entity matching). You can also get similarities, like between Buzz Aldrin and Neil Armstrong (similarity-based entity matching)
relationship-based entity matching: You have the entities astronaut and apollo 11. There are also relationships between astronaut, apollo 11 and Neil Armstrong.
—> these results can complement your search!

Another approach is directly selecting named Entities. So you directly click in the entity that you want. Example are the articles at blog.yovisto.com

Exploratory Search

Extension of traditional search and Semantic Search

Retrieval: You look for something specific (like a book) and know how to specify it

Exploration: You already read “1984” and want to read a book that is close to this one. In a library you would ask the librarian and he will tell you what to read next. We want to have this in our search system as well!! In a traditional library you can look at the shelves and can maybe also find another book that is similar.

For whom is it made?

People that are unfamiliar with the domain

People who are unsure about the ways to archive their goals

People who are unsure about their goals, you want to find something, but you cannot specify it

You can make graphs with the semantic information you have in order to give the user more information about the original result (more books by one author). You could also get broader results (you read a book by Jules Verne and get as a recommendation books by H.G. Wells, who was influenced by Jules Verne). Another example: start with Neil Armstrong — Apollo 11 and other crew members — Apollo 11 is part of apollo program and you find other apollo programs — you find apollo 13 and find out that there was an accident — you find the crash of the space ship “challenger”.

crowdsourcing via Amazon Mechanical Turk or games with purposein short there are three steps: term extraction – conceptualization – evaluation. Actual challenges in Ontology Learning:

Heterogenity

Uncertainty: the quality is low, you cannot be sure whether the information is right or not

You need consistency because otherwise you cannot do reasoning

Scalability: make sure that it is scalable

Quality: you neet to evaluate it and make sure it is right

Interactity: you need to involve users to help you improve the ontologies

Ontology Alignment
What is it? You try to find similarities between ontologies in order to combine them. But: an ontology only models reality, it is NOT the reality. The problems are similar to natural language: you run into ambiguities. You can also have problems with different conventions (time in seconds vs. time in time points), different granularities and different points of views.

You have differences on the syntactical, terminological. semantical or semiotic (pragmatic) level

Ontology Evaluation

This is the quality of an ontology in respect to a particular criterion. There are two basic principles:

Verification: it encoding and implementation correct (more the formal side)

Validation: how good is the model and how well does it match reality?

Criteria for validation:

correctness:

Accuracy (precision and recall)

completeness

consistency

quality:

Adaptibility

Clarity

computational efficiency

organizaional fitness (how well does it integrate in my software/organisation

The lecture deals with artificial intelligence, which means in this context, that we want to program computers that act logically, we do not want them to act like humans and we also do not want to build an artificial brain to understand how human thinking works.

The first problems are search problems, like finding the shortest path to a point in grid. For this you can use two types of search:

Breadth-First-Search: You try out every point around you and if do not reach the goal, you try the points next to these points and so on. The good thing about this is, you find the shortest way, the bad is it takes very long.

Depth-First-Search: You go down one way until the end (for instance, you always take a left turn) and if this does not work, you try another path. The good thing about this is that you probably find the solution faster than with Breadth-First, but you may not find the best solution.

In real-world-problems you sometimes also have costs, that you have to add to the problems, for example if you want to find the shortest path between two cities by train, you probably want to use the shortest distance and not the connection with least steps (edges in the graph).

A real improvement can be made if you add heuristics to your model, like you calculate the distance to your goal. Using this you can combine the this heuristics with Breadth-First-Search in order to find the best solution faster because you know when your search brings you further away from your goal.

The first week covers the basic principles of the technologies for the semantic web, especially RDF, which is one of the languages you can use for encoding information semantically. The basic principle behind the technologies are triples, which consist of subject-predicate-object. So you encode all your knowledge in that way, for example: Pluto – discovered – 1930.

One problem is that because of the syntax the expressions tend to be very long, so you can use abbreviations with namespaces like in XML or turtle, which helps you also to shorten your syntax.

Mr. Sack claims that with Semantic Web technologies you can go one step further into a web of data, because it is very easy to create data that is machine-readable. He gives a lot of examples using DBPedia. This site also provides a good interface, where you can download data in different machine-readable formats like xml or json. Example-page for Pluto.

Named Entities are nouns that – simply speaking – refer to something in the real world. An example would be the noun Los Angeles, which refers to a city in the US, unlike the noun apple, which describes a fruit. For tasks in information retrieval it is very useful to know whether a noun refers to a named entity or not because it is a common task in search to find named entities, for example if you want to make a trip to Los Angeles. It will then be important because people do not want to find information about the two words los and angeles, but information about this particular city.

So how do you recognize it? There are several techniques, that are used and combined. One way is analyzing parts of speech and trying to detect when a certain pattern of for example two nouns occurs (like USB device). Another way is to look at the sentences for keywords that may refer to named entities and then analyze if in this part of the sentence there are named entities. For instance, in patent retrieval new inventions are described in a way that does not make clear what it really means in order to make the patent claims broader. For instance, a floppy disc drive can be

At the end you can combine these techniques with machine learning, so you can mark named entities at a data set and let an algorithm learn, which of the nouns are named entities.

Another difficulty is the mapping of named entities. For example you have a text about politics in Germany. This text talks about the chancellor of Germany. You can use this information, but you still do not know if Angela Merkel or one of her predecessors. You will need more information to figure out about whom this person is talking, like the date when the article was written. Another awesome example is Java, which is an island and a programming language. There is also a book that uses this ambiguity. It is named Java ist auch eine Insel – Java is also an island.

You can find more information about this topic for example at Marrero et al. and more general information at Wikipedia.

In IR you got your query and from this query you get a result. But how good is this result? One way to measure this is by calculating the clarity of the result. The clarity means – generally speaking – how much the found results differ. You can measure this when you look a the result sets and try to find out how much the words in the found documents differ. Query clarity can tell you, how much ambiguity you have in your query.

Of course there are different ways to calculate the query clarity. The basic model is the one introduced by Cronen-Townsend et al. Others are the Improved Clarity Score by Hauff et al. and the Simplified Clarity Score by He and Ounis.