Designed for humans first and machines second, microformats are a set of simple, open data formats built upon existing and widely adopted standards. Instead of throwing away what works today, microformats intend to solve simpler problems first by adapting to current behaviors and usage patterns

HTML allows us to place metadata in the head of the document. The metadata can be both properties (as a string) and relationships to other documents.

HTML also allows us to put metadata in the body of the document, using @rel and @rev on anchors.

RDFa extends the @rel/@href technique to allow licenses to be attached to images. Say we have a list of images -- perhaps from a Flickr search -- here we see that we can attach a license to each of them.

HTML allows relationships (the @rel/@href combination) to be used in both the head and the body, but text properties can only be added in the head (via @content on &lt;meta&gt;.

RDFa extends the use of @content to the body. Note a small twist -- we have to use @property instead of @name, since the latter attribute is already used for other stuff. Key thing here is that we&apos;ve moved the machine-readable data closer to its human-readable version, which makes it a lot easier to publish.

Why would we do this? Well, first of all it&apos;s much easier to control the generation of the machine-readable data if it&apos;s close to the human-readable data. But second, once you put it close to the human-readable data, there are many situations where the human-readable version will also suffice for the machine-readable one, and so we can avoid duplication. Note that using @content for the date, illustrates a different point; in that case we preserve the distinction between the human- and machine-readable forms, because the machine-readable version is very precise.

Actually I cheated a little in the last slide. There is no such property as &apos;author&apos; or &apos;created&apos;, they just happen to have been used in &lt;head&gt; over the years by a sort of convention. @rel=&amp;quot;license&apos; does exist, however, and there are a few other relationship values (&apos;next&apos;, &apos;prev&apos;, and so on). But essentially, for other relationship values, and all property values, we need to use CURIEs. The advantage of this is that there are many pre-existing vocabularies that can immediately be used. Also, anyone can create a new vocabulary without having to ask anyone. Commontags was devised a few weeks ago, for example, and they didn&apos;t have to ask anyone&apos;s permission.

Recall that we added the relationship attributes to an image, so that we can specify license information...

...we can also add properties to the image.

HTML already supported relationships and properties that apply to the document, and we&apos;ve seen how RDFa adds relationships and properties for images. Now lets look at how RDFa lets us add relationships and properties for anything . Let&apos;s say we have a link to a SlideShare presentation.

We know that if we put the @rel attribute onto the &lt;a&gt; tag as normal, it implies that the current document has a license, and that the presentation itself is the license. So this is no good.

The answer is to firstly create a link to the desired license...

...and then to indicate that this license is attached to the presentation. We still use @rel, but now we&apos;re using it with the new attribute that RDFa adds -- @about.

And of course, we can also add properties.

Using @about sets the context for any further RDFa, not just on the current element.

Once you are in the new context, then everything works exactly as normal, so compare this to the previous slide; the only difference is that the previous slide uses @about to set the context, whilst this example has the &apos;current document&apos; to set the context.

We&apos;ve gone into a lot of detail on the basics of RDFa to show how it builds upon HTML&apos;s existing semantic features, but there are many more features. The main thing to emphasise is that HTML already had some useful semantic features, but what they meant was never formalised; RDFa did that. RDFa also adds to these features, but does so by applying the same approach.

There is much more we could have said, but suggest that interested readers look at the RDFa Primer, and other tutorials and articles. In passing, would say though that RDFa supports all of RDF&apos;s more advanced features too, such as datatype of literals, rdf:type , bnodes, XML literals, and so on. Advanced RDFa also allows quite elaborate chaining of statements allowing people to be connected to companies, reviews to businesses, and so on.

As Vish discussed, SearchMonkey is all about building richer, more useful search results. Here’s a few examples Enhanced Results.

And it allows the user to add the movie directly to their online movie rental queue

[will be animated]

[will be animated]

[will be animated]

[will be animated]

[will be animated]

[will be animated]

[will be animated]

SW: Representing and reasoning with structured data on the Web Both a relational and graph view on information IR:: Aggregating information at a document-level based on ad-hoc information needs DB: Representing and querying information in a relational model NLP: from text to information

Results are good, but consider the ads: First ad says: Virgins. Looking for virgins? Find exactly what you want today. Ebay.com Second ad: Virgins. …Find cheap tickets for Virgins. Third ad: Adspam… these people buy Yahoo! traffic and sell it to Google.

4.
Yahoo! by numbers (April, 2007) <ul><li>There are approximately 500 million users of Yahoo! branded services, meaning we reach 50 percent – or 1 out of every 2 users – online, the largest audience on the Internet (Yahoo! Internal Data). </li></ul><ul><li>Yahoo! is the most visited site online with nearly 4 billion visits and an average of 30 visits per user per month in the U.S. and leads all competitors in audience reach, frequency and engagement (comScore Media Metrix, US, Feb. 2007). </li></ul><ul><li>Yahoo! accounts for the largest share of time Americans spend on the Internet with 12 percent (comScore Media Metrix, US, Feb. 2007) and approximately 8 percent of the world’s online time (comScore WorldMetrix, Feb. 2007). </li></ul><ul><li>Yahoo! is the #1 home page with 85 million average daily visitors on Yahoo! homepages around the world, an increase of nearly 5 million visitors in a month (comScore WorldMetrix, Feb. 2007). </li></ul><ul><li>Yahoo!’s social media properties (Flickr, delicious, Answers, 360, Video, MyBlogLog, Jumpcut and Bix) have 115 million unique visitors worldwide (comScore WorldMetrix, Feb. 2007). </li></ul><ul><li>Yahoo! Answers is the largest collection of human knowledge on the Web with more than 90 million unique users and 250 million answers worldwide (Yahoo! Internal Data). </li></ul><ul><li>There are more than 450 million photos in Flickr in total and 1 million photos are uploaded daily. 80 percent of the photos are public (Yahoo! Internal Data). </li></ul><ul><li>Yahoo! Mail is the #1 Web mail provider in the world with 243 million users (comScore WorldMetrix, Feb. 2007) and nearly 80 million users in the U.S. (comScore Media Metrix, US, Feb. 2007) </li></ul><ul><li>Interoperability between Yahoo! Messenger and Windows Live Messenger has formed the largest IM community approaching 350 million user accounts (Yahoo! Internal Data). </li></ul><ul><li>Yahoo! Messenger is the most popular in time spent with an average of 50 minutes per user, per day (comScore WorldMetrix, Feb. 2007). </li></ul><ul><li>Nearly 1 in 10 Internet users is a member of a Yahoo! Groups (Yahoo! Internal Data). </li></ul><ul><li>Yahoo! is one of only 26 companies to be on both the Fortune 500 list and the Fortune’s “Best Place to Work” List (2006). </li></ul>

5.
Agenda <ul><li>Part 1 </li></ul><ul><ul><li>Publishing content on the Semantic Web </li></ul></ul><ul><ul><ul><li>Intro to RDF and the Semantic Web </li></ul></ul></ul><ul><ul><ul><li>Six ways to publish data on the Semantic Web </li></ul></ul></ul><ul><ul><ul><li>History of embedded metadata on the Web </li></ul></ul></ul><ul><ul><ul><li>RDFa, best practices and tools </li></ul></ul></ul><ul><ul><ul><li>Exercise </li></ul></ul></ul><ul><li>Part 2 </li></ul><ul><ul><li>Semantic Web in use </li></ul></ul><ul><ul><ul><li>SearchMonkey </li></ul></ul></ul><ul><ul><ul><li>BOSS and YQL </li></ul></ul></ul><ul><ul><ul><li>Semantic Search and Navigation </li></ul></ul></ul><ul><li>Part 3 </li></ul><ul><ul><li>Research in Semantic Search </li></ul></ul>

8.
Basic RDF <ul><li>RDF has two basic types of entities: resources and literals </li></ul><ul><ul><li>Roughly objects and built-in types in Object Oriented Programming </li></ul></ul><ul><ul><li>Resources are identified by a URI or otherwise called a blank node </li></ul></ul><ul><ul><ul><li>URIs are a generalization of URLs </li></ul></ul></ul><ul><ul><ul><li>Notation: <http://www.example.org/Person> or ex:Person </li></ul></ul></ul><ul><ul><li>Literals have an optional language and datatype (string, integer etc.) </li></ul></ul><ul><ul><ul><li>Datatypes are identified by URIs, e.g. XML Schema datatypes </li></ul></ul></ul><ul><ul><ul><li>Two literals are the same if their components are the same </li></ul></ul></ul><ul><ul><ul><li>Notation: “Joe B.” or Joe@en^^http://…#string </li></ul></ul></ul>

9.
RDF models <ul><li>A triple aka a statement is a tuple of (subject, predicate, object) </li></ul><ul><ul><li>Example: (Joe, loves, Mary) </li></ul></ul><ul><ul><li>Each triple gives the value of a property for a given resource or relates two objects to one another </li></ul></ul><ul><ul><li>A predicate is always a resource with a URI </li></ul></ul><ul><ul><li>A triple is also called a statement </li></ul></ul><ul><li>An RDF model is a set of triples </li></ul><ul><ul><li>Ordering of statements in an RDF document is irrelevant (unlike XML) </li></ul></ul>

11.
Ontologies <ul><li>Ontologies are collections of classes and properties used to describe objects in a particular domain </li></ul><ul><ul><li>Ontologies themselves are described in RDF or OWL (the Web Ontology Language), an extension of RDF </li></ul></ul><ul><ul><li>Example: the Friend-Of-A-Friend (FOAF) ontology for personal profiles </li></ul></ul><ul><li>Classes can be described by sub- and superclasses, required properties </li></ul><ul><ul><li>Class membership in RDF is expressed using the rdf:type property </li></ul></ul><ul><ul><li>An instance can have multiple classes (types) </li></ul></ul><ul><ul><li>A class can have multiple superclasses </li></ul></ul><ul><li>Properties can be described by their domain, range, cardinalities, etc. </li></ul>

12.
Advanced topic: Resources vs Literals <ul><li>Resources are objects, Literals are strings </li></ul><ul><li>Resources are instances of classes, Literals have datatypes </li></ul><ul><li>Whether something is a resource or literal sometimes depends on the detail of modeling </li></ul><ul><ul><li><meta property=“myvocab:knows”>Paris Hilton</meta> </li></ul></ul><ul><ul><li><item rel=“foaf:knows”> </li></ul></ul><ul><ul><ul><li><meta property=“foaf:name”>Paris Hilton</meta> </li></ul></ul></ul><ul><ul><li></item> </li></ul></ul><ul><li>You cannot make statements about literals (literals are always the object in a triple) </li></ul><ul><li>Resources can carry a globally unique identifier, literals have no identity </li></ul><ul><li>Web resources such as documents and images are resources </li></ul><ul><ul><li><item rel=“rdfs:seeAlso” resource=“http://www.some.related.page.com/”/> </li></ul></ul><ul><ul><li><item rel=“foaf:img” resource=“http://photosite.example.org/photo.jpg”/> </li></ul></ul><ul><li>When in doubt: it’s a resource </li></ul>

13.
Advanced Topic: Informational resources vs. Conceptual resources <ul><li>Informational resource: an HTML document, image, any other file on the Web </li></ul><ul><ul><li>Retrievable in its entirety from the Web </li></ul></ul><ul><ul><li>Retrieving it can return a 200 OK </li></ul></ul><ul><li>Conceptual (non-informational) resource: a person, an event, a place, etc. </li></ul><ul><ul><li>A description of it may be retrievable from the Web </li></ul></ul><ul><ul><li>When identified by a URL, retrieving it should return a 303 Redirect </li></ul></ul><ul><li>Never confuse a webpage with what it describes! </li></ul><ul><ul><li>You are not your Facebook profile: one is a document, the other is a person. A document has properties such as byte-size, media-type etc, a person has name, age, etc. </li></ul></ul><ul><ul><li>Make sure you don’t use the URL of an existing webpage as the URI of a resource </li></ul></ul>

14.
RDF is designed for distributed systems <ul><li>URIs provide web-wide global identification across documents </li></ul><ul><ul><li>A resource may be described by multiple documents </li></ul></ul><ul><ul><li>We know it’s the same resource because the same URI is used or through reasoning (advanced topic…) </li></ul></ul><ul><ul><li>URIs are intented to be reused </li></ul></ul><ul><ul><li>Unique, but not single identifiers: two URIs may denote the same thing </li></ul></ul><ul><li>URIs are dereferencable (can be retrieved) </li></ul><ul><ul><li>A well-behaved URI returns a description of the resource </li></ul></ul><ul><ul><li>Provides authority: the definition of foaf:Person lives at that URI </li></ul></ul><ul><li>Ontologies can be looked up as well </li></ul><ul><ul><li>Typically at the root of the URIs, also known as the namespace </li></ul></ul><ul><ul><li>Example: http://xmlns.com/foaf/0.1/Person redirects to the specification </li></ul></ul>

25.
Option 5: XSLT <ul><li>Publish the transformation from HTML to structured data </li></ul><ul><ul><li>GRDDL is a standard for linking an HTML page to a transformation that produces RDF data </li></ul></ul><ul><li>Advantages </li></ul><ul><ul><li>No change to the page </li></ul></ul><ul><li>Disadvantages </li></ul><ul><ul><li>Transformation needs to be executed to get to the data </li></ul></ul><ul><li>Tools </li></ul><ul><ul><li>Intel MashMaker </li></ul></ul><ul><ul><li>Dapper </li></ul></ul><ul><ul><li>Glue API from AdaptiveBlue </li></ul></ul><XSLT> xx yy 1 2

36.
Example: Creative Commons <ul><li>Current: rel attribute (HTML4) </li></ul>This work is licensed under a <a rel=&quot;license&quot; href=&quot;http://creativecommons.org/licenses/by/3.0/us/&quot;>Creative Commons Attribution 3.0 United States License</a>. <ul><li>Use of the “rel” attribute for semantic annotation is the birth of the microformat… </li></ul>

37.
Microformats (μf) <ul><li>Community centered around microformats.org </li></ul><ul><ul><li>Specifications and discussions are hosted there </li></ul></ul><ul><li>Agreements on the way to encode certain kinds metadata in HTML </li></ul><ul><ul><li>Reuse of semantic-bearing HTML elements </li></ul></ul><ul><ul><li>Based on existing standards </li></ul></ul><ul><ul><li>Minimality </li></ul></ul><ul><li>Microformats exist for a limited set of objects </li></ul><ul><ul><li>hCard (persons and organizations) </li></ul></ul><ul><ul><li>hCalendar (events) </li></ul></ul><ul><ul><li>hResume </li></ul></ul><ul><ul><li>hProduct </li></ul></ul><ul><ul><li>hRecipe </li></ul></ul><ul><li>Varying degrees of support and stability </li></ul><ul><ul><li>hCard and rel-tag are widely supported </li></ul></ul>

40.
Microformats vs. RDFa <ul><li>Choose microformats when you find a microformat that fits your needs and supported by Yahoo! </li></ul><ul><ul><li>Microformats are first option because they are simple </li></ul></ul><ul><ul><li>We support all major microformats, see the documentation </li></ul></ul><ul><ul><li>It’s a common misconception that RDFa requires XHTML: it doesn’t </li></ul></ul><ul><li>If you find none that perfectly fits your needs then you need RDFa </li></ul><ul><ul><li>Microformats have a fixed schema: you can not add your own attributes </li></ul></ul><ul><li>Example: a social networking site with user profiles </li></ul><ul><ul><li>VCard is a good candidate, but for example it doesn’t have a way to express the user’s social connections </li></ul></ul><ul><ul><li>You either live without this, or go with RDFa </li></ul></ul><ul><li>The rest of this presentation is about RDFa, which is thus more powerful, but also more complex </li></ul><ul><ul><li>We will focus on the concepts that are hard to grasp </li></ul></ul>

41.
Keep an eye on HTML5 <ul><li>Currently under standardization at the W3C </li></ul><ul><ul><li>Last Call this fall, keep an eye on it </li></ul></ul><ul><li>Introduces Microdata </li></ul><ul><ul><li>Similar to microformats </li></ul></ul><ul><ul><ul><li>Some predefined vocabularies with central registration </li></ul></ul></ul><ul><ul><li>Some of the flexibility of RDFa </li></ul></ul><ul><ul><li>Introduce new terms using reverse domain names or full URIs </li></ul></ul><ul><li>Semantic HTML elements such as <time>, <video>, <article>… </li></ul>

44.
What does RDFa look like? <ul><li>There are some metadata features in HTML already... </li></ul><ul><li>...so we give them an RDF interpretation... </li></ul><ul><li>...then we generalise them... </li></ul><ul><li>...and then we add a few more. </li></ul>

52.
CURIEs, or Compact URIs <ul><li>Named after Marie Curie, who was the first person to receive two Nobel prizes, one for physics and one for chemistry. </li></ul><ul><li>CURIEs allow a full URI to be expressed in a simple prefix:suffix form. </li></ul><ul><li>The 'suffix' part is looser than in XML namespaces, supporting formulations such as abc:123. </li></ul>

63.
Advanced RDFa <ul><ul><li>use of @datatype to set the data type of @content; </li></ul></ul><ul><ul><li>use of @typeof to set rdf:type; </li></ul></ul><ul><ul><li>support for bnodes; </li></ul></ul><ul><ul><li>support for XML literals; </li></ul></ul><ul><ul><li>ability to chain statements together. </li></ul></ul><ul><li>Note that since RDFa supports all of the features you'll find in RDF, then it means that you can even mark-up OWL documents in HTML. </li></ul>

64.
RDFa pitfalls <ul><li>Validation problems can stop us from extracting data </li></ul><ul><ul><li>Use the W3C validator </li></ul></ul><ul><ul><li>Use the right DOCTYPE declaration if using XHTML </li></ul></ul><ul><ul><li>Set the encoding of your page properly (using HTTP headers or XML declaration) </li></ul></ul><ul><li>Prefixes need to be defined using the xmlns attribute </li></ul><ul><li>Unless you are making statements about the document, set the subject using the about attribute </li></ul><ul><li>Do not include HTML elements in literal values </li></ul><ul><ul><li>Incorrect: <div property=“foaf:name”><b>Peter Mika</b></div> </li></ul></ul><ul><li>Use absolute URIs as the value of the resource attribute </li></ul><ul><ul><li>Or make sure you specify HTML base </li></ul></ul>

66.
More pitfalls: the typeof attribute <ul><li>Typeof does two things at once: it creates a new subject resource and assigns the type to it </li></ul><ul><li>BAD example: </li></ul><ul><ul><li><div about=“#id”> </li></ul></ul><ul><ul><li><span property=“foaf:name“>Peter Mika</span> </li></ul></ul><ul><ul><li><span rel=“foaf:img“ resource=“http://www.example.org/photo.jpg”> </li></ul></ul><ul><ul><li><span typeof=“foaf:Image”> </li></ul></ul><ul><ul><li> <span property=“dc:format”>jpg</span> </li></ul></ul><ul><ul><li></span </li></ul></ul><ul><ul><li></span </li></ul></ul><ul><ul><li></div> </li></ul></ul><ul><li>To correct, you have to repeat the resource attiribute on the span node with the typeof </li></ul>

68.
More pitfalls: breaking up descriptions <ul><li>You can not break up a description like this: </li></ul><ul><li><span rel=“foaf:knows&quot;> <span property=“foaf:name&quot;>Peter Mika</span> </span> …. </li></ul><ul><li><span rel=“foaf:knows&quot;> <a rel=“foaf:email“ href=“mailto:pmika@yahoo-inc.com /> </span> </li></ul><ul><li>This is not the same as: </li></ul><ul><li><span rel=“foaf:knows&quot;> <span property=“foaf:name&quot;>Peter Mika</span> </li></ul><ul><li> <a rel=“foaf:email“ href=“mailto:pmika@yahoo-inc.com /> </li></ul><ul><li></span> </li></ul><ul><li>In the first case there are two related resources, with one attribute each, in the second case there is a single related resource with two attributes. </li></ul>

69.
Tips <ul><li>Hiding information from being displayed </li></ul><ul><ul><li>Links without content will not be rendered </li></ul></ul><ul><ul><li>Use <span property=“foaf:name” content=“Peter Mika”/> </li></ul></ul><ul><li>Use datatypes to provide the expected type of a literal. </li></ul><ul><ul><li>This helps validation because any tool can check whether the literal is indeed of that type. </li></ul></ul>

70.
Choosing a vocabulary <ul><li>Look at SearchMonkey objects </li></ul><ul><ul><li>Video, Games, Presentations, Events, News, Businesses, Products, Discussion </li></ul></ul><ul><li>Search the Web or ask for advice on mailing lists </li></ul><ul><ul><li>[email_address] </li></ul></ul><ul><ul><li>[email_address] org </li></ul></ul><ul><li>Wikis </li></ul><ul><ul><li>semanticweb.org </li></ul></ul><ul><ul><li>vocamp.org </li></ul></ul><ul><li>Beware of people who claim to have the vocabulary of everything </li></ul><ul><ul><li>Preferably you want something small and targeted </li></ul></ul><ul><li>Never a 100% fit  you will need to introduce vocabulary terms (classes and properties) </li></ul><ul><ul><li>Do not introduce new classes/properties in existing namespaces </li></ul></ul><ul><ul><li>Example: the namespace http://xmlns.com/foaf/0.1/ is used by the FOAF project. Try not to introduce a new term without contacting the owner, i.e. the membership of the FOAF mailing list. </li></ul></ul>

71.
Advanced topic: creating a vocabulary <ul><li>Get advice on methodology </li></ul><ul><ul><li>vocamp.org and semanticweb.org </li></ul></ul><ul><li>Choose a namespace and a prefix </li></ul><ul><ul><li>Give sensible names, e.g. name it after your site, but don’t call it searchmonkey </li></ul></ul><ul><ul><li>Namespace ends either with a slash or a hash </li></ul></ul><ul><li>Create an RDF or OWL document describing your classes and properties </li></ul><ul><ul><li>Use an ontology editor such as Protégé 4.0 </li></ul></ul><ul><ul><li>Follow naming conventions </li></ul></ul><ul><li>Publish your vocabulary </li></ul><ul><ul><li>Make sure the URIs of your properties and classes are resolvable </li></ul></ul><ul><ul><ul><li>E.g. myvocab:digicam should resolve to a document containing the definition of myvocab:digicam </li></ul></ul></ul><ul><li>Convince others to adopt your vocabulary </li></ul><ul><ul><li>If you are in fishing, convince other fishing businesses </li></ul></ul>

72.
The process of annotating with RDFa <ul><ul><li>Invest in familiarizing with the RDFa syntax by reading the RDFa Primer </li></ul></ul><ul><ul><ul><li>It is also highly recommended that you read the RDF Primer . RDF is the data model used by RDFa. </li></ul></ul></ul><ul><ul><li>Choose a vocabulary from the SearchMonkey documentation that fits your needs </li></ul></ul><ul><ul><ul><li>A vocabulary describes a set of types and attributes within a given domain </li></ul></ul></ul><ul><ul><ul><li>If you don’t fin d a good candidate , extend an existing one or create a new one </li></ul></ul></ul><ul><ul><li>Annotate your page. </li></ul></ul><ul><ul><ul><li>Before you start, you might want to validate your page for (X)HTML conformance using the W3C’s (X)HTML Validator to reduce the chance of errors. Choose Document Type XHTML + RDFa. </li></ul></ul></ul><ul><ul><ul><li>No specific tool support. If you have an HTML or XML editor that supports DTDs, you will have syntax checking and highlighting. </li></ul></ul></ul><ul><ul><ul><li>Use the RDFa Distiller to validate which data can be extracted from your page. </li></ul></ul></ul><ul><ul><ul><li>If you fancy, use the RDF Validator to graphically visualize the RDF graph that is outputted. </li></ul></ul></ul><ul><ul><li>Put the annotated page online. The data will extracted the next time your page is crawled </li></ul></ul><ul><ul><ul><li>No need to explicitly submit anything </li></ul></ul></ul><ul><ul><ul><li>No notification when your site is crawled </li></ul></ul></ul><ul><li>See http://rdfa.info/rdfa-implementations for new tools and APIs </li></ul>

75.
Microsearch <ul><li>Metadata is out there </li></ul><ul><ul><li>Just how much data is out there? </li></ul></ul><ul><ul><li>What is the quality? </li></ul></ul><ul><li>Idea: bring metadata to the surface of search </li></ul><ul><li>How does it work? </li></ul><ul><ul><li>User enters query </li></ul></ul><ul><ul><li>Metadata is extracted dynamically </li></ul></ul><ul><ul><li>Entity reconciliation </li></ul></ul><ul><ul><li>Metadata is used to display </li></ul></ul><ul><ul><ul><li>rich abstracts, </li></ul></ul></ul><ul><ul><ul><li>related pages </li></ul></ul></ul><ul><ul><ul><li>spatial, temporal visualization </li></ul></ul></ul><ul><li>Microsearch prototype </li></ul>

80.
Lessons <ul><li>More metadata than we expected </li></ul><ul><ul><li>53% of unique queries have at least one metadata-enabled page in top 10 (n=7848) </li></ul></ul><ul><li>Performance is poor </li></ul><ul><ul><li>Metadata needs to come from the index for performance </li></ul></ul><ul><li>‘ Metacrap’ does exist </li></ul><ul><ul><li>Users have to see metadata to spot mistakes in their markup, warn others </li></ul></ul><ul><li>RDF templating (Fresnel) adds complexity </li></ul><ul><ul><li>Abstract needs to be customized to the particular site, query </li></ul></ul>

100.
Google’s Rich Snippets <ul><li>Shares a subset of the features of SearchMonkey </li></ul><ul><ul><li>Encourages publishers to embed certain microformats and RDFa into webpages </li></ul></ul><ul><ul><ul><li>Currently reviews, people, products, business & organizations </li></ul></ul></ul><ul><ul><li>These are used to generate richer search results </li></ul></ul><ul><li>SearchMonkey is customizable </li></ul><ul><ul><li>Developers can develop applications themselves </li></ul></ul><ul><li>SearchMonkey is open </li></ul><ul><ul><li>Wide support for standard vocabularies </li></ul></ul><ul><ul><li>API access </li></ul></ul>

116.
Hard searches <ul><li>Ambiguous searches </li></ul><ul><ul><li>Paris Hilton </li></ul></ul><ul><li>Multimedia search </li></ul><ul><ul><li>Images of Paris Hilton </li></ul></ul><ul><li>Imprecise or overly precise searches </li></ul><ul><ul><li>Publications by Jim Hendler </li></ul></ul><ul><ul><li>Find images of strong and adventurous people (Lenat) </li></ul></ul><ul><li>Searches for descriptions </li></ul><ul><ul><li>Search for yourself without using your name </li></ul></ul><ul><ul><li>Product search (ads!) </li></ul></ul><ul><li>Searches that require aggregation </li></ul><ul><ul><li>Size of the Eiffer tower (Lenat) </li></ul></ul><ul><ul><li>Public opinion on Britney Spears </li></ul></ul><ul><li>Queries that require a deeper understanding of the query, the content and/or the world at large </li></ul><ul><ul><li>Note: some of these are so hard that users don’t even try them any more </li></ul></ul>

122.
Study: metadata analysis <ul><li>What vocabularies are being used? . </li></ul><ul><li>What microformats should we support? </li></ul><ul><li>How much vocabulary reuse/extension there is? </li></ul><ul><ul><li>Is there a convergence? </li></ul></ul><ul><li>What is the quality of metadata? </li></ul><ul><ul><li>Datatype conformance </li></ul></ul><ul><ul><li>Logical consistency </li></ul></ul><ul><ul><li>Conformance to common use wrt common attributes </li></ul></ul><ul><li>How much spam is there? </li></ul><ul><ul><li>Distribution of spamicity scores </li></ul></ul><ul><ul><li>Do spamicity scores transfer to metadata? </li></ul></ul><ul><li>Are there new schemas emerging through the combination of existing vocabularies? </li></ul><ul><li>What is the metadata coverage in terms of queries? </li></ul><ul><ul><li>What percentage of queries from query logs would result in metadata? </li></ul></ul><ul><ul><li>How many would result in metadata that could answer the query? (by some approximation) </li></ul></ul>

123.
Study: Semantic Search Assist <ul><li>Observation: the same type of objects often have the same query context </li></ul><ul><ul><li>Users asking for the same aspect of the type </li></ul></ul><ul><li>Could we make query suggestions based on the type of the entity? </li></ul><ul><ul><li>Improvement for infrequent queries </li></ul></ul>apple ipod nano review sony plasma tv review jerry yang biography biography tim berners lee tim berners lee blog peter mika yahoo britney spears shaves her head

124.
Study: evaluation of semantic search <ul><li>Analysis of user needs </li></ul><ul><ul><li>How are these needs aligned with data on the Web? </li></ul></ul><ul><ul><li>How do the vocabularies differ? </li></ul></ul><ul><li>Analysis of query types </li></ul><ul><ul><li>Object queries? Object-attribute queries? Relationship queries? </li></ul></ul><ul><li>What it means for an object or a set of triples to be relevant to a query? </li></ul><ul><ul><li>Show me the answer and only the answer </li></ul></ul><ul><ul><li>Put me near the answer in the graph </li></ul></ul><ul><ul><li>Show me the justification (or at least the source) of the answer </li></ul></ul><ul><ul><li>… </li></ul></ul><ul><li>Semantic Search evaluation campaign planned for 2010 </li></ul>

125.
Challenges <ul><li>Future work in Semantic Web </li></ul><ul><ul><li>(Semi-)automated ways of metadata creation </li></ul></ul><ul><ul><ul><li>How do we go from 5% to 95%? </li></ul></ul></ul><ul><ul><li>Data quality </li></ul></ul><ul><ul><ul><li>We allow providing metadata for other people’s sites! </li></ul></ul></ul><ul><ul><li>Reasoning </li></ul></ul><ul><ul><ul><li>To what extent is reasoning useful? </li></ul></ul></ul><ul><ul><ul><li>For example, how much would entity resolution or taxonomic reasoning help? </li></ul></ul></ul><ul><ul><li>Scale </li></ul></ul><ul><ul><ul><li>How do we exploit cluster computing techniques? </li></ul></ul></ul><ul><ul><ul><li>What is between databases and IR engines? </li></ul></ul></ul><ul><ul><li>Fostering social agreements </li></ul></ul><ul><ul><ul><li>How do we get people to reuse vocabularies? </li></ul></ul></ul>