Abstract represented by the Linking Open Data clouds [1]. In fact,

apart from those obvious benefits that a generic user could The following paper describes a SKOS-based approach get from such kind of systems, as the reduced amount ofto index large ontologies in order to overcome such scala- misspelling and errors, the autocompletion offers a feasiblebility and responsiveness issues that hinder an efficient and way to improve terms disambiguation in order to enable thegeneral purpose autocompletion system. Differently from user to identify in an exact manner the concept he/she wantssuch autocompletion systems in which the underlying vo- to refer to [5].cabularies are limited, flat and easily indexable with theclassical text indexing techniques, issues addressed in this Classical autocompletion systems, like the well-knownpaper deal mainly with large, highly unpredictable, linked Google Suggest that shows to the user a list of possibledata spaces as the one inspired by the Linked Data philos- search keywords ordered by their occurrence value, are tra-ophy. ditionally based on the idea of matching input strings with a list of usable words in a vocabulary. Instead each system aiming to the terms disambiguation needs to be based on an1. Introduction underlying ontology, in order not only to correctly complete the partial text written by the user with the rest of the string, The term autocompletion, often used in the field of the but to match it with a concept of that ontology. To achieveuser interfaces and, more generally, in the human-machine this, the possible autocompletion choices presented to theinteraction technologies, refers to the capacity of the sys- user need to be categorized according to the concept theytem to predict what the user is currently typing in order are instance of, or another mechanism like the one offeredto complete the string automatically. Some autocompletion by SKOS.systems found an important application in several program-ming IDE tools and in some enhanced mobile softwares, In this paper an approach based on a SKOS-indexed snap-where the user capabilities are limited by non QWERTY- shot of an RDF ontology is presented. For prototyping pur-standard keyboards [2]. poses the DBpedia RDF ontology is used, more specifically a relational view of it, in order to reduce the complexity Common application fields are characterized by underly- of the real-time computation of the autocompletion choicesing well-defined and limited vocabularies, where the words presented to the user. This section continues with a sub-in the lexicon have different leading characters, constraints section describing the formalization of the semantic auto-under which the autocompletion prediction is more feasible completion problem and ends with a quick overview about[3]. Actually autocompletion could find an interesting ap- other efforts already present in literature. Section 2 presentsplication when, on the contrary, the underlying vocabulary the main aspects of our solution, jointly with the algorithmsis rich, highly unpredictable and heterogeneous as the one used. Conclusions and acknowledge close the paper.1.1 The Semantic Web autocompletion: problem statement

One of the best attempts to formally define the semantic

autocompletion problem is provided in [5], where the au-thors devise a real-time mechanism that prunes the graph,made of all the facets of a certain node, according to thepotential matching of the label with the user typed string.Even if the authors assure the applicability of their solu-tion providing three different deployment scenarios, it’s ouropinion that this solution, not providing as far as we cansee an indexing phase on the target ontology, could be un-suitable from a scalability point of view when dealing withlarge data spaces.

Hildebrand et al. [6], provide a serious investigation of

an autocompletion system which is also deployed in real Figure 1. Kevin Spacey examplescenarios. Their SKOS adoption, since demonstrate its suit-ability, deeply influenced the present work. to undesirable results. Actually such kind of solutions don’t The rest of this section provides a formal statement of take into consideration how the resource is linked with thethe problem as we modeled it. This formalization could be rest of the data space, loosing the opportunity to discoverhelpful to understand the algorithm we devised to produce new relationships and links between different instances andthe relational snapshot of the whole DBpedia dataspace, as concept. In one word, they loose the opportunity offered bydescribed in the following section. such linked knowledge.

Informally, the problem of achieving an autocompletion 2 The relational snapshot solution

service based on the linked data paradigm could be ex-pressed as follow: What we call here the relational snapshot is essentially an index built on an RDF ontology, where each tuple of thegiven a string s, retrieve all the instances with a rdfs:label relation refers to a resource R in the ontology and contains property value that start with the string s grouped by the the following fields: most representative SKOS subjects (URI of R, rdf s : label of R, a skos : subject S of R, where the key issue is represented by the meaning of rdf s : label of S)the ”most representative SKOS subject”. Since every nodecould be associated to several different SKOS subjects, as where the skos : subject S plays the role of the categorydepicted in Figure 1, the good behavior of the autocomple- coupled with the instance R presented to the user during thetion service is strictly influenced by the SKOS subject cho- autocompletion phase.sen to represent every resource. Actually, as showed in Figure 1, the resource with therdfs:label ”Kevin Spacey” is associated with several differ-ent SKOS categories, each of them related to a different as-pect of the ”Kevin Spacey” resource. From the autocomple-tion point of view, the fact that the resource ”Kevin Spacey”is presented to the user under the category ”American Ac-tors” is much more convenient than if it is presented as an”American expatriates in the United Kingdom” in order to Figure 2. A relational snapshotenable the user to provide an effective disambiguation.

Choosing the narrower category a resource has or the The adoption of a solution based on an index nativelymost consistent one are solutions that, even if characterized stored in a relational model is motivated mainly by the fol-by a straightforward computational complexity, often lead lowing two considerations: • Responsiveness of the adopted solution: as an au- if a formal analysis of the computational complexity is not tocompletion system found in a Web based environ- provided, this formalization is necessary to make this dis- ment its natural usage and since the service could be sertation more readable and concrete. However, the compu- accessed only remotely, the responsiveness raises seri- tation complexity of the algorithm strongly depends on the ous issues around the scalability of the adopted solu- underlying algorithm used to resolve the SPARQL queries tion. It comes obvious that a 30-years old consolidated on the targeted ontology. technology still is the most appropriate one to address Input: a resource URI r such scalability problems that come up when dealing Output: a SKOS subject URI s with substring matching over a large data sets instead subjects ← SkosSubjects(r); of using directly the native SPARQL substring match- integer similarity ← 0; ing constructs[4]. SKOS subject URI s ← subjects[0]; • Transparency: The relational view allows the end foreach subject ∈ subjects do mostLinked ← users of the autocompletion service to access the in- M ostLinkedResource(subject); dexed ontology without regarding how it is stored. actualSimilarity ← • Index adaptability: Every changes in the ontology ResourceSimilarity(r, mostLinked); subjected to the indexing should reflect on the index if similarity ≤ actualSimilarity then itself. The adopted solution allows the modification similarity ← actualSimilarity; of the relational view simply accessing to the stored s ← subject; tuples with SQL. RDF resources cancellation or new end insertion could be resolved trivially by one single SQL end delete or update on the index. return s Algorithm 1: The most representative SKOS subject2.1 A SKOS based indexing identification algorithm Even if the identification of the most representative Input: a resource URI rSKOS subject of a certain node seems too much related to Output: a set of SKOS subject URIs subjectsthe perception that an user could have regarding a node, sev- subjects ← ExecuteSP ARQLQuery(”SELECTeral positive assumptions could be done around the concept DISTINCT ?uri WHERE ?r skos:subject ?uri ”)of pertinence of a resource to one SKOS subject measured return subjectsas a degree of similarity between two resource. More precisely, Algorithm 2: The SPARQL query to retrieve the SKOS subjects of a resource • given two resources the degree of similarity is the num- Input: a skos subject URI subject ber of SKOS subject that they have in common and, Output: a resource URI resource • the most linked resource of a give skos subject is that subjectResources ← resource with the most number of links from other re- ExecuteSP ARQLQuery(”SELECT DISTINCT sources in the dataspace. ?uri WHERE ?uri skos:subject ?s”); mostLinked ← subjectResources.f irst(); With this two roughly defined properties is possible to linkedSize ←state that the most representative SKOS subject of a given GetLinkedResources(mostLinked).size();resource is the one that has the highest similarity degree foreach resource ∈ subjectResources dobetween its most linked resource and the initial resource to actualLinkedSize ←be categorized. An attempt to formalize the algorithm that GetLinkedResources(resource).size();computes the most representative SKOS subject of a given if linkedSize ≤ actualLinkedSize thenresource is provided in the rest of this section, using the linkedSize ← actualLinkedSize;SPARQL syntax when needed. mostLinked ← resource; end2.2 The Most representative SKOS sub- end ject identification algorithm return mostLinked Algorithm 3: The function that retrieves the most linked Hereby follows the formal description of the algorithms resource of a given SKOS subjectthat are the foundation of our indexing techniques. Even Input: a resource URI resource References Output: a set of resource URIs resources resources ← ExecuteSP ARQLQuery(”SELECT [1] T. Berners-Lee. Linked data. DISTINCT ?uri WHERE ?uri ?prop ?r”) http://www.w3.org/DesignIssues/LinkedData.html, Juli return resources 2006. Stand 12.5.2009. [2] J. Hasselgren, E. Montnemery, P. Nugues, and M. Svens- Algorithm 4: The SPARQL query that retrieves all the son. Hms: A predictive text entry method using bigrams. instances linked to a given resource In Workshop on Language Modeling for Text Entry Methods, 10th conference of the European Chapter of the Association Input: a resource URI uri1, a resource URI uri2 of Computational Linguistics, pages 43–49, 2003. Output: an integer N ( [3] E. Hyvnen and E. MŁkelŁ. Semantic autocompletion. In 0 ≤ N ≤ SkosSubjects(uri1)) R. Mizoguchi, Z. Shi, and F. Giunchiglia, editors, ASWC, vol- similarity ← 0 ume 4185 of Lecture Notes in Computer Science, pages 739– foreach subject ∈ SkosSubjects(uri1) do 751. Springer, 2006. if subject ∈ SkosSubjects(uri2) then [4] J. Prez, M. Arenas, and C. Gutierrez. Semantics and com- similarity + + plexity of sparql. In International Semantic Web Conference, end pages 30–43, 2006. [5] R. SinkkilŁ, E. MŁkelŁ, E. Hyvnen, and T. Kauppinen. end Combining context navigation with semantic autocomple- return similarity tion to solve problems in concept selection. In K. Belhaj- Algorithm 5: The function that measures the degree of jame, M. d’Aquin, P. Haase, and P. Missier, editors, SeMMA, volume 346 of CEUR Workshop Proceedings, pages 61–68. similarity between two instances CEUR-WS.org, 2008. Using the defined MRSS algorithm is therefore possible [6] J. Wielemaker, M. Hildebrand, J. Ossenbruggen, andto couple each resource URI to its rdf:label property value G. Schreiber. Thesaurus-based search in large heterogeneousand its MRSS URI. This allows to achieve the autocom- collections. In ISWC ’08: Proceedings of the 7th Inter- national Conference on The Semantic Web, pages 695–708,pletion process, resolving each substring matching, as the Berlin, Heidelberg, 2008. Springer-Verlag.following SQL query, where a < substring > is matchedwith the label of an instance coupled with its MRSS:

SELECT label, category FROM index WHERE label

REGEXP (”< substring > %”) GROUP BY category

3 Conclusion

The paper presented a work come out from the need of

enabling the users of a profiling system to uniquely andexplicitly identify instances of an ontology that representsconcepts of interest. In this sense the Linking Open DataURI-based approach reveals all its potential, on conditionthat a suitable user interface takes into account all the is-sues related to the term disambiguation. Even if a lot ofemphasis was given to the relational structure of the pro-posed indexing solution, alternative approaches, based onclassical text retrieval techniques applied to SKOS, are un-der investigation, jointly with a concrete application of thecurrently adopted approach on the DBpedia dataspace.