InfoSci®-Journals Annual Subscription Price for New Customers: As Low As US$ 4,950

This collection of over 175 e-journals offers unlimited access to highly-cited, forward-thinking content in full-text PDF and XML with no DRM. There are no platform or maintenance fees and a guarantee of no more than 5% increase annually.

Receive the complimentary e-books for the first, second, and third editions with the purchase of the Encyclopedia of Information Science and Technology, Fourth Edition e-book. Plus, take 30% off until July 1, 2018.

Take 20% Off All Publications Purchased Directly Through the IGI Global Online Bookstore: www.igi-global.com/

Abstract

In this article, an indexing scheme that includes the named entity category for each indexed term is proposed. Based on this, two methods are proposed, one to infer the semantics of an XML element based on its data content, called the confidence value of the element, and the second method computes the proximity scores of the query terms. The confidence value of an element is obtained based on the probability of a named entity category in the data content of the underlying XML element. The proximity score of the query terms measures the proximity and ordering of the query term within an XML element. The article then shows how a ranking function uses the confidence value of an XML element and proximity score to mitigate the impact of higher frequency terms and compute the relevance between a keyword query and an XML fragment. Finally, a keyword search system is introduced and experiments show that the proposed system outperforms existing approaches in terms of search quality and achieve a higher efficiency.

Article Preview

1. Introduction

Extensible Mark-up Language (XML) in recent years is one of the most widely used mark-up languages for information representation and exchange over the Internet. Currently, many documents are now represented and stored as XML documents on the web. Thus, the need for effective and user-friendly search systems for XML document search cannot be over emphasised. There are two fundamental methods for searching XML documents: using structured queries or keyword search. Structured queries are queries compose using query languages such as XQuery and XPath. Although these queries are effective, they in general return a set of results meaning that the results are not in ranked order (Cohen et al., 2003; Kim et al., 2009). Keyword queries are generally more user-friendly since users need not to learn a query language and/or remember the schema of the XML data in order to compose the queries. However, keyword queries are inherently ambiguous and it is impossible for users to clearly specify their intentions, which causes keyword search engines to inevitably generate large number of results and hence the needs for these systems to return relevant results earlier in the list of results. This implies that keyword search systems with relevant oriented ranking functions are needed.

Several keyword search systems for XML retrieval with different result ranking capabilities are proposed among them includes query structuring systems (Hummel et al., 2011; Li et al., 2010; Petkova et al., 2009; Li et al., 2009). A query structuring system converts a user keyword query into a set of structured queries and selects the best structured query or queries that match the given input query. However, existing query structuring systems either do not consider relevance ranking or use traditional text IR relevance ranking techniques that favour XML fragment higher term frequencies. For example, Hummel et al. 2011 has no ranking function while Li et al. 2009) has a ranking scheme that computes the relevance between a keyword query and an XML fragment based on the tightness of the XML elements and tf-idf score which favours elements with higher term frequencies. The scheme does not put the semantics of XML documents into account. Therefore, the scheme returns misleading results because it is powerless in recognizing irrelevant results when they are with high term frequencies, indicating a performance limitation.

To address these problems, firstly, a ranking function called NEBTOP is proposed. Specifically, the concept of confidence value is first proposed. Confidence value presents the weight of an XML node with respect to a query keyword. It is computed based on the data value of the node in question. To compute confidence value, each keyword in the data value of a node is converted into its corresponding named entity category (NEC). The NEC of a keyword is either a Person or Organization or Others. The confidence value of a leaf node with respect to a NEC is the probability of that NEC in a node. Then, a function that computes query terms proximity scores which rewards a node higher if it contains the query terms in the order they appeared in the query is proposed. The confidence value and term proximity score are combined and used by NEBTOP to normalise the impact of higher term frequencies in the existing ranking scheme. The existing approaches lack this boost score and therefore are powerless in recognizing irrelevant results when they are with high term frequency. Secondly an XML keyword Query Structuring System (XKQSS) is proposed which uses NEBTOP as its ranking function in order to improve retrieval performance.

The contributions of this paper can be summarised as follows:

•

An index scheme which stores the named entity category of each indexed term, in addition to the usual term frequencies and term position, is proposed.

•

A field base ranking function (NEBTOP) is proposed which allows term proximity score and nodes’ confidence value to be incorporated into BM25F scoring formula. Specifically, the concept of confidence value of a node is first introduced, which is the probability of a named entity category in the data content of a node. Then, the classical BM25TP is extended and a new term proximity score for each query term is proposed. This score considers how the query terms appear in the underlying node.

•

NEBTOP is included in the XKQSS search system and an experiment is conducted to compare the effectiveness of the proposed enhanced XKQSS system with some state-of-the-art systems.