Subscribe to the latest research through IGI Global's new InfoSci-OnDemand Plus

InfoSci®-OnDemand Plus, a subscription-based service, provides researchers the ability to access full-text content from over 100,000 peer-reviewed book chapters and 26,000+ scholarly journal articles covering 11 core subjects. Users can select articles or chapters that meet their interests and gain access to the full content permanently in their personal online InfoSci-OnDemand Plus library.

When ordering directly through IGI Global's Online Bookstore, receive the complimentary e-books for the first, second, and third editions with the purchase of the Encyclopedia of Information Science and Technology, Fourth Edition e-book.

InfoSci®-Journals Annual Subscription Price for New Customers: As Low As US$ 5,100

This collection of over 175 e-journals offers unlimited access to highly-cited, forward-thinking content in full-text PDF and HTML with no DRM. There are no platform or maintenance fees and a guarantee of no more than 5% increase annually.

Abstract

Since this research consists of two main parts, the XML compressor and the vague query processor, this chapter discusses the main XML compression techniques in its first part. It will highlight the advantages and disadvantages of these techniques and discusses the differences between them. The second part of this chapter will focus on the vague query processors used to retrieve information from XML documents.

1.1 Xml Compression Techniques

Recently, large numbers of XML compression techniques have been proposed. Each of which has different characteristics. This section discusses the differences between these compressors and their main features.

XML compressors can be classified into two classes either to be XML-blind or XML-conscious compressors. XML-blind or general purpose compressors deal with the XML document as a traditional text document ignoring its structure and apply the general purpose text compression techniques to compress them. These techniques can be classified into two main classes (Salomon, 2007), either to be statistical or dictionary based compressors (Augeri et al., 2007; Augeri, 2008). The statistical or the arithmetic compressors represent each string of characters using a fixed number of bits per character. PPM, CACM3, and PAQ are examples of this kind of compressors (Cleary and Witten, 1984; Moffat., 1990; Alistair et al., 1998). On the other hand, dictionary compression techniques substitute each string in the input by its reference in a dictionary maintained by the encoder. WinZip, GZIP, and BZIP2 are examples of this compression class (WinZip, 1990; GZip, 1992; BZip2, 1996).

On the other hand, XML-conscious compressors try to utilize the structural behaviour of XML documents in order to achieve better compression ratio and less time in comparative with the XML-blind type. Table 1 sets the main differences between the two aforementioned compressors types.

Table 1.

The main differences between XML-conscious and XML-blind compressors

XML-conscious compressors

XML-blind compressors

Information about XML documents is usually available in schema which can be optimized by XML-conscious compressors to get better compression.

Cannot take advantage of the schema to get useful information about the file.

They utilize the structure of XML document and the type of the data inside.

They do not take in consideration the entire file structure or data types.

Some of them abridge the original XML tree in a summary or compact tree for better ratio.

They cannot exploit redundancies in the XML tree structure.

Most of them are powerful in compressing small or large files.

They do not efficiently compress small files that can be used in transactions for e-business. (Hung, 2009)

The main theory of data compression, which described in (Shannon, 1948), is the formulation of the entropy rate (H) which indicates the limit to lossless data compression. The value of (H) depends on the probability of each symbol in the information source. The most popular entropy value is (Shannon, 1948):