VLDB Endowment

Displaying 1-40 of 251 results

The release of Hardware Transactional Memory (HTM) in commodity CPUs (Central Processing Units) has major implications on the design and implementation of main-memory databases, especially on the architecture of high performance lock-free indexing methods at the core of several of these systems. This paper studies the interplay of HTM and...

In this paper, the authors introduce Trill - a new query processor for analytics. Trill fulfills a combination of three requirements for a query processor to serve the diverse big data analytics space: query model: Trill is based on a tempo-relational model that enables it to handle streaming and relational...

First-generation streaming systems did not pay much attention to state management via ACID transactions. S-Store is a data management system that combines OLTP (OnLine Transaction Processing) transactions with stream processing. To create S-Store, the authors begin with H-Store, a main-memory transaction processing engine, and add primitives to support streaming. This...

Multi-tenancy and resource sharing are essential to make a Database-as-a-Service (DaaS) cost-effective. However, one major consequence of resource sharing is that the performance of one tenant's workload can be significantly affected by the resource demands of co-located tenants. The lack of performance isolation in a shared environment can make DaaS...

Newly-released web applications often succumb to a "Success Disaster," where overloaded database machines and resulting high response times destroy a previously good user experience. Unfortunately, the data independence provided by a traditional relational database system, while useful for agile development, only exacerbates the problem by hiding potentially expensive queries under...

To support "Database as a service" (DaaS) in the cloud, the database system is expected to provide similar functionalities as in centralized DBMS such as efficient processing of ad hoc queries. The system must therefore support DBMS-like indexes, possibly a few indexes for each table to provide fast location of...

The generation of RDF data has accelerated to the point where many data sets need to be partitioned across multiple machines in order to achieve reasonable performance when querying the data. Although tremendous progress has been made in the Semantic Web community for achieving high performance data management on a...

MapReduce has emerged as a viable competitor to database systems in big data analytics. MapReduce programs are being written for a wide variety of application domains including business data processing, text analysis, natural language processing, Web graph and social network analysis, and computational science. However, MapReduce systems lack a feature...

Occasional corruption of stored data is an unfortunate byproduct of the complexity of modern systems. Hardware errors, software bugs, and mistakes by human administrators can corrupt important sources of data. The dominant practice to deal with data corruption today involves administrators writing ad hoc scripts that run data-integrity tests at...

Fast processing of set intersections is a key operation in many query processing tasks in the context of databases and information retrieval. For example, in the context of databases, set intersections are used in the context of various forms of data mining, text analytics, and evaluation of conjunctive predicates. They...

With the availability of very large databases, an exploratory query can easily lead to a vast answer set, typically based on an answer's relevance (i.e., top-k, tf-idf) to the user query. Navigating through such an answer set requires huge effort and users give up after perusing through the first few...

The Web contains a significant volume of structured data in various domains, but a lot of data are dirty and erroneous, and they can be propagated through copying. While data integration techniques allow querying structured data on the Web, they take the union of the answers retrieved from different sources...

Increasingly complex databases need ever more sophisticated tools to help users understand their schemas and interact with the data. Existing tools fall short of either providing the "Big picture," or of presenting useful connectivity information. In this paper, the authors define summary graphs, a novel approach for summarizing schemas. Given...

In this paper, the authors present a technique for building a High-Availability (HA) DataBase Management System (DBMS). The proposed technique can be applied to any DBMS with little or no customization, and with reasonable performance overhead. Their approach is based on Remus, a commodity HA solution implemented in the virtualization...

Given a query object q, a Reverse Nearest Neighbor (RNN) query in a common certain database returns the objects having q as their nearest neighbor. A new challenge for databases is dealing with uncertain objects. In this paper, the authors consider Probabilistic Reverse Nearest Neighbor (PRNN) queries, which return the...

Location-based queries are widely employed to retrieve useful information based on the user's geographical position. For example, a tourist that walks around a city may seek points of interest (e.g., restaurants) in her vicinity that satisfy her preferences (e.g., cheap and highly-rated). A top-k query defined by the user preferences...

Database systems serving cloud platforms must serve large numbers of applications (or tenants). In addition to managing tenants with small data footprints, different schemas, and variable load patterns, such multitenant data platforms must minimize their operating costs by efficient resource sharing. When deployed over a pay-per-use infrastructure, elastic scaling and...

The MapReduce distributed programming framework has become popular, despite evidence that current implementations are inefficient, requiring far more hardware than traditional relational databases to complete similar tasks. MapReduce jobs are amenable to many traditional database query optimizations (B+Trees for selections, column-store-style techniques for projections, etc), but existing systems do not...

Index tuning, i.e., selecting the indexes appropriate for a workload, is a crucial problem in database system tuning. In this paper, the authors solve index tuning for large problem instances that are common in practice, e.g., thousands of queries in the workload, thousands of candidate indexes and several hard and...

OLTP (On-Line Transaction Processing) is an important business system sector in various traditional and emerging online services. Due to the increasing number of users, OLTP systems require high throughput for executing tens of thousands of transactions in a short time period. Encouraged by the recent success of GPGPU (General-Purpose computation...

The proliferation of imprecise data has motivated both researchers and the database industry to push statistical techniques into Relational DataBase Management Systems (RDBMSes). The authors study strategies to maintain model-based views for a popular statistical technique, classification, inside an RDBMS in the presence of updates (to the set of training...

In this paper, the authors present the design of a scalable, distributed stream processing system for RFID tracking and monitoring. Since RFID data lacks containment and location information that is key to query processing, they propose to combine location and containment inference with stream query processing in a single architecture,...

Scaling up the sparse matrix-vector multiplication kernel on modern Graphics Processing Units (GPU) has been at the heart of numerous studies in both academia and industry. In this paper the authors present a novel non-parametric, self-tunable, approach to data representation for computing this kernel, particularly targeting sparse matrices representing power-law...

The authors present a generic framework to make wrapper induction algorithms tolerant to noise in the training data. This enables one to learn wrappers in a completely unsupervised manner from automatically and cheaply obtained noisy training data, e.g., using dictionaries and regular expressions. By removing the site-level supervision that wrapper-based...

The current widespread use of location-based services and GPS technologies has revived interest in very fast and scalable shortest path queries. The authors introduce a new shortest path query type in which dynamic constraints may be placed on the allowable set of edges that can appear on a valid shortest...

Top-k spatial preference queries return a ranked set of the k best data objects based on the scores of feature objects in their spatial neighborhood. Despite the wide range of location-based applications that rely on spatial preference queries, existing algorithms incur non-negligible processing cost resulting in high response time. The...

In recent years, there has been considerable research on information extraction and constructing RDF knowledge bases. In general, the goal is to extract all relevant information from a corpus of documents, store it into an ontology, and answer future queries based only on the created knowledge base. Thus, the original...

There has been an increasing interest in deploying a storage system on Cloud to support applications that require massive scalability and high throughput in storage layer. Examples of such systems include Amazon's Dynamo and Google's BigTable. Cloud storage systems are designed to meet several essential requirements of data-intensive applications: manageability,...

The tremendous growth of the Internet has significantly reduced the cost of obtaining and sharing information about individuals, raising many concerns about user privacy. Spatial queries pose an additional threat to privacy because the location of a query may be sufficient to reveal sensitive information about the querier. In this...

An increasing amount of personal data is automatically gathered and stored on servers by administrations, hospitals, insurance companies, etc. Citizen themselves often count on internet companies to store their data and make them reliable and highly available through the internet. However, these benefits must be weighed against privacy risks incurred...

Rule-based information extraction from text is increasingly being used to populate databases and to support structured queries on unstructured text. Specification of suitable information extraction rules requires considerable skill and standard practice is to refine rules iteratively, with substantial effort. In this paper, the authors show that techniques developed in...

Recent legislation has increased the requirements of organizations to report data breaches, or unauthorized access to data. While access control policies are used to restrict access to a database, these policies are complex and difficult to configure. As a result, misconfigurations sometimes allow users access to unauthorized data. In this...

To enable smart environments and self-tuning data centers, the authors are developing the Aspen system for integrating physical sensor data, as well as stream data coming from machine logical state, and database or Web data from the Internet. A key component of this system is a query processor optimized for...

Many massive web and communication network applications create data which can be represented as a massive sequential stream of edges. For example, conversations in a telecommunication network or messages in a social network can be represented as a massive stream of edges. Such streams are typically very large, because of...

Probabilistic databases hold promise of being a viable means for large-scale uncertainty management, increasingly needed in a number of real world applications domains. However, query evaluation in probabilistic databases remains a computational challenge. Prior work on efficient exact query evaluation in probabilistic databases has largely concentrated on query-centric formulations (e.g.,...

Sensor nodes constitute inexpensive, disposable devices that are often scattered in harsh environments of interest so as to collect and communicate desired measurements of monitored quantities. Due to the commodity hardware used in the construction of sensor nodes, the readings of sensors are frequently tainted with outliers. Given the presence...

XMorph is a new, shape polymorphic, domain-specific XML query language. A query in a shape polymorphic language adapts to the shape of the input, freeing the user from having to know the input's shape and making the query applicable to a wide variety of differently shaped inputs. An XMorph query...

The analysis of many real-world event based applications has revealed that existing Complex Event Processing technology (CEP), while effective for efficient pattern matching on event stream, is limited in its capability of reacting in real-time to opportunities and risks detected or environmental changes. The authors are the first to tackle...

The authors demonstrate an efficient LINQ to SQL provider and its significant impact on the runtime performance of LINQ programs that process large data volumes. This alternative provider is based on Ferry, compilation technology that lets relational database systems participate in the evaluation of first-order functional programs over nested, ordered...

The workflow models have been essentially operation-centric for many years, ignoring almost completely the data aspects. Recently, a new paradigm of data-centric workflows, called business artifacts, has been introduced by Nigam and Caswell. The authors follow this approach and propose a model where artifacts are XML documents that evolve in...

There is growing interest in query language extensions for pattern matching over event streams and stored database sequences, due to the many important applications that such extensions make possible. The push for such extensions has led DBMS vendors and DSMS venture companies to propose Kleene-closure extensions of SQL standards, building...

The authors present Linkage Query Writer (LinQuer), a system for generating SQL queries for semantic link discovery over relational data. The LinQuer framework consists of LinQL, a language for specification of linkage requirements; a web interface and an API for translating LinQL queries to standard SQL queries; an interface that...

Commercial databases compete for market share, which is composed of not only net-new sales to those purchasing a database for the first time, but also competitive win-backs" and migrations. Database migration, or the act of moving both application code and its underlying database platform from one database to another, presents...

One of the key tasks of a database administrator is to optimize the set of materialized indices with respect to the current workload. To aid administrators in this challenging task, commercial DBMSs provide advisors that recommend a set of indices based on a sample workload. It is left for the...

Database systems have a large number of configuration parameters that control memory distribution, I/O optimization, costing of query plans, parallelism, many aspects of logging, recovery, and other behavior. Regular users and even expert database administrators struggle to tune these parameters for good performance. The wave of research on improving database...

The need to improve a suboptimal execution plan picked by the query optimizer for a repeatedly run SQL query arises routinely. Complex expressions, skewed or correlated data, and changing conditions can cause the optimizer to make mistakes. For example, the optimizer may pick a poor join order, overlook an important...

Recent legislation has increased the requirements of organizations to report data breaches, or unauthorized access to data. While access control policies are used to restrict access to a database, these policies are complex and difficult to configure. As a result, misconfigurations sometimes allow users access to unauthorized data. In this...

Column-oriented database systems (column-stores) have attracted a lot of attention in the past few years. Column-stores, in a nutshell, store each database table column separately, with attribute values belonging to the same column stored contiguously, compressed, and densely packed, as opposed to traditional database systems that store entire records (rows)...

OLTP (On-Line Transaction Processing) is an important business system sector in various traditional and emerging online services. Due to the increasing number of users, OLTP systems require high throughput for executing tens of thousands of transactions in a short time period. Encouraged by the recent success of GPGPU (General-Purpose computation...

The proliferation of imprecise data has motivated both researchers and the database industry to push statistical techniques into Relational DataBase Management Systems (RDBMSes). The authors study strategies to maintain model-based views for a popular statistical technique, classification, inside an RDBMS in the presence of updates (to the set of training...

In this paper, the authors present the design of a scalable, distributed stream processing system for RFID tracking and monitoring. Since RFID data lacks containment and location information that is key to query processing, they propose to combine location and containment inference with stream query processing in a single architecture,...

Scaling up the sparse matrix-vector multiplication kernel on modern Graphics Processing Units (GPU) has been at the heart of numerous studies in both academia and industry. In this paper the authors present a novel non-parametric, self-tunable, approach to data representation for computing this kernel, particularly targeting sparse matrices representing power-law...

The authors present a generic framework to make wrapper induction algorithms tolerant to noise in the training data. This enables one to learn wrappers in a completely unsupervised manner from automatically and cheaply obtained noisy training data, e.g., using dictionaries and regular expressions. By removing the site-level supervision that wrapper-based...

The authors describe an automatic database design tool that exploits correlations between attributes when recommending Materialized Views (MVs) and indexes. Although there is a substantial body of related work exploring how to select an appropriate set of MVs and indexes for a given workload, none of this work has explored...

Keyword search over entity databases (e.g., product, movie databases) is an important problem. Current techniques for keyword search on databases may often return incomplete and imprecise results. On the one hand, they either require that relevant entities contain all (or most) of the query keywords, or that relevant entities and...

The authors show how Recursive Markov Chains (RMCs) and their restrictions can define probabilistic distributions over XML documents, and study tractability of querying over such models. They show that RMCs subsume several existing probabilistic XML models. In contrast to the latter, RMC models: Capture probabilistic versions of XML schema languages...

A very important class of spatial queries consists of Nearest-Neighbor (NN) query and its variations. Many studies in the past decade utilize R-trees as their underlying index structures to address NN queries efficiently. The general approach is to use R-tree in two phases. First, R-tree's hierarchical structure is used to...

Given a set of users, their friend relationships, and a distance threshold per friend pair, the proximity detection problem is to find each pair of friends such that the Euclidean distance between them is within the given threshold. This problem plays an essential role in friend-locator applications and massively multiplayer...

Concepts are sequences of words that represent real or imaginary entities or ideas that users are interested in. As a first step towards building a web of concepts that will form the backbone of the next generation of search technology, the authors develop a novel technique to extract concepts from...

The authors propose a novel extraction approach that exploits content redundancy on the web to extract structured data from template-based web sites. They start by populating a seed database with records extracted from a few initial sites. The authors then identify values within the pages of each new site that...

Rule-based information extraction from text is increasingly being used to populate databases and to support structured queries on unstructured text. Specification of suitable information extraction rules requires considerable skill and standard practice is to refine rules iteratively, with substantial effort. In this paper, the authors show that techniques developed in...

Ranking is a fundamental operation in data analysis and decision support, and plays an even more crucial role if the dataset being explored exhibits uncertainty. This has led to much work in understanding how to rank uncertain datasets in recent years. In this paper, the authors address the problem of...

Complex event detection is an advanced form of data stream processing where the stream(s) are scrutinized to identify given event patterns. The challenge for many Complex Event Processing (CEP) systems is to be able to evaluate event patterns on high-volume data streams while adhering to real-time constraints. To solve this...

Query co-processing on Graphics Processors (GPUs) has become an effective means to improve the performance of main memory databases. However, this co-processing requires the data transfer between the main memory and the GPU memory via a low-bandwidth PCI-E bus. The overhead of such data transfer becomes an important factor, even...

Large flash disks, or Solid State Drives (SSDs), have become an attractive alternative to magnetic hard disks, due to their high random read performance, low energy consumption and other features. However, writes, especially small random writes, on flash disks are inherently much slower than reads because of the erase-before-write mechanism....

Predicate selectivity estimates are subject to considerable run-time variation relative to their compile-time estimates, often leading to poor plan choices that cause inflated response times. The authors present here a parametrized family of plan generation and selection algorithms that replace, whenever feasible, the optimizer's solely cost conscious choice with an...

The authors propose the k-representative regret minimization query (k-regret) as an operation to support multi-criteria decision making. Like top-k, the k-regret query assumes that users have some utility or scoring functions; however, it never asks the users to provide such functions. Like skyline, it filters out a set of interesting...

Probabilistic databases hold promise of being a viable means for large-scale uncertainty management, increasingly needed in a number of real world applications domains. However, query evaluation in probabilistic databases remains a computational challenge. Prior work on efficient exact query evaluation in probabilistic databases has largely concentrated on query-centric formulations (e.g.,...

Probabilistic databases hold promise of being a viable means for large-scale uncertainty management, increasingly needed in a number of real world applications domains. However, query evaluation in probabilistic databases remains a computational challenge. Prior work on efficient exact query evaluation in probabilistic databases has largely concentrated on query-centric formulations (e.g.,...

Recently, there has been increasing interest in extending relational query processing to include data obtained from unstructured sources. A common approach is to use stand-alone Information Extraction (IE) techniques to identify and label entities within blocks of text; the resulting entities are then imported into a standard database and processed...

A RkNN query returns all objects whose nearest k neighbors contain the query object. In this paper, the authors consider RkNN query processing in the case where the distances between attribute values are not necessarily metric. Dissimilarities between objects could then be a monotonic aggregate of dissimilarities between their values,...

Cloud computing is an extremely successful paradigm of service oriented computing and has revolutionized the way computing infrastructure is abstracted and used. Three most popular cloud paradigms include: Infrastructure as a Service (IaaS), Platform as a Service (PaaS), and Software as a Service (SaaS). The concept however can also be...

The authors propose a modeling of the problem of privacy-compliant data publishing that captures confidentiality constraints on one side and visibility requirements on the other side. Confidentiality constraints express the fact that some attributes, or associations among them, are sensitive and cannot be released. Visibility requirements express requests for views...

Web technologies have enabled data sharing between sources but also simplified copying (and often publishing without proper attribution). The copying relationships can be complex: some sources copy from multiple sources on different subsets of data; some co-copy from the same source, and some transitively copy from another. Understanding such copying...

Large text corpora with news, customer mail and reports, or Web 2.0 contributions offer a great potential for enhancing business-intelligence applications. The authors propose a framework for performing text analytics on such data in a versatile, efficient, and scalable manner. While much of the prior literature has emphasized mining keywords...

Tables are a universal idiom to present relational data. Billions of tables on Web pages express entity references, at-tributes and relationships. This representation of relational world knowledge is usually considerably better than completely unstructured, free-format text. At the same time, unlike manually-created knowledge bases, relational information mined from "Organic" Web...

Replication is a key mechanism to achieve scalability and fault-tolerance in databases. Its importance has recently been further increased because of the role it plays in achieving elasticity at the database layer. In database replication, the biggest challenge lies in the trade-o between performance and consistency. A decade ago, performance...

The representation of multidimensional points and objects, and the development of appropriate indexing methods that enable them to be retrieved efficiently is a well-studied subject. Most of these methods were designed for use in application domains where the data usually has a spatial component which has a relatively low dimension....

Because of high volumes and unpredictable arrival rates, stream processing systems are not always able to keep up with input data - resulting in buffer overflow and uncontrolled loss of data. To produce eventually complete results, load spilling, which pushes some fractions of data to disks temporarily, is commonly employed...

Submit Your Content

Get your content listed in our directory for free!

Our directory is the largest library of vendor-supplied technical content on the web. It's also the first place IT decision makers turn to when researching technology solutions. Our members are already finding your competitors' papers here - shouldn't they find yours too?