Many works have focused, for over twenty five years, on the integration of
the time dimension in databases (DB). However, the standard SQL3 does not yet
allow easy definition, manipulation and querying of temporal DBs. In this
paper, we study how we can simplify querying and manipulating temporal facts in
SQL3, using a model that integrates time in a native manner. To do this, we
propose new keywords and syntax to define different temporal versions for many
relational operators and functions used in SQL. It then becomes possible to
perform various queries and updates appropriate to temporal facts. We
illustrate the use of these proposals on many examples from a real application.

In this work we establish and point out connections between the notion of
query-answer causality in databases and database repairs, model-based diagnosis
in its consistency-based and abductive versions, and database updates through
views. The mutual relationships among these areas of data management and
knowledge representation shed light on each of them and help to share notions
and results they have in common. In one way or another, these are all
approaches to uncertainty management, which becomes even more relevant in the
context of big data that have to be made sense of.; Comment: On-line Proc. First International Workshop on Big Uncertain Data
(BUDA 2014). Co-located with ACM PODS 2014. arXiv admin note: text overlap
with arXiv:1404.6857

Document databases are becoming popular, but how to present complex document
query to obtain useful information from the document remains an important topic
to study. In this paper, we describe the design issues of a pattern-based
document database query language named JPQ. JPQ uses various expressive
patterns to extract and construct document fragments following a JSON-like
document data model. It adopts tree-like extraction patterns with a coherent
pattern composition mechanism to extract data elements from hierarchically
structured documents and maintain the logical relationships among the elements.
Based on these relationships, JPQ deploys a deductive mechanism to
declaratively specify the data transformation requests and considers also data
filtering on hierarchical data structure. We use various examples to show the
features of the language and to demonstrate its expressiveness and
declarativeness in presenting complex document queries.; Comment: 12 pages

In this paper, we emphasize the need for data cleansing when clustering
large-scale transaction databases and propose a new data cleansing method that
improves clustering quality and performance. We evaluate our data cleansing
method through a series of experiments. As a result, the clustering quality and
performance were significantly improved by up to 165% and 330%, respectively.; Comment: 6 pages, 5 figures

A boolean expression is in read-once form if each of its variables appears
exactly once. When the variables denote independent events in a probability
space, the probability of the event denoted by the whole expression in
read-once form can be computed in polynomial time (whereas the general problem
for arbitrary expressions is #P-complete). Known approaches to checking
read-once property seem to require putting these expressions in disjunctive
normal form. In this paper, we tell a better story for a large subclass of
boolean event expressions: those that are generated by conjunctive queries
without self-joins and on tuple-independent probabilistic databases. We first
show that given a tuple-independent representation and the provenance graph of
an SPJ query plan without self-joins, we can, without using the DNF of a result
event expression, efficiently compute its co-occurrence graph. From this, the
read-once form can already, if it exists, be computed efficiently using
existing techniques. Our second and key contribution is a complete, efficient,
and simple to implement algorithm for computing the read-once forms (whenever
they exist) directly, using a new concept, that of co-table graph, which can be
significantly smaller than the co-occurrence graph.; Comment: Accepted in ICDT 2011

Among many existing distance measures for time series data, Dynamic Time
Warping (DTW) distance has been recognized as one of the most accurate and
suitable distance measures due to its flexibility in sequence alignment.
However, DTW distance calculation is computationally intensive. Especially in
very large time series databases, sequential scan through the entire database
is definitely impractical, even with random access that exploits some index
structures since high dimensionality of time series data incurs extremely high
I/O cost. More specifically, a sequential structure consumes high CPU but low
I/O costs, while an index structure requires low CPU but high I/O costs. In
this work, we therefore propose a novel indexed sequential structure called
TWIST (Time Warping in Indexed Sequential sTructure) which benefits from both
sequential access and index structure. When a query sequence is issued, TWIST
calculates lower bounding distances between a group of candidate sequences and
the query sequence, and then identifies the data access order in advance, hence
reducing a great number of both sequential and random accesses. Impressively,
our indexed sequential structure achieves significant speedup in a querying
process by a few orders of magnitude. In addition...

Digital world is growing very fast and become more complex in the volume
(terabyte to petabyte), variety (structured and un-structured and hybrid),
velocity (high speed in growth) in nature. This refers to as Big Data that is a
global phenomenon. This is typically considered to be a data collection that
has grown so large it can not be effectively managed or exploited using
conventional data management tools: e.g., classic relational database
management systems (RDBMS) or conventional search engines. To handle this
problem, traditional RDBMS are complemented by specifically designed a rich set
of alternative DBMS; such as - NoSQL, NewSQL and Search-based systems. This
paper motivation is to provide - classification, characteristics and evaluation
of NoSQL databases in Big Data Analytics. This report is intended to help
users, especially to the organizations to obtain an independent understanding
of the strengths and weaknesses of various NoSQL database approaches to
supporting applications that process huge volumes of data.; Comment: 14 pages, 10 figures, 44 references used and with authors biographies

To extract physics results from the recorded data, the LHC experiments are
using Grid computing infrastructure. The event data processing on the Grid
requires scalable access to non-event data (detector conditions, calibrations,
etc.) stored in relational databases. The database-resident data are critical
for the event data reconstruction processing steps and often required for
physics analysis. This paper reviews LHC experience with database technologies
for the Grid computing. List of topics includes: database integration with Grid
computing models of the LHC experiments; choice of database technologies;
examples of database interfaces; distributed database applications (data
complexity, update frequency, data volumes and access patterns); scalability of
database access in the Grid computing environment of the LHC experiments. The
review describes areas in which substantial progress was made and remaining
open issues.; Comment: 10 pages, invited talk presented at the IV International Conference
on "Distributed computing and Grid-technologies in science and education"
(Grid2010), JINR, Dubna, Russia, June 28 - July 3, 2010

Causality has been recently introduced in databases, to model, characterize
and possibly compute causes for query results (answers). Connections between
query causality and consistency-based diagnosis and database repairs (wrt.
integrity constrain violations) have been established in the literature. In
this work we establish connections between query causality and abductive
diagnosis and the view-update problem. The unveiled relationships allow us to
obtain new complexity results for query causality -the main focus of our work-
and also for the two other areas.; Comment: To appear in Proc. UAI Causal Inference Workshop, 2015. One example
was fixed

Many studies have been conducted on seeking the efficient solution for
subgraph similarity search over certain (deterministic) graphs due to its wide
application in many fields, including bioinformatics, social network analysis,
and Resource Description Framework (RDF) data management. All these works
assume that the underlying data are certain. However, in reality, graphs are
often noisy and uncertain due to various factors, such as errors in data
extraction, inconsistencies in data integration, and privacy preserving
purposes. Therefore, in this paper, we study subgraph similarity search on
large probabilistic graph databases. Different from previous works assuming
that edges in an uncertain graph are independent of each other, we study the
uncertain graphs where edges' occurrences are correlated. We formally prove
that subgraph similarity search over probabilistic graphs is #P-complete, thus,
we employ a filter-and-verify framework to speed up the search. In the
filtering phase,we develop tight lower and upper bounds of subgraph similarity
probability based on a probabilistic matrix index, PMI. PMI is composed of
discriminative subgraph features associated with tight lower and upper bounds
of subgraph isomorphism probability. Based on PMI...

We present a survey of existing approaches to relational division in
rank-aware databases, discuss issues of the present approaches, and outline
generalizations of several types of classic division-like operations. We work
in a model which generalizes the Codd model of data by considering tuples in
relations annotated by ranks, indicating degrees to which tuples in relations
match queries. The approach utilizes complete residuated lattices as the basic
structures of degrees. We argue that unlike the classic model, relational
divisions are fundamental operations which cannot in general be expressed by
means of other operations. In addition, we compare the existing and proposed
operations and identify those which are faithful counterparts of universally
quantified queries formulated in relational calculi. We introduce Pseudo Tuple
Calculus in the ranked model which is further used to show mutual definability
of the various forms of divisions presented in the paper.

World-set algebra is a variable-free query language for uncertain databases.
It constitutes the core of the query language implemented in MayBMS, an
uncertain database system. This paper shows that world-set algebra captures
exactly second-order logic over finite structures, or equivalently, the
polynomial hierarchy. The proofs also imply that world-set algebra is closed
under composition, a previously open problem.; Comment: 22 pages, 1 figure

Different ways of entering data into databases result in duplicate records
that cause increasing of databases' size. This is a fact that we cannot ignore
it easily. There are several methods that are used for this purpose. In this
paper, we have tried to increase the accuracy of operations by using cluster
similarity instead of direct similarity of fields. So that clustering is done
on fields of database and according to accomplished clustering on fields,
similarity degree of records is obtained. In this method by using present
information in database, more logical similarity is obtained for deficient
information that in general, the method of cluster similarity could improve
operations 24% compared with previous methods.

Within the research area of deductive databases three different database
tasks have been deeply investigated: query evaluation, update propagation and
view updating. Over the last thirty years various inference mechanisms have
been proposed for realizing these main functionalities of a rule-based system.
However, these inference mechanisms have been rarely used in commercial DB
systems until now. One important reason for this is the lack of a uniform
approach well-suited for implementation in an SQL-based system. In this paper,
we present such a uniform approach in form of a new version of the soft
consequence operator. Additionally, we present improved transformation-based
approaches to query optimization and update propagation and view updating which
are all using this operator as underlying evaluation mechanism.; Comment: to appear in the Proceedings of the 19th International Conference on
Applications of Declarative Programming and Knowledge Management (INAP 2011)

Several propositions were done to provide adapted concurrency control to
object-oriented databases. However, most of these proposals miss the fact that
considering solely read and write access modes on instances may lead to less
parallelism than in relational databases! This paper cope with that issue, and
advantages are numerous: (1) commutativity of methods is determined a priori
and automatically by the compiler, without measurable overhead, (2) run-time
checking of commutativity is as efficient as for compatibility, (3) inverse
operations need not be specified for recovery, (4) this scheme does not
preclude more sophisticated approaches, and, last but not least, (5) relational
and object-oriented concurrency control schemes with read and write access
modes are subsumed under this proposition.

We model the algorithmic task of geometric elimination (e.g., quantifier
elimination in the elementary field theories of real and complex numbers) by
means of certain constraint database queries, called geometric queries. As a
particular case of such a geometric elimination task, we consider sample point
queries. We show exponential lower complexity bounds for evaluating geometric
queries in the general and in the particular case of sample point queries.
Although this paper is of theoretical nature, its aim is to explore the
possibilities and (complexity-)limits of computer implemented query evaluation
algorithms for Constraint Databases, based on the principles of the most
advanced geometric elimination procedures and their implementations, like,
e.g., the software package "Kronecker".; Comment: This paper is representing work in progress of the authors. It is not
aimed for publication in the present form

Database management systems (DBMSs) carefully optimize complex multi-join
queries to avoid expensive disk I/O. As servers today feature tens or hundreds
of gigabytes of RAM, a significant fraction of many analytic databases becomes
memory-resident. Even after careful tuning for an in-memory environment, a
linear disk I/O model such as the one implemented in PostgreSQL may make query
response time predictions that are up to 2X slower than the optimal multi-join
query plan over memory-resident data. This paper introduces a memory I/O cost
model to identify good evaluation strategies for complex query plans with
multiple hash-based equi-joins over memory-resident data. The proposed cost
model is carefully validated for accuracy using three different systems,
including an Amazon EC2 instance, to control for hardware-specific differences.
Prior work in parallel query evaluation has advocated right-deep and bushy
trees for multi-join queries due to their greater parallelization and
pipelining potential. A surprising finding is that the conventional wisdom from
shared-nothing disk-based systems does not directly apply to the modern
shared-everything memory hierarchy. As corroborated by our model, the
performance gap between the optimal left-deep and right-deep query plan can
grow to about 10X as the number of joins in the query increases.; Comment: 15 pages...

Recent work in data mining and related areas has highlighted the importance
of the statistical assessment of data mining results. Crucial to this endeavour
is the choice of a non-trivial null model for the data, to which the found
patterns can be contrasted. The most influential null models proposed so far
are defined in terms of invariants of the null distribution. Such null models
can be used by computation intensive randomization approaches in estimating the
statistical significance of data mining results.
Here, we introduce a methodology to construct non-trivial probabilistic
models based on the maximum entropy (MaxEnt) principle. We show how MaxEnt
models allow for the natural incorporation of prior information. Furthermore,
they satisfy a number of desirable properties of previously introduced
randomization approaches. Lastly, they also have the benefit that they can be
represented explicitly. We argue that our approach can be used for a variety of
data types. However, for concreteness, we have chosen to demonstrate it in
particular for databases and networks.; Comment: Submitted

A new type of logs, the command log, is being employed to replace the
traditional data log (e.g., ARIES log) in the in-memory databases. Instead of
recording how the tuples are updated, a command log only tracks the
transactions being executed, thereby effectively reducing the size of the log
and improving the performance. Command logging on the other hand increases the
cost of recovery, because all the transactions in the log after the last
checkpoint must be completely redone in case of a failure. In this paper, we
first extend the command logging technique to a distributed environment, where
all the nodes can perform recovery in parallel. We then propose an adaptive
logging approach by combining data logging and command logging. The percentage
of data logging versus command logging becomes an optimization between the
performance of transaction processing and recovery to suit different OLTP
applications. Our experimental study compares the performance of our proposed
adaptive logging, ARIES-style data logging and command logging on top of
H-Store. The results show that adaptive logging can achieve a 10x boost for
recovery and a transaction throughput that is comparable to that of command
logging.; Comment: 13 pages

One of the challenges currently problems in the use of cloud services is the
task of designing of specialized data management systems. This is especially
important for hybrid systems in which the data are located in public and
private clouds. Implementation monitoring functions querying, scheduling and
processing software must be properly implemented and is an integral part of the
system. To provide these functions is proposed to use an object-relational
mapping (ORM). The article devoted to presenting the approach of designing
databases for information systems hosted in a hybrid cloud infrastructure. It
also provides an example of the development of ORM library.