Abstract: Usually users are interested in querying data over a relatively small subset of the entire attribute set at a time. A potential solution is to use lower dimensional indexes that accurately represent the user access patterns. If the query pattern change, then the query response using the physical database design that is developed based on a static snapshot of the query workload may significantly degrade To address these issues, we introduce a parameterizable technique to recommend indexes based on index types that are frequently used for high-dimensional data sets and to dynamically adjust indexes as the underlying query workload changes. We incorporate a query pattern change detection mechanism to determine when the access patterns have changed enough to warrant change in the physical database design.

Use Search at http://topicideas.net/search.php wisely To Get Information About Project Topic and Seminar ideas with report/source code along pdf and ppt presenaion

Online Index Recommendations for.doc (Size: 1.57 MB / Downloads: 71)
Online Index Recommendations for High-Dimensional Databases Using
Query Workloads
5. Content
5.1 Abstract:
High-dimensional databases pose a challenge with respect to efficient access. users are usually interested in querying data over a relatively small subset of the entire attribute set at a time. A potential solution is to use lower dimensional indexes that accurately represent the user access patterns. So we are going to design one tool to address these issues we introduce a parameterizable technique to recommend indexes based on index types that are frequently used for high-dimensional data sets and to dynamically adjust indexes as the underlying query workload changes. In the first step we are finding the frequent item set to find the frequently requested items , after that we are applying the Association rule to reduce the size of the index. then we are applying the Histogram on the index then size of the index further get reduced. If the users query pattern changed the index will automatically adjust it. To do that We incorporate a query pattern change detection mechanism to determine when the access patterns have changed enough to warrant change in the physical database design. We perform experiments with a number of data sets, query sets, and parameters to show the effect that varying these characteristics has on analysis results.5.2 Introduction:
AN increasing number of database applications such as business data warehouses and scientific data repositories deal with high-dimensional data sets. As the number of dimensions/attributes and the overall size of data sets increase, it becomes essential to efficiently retrieve specific queried data from the database in order to effectively utilize
the database. Indexing support is needed to effectively prune out significant portions of the data set that are not relevant for the queries. Multidimensional indexing, dimensionality reduction, and Relational Database Management System (RDBMS) index selection tools all could be applied to the problem. However, for high-dimensional data sets, each of these potential solutions has inherent problems. To illustrate these problems, consider a uniformly distributed data set of 1,000,000 data objects with several hundred attributes. Range queries are consistently executed over five of the attributes. The query selectivity over each attribute is 0.1, so the overall query selectivity is 1=105 (that is, the answer set contains about 10 results). An ideal solution would allow us to read from the disk only those pages that contain matching answers to the query. We could build a multidimensional index over the data set so that we can directly answer any query by only using the index. However, the performance of multidimensional index structures is subject to Bellman’s curse of dimensionality and rapidly degrades as the number of dimensions increases. For the given example, such an index would perform much worse than a sequential scan. Another possibility would be to build an index over each single dimension. The effectiveness of this approach is limited to the amount of search space that can be pruned by a single dimension (in the example, the search space would only be pruned to 100,000 objects).5.2.1 High Dimensional Indexing:
A number of techniques have been introduced to address the high-dimensional indexing problem such as the X-tree [5] and the GC-tree [6]. Although these index structures have been shown to increase the range of effective dimensionality, they still suffer performance degradation at higher index dimensionality.5.2.2 Feature Selection
Feature selection techniques are a subset of dimensionality reduction targeted at finding a set of untransformed attributes that best represent the overall data set. These techniques are also focused on maximizing data energy or classification accuracy rather than query response. As a result, selected features may have no overlap with queried attributes.5.2.3 Index Selection
The index selection problem has been identified as a variation of the Knapsack Problem, and several papers proposed designs for index recommendations based on optimization rules. These earlier designs could not take advantage of modern database systems’ query optimizer. Currently, almost every commercial RDBMS provides the users with an index
recommendation tool based on a query workload and uses the query optimizer to obtain cost estimates. A query workload is a set of SQL data manipulation statements. The
query workload should be a good representative of the types of queries that an application supports.5.2.4 Automatic Index Selection
The ideas of having a database that can tune itself by automatically creating new indexes as the queries arrive have been proposed. In a cost model is used
to identify beneficial indexes and decide when to create or drop an index at runtime. Costa and Lifschitz propose an agent-based database architecture to deal with an
automatic index creation. Microsoft Research has proposed a physical-design alerter to identify when a modification to the physical design could result in improved performance.5.3 Literature survey:
5.3.1 Index Selection
Index Selection is a method of artificial selection in which several useful traits are selected simultaneously. First, each trait that is going to be selected is assigned a weight, the importance of the trait. I.e., if you were selecting for both height and the darkness of the coat in dogs, if height was more important to you, one would assign that a higher weighting. For instance, heights weighting could be ten and coat darkness' could be two. This weighting value is then multiplied by the observed value in each individual animal and then the score for each of the characteristics is summed for each individual. This result is the index score and can be used to compare the worth of each organism being selected. Therefore, only those with the highest index score are selected for breeding via artificial selection.
This method has advantages over other methods of artificial selection, such as tandem selection, in that you can select for traits simultaneously rather than sequentially. Thereby, no useful traits are being excluded from selection at any one time and so none will start to reverse while you concentrate on improving another property of the organism. However, its major disadvantage is that the weightings assigned to each characteristic are inherently quite hard to calculate precisely and so require some elements of trial and error before they become optimal to the breeder.5.3.2 Query Access pattern:
The advantage of using data access objects is the relatively simple and rigorous separation between two important parts of an application which can and should know almost nothing of each other, and which can be expected to evolve frequently and independently. Changing business logic can rely on the same DAO interface, while changes to persistence logic do not affect DAO clients as long as the interface remains correctly implemented.
In the specific context of the Java programming language, Data Access Objects can be used to insulate an application from the particularly numerous, complex and varied Java persistence technologies, which could be JDBC, JDO, EJB CMP, Hibernate, or many others. Using Data Access Objects means the underlying technology can be upgraded or swapped without changing other parts of the application.5.4 System Analysis:
5.4.1. Existing System
 Query response does not perform well if query patterns change.
 Because it uses static query workload.
 Its performance may degrade if the database size gets increased.
 Tradition feature selection technique may offer less or no data pruning capability given query attributes.5.4.2 Proposed System:
 We develop a flexible index selection frame work to achieve static index selection and dynamic index selection for high dimensional data.
 A control feedback technique is introduced for measuring the performance.
 Through this a database could benefit from an index change.
 The index selection minimizes the cost of the queries in the work load.
 Online index selection is designed in the motivation if the query pattern changes over time.
 By monitoring the query workload and detecting when there is a change on the query pattern, able to evolve good performance as query patterns evolve6. System Requirements:
Hardware:
PROCESSOR : PENTIUM III or more
RAM : 256 MB DD RAM
HARD DISK : 10 GBSoftware:
Front End : J2EE(JSP)
Back End : MS SQL 2000
Web server : Tomcat 5.0
Operating System : WindowsXP7. System design:
7.1 Data flow diagram
Dynamic index analysis framework :
7.2 Module Description
Modules
• Initialize the abstract Representation
• Calculate the Query Cost
• Index Selection loop
• Calculate the performanceDescription:
Module 1: Initialize the abstract Representation:
In this module we are monitoring the user queries and initialize the abstract representation. In this module we collection the user transaction and from that we are finding the frequently selected item. And by applying the association rule we are calculation the relationship between the records and finding the support and confidence. Based on that we are initializing the abstract representation. The initialization step uses a query workload and the data set to produce a set of Potential Indexes P, a Query Set Q,
and a Multidimensional Histogram H according to the support, confidence, and histogram size specified by the user. The description of the outputs and how they are generated are given as follows: Potential index set P. This is a collection of attribute sets that could be beneficial as an index for the queries in the input query workload. This set is computed using traditional data mining techniques. Considering the attributes involved in each query fro the input query workload to be a single transaction, P consists of the sets of attributes that occur together in a query at a ratio greater than the input support. Formally, the support of a set of attributes A is defined as where Qi is the set of attributes in the ith query, and n is the number of queries. For instance, if the input support is 10 percent and
attributes 1 and 2 are queried together in greater than 10 percent of the queries, then a representation of the set of attributes {1, 2} will be included as a potential index. Note
that because a subset of an attribute set that meets the support requirement will also necessarily meet the support, all subsets of attribute sets meeting the support will also be
included as potential indexes (in the example above, both sets {1} and {2} will be included). As the input support is decreased, the number of potential indexes increases. Note that our particular system is built independently of a query optimizer, but the sets of attributes appearing in the predicates from a query optimizer log could just as easily
be substituted for the query workload in this step. If a set occurs nearly as often as one of its subsets, an index built over the subset will likely not provide much benefit over the query workload if an index is built over the attributes in the set. Such an index will only be more effective in pruning data space for those queries that involve only the subset’s attributes. In order to enhance analysis speed with limited effect on accuracy, the input
confidence is used to prune the analysis space. Confidence is the ratio of a set’s occurrence to the occurrence of a subset. While data mining the frequent attribute sets in the query workload in determining P,we also maintain the association rules for disjoint subsets and compute the confidence of these association rules. The confidence of an association rule is defined as the ratio that the antecedent (left-hand side of
the rule) and consequent (right-hand side of the rule) appear together in a query, given that the antecedent appears in the query. Formally, the confidence of an association rule
fset of attributes Ag ! fset of attributes Bg, where A and B are disjoint, is defined as In our example, if every time attribute 1 appears, attribute 2 also appears, then the confidence of f1g ! f2g ¼ 1:0. If attribute 2 appears without attribute 1 as many times as it appears with attribute 1, then the confidence f2g ! f1g ¼ 0:5. If we have set the confidence input to 0.6, then we will prune the attribute set {1} from P, but we will keep attribute set {2}. We can also set the confidence level based on the attribute set cardinality. Since the cost of including extra attributes that are not useful for pruning increases with increased indexed dimensionality, we want to be more conservative with respect to pruning attribute subsets.
Index Selection Notation List confidence could take on a value that is dependent on the set cardinality. Although the Apriori algorithm was appropriate for the relatively low attribute query sets in our domain, a more efficient algorithm such as the FP-Tree [24] could be applied if the attribute sets associated with queries are too large for the Apriori technique to be efficient. Although it is desirable to avoid examining a high-dimensional index set as a potential index, another possible solution in the case where a large number of attributes are frequent together would be to partition a large closed frequent item set into disjoint subsets for further examination. Techniques such as CLOSET [25] could be used to arrive at the initial closed frequent item sets. Query set Q. This is the abstract representation of the query workload. It is initialized by associating the potential indexes that could be beneficial for each query with that query. These are the indexes in the potential index set P that share at least one common attribute with the query. At the end of this step, each query has an identified set of possible indexes for that query. Multidimensional histogram H. An abstract representation of the data set is created in order to estimate the query cost associated with using each query’s possible indexes to
answer that query. This representation is in the form of a multidimensional histogram H. A single bucket represents a unique bit representation across all the attributes represented in the histogram. The input histogram size dictates the number of bits used to represent each unique bucket in the histogram. These bits are designated to represent only the single attributes that met the input support in the input query workload. If a single attribute does not meet the support, then it cannot be part of an attribute set appearing in P. There is no reason to sacrifice data representation resolution for attributes that will not be evaluated. The number of bits that each of the represented attributes gets is proportional to the log of that attribute’s support. This gives more resolution to those attributes that occur more frequently in the query workload. Data for an attibute that has been assigned b bits is divided into 2b buckets. In order to handle data sets with uneven data distribution, we define the ranges of each bucket so that each bucket contains roughly the same number of points. The histogram is built by converting each record in the data et to its representation in bucket numbers. As we process data rows, we only aggregate the count of rows with each unique bucket representation, because we are just interested in estimating the query cost. Note that the multidimensional histogram is based on a scalar quantize designed on data and access patterns, as opposed to just data in the traditional case. A higher accuracy in representation is achieved by using more bits to quantize the attributes that are more frequently queried. or illustration, Table 2 shows a simple multidimensional histogram example. This histogram covers three attributes and uses 1 bit to quantize attributes 2 and 3, and 2 bits to quantize attribute 1, assuming that it is queried more frequently than the other attributes. In this example, for attributes 2 and 3, values from 1 to 5 quantize to 0, and values from 6 to 10 quantize to 1. For attribute 1, values 1 and 2 quantize to 00, 3 and 4 quantize to 01, 5, 6, and 7quantize to 10, and 8 and 9 quantize to 11. The .’s in the column “Value” denote attribute boundaries (that is, attribute 1 has 2 bits assigned to it). Note that we do not maintain any entries in the histogram for bit representations that have no occurrences. Thus, we cannot have more histogram entries than records and will not suffer from exponentially increasing the number of potential multidimensional histogram buckets for high-dimensional histograms.