Abstract

Clustering is one of the most important techniques in machine learning and data mining tasks. Similardocuments are grouped by performing clustering techniques. Similarity measuring is used to determine transactionrelationships. Hierarchical clustering model produces tree structured results. Partitioned based clustering produces theoutcome in grid format. Text documents are unstructured data values with high dimensional attributes. Document clusteringgroup ups unlabeled text documents into meaningful clusters. Traditional clustering methods require cluster count (K) forthe document grouping process. Clustering accuracy degrades drastically with reference to the unsuitable cluster count.Document features are automatically partitioned into two groups’ discriminative words and nondiscriminative words. Onlydiscriminative words are useful for grouping documents. The involvement of nondiscriminative words confuses theclustering process and leads to poor clustering solution in return. A variation inference algorithm is used to infer thedocument collection structure and partition of document words at the same time. Dirichlet Process Mixture (DPM) model isused to partition documents. DPM clustering model uses both the data likelihood and the clustering property of the DirichletProcess (DP). Dirichlet Process Mixture Model for Feature Partition (DPMFP) is used to discover the latent clusterstructure based on the DPM model. DPMFP clustering is performed without requiring the number of clusters as input.Discriminative word identification process is improved with the labeled document analysis mechanism. Conceptrelationships are analyzed with Ontology support. Semantic weight model is used for the document similarity analysis. Thesystem improves the scalability with the support of labels and concept relations for dimensionality reduction process. Thesystem development is planned with Java language and Oracle relational database.

References

No relevant information is available
If you register references through the customer center, the reference information will be registered as soon as possible.