Tag: big data project titles

Identification of Spiders and Crawlers Identification of Spiders and Crawlers: Spiders are small web programs that harvest information for search engines. These spiders tracks the websites. In some ways these are good by quickly showing up the websites. These programs follow certain links on the web and gather information. You can also explicitly instruct a robot not to follow any of the links on the page. Like the good spiders, bad spiders are also present known as spam spiders. These bad spiders try to harvest your email address. Some spiders…

Sentiment Analysis Using Twitter Data – Hadoop Project In today’s world, opinions and reviews accessible to us are one of the most critical factors in formulating our views and influencing the success of a brand, product or service. With the advent and growth of social media in the world, stakeholders often take to expressing their opinions on popular social media, namely Twitter. While Twitter data is extremely informative, it presents a challenge for analysis because of its humongous and disorganized nature. This paper is a thorough effort to dive into…

FiDoop: Parallel Mining of Frequent Itemsets Using MapReduce Existing parallel mining algorithms for frequent itemsets lack a mechanism that enables automatic parallelization, load balancing, data distribution, and fault tolerance on large clusters. As a solution to this problem, we design a parallel frequent itemsets mining algorithm called FiDoop using the MapReduce programming model. To achieve compressed storage and avoid building conditional pattern bases, FiDoop incorporates the frequent items ultrametric tree, rather than conventional FP trees. In FiDoop, three MapReduce jobs are implemented to complete the mining task. In the crucial…

HConfig: Resource Adaptive Fast Bulk Loading in HBase NoSQL (Not only SQL) data stores become a vital component in many big data computing platforms due to its inherent horizontal scalability. HBase is an open-source distributed NoSQL store that is widely used by many Internet enterprises to handle their big data computing applications (e.g. Facebook handles millions of messages each day with HBase). Optimizations that can enhance the performance of HBase are of paramount interests for big data applications that use HBase or Big Table like key-value stores. In this paper…

Self-Adjusting Slot Configurations for Homogeneous and Heterogeneous Hadoop Clusters The MapReduce framework and its open source implementation Hadoop have become the defacto platform for scalable analysis on large data sets in recent years. One of the primary concerns in Hadoop is how to minimize the completion length (i.e., makespan) of a set of MapReduce jobs. The current Hadoop only allows static slot configuration, i.e., fixed numbers of map slots and reduce slots throughout the lifetime of a cluster. However, we found that such a static configuration may lead to low…

Frequent Itemset Mining for Big Data in social media using ClustBigFIM algorithm Abstract – Tremendous amount of data is getting explored through IOT (Internet of Things) from variety of sources such as sensor network, social media feed, internet applications, called as Big Data. Big Data cannot be handled by conventional tools and techniques. Social networks are becoming dominant in communications over internet. The Big Data mining is essential in order to extract value from massive amount of data which could give better insights using efficient techniques. Association Rule mining and…

FastRAQ: A Fast Approach to Range-Aggregate Queries in Big Data Environments Range-aggregate queries are to apply a certain aggregate function on all tuples within given query ranges. Existing approaches to range-aggregate queries are insufficient to quickly provide accurate results in big data environments. FastRAQÃƒÂ¢Ã¢â€šÂ¬Ã¢â‚¬Âa fast approach to range-aggregate queries is proposed in big data environments. FastRAQ first divides big data into independent partitions with a balanced partitioning algorithm, and then generates a local estimation sketch for each partition. When a range-aggregate query request arrives, FastRAQ obtains the result directly by…