3.
Big data
 “Big data” refers to datasets whose size is beyond the ability
of typical database software tools to capture, store, manage
and analyze.
 This definition can vary by sector depending on what kinds
of software tools are commonly available and what sizes of
datasets are common there.
 As technology advances over time, the size of datasets that
qualify as Big data will also increase.
 With these caveats, Big data will range from a few dozen
terabytes to multiple petabytes (thousands of terabytes ).

4.
Big data—a growing torrent
 $600 to buy a disk drive that can store all of the world’s music
 5 billion mobile phones in use in 2010 .
 30 billion pieces of content shared on Facebook every month .
 40% projected growth in global data generated per year vs.
5% growth in global IT spending .
 235 terabytes data collected by the US Library of Congress by
April 2011.
 15 out of 17 sectors in the US have more data stored per
company than the US Library of Congress.

7.
Big Data vs. DWH-DM
• Areas like genomics, astronomy, military surveillance and
RFID technology are also contributing to the explosive
growth of the field.
• A jet engine’s sensors sends terabytes of data every hour,
which can be used to build predictive models for repair
cycles. Understanding when repairs should be done, instead
of doing traditional preventive maintenance at certain set
intervals, could be worth billions of dollars.
• The challenge in big data analytics is to dig deeply, quickly
and widely
• DWH-DM
– Structured data
– Off-line algorithms

8.
Challenges of Large Scale Social Network
Analysis
 Social networking sites like Facebook, YouTube, Orkut and
Twitter are among the most popular sites on the internet.
 Users of these sites form a social network (SN), which provides
a powerful mean of sharing, organizing, and finding contents
and contacts.
 However, the rate at which SNs are growing, posses many
latent challenges in maintaining the stability of their
underlying systems and the members associated with them.

9.
Challenges of Large Scale Social Network
Analysis
• Social Networks (SNs) are living networks that daily give birth
to data traces which can be up to exabytes in volume.
• For example, Facebook produce more than a petabyte of data
per day. Even it’s logging data exceeds 25 terabytes per-day.
• Google creates as much information (social blogs and orkut )
in two days now, as we did from the dawn of man through
2003 i.e., one exabyte of data.
• Analysts need to analyze this huge plethora of SN data to
support system management activities in limited time.

10.
Big data and Big Brother
• Perhaps one of the biggest contributors to big data, however,
is social networking.
• People themselves have become contributors of information
as they increasingly use services such as Facebook and
LinkedIn to connect with each other.
• “LinkedIn is a particularly interesting target, given the
professional nature of its audience. By analyzing LinkedIn
network information, we can learn a lot about individuals and
the people that they know”

11.
• While it may be difficult to manipulate big data at a grand
scale, it is relatively easy, given the right tools and techniques,
to analyze small subsets (such as personal networks of
contacts) for potentially useful results.
• We can do this at a micro-analytic level, where we mine
profiles for snippets of information and at the macro-analytic
level, where we look at patterns in the data.
• “Even when people are not part of your network, a
properly filled-out profile reveals their job title, where
they worked in the past, and where they were
educated.”

12.
Where does it come from??
 In the global marketplace, businesses, suppliers and customers are
creating and consuming vast amounts of information .

13.
Cont… Big Data
 Gartner predicts that enterprise data in all forms will grow
650% over the next 5 years.
 According to IDC, the world's volume of data doubles every
18 months.
 This flood of data is referred to as “information overload,”
“data deluge” and “big data” .
 Big data creates a challenge for business leaders.

14.
NoSQL Databases
 Most of the organizations that built data platforms have
found it necessary to go beyond the relational database model
to tackle big data, because they become ineffective at this
scale.
 Managing, sharding and replication across a horde of
database servers is difficult and slow.
 To store huge datasets effectively a new breed of databases
are developed. There databases are called NoSQL databases,
or Non-Relational databases.

15.
NoSQL Databases
Many of the NoSQL databases are the logical descendants of
Google’s BigTable and Amazon’s Dynamo.
These are designed to be distributed across many nodes, to
provide consistency and to have very flexible schema.

16.
Popular NoSQL databases
Cassandra:
 Developed at Facebook, in production use at Twitter,
Rackspace, Reddit, and other large sites.
 Cassandra is designed for high performance, reliability,
and automatic replication. It has a very flexible data
model. A new startup, Riptano, provides commercial
support.
HBase:
 Part of the Apache Hadoop project, and modeled on
Google’s BigTable.
 Suitable for extremely large databases (billions of rows,
millions of columns), distributed across thousands of
nodes. Along with Hadoop, commercial support is
provided by Cloudera.

17.
Prevalence of Big Data
 Big data is not limited to big companies like Facebook and
Google.
 According to McKinsey Global Institute study in 2011
 Most of the investment firms in U.S with less than 1,000
employees has 3.8 petabytes of data stored.
 Companies in all sectors have at least 100 terabytes stored.

19.
Big data Technologies
 Big data technologies describe a new generation of
technologies and architectures, designed to economically
extract value from very large volumes of a wide variety of
data, by enabling high velocity capture, discovery, and/or
analysis.
 The above definition incorporates all types of data (e.g., realtime, analytic) managed by next generation systems.

20.

MapReduce approach is basically a divide-and-conquer
strategy for distributing an extremely large problem across
an extremely large computing cluster.

In the “map” stage, a programming task is divided into a
number of identical subtasks, which are then distributed
across many processors.

The intermediate results are then combined by a single
reduce task.

MapReduce provides a solution to Google’s biggest
problem, i.e creating large searches.

21.
 MapReduce has proven to be widely applicable to many large
data problems, ranging from search to machine learning.
 The
most
popular
open
source
MapReduce is the Hadoop project.
implementation
of

22.
Applications of Big data Analysis
 Facebook and LinkedIn use patterns of friendship
relationships to suggest other people you may know, or
should know, with frightening accuracy.
 Amazon saves your searches, correlates what you search for
with what other users search for, and uses it to create
surprisingly appropriate recommendations.
 Medical researchers sift through the health records of
thousands of people to try to identify useful correlations
between medical treatments and health outcomes.

23.
Applications of Big data Analysis
 Facebook and LinkedIn use patterns of friendship
relationships to suggest other people you may know, or
should know, with frightening accuracy.
 Amazon saves your searches, correlates what you search for
with what other users search for, and uses it to create
surprisingly appropriate recommendations.
 Medical researchers sift through the health records of
thousands of people to try to identify useful correlations
between medical treatments and health outcomes.

24.
 As
data volumes are growing exponentially, so is the
concern over data preservation, access,
dissemination, and usability. Many agencies has
taken initiatives to research into areas such as
automated analysis techniques, data mining,
machine learning, privacy, and database
interoperability and these will help to identify how
big data can enable science in new ways and at new
levels..