In most enterprises, whether it’s a public or private enterprise, there is typically a mountain of data, structured and unstructured data, that contains potential insights about how to serve their customers better, how to engage with customers better and make the processes run more efficiently. Consider this:

Data is seen as a resource that can be extracted and refined and turned into something powerful. It takes a certain amount of computing power to analyze the data and pull out and use those insights. That where the new tools like Hadoop, NoSQL, In-memory analytics and other enablers come in.

What business problems are being targeted?

Why are some companies in retail, insurance, financial services and healthcare racing to position themselves in Big Data, in-memory data clouds while others don’t seem to care?

As a result, a new BI and Analytics framework is emerging to support public and private cloud deployments.

The excitement is that Big Data capabilities fundamentally change the core premise of BI and analytics – the ability to have end-users (and even machines) perform ad-hoc analysis and reporting tasks over large and continuously growing amounts of structured and unstructured information such as log files, sensor data, streaming data, sales transactions, emails, research data and images collectively known as ‘big data.’

Technology Innovation around Big Data

Big Data is a hot topic because it represents the first time in about 30 years that people are rethinking databases and data management. Literally, since about 1980 the enterprise database market has consolidated around 3 vendors – Oracle, IBM and Microsoft.

Hardware architectures have changed — people want to scale horizontally like Google.

Innovation around Big Data is also happening on other fronts from the core (e.g., analytics and query optimization), to the practical (e.g., horizontal scaling), to the mundane (e.g., backup and recovery).

New Tools

So if you have not heard of these tools – Hadoop, NoSQL, MongoDB, Cassandra, HBase, Columnar databases, Data Appliances – then it’s time for a quick primer.

NoSQL stands for Not Only SQL. NoSQL databases do not use the popular SQL (Structured Query Language) to create tables and insert, delete or update data. Many NoSQL deployments handle data that simply can’t be handled by a relational database, such as sparse data, text, and other forms of unstructured content. Unstructured content include social media/networks, Internet text and documents; call detail records, photography and video archives;; and web logs. Industry specific unstructured data include RFID; large scale eCommerce catalogs, sensor networks, astronomy, atmospheric science, genomics, biogeochemical, biological, and other complex and/or interdisciplinary scientific research; military surveillance; and medical records.

Cassandra was developed by Facebook and later open sourced in 2008. Cassandra is influenced by the Google BigTable model, but also uses concepts from Amazon’s Dynamo distributed key-value store. Eventually, Cassandra became an Apache project. It falls under a category of databases called NoSQL, which stands for Not Only SQL. Cassandra database is used by Facebook, Digg and Twitter.

Hbase – is NoSQL open-source, column-oriented store database modeled from Google’s BigTable system. Hbase is an Apache project. It is part of the Hadoop ecosystem. See this presentation on how FaceBook uses HBase in Production.

Hadoop – Apache Hadoop is a popular open-source software framework for distributed/grid-computing environments that enable applications to analyze large data sets. Relational database systems are good at data retrieval and queries but don’t accept new data. Hadoop and other tools get around this and allow data ingestion at incredibly fast rates.

Hadoop was built initially by Doug Cutting while he was at Yahoo, has become prominent first in unstructured data management and cloud computing.

Hadoop is designed to process terabytes and even petabytes of unstructured and structured data. It breaks large workloads into smaller data blocks that are distributed across a cluster of commodity hardware for faster processing. But Hadoop requires additional programming tools such as Pig or Hive to write SQL-like queries to retrieve the data.

Technically, Hadoop, a Java based framework, consists of two elements: reliable very large, low-cost data storage using the Hadoop Distributed File System (HDFS) and high-performance parallel/distributed data processing framework called MapReduce.

Hadoop builds on the MapReduce algorithm. MapReduce, first introduced by Google in 2004, consists of two functions – Map and Reduce. Map takes large computational problems, breaks them down into smaller subproblems and distributes those to worker nodes, which solve the problem and pass the answer back to the master node. The Reduce function consolidates the answers from the Map function to produce the final output. Search algorithms (public cloud) are often designed in this fashion.

Hadoop runs on a collection/cluster of commodity, shared-nothing x86 servers. You can add or remove servers in a Hadoop cluster (sizes from 50, 100 to even 2000+ nodes) at will; the system detects and compensates for hardware or system problems on any server. Hadoop is self-healing. It can deliver data — and can run large-scale, high-performance processing batch jobs — in spite of system changes or failures.

Data Appliances

Data appliances are one of the fastest growing categories in Big Data. Data appliances integrate database, processing, and storage in a integrated system optimized for analytical processing and designed for flexible growth. The architecture is based on the following core principles:

Processing close to the data source

Appliance simplicity (ease of procurement; limited consulting)

Massively parallel architecture

Platform for advanced analytics

Flexible configurations and extreme scalability

A number of vendors are going down the path of appliance and quasi-appliance offerings which have some preconfiguration of hardware and software, cloud-supporting deployments, and reference configurations.

SAP HANA is an equivalent of Exadata and debuted at Sapphire 2011. HANA is based on a fundamental computer science principle: when operating on large data sets and want fast response times, do not move data from disk unless absolutely necessary. Separate OLAP (BI data) and OLTP (transaction data). Have the OLAP in-memory and speed up the dashboards, reporting and analytics.

MongoDB

MongoDB is an open source database, combining scalability, performance and ease of use, with traditional relational database features such as dynamic queries and indexes. It has become the leading NoSQL database choice, with downloads exceeding 100,000 per month. Thousands of customers including Fortune 500 enterprises and leading Web 2.0 companies are developing large-scale applications and performing real-time “Big Data” analytics with MongoDB. For more information, visit www.mongodb.org or www.10gen.com. 10gen develops MongoDB, and offers production support, training, and consulting for the database.

There are many new database directions appearing on the landscape today. These include nonschematic DBMS ( “NoSQL”), cloud databases, highly distributed databases, small footprint DBMS, and in-memory database (IMDB). The business applications of these are driven by high performance, low latency and efficiency in deployment. All of these are driven by the premise that insight into data requires more than tabular analysis.

Google’s LevelDB – NoSQL

Google in May 2011 open-sourced a BigTable-inspired key-value database library called LevelDB under a BSD license. It was created by Dean and Ghemawat of the BigTable project at Google. A recent blog post from Google made the project more widely known. It’s available for Unix based systems, Mac OS X, Windows, and Android.

According to the announcement: “LevelDB may be used by a web browser to store a cache of recently accessed web pages, or by an operating system to store the list of installed packages and package dependencies, or by an application to store user preference settings. We designed LevelDB to also be useful as a building block for higher-level storage systems. Upcoming versions of the Chrome browser include an implementation of the IndexedDB HTML5 API that is built on top of LevelDB. Google’s Bigtable manages millions of tablets where the contents of a particular tablet are represented by a precursor to LevelDB.”

Big Data Use Cases

E-tailing – E-Commerce – Online Retailing

Recommendation engines — increase average order size by recommending complementary products based on predictive analysis for cross-selling.

Cross-channel analytics — sales attribution, average order value, lifetime value (e.g., how many in-store purchases resulted from a particular recommendation, advertisement or promotion).

Event analytics — what series of steps (golden path) led to a desired outcome (e.g., purchase, registration).

Retail/Consumer Products

Merchandizing and market basket analysis.

Campaign management and customer loyalty programs.

Supply-chain management and analytics.

Event- and behavior-based targeting.

Market and consumer segmentations.

Financial Services

Compliance and regulatory reporting.

Risk analysis and management.

Fraud detection and security analytics.

CRM and customer loyalty programs.

Credit risk, scoring and analysis.

High speed Arbitrage trading

Trade surveillance.

Abnormal trading pattern analysis

Web & Digital Media Services

Large-scale clickstream analytics.

Ad targeting, analysis, forecasting and optimization.

Abuse and click-fraud prevention.

Social graph analysis and profile segmentation.

Campaign management and loyalty programs.

Government

Fraud detection and cybersecurity.

Compliance and regulatory analysis.

Energy consumption and carbon footprint management.

New Applications

Sentiment Analytics

Mashups – Mobile User Location + Precision Targeting

Machine-generated data, the exhaust fumes of the Web

Health & Life Sciences

Health Insurance fraud detection

Campaign and sales program optimization.

Brand management.

Patient care quality and program analysis.

Supply-chain management.

Drug discovery and development analysis.

Telecommunications

Revenue assurance and price optimization.

Customer churn prevention.

Campaign management and customer loyalty.

Call Detail Record (CDR) analysis.

Network performance and optimization

Mobile User Location analysis

Smart meters in the utilities industry. The rollout of smart meters as part of the Smart Grid adoption by utilities everywhere has resulted in a deluge of data flowing at unprecedented levels. Most utilities are ill-prepared to analyze the data once the meters are turned on.

As I speak to customers, it is becoming more clear to me that there is going to be growing push towards an elastic / adaptive infrastructure for data warehousing and analytics. With increasing focus on mobility and faster decision making…the business is going to push for this faster than Corporate IT can react.

Bottomline

What’s next? That’s a simple question to ask, but it’s not so simple to answer.

Big Data is a umbrella phrase for a set of technologies, skills, methods and processes, some new, some not for gaining insight from mountains of data. It is essentially the combination of the 3 V’s – volume, velocity and variety.

I am seeing the following trends:

the Enterprise IT roadmap is going to divide into a Compute Cloud AND Data Clouds.

The Compute Cloud (Private/Public/Hybrid) is being driven from the virtualization/resource side

The Data Cloud (in-memory, data appliances) is being driven from mobility and decision making side.

Prediction from some circles – Half of the world’s data will be stored in Apache Hadoop within five years

Opportunity that startups like Cloudera are pursuing — Grow the Apache Hadoop Ecosystem by making Apache Hadoop easier to consume, profit by providing training, support and certification

Making the system smart in terms of Big Data is something that isn’t too far. The world has witnessed many technological revolution and has changed for the better. The needs of storage are also changing day by day and Database have had their limits of elasticity. Big Data requires a lot of processing.Exadata by Oracle has proved to be an outstanding resource in terms of space and cost purposes. More space and efficient handling, Exadata gives an edge to traditional databases in all aspects. We as XDuce are organizing a workshop for Exadata Database Machine Administration. Connect here for more details and registrations: http://xduce.com/exadatadatabase/Exadata-Campaigns.html

Defining Business Analytics

What is Business Analytics? Business Analytics is the intersection of business and technology, offering new opportunities for a competitive advantage. Business analytics unlocks the predictive potential of data analysis to improve financial performance, strategic management, and operational efficiency.

What is BI? BI is the "computer-based techniques used in spotting, digging-out, and analyzing 'hard' business data, such as sales revenue by products or departments or associated costs and incomes. Objectives of BI implementations include (1) understanding of a firm's internal and external strengths and weaknesses, (2) understanding of the relationship between different data for better decision making, (3) detection of opportunities for innovation, and (4) cost reduction and optimal deployment of resources." (Business Dictionary). Most widely used BI tool is Microsoft Excel.
-----------------------------
What is Big Data? Big data refer to data scenarios that grow so large (petabytes and more) that they become awkward to work with using traditional database management tools. The challenge stems from data volume + flow velocity + noise to signal conversion. Big data is spawning new tools that are mix of significant processing power, parallelism and statistical, machine learning, or pattern recognition techniques
----------------------------

Corporate performance management software and performance management concepts, such as the balanced scorecard, enable organizations to measure business results and track their progress against business goals in order to improve financial performance.
-----------------------------

Data visualization tools, include mashups, executive dashboards, performance scorecards and other data visualization technology, is becoming a major category.
-----------------------------

BI platforms provide a range of capabilities for building analytical applications. Examples are Oracle OBIEE, SAP Business Objects 4.0. There are many choices and combinations of BI platforms, capabilities and use cases as well as many emerging BI technologies such as in memory analytics, interactive visualization and BI integrated search. The idea of standardizing on one supplier for all of one’s BI capabilities is difficult to do. Increasingly, standardization and more about managing a portfolio of tools used for a set of capabilities and use cases.
----------------------------
Data integration tools and architectures in support of BI continue to evolve. Extract-Transfer-Load (ETL) tools make up a big segment of this category in addition to data mapping tools. Organizations must now support a range of delivery styles, latencies, and formats.
----------------------------
BI is about "sense and respond." Analytics is about "anticipate and shape" models.

About

Business Analytics 3.0 blog is meant for decision makers and managers who are trying to make sense of the rapidly changing technology landscape and build next generation solutions. It is aimed at helping business decision makers navigate the "Raw Data -> Aggregate Data -> Intelligence -> Insight -> Decisions" chain.