Sunday, January 12, 2014

Big Data is THE biggest buzzwords around at the moment and definitely big data will change the world.Big Data refers to data sets that are too large to be processed and analyzed by traditional IT technologies.The Big Data Universe is changing right before our eyes and beginning to explode.Big data absolutely has the potential to change the way governments,
organizations, and academic institutions conduct business and make
discoveries, and its likely to change how everyone lives their
day-to-day lives.In the next five years, we’ll generate more data as humankind than we generated in the previous 5,000 years ...!!! Records and data exist in electronic digital form generated by mobile
communications to surveillance cameras to emails to web sites to transaction
receipts; it can combine daily news, social media feeds and videos.

What is big data?Every day, we create 2.5 quintillion bytes of data — so much that 90%
of the data in the world today has been created in the last two years
alone. This data comes from everywhere: sensors used to gather climate
information, posts to social media sites, digital pictures and videos,
purchase transaction records, and cell phone GPS signals to name a few.
This data is big data. Gartner defines Big Data as high volume, velocity and variety
information assets that demand cost-effective, innovative forms of
information processing for enhanced insight and decision making. According to IBM, 80% of data captured today is unstructured, from
sensors used to gather climate information, posts to social media
sites, digital pictures and videos, purchase transaction records, and
cell phone GPS signals, to name a few. All of this unstructured data
is Big Data.

In other words, Big
data is the term for a collection of data sets so large and complex
that it becomes difficult to process using on-hand database management
tools or traditional data processing applications. The challenges
include capture, curation, storage,search, sharing, transfer, analysis
and visualization. The trend to larger data sets is due to the
additional information (VALUE) derivable from analysis of a single large
set of related data allowing correlations to be found to "spot business
trends, determine quality of research, prevent diseases, link legal
citations, combat crime, and determine real-time roadway traffic
conditions.

What does Hadoop solve?

Organizations are discovering that important predictions can be made by sorting through and analyzing Big Data.

However, since 80% of this data is "unstructured", it must be
formatted (or structured) in a way that makes it suitable for data
mining and subsequent analysis.

Hadoop is the core platform for structuring Big Data, and solves the problem of making it useful for analytics purposes.

In
2004, Google published a paper on a process called MapReduce that used
such an architecture. MapReduce framework provides a parallel processing
model and associated implementation to process huge amount of data.
With MapReduce, queries are split and distributed across parallel nodes
and processed in parallel (the Map step). The results are then gathered
and delivered (the Reduce step). The framework was incredibly
successful, so others wanted to replicate the algorithm. Therefore, an
implementation of MapReduce framework was adopted by an Apache open
source project named Hadoop. Click here to download :MapReduce: Simpli ed Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat.

Big data spans four dimensions -The 4
Vs that characterize big data:

Volume – the vast amounts of data generated every second -Example: terabytes, Records, Transactions,Tables and files

Velocity – the speed at which new data is generated and moves around
(credit card fraud detection is a good example where millions of
transactions are checked for unusual patterns in almost real time) -Example: Batch , Near time,Real time and Streams

Variety – the increasingly different types of data (from financial
data to social media feeds, from photos to sensor data, from video
capture to voice recordings)-Example : structured, unstructured, semi structured and all 3 types.

Veracity – the messiness of the data (just think of Twitter posts with hash tags, abbreviations, typos and colloquial speech)

How the Big Data Explosion Is Changing the World ?

Big data is the term increasingly used to describe the process of
applying serious computing power – the latest in machine learning and
artificial intelligence – to seriously massive and often highly complex
sets of information. Big data can be comparing
utility costs with meteorological data to spot trends and
inefficiencies. Big data can be comparing ambulance GPS information with
hospital records on patient outcomes to determine the correlation
between response time and survival and can also be the tiny
device you wear to track your movement, calories and sleep to track your
own personal health and fitness. Our daily lives generate an enormous collection of data.Whether you’re surfing the Web, shopping at the store, driving your
smart car around town, boarding an airplane, visiting a doctor,
attending class at university, each day you are generating a variety of
data.The benefit of the data depends on where and to whom you’re talking
to - a lot of the ultimate potential is in the ability to
discover potential connections, and to predict potential outcomes in a
way that wasn’t really possible before.With more data than ever available in digital form, progressively
inexpensive data storage, and more advanced computers at the ready to
help process and analyze it all.Companies believe that big data has the power to drive practical
insights that just weren’t possible before. It’s about
managing all that data and providing tools that enable everyone to
answers questions– questions they might not have even known they had. IBM CEO Ginni Rometty says big data
and predictive decisions will reshape organizations, and computers that
learn, like Watson, will be tech's next big wave. Its a vision of future .A hospital uses rapid gene sequencing to stop an outbreak of antibiotic
resistant bacteria, saving lives. A railroad company gets an alert from a
train’s sensor that a preventative fix is needed, saving the cost and
time of removing the train from the tracks later. A university notices a
student’s activity level has started to drop to a level consistent with
dropouts, and reaches out to assist.

Classic UseCases and its implementation in real-time scenarios : ----------------------------------------------------------------------------1) Retailers can exploit the data to track sales and consumer behavior,
in store and online;

2) Health professionals and epidemiologists trying to
predict the spread of disease combine data from health services, border
agencies and a variety of other sources.

4) The finance
sector seeks to exploit one of the most valuable mother lodes of data
through powerful tools that can make sense of patterns in news, trading
activities and other more esoteric sources.

5) India’s Unique identification project [Aadhaar project], spearheaded by NandanNilekani,
will collect and process billions of data, to provide identification
for each resident across the country and would be used primarily as the
basis for efficient delivery of welfare services. It would also act as a
tool for effective monitoring of various programs and schemes of the
Government.

7) Predicting a crime -Chicago Designing Predictive Software Platform to Identify Crime Patterns. Beyond the public safety uses, the platform could also help officials
make better decisions for city services like restaurant inspections,
snow plowing or garbage delivery.........etc !!!

Data scientists are building specialized systems that can read through billions of bits of data, analyze them via self-learning algorithms and package the insights for
immediate use.------------------------------------------

In the next few years millions of big data-related IT jobs will be
created worldwide and
there is a major shortage of the “analytical and managerial talent
necessary to make the most of big data.The United States alone faces a
shortage of more than 140,000 workers with big data skills as well as
up to 1.5 million managers and analysts needed to analyze and make
decisions based on big data findings. ---------------------------------------------------------------------Click here - Overview of apache Hadoop Click here - Watson - Era of cognitive Computing

Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware. Hadoop is an Apache top-level project being built and used by a global community of contributors and users. It is licensed under the Apache License 2.0.

Hadoop was created by Doug Cutting and Mike Cafarella in 2005. Cutting,
who was working at Yahoo! at the time, named it after his son's toy
elephant. It was originally developed to support distribution for the
Nutch search engine project. No one knows that better than Doug Cutting, chief architect of Cloudera and one of the curious story behind Hadoop. When he was creating the open source software that supports the processing of large data sets, Cutting knew the project would need a good name. Cutting's son, then 2, was just beginning to talk and called his beloved
stuffed yellow elephant "Hadoop" (with the stress on the first
syllable). Fortunately, he had one up his sleeve—thanks to his son. The son (who's now 12) frustrated with this. He's always saying 'Why don't you say my name, and
why don't I get royalties? I deserve to be famous for this :)

All the modules in Hadoop are designed with a fundamental assumption that hardware failures (of individual machines, or racks of machines) are common and thus should be automatically handled in software by the framework. Apache Hadoop's MapReduce and HDFS components originally derived respectively from Google's MapReduce and Google File System (GFS) papers.Beyond HDFS, YARN and MapReduce, the entire Apache Hadoop “platform” is now commonly considered to consist of a number of related projects as well – Apache Pig, Apache Hive, Apache HBase, and others

For the end-users, though MapReduce Java code is common, any programming language can be used with "Hadoop Streaming" to implement the "map" and "reduce" parts of the user's program. Apache Pig, Apache Hive among other related projects expose higher level user interfaces like Pig latin and a SQL variant respectively. The Hadoop framework itself is mostly written in the Java programming language, with some native code in C and command line utilities written as shell-scripts.

HDFS & MapReduce :
There are two primary components at the core
of Apache Hadoop 1.x : the Hadoop Distributed File System (HDFS) and the
MapReduce parallel processing framework. These open source projects,
inspired by technologies created inside Google.

Hadoop distributed file system :The Hadoop distributed file system (HDFS) is a distributed, scalable, and portable file-system written in Java for the Hadoop framework. Each node in a Hadoop instance typically has a single namenode; a cluster of datanodes form the HDFS cluster. The situation is typical because each node does not require a datanode to be present. Each datanode serves up blocks of data over the network using a block protocol specific to HDFS. The file system uses the TCP/IP layer for communication. Clients use Remote procedure call (RPC) to communicate between each other.

HDFS stores large files (typically in the range of gigabytes to terabytes) across multiple machines. It achieves reliability by replicating the data across multiple hosts, and hence does not require RAID storage on hosts. With the default replication value, 3, data is stored on three nodes: two on the same rack, and one on a different rack. Data nodes can talk to each other to rebalance data, to move copies around, and to keep the replication of data high. HDFS is not fully POSIX-compliant, because the requirements for a POSIX file-system differ from the target goals for a Hadoop application. The tradeoff of not having a fully POSIX-compliant file-system is increased performance for data throughput and support for non-POSIX operations such as Append.HDFS added the high-availability capabilities for release 2.x allowing the main metadata server (the NameNode) to be failed over manually to a backup in the event of failure- automatic fail-over.

The HDFS file system includes a so-called secondary namenode, which misleads some people into thinking that when the primary namenode goes offline, the secondary namenode takes over. In fact, the secondary namenode regularly connects with the primary namenode and builds snapshots of the primary namenode's directory information, which the system then saves to local or remote directories. These checkpointed images can be used to restart a failed primary namenode without having to replay the entire journal of file-system actions, then to edit the log to create an up-to-date directory structure. Because the namenode is the single point for storage and management of metadata, it can become a bottleneck for supporting a huge number of files, especially a large number of small files. HDFS Federation, a new addition, aims to tackle this problem to a certain extent by allowing multiple name-spaces served by separate namenodes.

An advantage of using HDFS is data awareness between the job tracker and task tracker. The job tracker schedules map or reduce jobs to task trackers with an awareness of the data location. For example: if node A contains data (x,y,z) and node B contains data (a,b,c), the job tracker schedules node B to perform map or reduce tasks on (a,b,c) and node A would be scheduled to perform map or reduce tasks on (x,y,z). This reduces the amount of traffic that goes over the network and prevents unnecessary data transfer. When Hadoop is used with other file systems this advantage is not always available. This can have a significant impact on job-completion times, which has been demonstrated when running data-intensive jobs.HDFS was designed for mostly immutable files and may not be suitable for systems requiring concurrent write-operations.Another limitation of HDFS is that it cannot be mounted directly by an existing operating system. Getting data into and out of the HDFS file system, an action that often needs to be performed before and after executing a job, can be inconvenient. A Filesystem in Userspace (FUSE) virtual file system has been developed to address this problem, at least for Linux and some other Unix systems.

File access can be achieved through the native Java API, the Thrift API to generate a client in the language of the users' choosing (C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, Smalltalk, and OCaml), the command-line interface, or browsed through the HDFS-UI webapp over HTTP.

JobTracker and TaskTracker: the MapReduce engine:

Above the file systems comes the MapReduce engine, which consists of one JobTracker, to which client applications submit MapReduce jobs. The JobTracker pushes work out to available TaskTracker nodes in the cluster, striving to keep the work as close to the data as possible. With a rack-aware file system, the JobTracker knows which node contains the data, and which other machines are nearby. If the work cannot be hosted on the actual node where the data resides, priority is given to nodes in the same rack. This reduces network traffic on the main backbone network. If a TaskTracker fails or times out, that part of the job is rescheduled. The TaskTracker on each node spawns off a separate Java Virtual Machine process to prevent the TaskTracker itself from failing if the running job crashes the JVM. A heartbeat is sent from the TaskTracker to the JobTracker every few minutes to check its status. The Job Tracker and TaskTracker status and information is exposed by Jetty and can be viewed from a web browser.

Hadoop 1.x MapReduce System is composed of the JobTracker, which is the master, and the per-node slaves- TaskTrackers

If the JobTracker failed on Hadoop 0.20 or earlier, all ongoing work was lost. Hadoop version 0.21 added some checkpointing to this process; the JobTracker records what it is up to in the file system. When a JobTracker starts up, it looks for any such data, so that it can restart work from where it left off.Known limitations of this approach in Hadoop 1.x are:

The allocation of work to TaskTrackers is very simple. Every TaskTracker has a number of available slots (such as "4 slots"). Every active map or reduce task takes up one slot. The Job Tracker allocates work to the tracker nearest to the data with an available slot. There is no consideration of the current system load of the allocated machine, and hence its actual availability.If one TaskTracker is very slow, it can delay the entire MapReduce job - especially towards the end of a job, where everything can end up waiting for the slowest task. With speculative execution enabled, however, a single task can be executed on multiple slave nodes.

Apache Hadoop NextGen MapReduce (YARN): MapReduce has undergone a complete overhaul in hadoop-0.23 and we now have, what we call, MapReduce 2.0 (MRv2) or YARN

Apache™ Hadoop® YARN is a sub-project of Hadoop at the Apache Software
Foundation introduced in Hadoop 2.0 that separates the resource
management and processing components. YARN was born of a need to enable a
broader array of interaction patterns for data stored in HDFS beyond
MapReduce. The YARN-based architecture of Hadoop 2.0 provides a more
general processing platform that is not constrained to MapReduce.

Architectural view of YARN

The fundamental idea of MRv2 is to split up the two major functionalities of the JobTracker, resource management and job scheduling/monitoring, into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM). An application is either a single job in the classical sense of Map-Reduce jobs or a DAG of jobs.

The ResourceManager and per-node slave, the NodeManager (NM), form the data-computation framework. The ResourceManager is the ultimate authority that arbitrates resources among all the applications in the system.

The per-application ApplicationMaster is, in effect, a framework specific library and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the tasks.

Overview of Hadoop1.0 and Hadopp2.0

As part of Hadoop 2.0, YARN takes the resource management capabilities
that were in MapReduce and packages them so they can be used by new
engines. This also streamlines MapReduce to do what it does best,
process data. With YARN, you can now run multiple applications in
Hadoop, all sharing a common resource management. Many organizations
are already building applications on YARN in order to bring them IN to
Hadoop.

A next-generation framework for Hadoop data processing

As part of Hadoop 2.0, YARN takes the resource management capabilities
that were in MapReduce and packages them so they can be used by new
engines. This also streamlines MapReduce to do what it does best,
process data. With YARN, you can now run multiple applications in
Hadoop, all sharing a common resource management. Many organizations
are already building applications on YARN in order to bring them IN to
Hadoop.When enterprise data is made available in HDFS, it is
important to have multiple ways to process that data. With Hadoop 2.0
and YARN organizations can use Hadoop for streaming, interactive and a
world of other Hadoop based applications.

What YARN Does

YARN enhances the power of a Hadoop compute cluster in the following ways:

Scalability The processing power in data centers continues to
grow quickly. Because YARN ResourceManager focuses exclusively on
scheduling, it can manage those larger clusters much more easily.

Compatibility with MapReduce Existing MapReduce applications and users can run on top of YARN without disruption to their existing processes.

Improved cluster utilization. The ResourceManager is a pure
scheduler that optimizes cluster utilization according to criteria such
as capacity guarantees, fairness, and SLAs. Also, unlike before, there
are no named map and reduce slots, which helps to better utilize cluster
resources.

Support for workloads other than MapReduceAdditional
programming models such as graph processing and iterative modeling are
now possible for data processing. These added models allow enterprises
to realize near real-time processing and increased ROI on their Hadoop
investments.

AgilityWith MapReduce becoming a user-land library, it can
evolve independently of the underlying resource manager layer and in a
much more agile manner.

How YARN Works

The fundamental idea of YARN is to split up the two major responsibilities of the JobTracker/TaskTracker into separate entities:

a global ResourceManager

a per-application ApplicationMaster.

a per-node slave NodeManager and

a per-application Container running on a NodeManager

The ResourceManager and the NodeManager form the new, and generic,
system for managing applications in a distributed manner. The
ResourceManager is the ultimate authority that arbitrates resources
among all the applications in the system. The per-application
ApplicationMaster is a framework-specific entity and is tasked with
negotiating resources from the ResourceManager and working with the
NodeManager(s) to execute and monitor the component tasks.\ The
ResourceManager has a scheduler, which is responsible for allocating
resources to the various running applications, according to constraints
such as queue capacities, user-limits etc. The scheduler performs its
scheduling function based on the resource requirements of the
applications. The NodeManager is the per-machine slave, which is
responsible for launching the applications’ containers, monitoring their
resource usage (cpu, memory, disk, network) and reporting the same to
the ResourceManager. Each ApplicationMaster has the responsibility of
negotiating appropriate resource containers from the scheduler, tracking
their status, and monitoring their progress. From the system
perspective, the ApplicationMaster runs as a normal container ------------------------------------------------------------ References:
1) http://hadoop.apache.org/2) http://hortonworks.com
3) http://www.cloudera.com--------------------------------------------------------------------------------

It is believed that artificial intelligence would take long time to function like human brains .But not sure how long we need to wait for this revolution.No doubt ..Computers have brought a revolution in human life. Nowadays computers are busy taking most of the human activities , can think and has problem-solving capabilities. These factors make us believe that computers are likely to replace human beings in future.

IBM announced a major new initiative aimed at accelerating progress in the era of cognitive computing.Big Blue is using the human brain as a template for breakthrough
designs. Assume yourself as a supercomputer that's cooled and powered by
electronic blood,natural healing system,thousands of I/O activities :) .

In the era of real-timeBig Data, the “old number crunching”
computers are not sufficient anymore and new computers are required that
can interact with us, the way we want to interact with each other. That
can visualize data the way we humans interact with the world.Although the computers to date are capable of handling vast amounts
of data, they still do that with separated memory and processing and
doing all the steps in a sequential order. IBM is developing a new type
of computers, cognitive computers that can be trained with artificial
intelligence and machine-learning algorithms to become more like humans
and deal with data the way humans do.Cognitive computing will bring a level of fluidity and
appropriateness to the way we will interact with computers. The idea of
the new cognitive computer that IBM is developing is to facilitate human
cognition beyond the current barriers because of the ever-increasing
volumes of data.

His name is Watson. He's bad with puns. Great at math. And, he won the game show "Jeopardy!" against real, live, breathing, thinking humans (Brad Rutter and Ken Jennings, two of Jeopardy's champions) .The top prize for the Watson showdown is $1 million, with $300,000 for
second place and $200,000 for third. Jennings and Rutter planed to donate
half their winnings to charity. Watson won $1 million and all of its winnings donated to charity...Link where the humans were destroyed by Watson @ final round of man vs. machine.Those game shows were reminiscent of IBM's "Deep Blue," a chess-playing
computer that, in 1996 and 1997, was pitted against world champion Gary
Kasparov.Kasparov beat the first version of
Deep Blue in 1996, but was defeated by a revamped program in 1997 --
with Blue scoring two wins and three draws in a best-of-five contest. Deep Blue relied heavily on mathematical calculations, while Watson has
to interpret human language, a far more difficult task.IBM’s computer system Watson vanquished human contests on the TV quiz showJeopardy!. Its combination of machine-learning strategies and an ability to process
natural language—or ordinary speech—allowed it to defeat human
contestants .Watson's software is powered by an IBM Power7 server (supercomputer with 2,880 IBM Power750 cores, or computing brains, and 15 terabytes of memory) and, according to
developers, is optimized to process complex questions and render answers
quickly. The question now: can it defeat the complexities of the real world? IBM is confident and it will also combine Watson with other “cognitive computing”
technologies and invest a further $1 billion into a business it says
will define the future of how companies use data.

IBM Watson Group,to be headquartered in New York City’s Silicon Alley.
The organization is unique within IBM– integrating research, software,
systems design, services and industry expertise.This will
revolutionize everything from cancer care to call centers. Among IBM’s biggest plans for Watson has been creating a system that can
read medical records and recommend treatments, particularly for cancer
patients.Watson is still a medical student or about to complete the internship :). Watson is going to work with doctors, helping oncologists treat patients.

Only 20 percent of the knowledge physicians use
to make diagnosis and treatment decisions today is evidence based. The
result? One in five diagnoses are incorrect or incomplete and nearly 1.5
million medication errors are made in the US every year. Given the
growing complexity of medical decision making, how can health care
providers address these problems?. The information medical professionals need to support improved decision
making is available. Medical journals publish new treatments and
discoveries every day. Patient histories give clues. Vast amounts of
electronic medical record data provide deep wells of knowledge. Some
would argue that in this information is the insight needed to avoid
every improper diagnosis or erroneous treatment.
In fact, the amount of medical information available is doubling every
five years and much of this data is unstructured - often in natural
language. And physicians simply don't have time to read every journal
that can help them keep up to date with the latest advances - 81 percent
report that they spend five hours per month or less reading journals.
Computers should be able to help, but the limitations of current systems
have prevented real advances. Natural language is complex. It is often
implicit: the exact meaning is not completely and exactly stated. In
human language, meaning is highly dependent on what has been said
before, the topic itself, and how it is being discussed: factually,
figuratively or fictionally - or a combination.

What Watson can do—given the right data—is pull up relevant literature
and also consistently recommend the same course of treatment that’s
suggested in the written medical guidelines that doctors consult. But
following guidelines is also something that less sophisticated software
can do. Watson can easily duplicate a guideline recommendation. Machine that docs can turn to as an adviser
and colleague.That system will be able to make recommendations for treating several
cancers based on manually organized inputs—structured data—and will
also interpret text notes for two cancers, lung and breast, with
reasonable accuracy. This is the right time to move forward with a bigger investment.

How Watson can address healthcare challenges ? Watson uses natural language capabilities, hypothesis generation, and evidence-based learning to support medical professionals as they make decisions. For example, a physician can use Watson to assist in diagnosing and treating patients. First the physician might pose a query to the system, describing symptoms and other related factors. Watson begins by parsing the input to identify the key pieces of information. The system supports medical terminology by design, extending Watson's natural language processing capabilities.

Watson then mines the patient data to find relevant facts about family history, current medications and other existing conditions. It combines this information with current findings from tests and instruments and then examines all available data sources to form hypotheses and test them. Watson can incorporate treatment guidelines, electronic medical record data, doctor's and nurse's notes, research, clinical studies, journal articles, and patient information into the data available for analysis. Watson will then provide a list of potential diagnoses along with a score that indicates the level of confidence for each hypothesis.

The ability to take context into account during the hypothesis generation and scoring phases of the processing pipeline allows Watson to address these complex problems, helping the doctor — and patient — make more informed and accurate decisions. Preparing Watson for Moon Shots:The University of Texas MD Anderson Cancer Center in Houston ranks as
one of the world's most respected centers focused on cancer patient
care, research, education and prevention.MD Anderson’s Moon Shots Program is an unprecedented and highly concentrated assault against cancer.

IBM’s Watson technology is expected to play a key role within APOLLO, a technology driven “adaptive learning environment” that MD Anderson is developing as part of its Moon Shots program. APOLLO enables iterative and continued learning between clinical care and research by creating an environment that streamlines and standardizes the longitudinal collection, ingestion and integration of patient’s medical and clinical history, laboratory data as well as research data into MD Anderson’s centralized patient data warehouse. Once aggregated, this complex data is linked and made available for deep analyses by advanced analytics to extract novel insights that can lead to improved effectiveness of care and better patient outcomes.

One of the richest sources of valuable clinical insight trapped within this patient data is the unstructured medical and research notes, and test results, for each cancer patient Watson’s cognitive capability has been shown to be powerful tool to extract valuable insight from such complex data and MD Anderson's Oncology Expert Advisor capability can generate a more comprehensive profile of each cancer patient. This will help physicians better understand the patient’s data in the evaluation of a patient’s condition.

By identifying and weighing data-driven connections between the attributes in a patient’s profile and the knowledge corpus of published medical literature, guidelines in Watson, MD Anderson’s Oncology Expert Advisor can provide evidence-based treatment and management options that are personalized to that patient, to aid the physician’s treatment and care decisions. These options can include not only standard approved therapies, but also appropriate investigational protocols.

”One unique aspect of the MD Anderson Oncology Expert Advisor is that it will not solely rely on established cancer care pathways to recommend appropriate treatment options,” explained Lynda Chin, M.D., professor and chair of Genomic Medicine and scientific director of the Institute for Applied Cancer Science at MD Anderson. “The system was built with the understanding that what we know today will not be enough for many patients. Therefore, our cancer patients will be automatically matched to appropriate clinical trials by the Oncology Expert Advisor. Based on evidence as well as experiences, our physicians can offer our patients a better chance to battle their cancers by participating in clinical trials on novel therapies.”

The MD Anderson Oncology Expert Advisor is expected to help physicians improve the future care of cancer patients by enabling comparison of patients based on a new range of data-driven attributes, previously unavailable for analysis. For example, MD Anderson’s clinical care and research teams can compare groups of patients to identify those patients who responded differently to therapies and discover attributes that may account for their differences. This analysis will then inform the generation of testable hypotheses to help researchers and clinicians to advance cancer care continually. Click here for more Information

Finally , Why did they name it Watson?

It's the name of the founder of IBM, Thomas J. Watson.

Those possibilities that Watson's breakthrough computing capabilities
hold for building a smarter planet and helping people in their business
tasks and personal lives - stunned everyone !!!

Sunday, January 5, 2014

Apache Hadoop 2/YARN/MR2 Multi-node Cluster Installation for Beginners:
In this blog ,I will describe the steps for setting up a distributed,
multi-node Hadoop cluster running on Red Hat Linux/CentOS Linux distributions.Now we are comfortable with installation and execution of MapReduce applications on Single node in Pseudo-distributed Mode. [Click here for the details on single node installation].Let us move one step forward to deploy multi-node cluster .

Hadoop Cluster: Hadoop Cluster is designed for distributed processing of large data sets across group of commodity machines (low-cost servers). The Data could be unstructured, semi-structured and also could be structured data.It is
designed to scale up to thousands of machines, with a high degree of
fault tolerance and software has the intelligence to detect & handle
the failures at the application layer.

Thre are 3 types of machines based on their specific roles in Hadoop cluster environment

1] Client machines : - Loading the data (input files) into the cluster - Submission of jobs (in our case - its a MapReduce Job) - Collect the result and view the analytics 2] Master nodes : - The Name Node coordinates the data storage function (HDFS) keeping the Meta data information- The ResourceManager negotiates the necessary resources for a container and launches an ApplicationMaster to represent the submitted application. 3] Slave nodes : Major part of cluster consists of Slave Nodes to perform computation .The NodeManager manages each node within a YARN cluster. The NodeManager provides per-node services within the cluster - management of a container over its life cycle to monitoring resources and tracking the health of its node. Container represents an allocated resource in the cluster. The resource Manager is the sole authority to allocate any container to applications. The allocated container is always on a single node and has unique containerID. It has a specific amount of resource allocated. Typically, an ApplicationMaster receive the container from the ResourceManager during resource negotiation and then talks to the NodeManager to start/stop container. Resource models a set of computer resources. Currently it only models Memeory [may be in future other resources like CPUs will be added ].

Step 1: First thing here is to establish a network between master node and slave node.
Assign IP address to eth0 interface of node1 and node 2 and include those IP address and hostname to /etc/hosts file as shown here.