incubator-general mailing list archives

I would like to call a vote on accepting Howl as an Incubator
project. The proposal is available at http://wiki.apache.org/incubator/HowlProposal
. You can see the discussion from the proposal thread at http://tinyurl.com/5w7y9p9
.
Alan.
----------------------
Abstract
Howl is a table and storage management service for data created using
Apache Hadoop.
Proposal
The vision of Howl is to provide table management and storage
management layers for Apache Hadoop. This includes:
• Providing a shared schema and data type mechanism.
• Providing a table abstraction so that users need not be concerned
with where or how their data is stored.
• Providing interoperability across data processing tools such as
Pig, Map Reduce, Streaming, and Hive.
Background
Data processors using Apache Hadoop have a common need for table
management services. The goal of a table management service is to
track data that exists in a Hadoop grid and present that data to users
in a tabular format. Such a table management service needs to provide
a single input and output format to users so that individual users
need not be concerned with the storage formats that are chosen for
particular data sets. As part of having a single format, the data will
need to be described by one type of schema and have a single datatype
system.
Additionally, users should be free to choose the best tools for their
use cases. The Hadoop project includes Map Reduce, Streaming, Pig, and
Hive, and additional tools exist such as Cascading. Each of these
tools has users who prefer it, and there are use cases best addressed
by each of these tools. Two users on the same grid who need to share
data should not be constrained to use the same tool but rather should
be free to choose the best tool for their use case. A table management
service that presents data in the same way to all of the tools can
alleviate this problem by providing interfaces to each of the data
processing tools.
There are also a few other features a table management service should
provide, such as notification of when data arrives.
A couple of developers at Yahoo! started the project. It is based on
the Hive MetaStore component. There is good amount of interest in such
a service expressed from Yahoo!, Facebook, LinkedIn, and, others. We
are therefore proposing to place Howl in the Apache incubator and to
build an open source community around it.
Rationale
There is a strong need for a table management service, especially for
large grids with petabytes of data, and where the data volume is
increasing by the day. Hadoop users need to find data to read and have
a place to store their data. Currently users must understand the
location of data to read, the storage format, compression techniques
used, etc. To write data they need to understand where on HDFS their
data belongs, the best compression format to use, how their data
should be serialized, etc.
Most users do not want to be concerned with these issues. They want
these managed for them.
Having it as an Apache Open Source project will highly benefit Howl
from the point of view of getting a large community that currently
uses Hadoop and the other products built around Hadoop (like Pig,
Hive, etc.). Users of the Hadoop ecosystem can influence Howl’s
roadmap, and contribute to it. Looking at it in another way, we
believe having Howl as part of the Hadoop ecosystem will be a great
benefit to the current Hadoop/Pig/Hive community too.
Current Status
Meritocracy
Our intent with this incubator proposal is to start building a diverse
developer community around Howl following the Apache meritocracy
model. We have wanted to make the project open source and encourage
contributors from multiple organizations from the start. We plan to
provide plenty of support to new developers and to quickly recruit
those who make solid contributions to committer status.
Community
Howl is currently being used by developers at Yahoo! and there has
been an expressed interest from LinkedIn and Facebook. Yahoo! also
plans to deploy the current version of Howl in production soon. We
hope to extend the user and developer base further in the future. The
current developers and users are all interested in building a solid
open source community around Howl.
To work towards an open source community, we have started using the
GitHub issue tracker and mailing lists at Yahoo! for development
discussions within our group.
Core Developers
Howl is currently being developed by four engineers from Yahoo! -
Devaraj Das, Ashutosh Chauhan, Sushanth Sowmyan, and Mac Yang. All the
engineers have deep expertise in Hadoop and the Hadoop Ecosystem in
general.
Alignment
The ASF is a natural host for Howl given that it is already the home
of Hadoop, Pig, HBase, Cassandra, and other emerging cloud software
projects. Howl was designed to support Hadoop from the beginning in
order to solve data management challenges in Hadoop clusters. Howl
complements the existing Apache cloud computing projects by providing
a unified way to manage data.
Known Risks
Orphaned Products
The core developers plan to work full time on the project. There is
very little risk of Howl getting orphaned since large companies like
Yahoo! are planning to deploy this in their production Hadoop
clusters. We believe we can build an active developer community around
Howl (companies like Facebook and LinkedIn have also expressed
interest).
Inexperience with Open Source
All of the core developers are active users and followers of open
source. Devaraj Das is an Apache Hadoop committer and Apache Hadoop
PMC member, and has experience with the Apache infrastructure and
development process. Ashutosh Chauhan is an Apache Pig committer and
Apache Pig PMC member. Sushanth Sowmyan and Mac Yang made
contributions to the Apache Hive and the Apache Chukwa projects.
Homogeneous Developers
The current core developers are all from Yahoo! However, we hope to
establish a developer community that includes contributors from
several corporations, and we are starting to work towards this with
Facebook and LinkedIn.
Reliance on Salaried Developers
Currently, the developers are paid to do work on Howl. However, once
the project has a community built around it, we expect to get
committers and developers from outside the current core developers.
Companies like Yahoo! are invested in Howl being a solution to the
data management problem in Hadoop clusters, and that is not likely to
change.
Relationships with Other Apache Products
Howl is going to be used by users of Hadoop, Pig, and Hive. See
section Initial Source below for more information about Howl's
relationship to Hive.
An Excessive Fascination with the Apache Brand
While we respect the reputation of the Apache brand and have no doubts
that it will attract contributors and users, our interest is primarily
to give Howl a solid home as an open source project following an
established development model. We have also given reasons in the
Rationale and Alignment sections.
Documentation
Information about Howl can be found at http://wiki.apache.org/pig/
Howl. The following sources may be useful to start with:
•
The GitHub site: https://github.com/yahoo/howl
•
The roadmap: http://wiki.apache.org/pig/HowlJournal
Initial Source
Howl has been under development since Summer 2010 by a team of
engineers in Yahoo!. It is currently hosted on GitHub under an Apache
license at https://github.com/yahoo/howl.
The initial development of Howl has consisted of:
• maintaining a branch of the entire Hive codebase
• getting Howl-related patches committed to Hive
• developing Howl-specific plugins and wrappers to customize Hive
behavior
At runtime, Howl executes Hive code for metastore and CLI+DDL,
disabling anything related to Hadoop map/reduce execution. It also
makes use of the RCFile storage format contained in Hive.
This approach was taken as a first step in order to validate the
required functionality and get a production version working. However,
in the long-term, maintaining a clone of Hive is undesirable. One
possible resolution is to factor the metastore+CLI+DDL components out
of Hive and move them into Howl (making Hive dependent on Howl).
Another possible resolution is to remove the copy of Hive from Howl
and do the build/release engineering necessary to make Howl depend on
Hive. As part of the incubation process, we plan to work towards
resolution of these issues.
External Dependencies
The dependencies all have Apache compatible licenses.
Cryptography
Not applicable.
Required Resources
Mailing Lists
• howl-private for private PMC discussions (with moderated
subscriptions)
• howl-dev
• howl-commits
• howl-user
Subversion Directory
https://svn.apache.org/repos/asf/incubator/howl
Issue Tracking
JIRA Howl (HOWL)
Other Resources
The existing code already has unit tests, so we would like a Hudson
instance to run them whenever a new patch is submitted. This can be
added after project creation.
Initial Committers
• Devaraj Das
• Ashutosh Chauhan
• Sushanth Sowmyan
• Mac Yang
• Paul Yang
• Alan Gates
A CLA is already on file for Sushanth.
Affiliations
• Devaraj Das (Yahoo!)
• Ashutosh Chauhan (Yahoo!)
• Sushanth Sowmyan (Yahoo!)
• Mac Yang (Yahoo!)
• Paul Yang (Facebook)
• Alan Gates (Yahoo!)
Sponsors
Champion
Owen O’Malley
Nominated Mentors
• Olga Natkovich (Pig PMC member and Apache VP for Pig)
• Alan Gates (Pig PMC member)
• John Sichi (Hive PMC member)
Sponsoring Entity
We are requesting the Incubator to sponsor this project.
---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org