Apache Hadoop is one of the most popular open-source tools used to harness clusters of computers to process, analyze or learn from massive amounts of data. Whether you are new to Hadoop or an experienced user, this is a great opportunity to improve your knowledge and network with others in the Baltimore computing technology community.

The first meeting will be held from 7:00pm to 9:30pm on Thursday, 19 February 2015 at AOL/Advertising.com at 1020 Hull St #100, Baltimore, MD (map). Join the group here.

UMBC's Center for Hybrid Multicore Productivity Research, an NSF Industry & University Cooperative Research Center is holding its Industry Advisory Board meeting at UMBC 12-14 June. Students from UMBC and UCSD will present tutorials on a number of the technologies underlying ongoing CHMPR projects in a session from 1:00-5:00 on Wednesday June 12 in ITE 456. The tutorial session is free and open to the public.

China’s Tianhe-1A is being recognized as the world’s fastest supercomputer. It has 7168 NVIDIA Tesla GPUs and achieved a Linpack score of 2.507 petaflops, a 40% speedup over Oak Ridge National Lab’s Jaguar, the previous top machine. Today’s WSJ has an article,

“Supercomputers are massive machines that help tackle the toughest scientific problems, including simulating commercial products like new drugs as well as defense-related applications such as weapons design and breaking codes. The field has long been led by U.S. technology companies and national laboratories, which operate systems that have consistently topped lists of the fastest machines in the world.

But Nvidia says the new system in Tianjin—which is being formally announced Thursday at an event in China—was able to reach 2.5 petaflops. That is a measure of calculating speed ordinarily translated into a thousand trillion operations per second. It is more than 40% higher than the mark set last June by a system called Jaguar at Oak Ridge National Laboratory that previously stood at No. 1 on a twice-yearly ranking of the 500 fastest supercomputers.”

The NYT and HPCwire also have good overview articles. The HPC article points out that the Tianhe-1A has a relatively low Linpack efficiency compaed to the Jaguar.

“Although the Linpack performance is a stunning 2.5 petaflops, the system left a lot of potential FLOPS in the machine. Its peak performance is 4.7 petaflops, yielding a Linpack efficiency of just over 50 percent. To date, this is a rather typical Linpack yield for GPGPU-accelerated supers. Because the GPUs are stuck on the relatively slow PCIe bus, the overhead of sending calculations to the graphics processors chews up quite a few cycles on both the CPUs and GPUs. By contrast, the CPU-only Jaguar has a Linpack/peak efficiency of 75 percent. Even so, Tianhe-1A draws just 4 megawatts of power, while Jaguar uses nearly 7 megawatts and yields 30 percent less Linpack.

The (unofficial) “official” list of the fastest supercomputers is TOP500 which seems to be inaccessible at the moment, due no doubt to the heavy load caused by the news stories above. The TOP500 list is due for a refresh next month.

Hadoop has become one of the most popular frameworks to exploit parallelism on a computing cluster. You don’t actually need access to a cluster to try Hadoop, learn how to use it, and develop code to solve your own problems.

UMBC Ph.D student Vlad Korolev has written an excellent tutorial, Hadoop on Windows with Eclipse, showing how to install and use Hadoop on a single computer running Microsoft Windows. It also covers the Eclipse Hadoop plugin, which enables you to create and run Hadoop projects from Eclipse. In addition to step by step instructions, the tutorial has short videos documenting the process.

If you want to explore Hadoop and are comfortable developing Java programs in Eclipse on a Windows box, this tutorial will get you going. Once you have mastered Hadoop and had developed your first project using it, you can go about finding a cluster to run it on.

We are early in the era of big data (including social and/or semantic) and more and more of us need the tools to handle it. Monday’s NYT had a story, Hadoop, a Free Software Program, Finds Uses Beyond Search, on Hadoop and Cloudera, a new startup that offering its own Hadoop distribution that is designed to beasier to install and configure.

“In the span of just a couple of years, Hadoop, a free software program named after a toy elephant, has taken over some of the world’s biggest Web sites. It controls the top search engines and determines the ads displayed next to the results. It decides what people see on Yahoo’s homepage and finds long-lost friends on Facebook.”
…
Three top engineers from Google, Yahoo and Facebook, along with a former executive from Oracle, are betting it will. They announced a start-up Monday called Cloudera, based in Burlingame, Calif., that will try to bring Hadoop’s capabilities to industries as far afield as genomics, retailing and finance. The company has just released its own version of Hadoop. The software remains free, but Cloudera hopes to make money selling support and consulting services for the software. It has only a few customers, but it wants to attract biotech, oil and gas, retail and insurance customers to the idea of making more out of their information for less.

Cloudera’s distribution, curently based on Hadoop v0.18.3, uses RPM and comes with a Web-based configuration aide. The company also offers some free basic training in mapReduce concepts, using Hadoop, developing appropriate algorithms and using Hive.

Disco is a Python-friendly, open-source Map-Reduce framework for distributed computing with the slogan “massive data – minimal code”. Disco’s core is written in Erlang, a functional language designed for concurrent programming, and users typically write Disco map and reduce jobs in Python. So what’s wrong with using Hadoop? Nothing, according to the Disco site, but…

“We see that platforms for distributed computing will be of such high importance in the future that it is crucial to have a wide variety of different approaches which produces healthy competition and co-evolution between the projects. In this respect, Hadoop and Disco can be seen as complementary projects, similar to Apache, Lighttpd and Nginx.

It is a matter of taste whether Erlang and Python are more suitable for the task than Java. We feel much more productive with Python than with Java. We also feel that Erlang is a perfect match for the Disco core that needs to handle tens of thousands of tasks in parallel.

Thanks to Erlang, the Disco core remarkably compact, currently less than 2000 lines of code. It is relatively easy to understand how the core works, and start experimenting with it or adapt it to new environments. Thanks to Python, it is easy to add new features around the core which ensures that Disco can respond quickly to real-world needs.”

The Disco tutorial uses the standard word counting task to show how to set up and use Disco on both a local cluster and Amazon EC2. There is also homedisco, which lets programmers develop, debug, profile and test Disco functions on one local machine before running on a cluster. The word counting example from the tutorial is certainly nicely compact:

There’s a very interesting late addition to UMBC’s spring schedule — CMSC 491/691A, a special topics class on parallel programming. Programming multi-core and cell-based processors is likely to be an important skill in the coming years, especially for systems that require high performance such as those involving scientific computing, graphics and interactive games.

The class will meet Tu/Thr from 7:00pm to 8:15pm in the “Game Lab” in ECS 005A and will be taught by research professors John Dorband and Shujia Zhou. Both are very experienced in high-performance and parallel programming. Professor Dorband helped to design and build the first Beowulf cluster computer in the mid 1990s when he worked at the NASA’s Goddard Space Flight Center. Shujia Zhou has worked at Northrop Grumman and NASA/Goddard on a wide range of projects using high-performance and parallel computing for climate modeling and simulation.

CMSC 491/691a Special Topics in Computer Science:
Introduction to parallel computing emphasizing the
use of the IBM Cell B.E.

There will a free CloudCamp ‘unconference’ in Chantilly VA (outside DC) from 3pm to 9pm on Wednesday 12 November.

“CloudCamp is an unconference where early adapters of Cloud Computing technologies exchange ideas. With the rapid change occurring in the industry, we need a place we can meet to share our experiences, challenges and solutions. At CloudCamp, you are encouraged you to share your thoughts in several open discussions, as we strive for the advancement of Cloud Computing. End users, IT professionals and vendors are all encouraged to participate.”

The UMBC Multicore Computation Center is hosting a free workshop on Frontiers of Multicore Computing 26-28 August 2008 at UMBC. The workshop will feature leading computational researchers who will share their current experiences with multicore applications. A number of computer architects and major vendors have also been invited to describe their road maps to near and long-term future system developments. The FMC workshop will focus on applications in the fields of geosciences, aerospace, defense, interactive digital media and bioinformatics. The workshop has no registration fees but you must register to attend. More information regarding hotel accommodations, tutorials, exhibits and access to the campus can also be found at the website.

Members of the UMBC ebiquity lab will make presentations on our current and planned use of multicore and cloud computing for research in exploiting Wikipedia as as knowledge base and also in extracting communities from very large social network graphs.

My colleague Marc Olano recently blogged about the new Larrabee chip from Intel, which will be described in a SIGGRAPH paper in a session he is chairing. This chip, with multiple old Pentium type cores running at 1GHz, seems a logical culmination of the recent multi/many core trend. IBM’s plans with the Cell/BE, and perhaps with the newer generation Power Chips, are also headed in a similar direction. Short of material scientists doing some magic with high K dielectrics or airgaps or CNFETs or whatever, the trend seems to be away from a single CPU with more transistors running faster and faster to multicored chips not clocked very fast. There’s a good reason for it (heat), as anyone who’s had a high end laptop and actually put it on their laps can testify. Further down the road, even more complex parallel architectures are proposed, with MCMs on chip connecting optically, and perhaps even memory stacked on top of the CPU layer talking optically back and forth! In other words, a few years down the road, the default box on which a system builder will write code will be something other than a single cored CPU. Bernie Meyerson from IBM discusses such issues in his talks — I can’t lay my hands on a publicly available power point, but some of the ideas are discussed in a recent interview.

Do these developments mean that we should be rethinking Programming 1 and 2, especially for CS majors. Do students now need to think parallel or multi-threaded programming from day one? Can that be done without first doing standard imperative programming? Given the less than ideal state of high school CS education, is it realistic to expect that students will get Programming 1 (and maybe 2) in high school? In our department, we’re offering class on programming the Cell/BE, and a course related to GPU programming, but those are typically meant for seniors. How about courses further upstream. Should data structures and algorithms change — maybe concepts like transactional memory need to be introduced ? Should OS change — talk much more about virtualization, and redoing virtual memory when ample NVRAM is available and accessible from a core ?

Cloud computing is a hot topic this year, with IBM, Microsoft, Google, Yahoo, Intel, HP and Amazon all offering, using or developing high-end computing services typically described as “cloud computing”. We’ve started using it in our lab, like many research groups, via the Hadoop software framework and Amazon’s Elastic Compute Cloud services.

Bill Poser notes in a post (Trademark Insanity) on Language Log that Dell as applied for a trademark on the term “cloud computing”.

It’s bad enough that we have to deal with struggles over the use of trademarks that have become generic terms, like “Xerox” and “Coke”, and trademarks that were already generic terms among specialists, such as “Windows”, but a new low in trademarking has been reached by the joint efforts of Dell and the US Patent and Trademark Office. Cyndy Aleo-Carreira reports that Dell has applied for a trademark on the term “cloud computing”. The opposition period has already passed and a notice of allowance has been issued. That means that it is very likely that the application will soon receive final approval.

It’s clear, at least to me, that ‘cloud computing’ has become a generic term in general use for “data centers and mega-scale computing environments” that make it easy to dynamically focus a large number of computers on a computing task. It would be a shame to have one company claim it as a trademark. On Wikipedia a redirect for the Cloud Computing page was created several weeks before Dell’s USPTO application. A Google search produces many uses of cloud computing in news articles before 2007, although it’s clear that it’s use didn’t take off until mid 2007.

An examination of a Google Trends map shows that searches for ‘cloud computing’ (blue) began in September 2007 and have increased steadily, eclipsing searches for related terms like Hadoop, ‘map reduce’ and EC2 over the past ten months.

“… must file a ‘Statement of Use’ or ‘Extension Request’ within 6 months (by January 8, 2009) in order to proceed to registration, and thereafter must enforce the trademark to prevent removal for ‘non-use’. This may be used to prevent other vendors (eg Google, HP, IBM, Intel, Yahoo) from offering certain products and services relating to data centers and mega-scale computing environments under the cloud computing moniker.”

Sandia and Oak Ridge national laboratories have established the Institute for Advanced Architectures to work toward computers that are a million times faster than todays supercomputers.

“An exaflop is a thousand times faster than a petaflop, itself a thousand times faster than a teraflop. Teraflop computers â€”the first was developed 10 years ago at Sandia â€” currently are the state of the art. They do trillions of calculations a second. Exaflop computers would perform a million trillion calculations per second.” (link)

Initial funding of $7.4M is provided by congressional mandate from the National Nuclear Security Administration and the Department of Energyâ€™s Office of Science.