Archive for May 2010

Basically the GNU Go app is GPL, which is not compatible with the terms of service of Apple’s App Store. This is the wrap-up on the FSF post:

That’s the problem in a nutshell: Apple’s Terms of Service impose restrictive limits on use and distribution for any software distributed through the App Store, and the GPL doesn’t allow that. This specific case involves other issues, but this is the one that’s most unique and deserves explanation.

We would’ve liked to see Apple do the right thing and remove these limits, but it looks like that’s not going to happen. Apple has removed GNU Go from the App Store, continuing their longstanding habit of preventing users from doing anything that Apple doesn’t want them to do. As we said in our initial announcement, this is disappointing but unsurprising; Apple made this choice a long time ago. We just need to make sure everybody else gets the message: if you value your independence and creativity, you should be aware that Apple doesn’t. Take your computing elsewhere.

I am a firm believer in the FOSS, but this is nonsense from the FSF. This position is micro-focused and blinded to the larger picture. The App store provides both free (zero-cost) and paid downloads. App developers are able to provide open source or proprietary apps. Developers have the freedom to choose. What the FSF wants Apple to do is to remove the mechanisms in place that protect the distribution of proprietary apps. My guess is that the proprietary apps account for over 98% of all iPhone and iPad apps.The FSF wants Apple to abandon those developers (>98%) in favor of the developers who believe in the Free Software philosophy (<2%).

The FSF’s position is that, by rights, software is fundamentally free and that intellectual property is bad. The FSF doesn’t like the fact that people create proprietary software. They don’t even like shareware. They think anyone creating software should give it away for free and provide the source code too. The FSF values your creativity – they value it so much they think you don’t have the rights to own your creation.

Last week, on the same day, both Pentaho and IBM made announcements about Hadoop support. There are several interesting things about this:

IBM’s announcement is a validation of Hadoop’s functionality, scalability and maturity. Good news.

Hadoop, being Java, will run on AIX, and on IBM hardware. In fact, Hadoop hurts the big iron vendors. Hadoop also, to some extent competes with IBM’s existing database offerings. But their announcement was made by their professional services group, not by their hardware or AIX groups. For IBM this is a services play.

IBM announced their own distro of Hadoop. This requires a significant development, packaging, testing, and support investment for IBM. They are going ‘all in’, to use a poker term. The exact motivation behind this has yet to be revealed. They are offering their own tools and extensions to Hadoop, which is fair enough, but this is possible without providing their own full distro. Only time will show how they are maintaining their internal fork or branch of Hadoop and whether any generic code contributions make it out of Big Blue into the Hadoop projects.

IBM is making a play for Big Data, which, in conjunction with their cloud/grid initiatives, makes perfect sense. When it comes to cloud computing, the cost of renting hardware is gradually converging with the price of electricity. But with the rise of the cloud, an existing problem is compounded. Web-based applications generate a wealth of event-based data. This data is hard enough to analyze when you have it on-premise, and it quickly eclipses the size of the transactional data. When this data is generated in a cloud environment, the problem is worse: you don’t even have the data locally, and moving it will cost you. IBM is attempting a land-grab: cloud + Hadoop + IBM services (with or without IBM hardware, OS, and databases). They are recognizing the fact that running apps in the cloud and storing data in the cloud are easy: but analyzing that data is harder and therefore more valuable.

Pentaho’s announcement, was similar in some ways, different in others:

Like IBM, we recognize the needs and opportunities.

Technology-wise, Pentaho has a suite of tools, engines and products that are a much better suited for Hadoop integration, being pure Java and designed to be embedded

Pentaho has no plans to release our own distro of Hadoop. Any changes we make to Hadoop, Hive etc will be contributed to Apache

And lastly, but no less importantly, Pentaho announced first. ;-)

When it comes to other players:

Microsoft is apparently making Hadoop ready for Azure, but is Hadoop currently is not recommended for production use on Windows. It will be interesting to see how these facts resolve themselves.

Oracle/Sun has the ability to read from the Hadoop file system and has a proprietary Map/Reduce capability, but no compelling Hadoop support yet. In direct conflict with the scale-out mentality of Hadoop, in a recent Wired interview Larry Ellison talked about Oracle’s new hardware

The machine costs more than $1 million, stands over 6 feet tall, is two feet wide and weighs a full ton. It is capable of storing vast quantities of data, allowing businesses to analyze information at lightening fast speeds or instantly process commercial transactions.

HP, Dell etc are probably picking up some business providing the commodity hardware for Hadoop installations, but don’t yet have a discernible vision.

Dan has been at EMC for a number of years and know a lot about data. He is dead on when he talks about metadata and dimensionality of Map/Reduce and NoSQL data stores. These environments are rich in data but the metadata can be very sparse or non-existent. This makes reporting and analysis of the data harder.

The thing is, taking all of these in combination, Pentaho is the only technology that satisfies all of these points.

You can see a few of the upcoming integration points in the demo video. The ones shown in the video are only a few of the many integration points we are going to deliver.

Most recently I’ve been working on integrating the Pentaho suite with the Hive database. This enables desktop and web-based reporting, integration with the Pentaho BI platform components, and integration with Pentaho Data Integration. Between these use cases, hundreds of different components and transformation steps can be combined in thousands of different ways with Hive data. I had to make some modifications to the Hive JDBC driver and we’ll be working with the Hive community to get these changes contributed. These changes are the minimal changes required to get some of the Pentaho technologies working with Hive. Currently the changes are in a local branch of the Hive codebase. More specifically they are a ‘SHort-term Rapid-Iteration Minimal Patch’ fork – a SHRIMP Fork.

Technically, I think the most interesting Hive-related feature so far is the ability to call an ETL process within a SQL statement (as a Hive UDF). This enables all kinds of complex processing and data manipulation within a Hive SQL statement.

There are many more Hadoop-related ETL and BI features and tools to come from Pentaho. It’s gonna be a big summer.