Just in case you don’t know, yet: Pentaho Community Meeting PCM17 celebrates its 10th anniversary this November (10th through 13th) with a 3-days event full of presentations and networking.And it-novum has the honor to host the event together with Pentaho. Read more over here: PCM17 and FAQ

So, back to PCM17: It will take place in Mainz where everything started in 2008. I’m very excited about this, since it will be the same location, just much bigger(!) rooms. And I’m happy to assist and do things here and there, just as I did back in 2008. The difference is: Back in 2008, I organized it almost all on my own and this time we have so many helping hands, it’s amazing!

Here is a short status: We already have the agenda filled up with talks and, this is the first time ever, we added technical short talks on Friday together with the Hackathon. There are so many nice things to tell, we will make it possible! And, we got the highest number of attendees ever. Stay tuned for the final numbers. We will make it happen, anyone will be in.

So there is no Call for Presentations anymore, sorry… but, there is a Call for Collectibles. I explain why and what this is:

Since about two years, I’m a member of the non-profit organization GO for Tansania e.V. supporting The Brilliant Pre-Primary and Primary School in Tanzania. This school teaches children in the poorest families of Tanzania, many of whom are orphans. It is financed mainly through donations of the annual fundraising run at the Gymnasium Mainz-Oberstadt school and further donations.

To support this non-profit organization, we will do an auction on Saturday evening with a lot of nice Pentaho collectibles. BTW: Since Pentaho is now part of Hitachi Vantara, this might be even more interesting to get some of these 🙂

Here are two highlights what we already have for the auction:

#1 The Pentaho table-cloth that you see on the picture of the first Pentaho community meeting in 2008:

#2 The Pentaho flag that was on the entrance of the first event:

Furthermore, we have scratch books, blank DVDs all with the nice Pentaho logo.

Call for Collectibles: Please bring what you have and what you want to donate via the auction to support children in Tanzania.

If you have something but are not able to attend the event, it would be great when you could send it over to me. Please contact me. We will make sure your name will be on the „Walk of fame“ of donors 🙂

Thanks a lot and stay tuned for more …

And just in case you are not registered, yet for the 10th Pentaho Community Event, here you find everything you need to know, check it out: PCM17 and FAQ

The Filter Rows step is a special MDI scenario, since it has a nested structure of filter conditions (this applies accordingly to Join Rows step).

Example: a OR b

Example: a OR (x AND y)

The MDI condition is given in XML notation, for example:

The condition XML has the same format as we store the transformation meta data in a KTR file in XML format. We do not have a DTD (Document Type Definition) for the KTR XML format, nor the condition. But it is very easy to get to a XML condition:

Create a sample Filter step with the different conditions you need. This gives you all the information, for example the value for function (in our example &#x3d; represents the equal function)

Select the step and Copy it to the clipboard. Paste it into a text editor. Alternatively you can store the KTR and open the KTR in a text editor.

Find the <condition> element and their nested elements and modify it accordingly to use it in your MDI scenario.

Hint to remove the CR and LF within the XML in case you need a single line:

Use Notepad++ or a similar text editor

Replace \r\n and leave the „Replace with“ field empty

Make sure to switch to Extended Search Mode

Different MDI Architectures

In general, a MDI process can:

„Just“ inject metadata and call the template transformation –> we call this the MDI Standard use case

Additionally, the data can get pushed (streamed) from the main to the template transformation and also pulled back. This is needed when the template processes dynamic data from the main transformation. –> we call this the MDI Data Flow use case

For big data use cases, we can fill the template transformation with metadata and store it (phase 1). Then this stored transformation can be copied and executed on multiple nodes (phase 2).–> we call this the MDI 2 Phase Processing use case, also used in our Onboarding Blueprint for Big Data: Filling the Data Lake

And last but not least: If you miss a feature, let us know in JIRA (when you are a community user) or contact Pentaho Customer Support (when you are a Pentaho customer) to let us know about your feature requests and use cases. The more we know what is needed, the more we can help. Join us, either as a customer or community member, to leverage the power of Pentaho.

As I mentioned yesterday, Metadata Injection is my favorite part of this release. Now I get to explain why.

Metadata Injection (MDI) gives you the ability to modify transformations at execution time.

By dynamically passing source metadata to PDI at run time, IT teams can drive hundreds of data ingestion and preparation processes through just a few actual transformations. This promotes reusability and productivity, accelerating the creation of complex data onboarding processes.

How does Pentaho 6.1 and data onboarding relate back to the analytic data pipeline?Organizations face challenges scaling their data pipelines to accommodate exploding data variety, volume, and complexity. In particular, it can take considerable time and resources to engineer and prepare data for the following types of enterprise use cases:

The modern data onboarding necessary in these projects is more than just ‘connecting to’ or ‘loading’ data. Rather, it involves managing a scalable, repeatable process to ingest an array of changing data sources.

Traditional static ETL or hand-coding approaches require repetitive, time-consuming manual design that can increase the risk of human error. Meanwhile, this leads to staffing challenges and increased backlog. IT labor hours are tied up in risky, long term development projects when there are other potentially more valuable projects that could be worked on instead (opportunity cost).

What is metadata injection and why is it useful?

Recall that metadata is “a set of data that describes and gives information about other data.” In the context of a simple PDI transformation, examples of metadata include the field names, field types (string, number, etc), and field lengths (5 characters or 50?) from a CSV file input step in PDI. Normally, this metadata is accessed manually when a user creates the transformation, such as by hitting the Get Fields button in the CSV input step.

Metadata injection refers to the dynamic passing of metadata to PDI transformations at run time in order to control complex data integration logic. The metadata (from the data source, a user defined file, or an end user request) can be injected on the fly into a transformation template, providing the “instructions” to generate actual transformations. This enables teams to drive hundreds of data ingestion and preparation processes through just a few actual transformations, heavily accelerating time to data insights and monetization. In data onboarding use cases, metadata injection reduces development time and resources required, accelerating time to value. At the same time, the risk of human error is reduced.

In Pentaho 6.1, we have hardened and enhanced our metadata injection capabilities. This includes enabling metadata injection with new steps, providing new documentation and examples on help.pentaho.com, and making other standardizations and improvements.

Here is a list of PDI steps that support metadata injection as of PDI 6.1:

Sample: We also added a tutorial for MDI with a step-by-step documentation to create a solution to parse different Excel files that contain invoice data from suppliers to process and unify:

This is a simplified example for illustration to store the data in a text file. It can even be used in all sorts of use cases and can store the data in SQL, NoSQL databases or Big Data. You have all the capabilities of PDI – sometimes I hear the term Swiss Army Knife 🙂

The journey continues and our engineering teams already work with full steam ahead to add more and more capabilities in the upcoming release – stay tuned!

And last but not least: If you miss a feature, let us know in JIRA (when you are a community user) or contact Pentaho Customer Support (when you are a Pentaho customer) to let us know about your feature requests and use cases. The more we know what is needed, the more we can help. Join us, either as a customer or community member, to leverage the power of Pentaho.

I am very excited to share that today we announced Pentaho 6.1 (see also the German press release). There is so much to share that I am dividing this into two blog posts. Today I will share a summary of the new features and improvements of Pentaho 6.1. Stay tuned for tomorrow when I will go into detail about my personal favorite new feature in this release – Metadata Injection.

New Features and Improvements in Pentaho 6.1

Pentaho Data Services

This powerful feature has been around since PDI 5.4 (known as Thin Kettle JDBC driver) and it solves a couple of really nice use cases that the following picture illustrates:

What we did in 6.0: Within 6.0, the usability and performance have been improved by Caching and Push-Down-Optimization. We are not finished, yet: There are remaining use cases to continue these improvements and harden this feature for enterprise use.

And we worked on these in 6.1:

Added Parameter Pushdown Optimization for Data Services – You can improve the performance of your Pentaho data service through the new Parameter Pushdown optimization technique. This technique is helpful if your transformation contains any step that should be optimized, including input steps like REST where a parameter in the URL could limit the results returned by a web service.

Driver Download for Data Services in Pentaho Data Integration – When connecting to a Pentaho Data Service from a non-Pentaho tool, you previously needed to manually download a Pentaho Data Service driver and install it. Now in 6.1, you can use the Driver Details dialog in Pentaho Data Integration to download the driver.

Pentaho Data Service as a Build Model Source Edit section – You can use a Pentaho Data Service as the source in your Build Model job entry, which streamlines the ability to generate data models when you are working with virtual tables.

Further PDI improvements

Improved JSON Input Performance – The JSON library used by Pentaho Data Integration has been replaced. The new library increases speed by up to 10 times or more, and can now handle file of up to hundreds of MBs in size. Although the library has been updated, the user interface to this functionality is fully backward compatible.

New Run Options Window – The Execute a transformation and Execute a job windows in Pentaho Data Integration have been replaced with new Run Options windows. These new windows streamline your interaction while running a transformation or job, especially with the Always show dialog on run option. With this option disabled, the Run Options window does not appear the next time you run your transformation or job. The options currently set are reused until you access the Run Options window again.

New Run Context Menu – Besides the Action menu or by pressing F8 in Pentaho Data Integration, you can access the Run Options window through the new Run context menu. In this menu, select Run to run your transformation or job, and select Run Options to change any of options before running.

Related to Analyzer and Data Model

Hide a Calculated Measure in the Data Model Edit section – You can hide a calculated measure in the data model. In the Annotate Stream step, select the Hide this calculated measure in the model check box to include the measure as part of the model, but to hide it from users when the data source is opened in Analyzer.

Publish a Data Model with Hidden Attributes Edit section – Hide attributes and calculated measures in Analyzer when creating them on a data model in the Annotate dialog box by setting the Hidden property to ‚True‘ to hide the level in Analyzer. The Hidden property is useful for attributes needed to build a proper data model, but not needed for analytic purposes.

Edit Calculated Measures in Analyzer Edit section – You can now update the properties on a calculated measure created via inline model editing within Analyzer, such as if you want to rename the measure or adjust the MDX formula on the calculation. Also, you can now easily identify calculated measures in the Available fields list by the icon, ‚f(x)‘, which only displays next to calculated measures created in Analyzer via inline model editing.

Show and Hide Available Fields Edit section – You can select to hide or show fields in the list of Available fields for a report in Analyzer. Hiding fields is helpful when you want a clear view of only those fields you are interested in for your report. When you hide a field, it is no longer available for selection in the report.

Related to Plattform, CTools & Documentation

Platform Improvements – We continued with the foundational improvements to the Pentaho Platform first introduced in Pentaho 6.0. Highlights include continued Karaf hardening and Jackrabbit performance improvements.

CTools Documentation Added to Help – We’ve added CTools documentation to the Help. You can now learn the basics of the Community Dashboard Editor (CDE) with the CDE Quick Start Guide and its companion articles: CTools Overview, Activate CDE, CDE Dashboard Overview, CDE Quick Start Guide

CDF API Documentation – A full set of API documentation has been created for OEMs who want to use JavaScript APIs to build custom dashboards with the Community Dashboard Framework (CDF).

Help Site Improvements – Our Help site has been upgraded to a more responsive version of MindTouch. This upgrade improves content readability and simplifies navigation, along with a better interaction on a mobile device.

And last but not least: If you miss a feature, let us know in JIRA (when you are a community user) or contact Pentaho Customer Support (when you are a Pentaho customer) to let us know about your feature requests and use cases.

Every time when there is something new coming on the horizon, it’s a good time to look back. The intention of Hitachi Data Systems to acquire Pentaho is such a remarkable move that it opens up a great new chapter in Pentaho’s history.

As I just researched the details of the journey, I was very happy that you can still find the referenced postings on the Internet after all these years. All the other details came from my memories, so hope they are all accurate, especially the time of the day! And please consider that this is from my personal view and that I’m Kettle addicted, so European and German filter rules together with the priority on Pentaho Data Integration (aka Kettle) topics might apply here and there…

Sep 8, 2004 (right after dinner) – I read Matt Casters post #30 on the BIRT mailing list about Kettle version 2.0. BTW: The BIRT project was just at the beginning and it was even before the BIRT project got officially accepted by the Eclipse Foundation on Oct 6th, 2004.

At this time Kettle was closed source and I got a trial license code from Matt to play with it. Since our intention at Proratio (my employer at this time) was to use it to load the companies data warehouse from our ERP with an extreme complex condition system, the needed combination of steps would be a night mare. I asked Matt if there is some kind of plug-in possibility for Java code and his answer was about this: „This is a good idea, but since I’m doing this part time and on the weekends, I don’t know when it will be finished.“

In the meantime I developed a connector to SAP (some of the root bits and bytes went into the godesys SAP-Connector, formerly ProERPConn by Proratio) and thought it would be a good time to check with Matt again. And surprisingly he just finished the plug-in possibility. Sometime in summer 2005 we ended up sitting together in Mainz and managed to integrate the SAP connector within Kettle in a very short time (between breakfast and lunch).

Early Dec 2005 – Matt Casters released Kettle to the open-source with a LGPL license (moved to Apache 2.0 later on). For those who would like to know more about the history of Kettle (Matt started 2001 to work on it), have a look at the Pentaho forum post Project road map, history of kettle. It contains also a zip file of the very first Java Kettle version ever and a screen shot of version 1.0. You can also read a bit more about the story in the book Pentaho Kettle Solutions.

Dec 12, 2005 – Matt Casters‘ first forum post at Pentaho: „[…] Therefor, I would like to convince you to include a powerfull ETL tool that allows users to quickly build and maintain a full data warehouse. For me this means including slowly changing dimensions. Copying a couple of tables and hoping that the reporting engine will be able to cope just won’t do. Please feel free to consider including Kettle. Kettle is an ETL tool that turned LGPL about 10 days ago. […]“

Apr 4, 2006 – The press release went out: Pentaho Acquires Kettle Project – Fastest Growing Open Source BI Project Strengthens Portfolio With Best-in-Class Data Integration. This was also the time that Kettle was known as Pentaho Data Integration.

Jun 2007 – I joined Pentaho as employee #3 in Europe. #1 was Thomas Morgner (Founder and Chief Architect of the Pentaho Reporting engine) and #2 Matt Casters. Since we had a limited head count (not only in Europe), I had multiple heads on: working on the code base, documentation, QA, support, training, prove of concepts, presales, community, even Guerilla-Marketing and may be something else that I forgot.

Dec 2007 – Pentaho won it’s first German customer – a large health insurance company

Jun 2008 – The first Pentaho community event in Mainz was a great success, bringing people together with their ideas, projects and to know each other personally and last but not least setting a foundation stone to continue and grow the community over the next years.

This acquisition builds on an existing OEM relationship between the two companies and is a core component of the HDS strategy to accelerate its Social Innovation business and become a leader in IoT. Social Innovation is the unifying strategy across Hitachi businesses to deliver solutions that enable healthier, safer and smarter societies. The Pentaho vision to create transformational value from data generated and interconnected across people and things is brought to life by a big data orchestration platform to power embedded analytics.

A lot of exiting things happened at PentahoWorld 2014, you can read a lot about it on Twitter #PWorld2014. It is amazing what momentum we reached and that over 400 attendees have been at this conference. BTW: Did you know that Pentaho celebrated it’s 10th anniversary last week, too?

PentahoWorld 2014

For me, as a „PDI/Kettle addicted person“ 😉 one of the major topics was also the release of Pentaho 5.2 (BA & PDI). You can read about all the great new features over here: Upgrade Existing Pentaho Systems (the press release is not out, yet by the time of writing this article, but will come soon over here: Pentaho Press Releases).

PDI 5.2

Here are the highlight of PDI 5.2:

Pentaho Data Integration 5.2 delivers many exciting and powerful features that help you quickly and securely access, blend, transform, and explore data.

New Streamlined Data Refinery Feature

The Streamlined Data Refinery (SDR) is a simplified, ad hoc ETL refinery composed of a series of PDI jobs that take raw data, augment and blend it through the request form, and then publish it to the BA Server for report designers to use in Analyzer.

R Script Executor Step Improvements

The R Script Executor, Weka Forecasting, and Weka Scoring steps form the core of the Data Science Pack and transforms PDI into a powerful, predictive analytics tool. The R Script Executor step allows you to incorporate R scripts in your transformation so that you can include R-based statistical programming in your data flow. In PDI Version 5.2 you can now „plug and play“ R scripts, without extra customization. Now you can pass incoming field metadata to the output field metadata, use a more intuitive user interface to run scripts by rows or by batches, and test scripts.

New DI Server Administration Features

Porting content from one environment to another and performing general DI Repository maintenance is easier with the introduction of the new Purge Utility. The Purge Utility permanently purges the repository of versions of shared objects, such as database connection information, jobs, and transformations. You can also turn DI Repository versioning and comment capturing capabilities on and off.

Kerberos Security Support for CDH 5.1 and HDP 2.1

If you are already using Kerberos to authenticate access to a Cloudera Distributed Hadoop 5.1 or Hortonworks Data Platform 2.1 cluster, with a little extra configuration, you can also use Kerberos to authenticate Pentaho DI users who need to access those clusters.

New Marketplace Plugins

Pentaho Marketplace continues to grow with many more of your contributions. Pentaho Marketplace is a home for community-developed plugins and a place where you can contribute, learn, benefit from, and connect to others. New contributions include:

LookupTimeDimensionStep: Looks up and creates an entry on a data warehouse dimension time table and returns the ID.

Probabilistic Row Distributions: Contains a collection of Row Distribution plugins for PDI that use probabilistic methods for determining the distribution of rows.

PDI Groovy Console: Adds a Groovy console to the Help menu that has helper methods and classes that interact with the PDI environment.

I’m very happy to see PentahoWorld coming soon on October 8 through 10, 2014. It is our first(!) worldwide conference for Users, Advocates and Partners of Pentaho.

After many years of Pentaho community meetings, this is a next logical step in boosting the great Pentaho story. I initiated and organized the first Penaho community meeting. It was conducted in my hometown Mainz, Germany back in 2008 and we had a great audience, a lot of information, talks and fun. Since then, it succeeded every year and will continue in the future: the next community meeting is scheduled for November 14 through 16 in Antwerp, Belgium.

BTW: By writing this, a lot of history come to my mind, so I will continue to write a bit more about history that formed and forms our future in IT, Pentaho and Big Data in the next couple of blog posts, stay tuned via my Twitter account or RSS feeds of this blog.

Training and Interactive Sessions: Three days of rich content to fuel the data-driven business

Learn how to apply the latest Pentaho innovations, see how customers use Pentaho to solve real-world problems, gain insight into product roadmaps and strategic plans, and take away practical knowledge and best practices.

Pentaho 5.1 delivers code-free analytics directly on MongoDB, simplifies data preparation for data scientists and adds full YARN support. See the full list of Pentaho Business Analytics 5.1 features and technical specifications.

Here are some more details about the new PDI 5.1 release:

Native (YARN) Hadoop integration

PDI includes support for YARN capabilities including enhanced scalability, compatibility with MapReduce, improved cluster use, and greater support for agile development processes. YARN also provides support for non-MapReduce workloads.

YARN for Carte Kettle Clusters

The Start a YARN Kettle Cluster and Stop a YARN Kettle Cluster entries make it possible to to execute carte transforms in parallel using YARN. Carte clusters are implemented using the resources and data nodes of a Hadoop cluster, which optimizes resources and speeds processing.

Cloudera, Hortonworks, and MapR Hadoop Distribution Support

Use Pentaho’s innovative Big Data Adaptive Layer to connect to more Hadoop Distributions, including Cloudera 5, Hortonworks 2.1, and MapR 3.1. These certified and tested YARN-based distributions allow you to use PDI to build scalable solutions that are optimized for performance. Pentaho supports over 20 different Hadoop Distribution versions from vendors such as Apache, Hortonworks, Intel, and MapR. Pentaho also supports Cloudera distributions and is a certified Cloudera partner.

Data Science Pack with Weka and R

The R Script Executor, Weka Forecasting, and Weka Scoring steps form the core of the Data Science Pack and transforms PDI into a powerful, predictive analytics tool. The R Script Executor step, which is new for 5.1, lets you include R scripts in your transformations and jobs. You can customize random seed sampling, limit the batch and reservoir size, adjust logging level messages, and more. You can also choose to load the script from a file at runtime, enabling you to have more flexibility in transformation design.

Security Enhancements

PDI security has been enhanced to include support for more standard security protocols and specifications.

New Execute Permission: You can now choose whether to grant permission to execute transformations and jobs by user role. This provides more finely-tuned access controls for different groups and can be useful for auditing, deployment, or quality assurance purposes.

Kerberos Security Support for Hadoop & MongoDB: If you are already using Kerberos to authenticate access a data source, with a little extra configuration, you can also use Kerberos to authenticate DI users who need to access your data.

Teradata and Vertica Bulkloaders

There are two new bulkloaders steps: Teradata TPT Bulkloader and Vertica Bulkloader. Also, newer versions of Teradata and Vertica are now supported.

New Marketplace Plugins

Pentaho Marketplace continues to grow with many more of your contributions. As a testimonial to the power of community, Pentaho Marketplace is a home for your plugins and a place where you can contribute, learn, benefit from, and connect to others. New contributions include:

Additional to the release of the Enterprise Edition (EE), Pentaho released the stable build of 5.0 Community Edition (CE).

If you can’t wait to get to the download, have a look at our new community.pentaho.com. It hosts all information and the download link. And for the download of a free 30-day trial of Pentaho EE, visit pentaho.com/download.

Pentaho Instaview is the fastest way to start using Pentaho Data Integration to analyze and visualize data. Instaview uses templates to manage the complexities of data access and preparation. You can focus on selecting and filtering the data you want to explore, rather than spending time creating source connections and identifying measure and dimension fields. Once the data has been selected, Instaview automatically generates transformation and metadata models, executes them, and launches Pentaho Analyzer. This allows you to explore your data in the Analyzer desktop user interface.
As your data requirements become more advanced, you have the ability to create your own templates and use the full power of Pentaho Data Integration (PDI).
Watch this video and see the Getting Started with Pentaho Data Integration Instaview Guide to understand and learn more about Pentaho Instaview or

PDI Operations Mart

The PDI Operations Mart enables administrators to collect and query PDI log data into one centralized data mart for easy reporting and analysis. The operations mart has predefined samples for Pentaho Analyzer, Interactive Reporting, and Dashboards. You can create individualized reports to meet your specific needs.

Sample inquiries include

How many jobs or transformations have been successful compared to how many failed in a given period?

How many jobs or transformations are currently running?

What are the longest running jobs or transformations in a given period?

What is the highest failure rate of job or transformations in a given period?

How many rows have been processed in a particular time period? This enables you to see a trend of rows or time in time series for selected transformations.

The operations mart provides setup procedures for MySQL, Oracle, and PostgresSQL databases. Install instructions for the PDI Operations Mart are available in the Pentaho InfoCenter.

Concat Fields Step

The Concat Fields step is used to join multiple fields into one target field. The fields can be delimited by a separator and the enclosure logic is completely compatible with the Text File Output step.

This step is very useful for joining fields as key/value pairs for the Hadoop MapReduce Output step.

The EDI to XML step converts EDI message text, which conforms to the ISO 9735 standard, to generic XML. The XML text is more accessible and enables selective data extraction using XPath and the Get Data From XML step.

New Pentaho Data Integration Software Development Kit (PDI SDK)

Extending and Embedding Pentaho Data Integration enables developers to utilize PDI beyond the out-of-the box functionality. This guide explains the mechanics of extending PDI plugins. It also explains embedding PDI functionality directly into Java applications. Pentaho provides sample code for all plugin types and embedding scenarios. More details can be found in the Pentaho Infocenter Embedding and Extending Pentaho Data Integration.

The complete PDI 4.4 change log: Now with JIRA Components and PDI Sub-Components

This is the first release with very detailed information about what has changed on a more granular level, so you could query JIRA by components and PDI Sub-components like steps, job entries and specific database topics. This helps in the upgrade process and finding already existing issues in JIRA. More details can be found in the Pentaho Wiki page JIRA PDI Components, Sub-components and Labels and the complete PDI 4.4 change log can be found in JIRA. You need to be logged into JIRA to include the fields in your default report, but you can also download the complete list when you select „View“ / „Excel (All fields)“.

Some background about versioning and compatibility for Kettle core

From the Kettle core perspective, 4.3.0 and 4.4.0 are still bug fix releases for 4.2.x but with significant new features. As such, we have decided to call the release version 4.4.0 rather than 4.3.1. As always, please see the PDI Upgrade Guide for specific topics when upgrading.

Carte is a simple web server that allows you to execute transformations and jobs remotely and execute transformations clustered. Carte is normally started with a .bat file within Windows environments but there are some use cases to run Carte as a Windows Service, e.g.:

When Carte instances are running using a command window, anyone by mistake could close the instance and Carte will go down.

The Carte.bat command window is tied to the user session that called the batch file and needs to be kept logged in.

With a Windows Service you can start the Carte service at machine startup and also configure it to restart after a crash.

After you completed the instructions on the Pentaho Wiki, you are able to get Carte running as a Windows service like this:

It is also possible to get a system tray

to get the console output of Carte like this after the initial start of the Windows service:

The console gets the same information as you get via the browser when you logged into carte, e.g. after the execution of the sample transformation:

Many thanks for the developers of YAJSW (Yet Another Java Service Wrapper) for this great solution to wrap Java programs as a service and to Dan to bring this to my attention.

On May 23rd 2012 Pentaho and 10gen are jointly announcing a partnership to provide direct integration between Pentaho Business Analytics and MongoDB.

MongoDB is a scalable, high-performance, open source NoSQL database featuring document-oriented storage, auto-sharding for horizontal scalability, rich document-based queries and fast in-place updates. MongoDB is designed with both scalability and developer agility in mind. Instead of storing your data in tables and rows as you would with a relational database, in MongoDB you store JSON-like documents with dynamic schemas. The goal of MongoDB is to bridge the gap between key-­value stores (which are fast and scalable) and relational databases (which have rich functionality).

Working together, Pentaho and 10gen offer the first MongoDB-based big data analytics solution to the market. This solution combines MongoDB with Pentahos visual interfaces for high-performance data input, output and manipulation, as well as data discovery, visualization and predictive analytics. This makes it easy and productive for IT staff, developers, data scientists and business analysts to operationalize, integrate and analyze both big data and traditional data sources.

MongoDB and further BigData and NoSQL connectors are available in all editions of Kettle including Community, Basic, Professional and Enterprise since Version 4.3.

Kettle makes it really easy to extract data from multiple big data and traditional sources, and integrate it into a data mart or warehouse. The visual interfaces for big data will accelerate development and maintenance of MongoDB-based solutions a lot!

It was already possible to profile your data in an easy way with Kettle: Open the Database Explorer, chose a table and right click in the context menu on Data Profile. The result was basic information about the data like Min, Max, Count all for strings and additional information for numeric data but these were only basic metrics about your data. We have a much more better solution now:

Human Inference (DataCleaner) and Pentaho (Kettle) worked together to integrate their tools and the result is a nice and seamless integration of DataCleaner into Kettle. A sample for introduction and FAQ can be found at Kettle Data Profiling with DataCleaner

You can right click on any step within your transformation and profile your data. It is also possible to clean or harmonize your data and check the result directly within your transformation.

Here are some screen shots:

Number Analyzer

String Analyzer

Pattern Analyzer (e.g. you see how many single words with first capital letters etc. exist in your data):

And there are much more Analyzers you can choose from: Matching & Deduplication, Boolean, Character Set Distribution, Data Gap, Date/Time, Reference Data Matcher, Value & Weekday DistributionAnalyzers

All of Pentaho’s big data capabilities will be available as open source in the new Pentaho Kettle 4.3 release: Big data capabilities include the ability to input, output, manipulate and report on data using the following Hadoop and NoSQL stores: Cassandra, Hadoop HDFS, Hadoop MapReduce, Hadapt, HBase, Hive, HPCC Systems and MongoDB.

The ExeBatLauncher is a simple way of calling .bat files as .exe files in creating an .exe command that calls the same .bat command.

For example you want to call the Spoon.bat file by Spoon.exe, simply rename the ExeBatLauncher.exe (that is found in the distrib folder) to Spoon.exe and copy it to the same directory and you are done.

If you want to have a polished Spoon.exe with the right icon to test with, just download it from the PDI-6949 attachment.

The reason why I created this was a Pentaho Data Integration JIRA case: PDI-6949 – „While in PDI-2925 the feature of kettle.exe was removed, you cannot add a .bat file to the quick launch of Windows 7. The workaround is to add a cmd.exe to the quick launch and add /c <path to .bat> to modify it. This is due to microsoft security in Win 7.“

I also thought a lot of launching Kettle just by clicking on a .ktr or .kjb file and accomplished this by a wrapper .bat (calling the target bat with the /file: option). There are also nasty ways of launching .bat files by creating self extracting exe archives (nice trick), but this is just too dirty to be professional. But I tried also this and it worked even with the Spoon icon, but some limitations like having to extract the.bat file from the self extracting archive into the distribution folder are not a good solution. Further investigation into existing solutions of calling Java from .exe brought some nice projects but that is not the solution we want at this time.

I wanted to have a pretty simple .exe that just calls the .bat with the same name and here is the code for this:

As you see it supports:

calling the .exe/.bat from every folder

passing all arguments from the caller

returning the return code to the caller

If you want to compile it, follow the instructions in the _README.txt file in the attachment: ExeBatLauncher.zip

Today was our first German Customer Meeting that Bruno initiated and it took place at one of our customer locations @Wirecard AG (many thanks!).

A nice (and squeezed ) agenda and around 20 attendees discussed the presented customer solutions, their experiences and what Pentaho can do better, but also: what really works well and this is a lot and proven in medium and large deployments. Here are some examples in addition to „normal“ BI:

Application and data row level role based security in a multi-tenant solution using special role management via a Mondrian plug-in

Highly performant solutions with load balancing servers and high availability (99.95%), one of them with a failover with two complete separate locations

Pentaho integrates very well in existing IT infrastructures

Event triggered ETL by an ESB (Enterprise Service Bus) / SOA architecture: JBoss ESB triggers PDI Jobs for application integration (JMS, File, HTTP, SMTP, FTP etc.). At this time via the Kettle API but it is planned to use Kettle Web-Services

PDI is used as a data source for reporting, just a nice EAI use case

Pentaho is integrated into JBoss EAP (the customer version of JBoss AS, everything in this platform is secured by default)

3 of the attendees are using Infobright

Report bursting was accomplished by 3 different methods: 1) using xActions, 2) using ant scripts, 3) using Kettle

Major discussed improvements:

More integrated and consistent security throughout all products (its possible but special requirements are tricky to implement)

Report bursting with scheduling should be more standardized and user friendly

Since Pentaho can be extended and embedded very well via extension APIs, the upgrade to next versions should be more seamless by special dedicated documentation for OEM and stable API extension points [this will be addressed in the road map by the SDK]

Upgrade process: upgrade tools would be a huge benefit for supporting it, also when some releases were not installed (e.g. jump from 3.x to 4.1 instead of 4.0)

As part of the Pentaho Product Management team, I presented the product road map and gave some quick insights in the Kettle Star Modeler.

I look forward to the next event of this type for knowledge sharing between Pentaho customers and users. We found already one volunteer (Guy) to take it from here 🙂

BTW: This event was only in German language (at this time!) and thanks again to Wirecard, Rob & Bruno for organizing this and to the attendees for their presentations and active participation!

Here are some impressions….

Veröffentlicht in General|Kommentare deaktiviert für First German Pentaho Customer Meeting in Munich

This step provides the ability to read data from any type of XML file using the StAX parser. The existing Get Data from XML step is easier to use but uses DOM parsers that need in memory processing and even the purging of parts of the file is not sufficient when these parts are very big.

The XML Input Stream (StAX) step uses a completely different approach to solve use cases with very big and complex data stuctures and the need for very fast data loads: Since Kettle has so many own steps to process data in different ways, the processing logic has been moved more into the transformation and the step itself provides the raw XML data stream together with additional and helpful processing information.

Since the processing logic of some XML files can sometimes be very tricky, a good knowledge of the existing Kettle steps is recommended to use this step. Please see the different samples at the Kettle Wiki for illustrations of the usage.

Note: In almost all use cases, a Set/Reset functionality was needed. At this time it can be accomplished by the Modified Java Script Value step or the User Defined Java Class step where the latter one is recommended and much more faster. An own Kettle step with Set/Reset functionality is one the road map to solve these and other similar use cases, see PDI-6389 for more details.

Choose this step, whenever you have limitations with other steps or when you are in need of parsing XML with the following conditions:

Very fast and independend of the memory regardless of the file size (GBs and more are possible due to the streaming approach)

Very flexible reading different parts of the XML file in different ways (and avoid parsing the file many times)

Here is an example of parsing the following XML file with 2 main sample data blocks (Analyzer Lists & Products):

A preview on the step may look like this (depending on the selected fields):

You see you really get almost the original streaming information with Elements and Attributes from the XML file together with helpful other fields like the element level.

Since the processing logic of some XML files can sometimes be very tricky, a good knowledge of the existing Kettle steps is recommended to use this step. Please see the different samples of this step for illustrations of the usage.

The transformation looks like this:

The end result for the Analyzer List block:

The end result for the Products block (splitted for example into two separate data streams for the end system):

And there are a lot more options in the step to help to solve your needs:

Kettle is used more and more in enterprises where the standard obfuscation of credentials is not sufficient enough. There are requirements to use strong encryption methods and even to store internal data encrypted (covered in PDI-6168 and PDI-6170). The above use cases inspired me to create some simple transformations to test and play around with encryption.

Let’s start with creating a key by the cryptographyCreateSecretKey transformation:

The generateKey step uses the User Defined Java Class step and implements sample code for AES, the Advanced Encryption Standard is a symmetric-key encryption standard, see also http://java.sun.com/developer/technicalArticles/Security/AES/AES_v1.html. The key serialization to file is a little trick to obfuscate the key. Other methods can be included instead of the clear text file output.

Now that we have the key file, we can encrypt our secret data:

With the transformation cryptographyEncrypt:

We keep it simple and assume the key is available in each row (accomplished by the Join Key).

The encrypted result looks like this:

Let’s decrypt it with the transformation cryptographyDecrypt:

The result is correct but only when the key file is the same and the encrypted data was not modified. You can test it yourself and see what error messages come up or the resulting files look like when the key file or data was modified.

Instead of storing the decrypted data to a file there are a lof of other options, e.g.:

use the decrypted data as credentials in subsequent steps or transformations

put the decrypted data into variables visible in a limited scope (e.g. parent job) and use them as credentials for databases, repository etc. (see PDI-6168)

I will not discuss about the buzz about the iPhone tracking – all that needs to be discussed is already out. That iPhone is storing locations in the consolidated.db was known a long time ago, but now we got a prove of concept by Pete Warden and Alasdair Allan and this inspired me to dig into this a bit more.

After an amazing short time of less than one hour I made it possible to read in my own iPhone consolidated.db with Kettle. [It takes more time to write about this… and the other real hard thing was to locate this file in my file system.] When we can read this file with Kettle we are open and can look up any additional data, filter by any criteria, store this in another database, use the Pentaho BI Suite to create maps and anything else that can a human being imagine. Just have a look at the Pentaho Sandbox or more general to the Pentaho site.

Here are my findings and steps to reproduce:

1) Locate the consolidated.db in your filesystem.

This is a little bit tricky, since the filename is mapped to an ID. I found it very easy by just searching for the string „CellLocation“ in the iTunes backup folder, described over here for Windows:

Within Kettle you have the Normalizer and Denormalizer steps to help with pivoting (a simplified use case for pivoting is changing the row/column axis, see also transpose). Imagine you would like to have a cross table, you may use a Kettle transformation to accomplish this. (In these days Pentaho Reporting’s cross tab functionality is on the road map but not implemented, yet. One solution would be to use a Kettle Transformation as a data source.).

Another solution for a cross tab is to use the SQL CASE WHEN construct. Here is an example:

The result of the following sample query select status, year_id, sum(totalprice) as totalprice from orderfact group by status, year_id looks like this:

When we want to put the years (2003, 2004, 2005) into separate columns, we can use the SQL CASE WHEN construct:

In another use case, we store actual and planning data (e.g. for turnover, quantity, weights etc.). To make it a bit more challenging, planning data is often stored as a budget and multiple forecasts or different plan versions (scenarios).

One solution for this problem is to store these data in separate columns, e.g. have columns for actual and other columns for planning data. Having multiple measures (see above) needs to have multiple columns for each type of the measures, e.g. turnover_actuals, quantity_actuals etc. and have this also for all types of planning scenarios. Thus you would get a big amount of columns and the biggest issue is: What happens when you need a new planning scenario? You would need to add more columns and your data model is not flexible enough and needs to be changed.

Another solution is to store the definition of an actual or planning type as an extra TYPE_ID in our table. This means each row is marked whether it is an actual or any kind of planning type.

But, when you sum up the whole table, the result would be wrong since it contains a mix of actual and planning data.

This can be solved also with the CASE WHEN construct even within a Mondrian schema definition:

With this solution you are flexible with your data model and the result will be correct.

Another more complex use case is to combine this with a reporting solution for project costing or contribution accounting. When you have a large number of different costs and would store these costs in separate columns you will reach a non manageable amount of columns and your data model is not flexible enough. It’s better to store these different types of costs in rows and you are much more flexible.

By all the flexibility with the data model you need to consider performance aspects of your database, e.g. if column based databases behave different with the CASE WHEN (I have not tested this) and different approaches should be tested. Any comments and findings are much appreciated.

Veröffentlicht in General|Kommentare deaktiviert für How the SQL CASE WHEN construct can help you in pivoting data?

There are a lot of different options to connect with a SAP system from Kettle. It was back in 2009 were I tested the functionality to access a SAP BI system via Webservices (XML/A) and here are my (historic) findings:

To analyze the result of the MDX query with Kettle is complicated (you need for every number of columns a separate transformation) but is possible. In these days other Pentaho frontend tools are able to process this direct via OLAP4J. I tested this with the Webservices step (using Basic Authentication) and for specific usecases this would be an option. Further development brought the OLAP Input step and this simplifies a lot and was tested successfully.

Here are some further information that might be of help for setting this up:

For the XML/A interface the following webservice should be switched on in transaction SICF:
/default_host/sap/bw/xml/soap/xmla

XMLA has two methodsDiscover – This method is used to query metadata and master data (this corresponds to the BAPI MDDataProviderBW).Execute – You can execute MDX commands with this method to receive the corresponding result set (this corresponds to the BAPI MDDataSetBW).

Description of the service can be reached by: <Protocol>://<Server>:<Port>/sap/bw/xml/soap/xmla?wsdl

For XMLA with SAP BI you have to create MDX commands generated from the metadata you receive from the system.

Below is transaction MDXTEST in SAP BI to test MDX query on a SAP BI infoprovider or query:SELECT
{[Measures].[0HDCNT_LAST],
[Measures].[ZKFHRFTE]} ON COLUMNS,
[0EMPLOYEE].MEMBERS ON ROWS
FROM [$ZMCHRPA1]

Some syntax templates are available in the MDX test editor.

Another service that is available is to get data directly from a query or query view. The service has one method: GetQueryViewData
Go to SICF and activate this web service: /sap/bc/srt/rfc/sap/QUERY_VIEW_DATA

The output will be in format of Datasets: AXIS_INFO (Metadata description of the characteristics, attributes, and structures on the axes), AXIS_DATA (Characteristic values, attribute values etc.), CELL_DATA (Value cells: The cells are arranged by row first, and then by column), TXT_SYMBOLS (Text elements for the query).

When you want to check if your Kettle jobs and transformations execute correct, you can use some operational patterns like

Health Checks for the JVM

Health checks for clustered environments and failover of a master server

Analyze log entries

Use restartable solutions

and others that are described in more detail in the Pentaho PDI training class

When you need a different approach for more safety and to measure additional values, then you can use the concept of a Watchdog and combine this with the normal patterns. A use case is e.g. you are building a very critical system (imagine a rocket flying to space) and engage different software developers, use different measuring units and different computers to test if everything is correct. When one of the units fails, you can signal an event and e.g. abort the launch. Well, the Kettle world is most times not such critical but it is used more and more in systems that need a very high reliability and 24×7 with high and continuously guaranteed throughput.

With a Kettle solution you may need many components to check, e.g. the Data Integration Server, Carte and clustered environments, Kitchen/Pan jobs and transformations, detect dead locks and also for external components like the JVM (e.g. memory, used CPUs), the server (e.g. test with a network ping) and the databases (up and running, capacity etc.).

You also need a signal to noise detection: Define what is normal (noise) and notify on exceptions (e.g. set thresholds  absolute or relative eventually by average), define what is unusual and notify on these events and define events (e.g. actions, notifications & alerts). By all these checks you have a primary constraint: You need to minimise the footprint and impact to the system by the measurements.

Why a Watchdog can help here?

One solution for checking if everything is on track is to use the concept of a Watchdog with tasks and events:

A software crash might go undetected by conventional watchdog strategies when you have multitasking (e.g. many PDI jobs and cluster nodes). The success lies in weaving the watchdog into the fabric of all of the system’s tasks, which is much easier than it sounds:

Build a watchdog task

Create a data structure (database table) that has one entry per task

When a task starts it increments its entry in the structure. Tasks that only start once and stay active forever can increment the appropriate value each time through their main loops, e.g. every 10,000 rows

As the job or transformation runs the number of counts for each task advances.

Infrequently but at regular intervals the watchdog runs.

The watchdog scans the structure, checking that the count stored for each task is reasonable. One that runs often should have a high count; another which executes infrequently will produce a smaller value.

If the counts are unreasonable, halt and let the watchdog timeout and fire an event. If everything is OK, set all of the counts to zero and exit.

This is a derived concept from Jack Ganssle (2004): Great Watchdogs (especially the section WDTs for Multitasking).

An example implementation with Kettle

Most of the Kettle health checks can be accomplished with this Watchdog concept. An example implementation with Kettle is task oriented, not server oriented. This means it will check, if the task (a Transformation or Job) is running as expected independently in what environment (e.g. clustered or not).

Since Kettle has a wide variety of transformation steps and job entries I tend often not to program in a classical programming language but solve this solely with Kettle.

Here is an example of the event handling that could be altered very flexible by adding new events:

And here is an example of the above described logic of the Watchdog step. Sooner or later this could be simplified by building some Kettle steps that encapsulate these logics but for now it just works: