Sunday, September 25, 2011

Be Aware

Denial of Service attacking (DoS), IP Spoofing, Comment Spamming and Malware programming... are malicious activities designed to disrupt services used by many people and organizations. If you are taking advantage of the internet to run your business, create an awareness of a product or service or simply keep in touch with friends and family, your systems are at risk at becoming a target.

Successful internet "intrusions" can cost you money and even steal your identity. DoS attacks can prevent internet sites from running efficiently and in most cases can take them down. IP Spoofing, frequently used in DoS attacks, is a means to "forge" the IP address and make it appear that the internet request or "attack" is coming from some other machine or location. And Comment Spamming, oh brother...where programs or people flood your site with random nonsense comments and links with an attempt to raise their site's search engine ranking or increase internet traffic to their sites:

"Nice informations for me. Your posts is been helpful. I wish to has valuable posts like yours in my blog. How do you find these posts? Check mind out [link here]"

Huh? - LOL

You may already have defensive measures in place to address some if not all of these things. There are programs, filters and services that you can use to look up, track and prevent this sort of activity. However, with the continuous stream of unique and newly produced malware, those programs and services are only as good as the latest "malicious" activity that is captured. No matter what, it will eventually cause headaches for many people and organizations around the globe. Being able to monitor when something is "just not right" is a great step in the right direction.

Analyze

In September of 2010, I introduced the Pentaho Evaluation Sandbox. It was designed as a tool to assist with Pentaho evaluations as well as showcase many examples of what Pentaho can do. There have been numerous unique visitors to this site, both legitimate and some as I soon discovered...not. Prior to the site's launch, using Pentaho's Reporting, Dashboard and Analysis capabilities, I created a simplistic Web Analytic Dashboard that would highlight metrics and dimensions of the Sandbox's internet traffic. It was a great example to demonstrate Pentaho Web Analytics embedded in a hosted application. Upon my daily review of the Site Activity dashboard which includes a real-time visit strip chart monitor, I noticed an unusually large spike in page views that occurred within a 1 minute time-frame.

Now that spike can be normal, providing a number of different people are surfing the site at the same time. However it caught my attention as "unusual" due to what I knew was normal. The dashboard quickly alerted me of something I should possibly take action on. So I clicked on the point at the peak to drill-down into the page visit detail at that time. The detail report revealed that who or whatever was accessing the Sandbox was rapidly traversing the site's page map and directories looking for holes in the system. I also notice that all the page views were accessed by the same IP address within under 1 minute. Hmmm, I thought. "That could be a shared IP, a person or even a bot ignoring my robots.txt rules." But..as I scrolled down I further discovered there were attempts to access the .htaccess and passwd files that protect the site. I immediately clicked on the IP address data value in the detail report (in my admin version of the report) which linked me to an IP Address Blacklist look-up service. The Blacklist Look-up program informed me that the IP address has been previously reported and was listed as suspicious for malicious activity. BINGO! Goodbye whoever you are!

Take Action

I quickly took action on my findings by banning the IP address from the system to prevent any further attempts to access the site. I then began to think of some random questions I needed to ask of the data. I switched gears and turned to Pentaho Analysis. Upon further analysis of the site's data using Pentaho Analyzer Report - I was able to see evidence of IP Spoofing and even Comment Spamming coming form certain IP address ranges. The action I took next was to block certain IP address ranges that have been accessing the site in this manner. In addition I created a contact page for those who may be accessing the site legitimately but may have gotten blocked if their IP falls in that range.

Wow, talk about taking action on your data huh?

It is not a question of if, but when an unwarranted attempt will occur on your systems. Make sure you take the appropriate steps to protect them by using the appropriate software and services that will make you aware of problems. My experience may be an oversimplification but it is a great example of how I used Pentaho to make me aware of a problem and take that raw data and turn it into actionable information.

Special thanks to Marc Batchelor, Chief Engineer and Co-Founder of Pentaho for helping me explore the corrective actions to take to protect the Pentaho Evaluation Sandbox.

Monday, September 19, 2011

You have questions. How do you get your answers? The methods and the tools used to help get those answers to business questions will vary per organization. For those without established BI solutions; using desktop database query and spreadsheet tools are...all too common. And...If there is a BI tool in place, usage and its longevity are dependent on its capabilities, costs to maintain it and ease of use for both development staff and business users. Decreased BI tool adoption, due to rising costs, lack of functionality and complexity may increase dependencies on technical resources and other home grown solutions to get answers. IT departments have numerous responsibilities. Running queries and creating reports may be ancillary, which can result in information not getting out in a timely manner, questions going unanswered and decisions being delayed. Therefore, the organization may not be leveraging its BI investment for what it was originally designed to do...empower business user to create actionable information.

The BI market is saturated with BI tools, from the well known proprietary vendors to the established commercial open source leaders and niche players. There are choices that include the "Cloud", on premise, hosted (SaaS) and even embedded. Let's face it and not complicate things...most, if not all, of the BI tools out there can do the same thing in some form or fashion. They are designed to access, optimize and visualize data that will aid in the answering of questions and tracking of business performance. Dashboards, Reporting and Analysis fall under a category I refer as "Content Delivery". These methods of delivering information are the foundation of a typical BI solution. They provide the most common means for tracking performance and identifying problems that need attention. But..did you know, there is usually some sort of prep work to be done, before that chart or traffic light is displayed on your screen or printed in that report. That prep work can range from simple ETL scripting to provisioning more robust Data Warehouse and Metadata Repositories.

Data Integration

Content Delivery should begin first with some sort of Data Integration. In my 15 years in the BI space I have not seen one customer or prospect challenge me on this. They all have "data" in multiple silos. They all have a "need" to access it, consolidate it, extrapolate it and make it available for analysis and reporting applications. Whether they use it already as second-hand data, loaded into an Enterprise Data Warehouse for historical purposes, or produce Operational Data Stores, they are using Data Integration. Whether they are writing code to access and move the data, using a proprietary utility or even some ETL tool, they are using Data Integration. It is important to realize that not all data needs to be "optimized" out of the gate, as it is not only the data that is important. It is how it will be used in the day to day activities supporting the questions that will be asked. This requires careful planning and consideration of the overall objectives that the BI tools will be supporting.

Well, How do I know what tools to use? - Stay Tuned

With so many tools available, how will you know what is right for the organization? Thorough investigation of the tools through RFIs, RFPs, self evaluation and POCs are a good start. However, make sure you are selecting tools based on the ability to solve your specific current AND future needs and not solely because it looks cool and provides only the "sex and sizzle" the executives are after. The typical need is always Reporting, Analysis, Dashboards. Little realize that there is a lot more to it than those three little words. In the next part of this article I will cover a few of the most common "BI Profiles" that are in almost every organization. In each profile I will cover the Pains, Symptoms and Impacts that plague organizations today as well as the solution strategies and limitations you should be aware of when looking at Pentaho.

In the PDI 4 Cookbook you will find over 70 "recipes" that will not only answer the most common ETL questions, when working with Pentaho Data Integration, but also guide you through each exercise. Whether you are new to ETL or new to Pentaho, you will find that the Pentaho Data Integration 4 Cookbook accommodates many skill sets, from the novice to the expert. It is a great addition to the growing series of Pentaho books published by PACKT and Wiley.

Chapter 1 introduces you to working with databases and covers step by step how to connect PDI to your data so you can begin extracting, transforming and loading with ease. The chapter even shows you how to work with parameters...a very powerful feature of PDI.

Each section in the book clearly identifies the steps taken to perform the tasks with headers marked "Getting Ready", "How to do it" and then follows up with "How it Works". Very nice for those who need to understand what is happening inside the PDI ETL engine and behind the scenes.

The book continues with topics on working with files, XML, using Lookups, Data Flows, Jobs and goes into integrating Pentaho Data Integration transformations and jobs with the rest of the Pentaho BI Suite, leveraging such things as Pentaho Reports, Pentaho Action Sequences and the Community Dashboard Framework. I especially like the topics that are covered in Chapter 8; using Pentaho Data Integration with CDA (Community Data Access) and CDE (Community Dashboard Editor). This topic depicts greater extensibility of the Pentaho software by working with powerful Pentaho plug-ins contributed by Webdetails.

The book concludes with Chapter 9, which helps you get the most out of Pentaho Data Integration, by explaining how to work with PDI logging, JSON, custom programs and sample data generators.

If you are exploring the world of Pentaho, I would highly suggest picking up this book. It is great for beginners and for those experts (myself included) who thought they knew everything there was to know about Pentaho Data Integration and were pleasantly surprised by the additional knowledge gained.