Collecting data is relatively easy, but turning raw information into something useful requires that you know how to extract precisely what you need. With this insightful book, intermediate to experienced programmers interested in data analysis will learn techniques for working with data in a business environment. You'll learn how to look at data to discover what it contains, how to capture those ideas in conceptual models, and then feed your understanding back into the organization through business plans, metrics dashboards, and other applications.

Along the way, you'll experiment with concepts through hands-on workshops at the end of each chapter. Above all, you'll learn how to think about the results you want to achieve -- rather than rely on tools to think for you.

Use graphics to describe data with one, two, or dozens of variables

Develop conceptual models using back-of-the-envelope calculations, as well asscaling and probability arguments

Mine data with computationally intensive methods such as simulation and clustering

Make your conclusions understandable through reports, dashboards, and other metrics programs

Philipp K. Janert

After previous careers in physics and softwaredevelopment, Philipp K. Janert currentlyprovides consulting services for data analysis,algorithm development, and mathematical modeling.He has worked for small start-ups and in largecorporate environments, both in the U.S. andoverseas. He prefers simple solutions that workto complicated ones that don't, and thinks thatpurpose is more important than process. Philippis the author of "Gnuplot in Action - UnderstandingData with Graphs" (Manning Publications), and haswritten for the O'Reilly Network, IBM developerWorks,and IEEE Software. He is named inventor on a handfulof patents, and is an occasional contributor to CPAN.He holds a Ph.D. in theoretical physics from theUniversity of Washington. Visit his company websiteat www.principal-value.com.

I think that this is overall a really good book on data analysis. The author drawing from his experience really adds a lot of value. I also like how he covers a lot of ground on different topics and really tries to clarify the concepts within the text. Overall, I'd definitely recommend this book to someone when it concerns getting a good introduction on all those topics. This isn't a reference book or technical manual.

Weak point would be some of the examples, but then I don't think that is the main focus of the book. He does cover a lot of different technologies. If a reader is interested in a specific case, I'd say grab an additional book that is more detailed on the subject (like an R or Python book, which can add value).

I think this book is most useful to someone with a basic understanding of statistics, programming and familiar or has heard about the concepts in the book. Initially, 2-3 years ago, I read this book and didn't understand it that well. Now after being a developer for some time and having studied statistics on my own, I can really appreciate this text more.

This book is the best I have seen for giving a concise and useful introduction to data analysis for developers. It gives practical examples of solutions to important problems. Now that the author has made code available on-line, the few complaints that others have raised have been resolved.

I have plenty of books with all the math. And there are plenty of books that cover these issues at a fluffy level. This book strikes a balance that is likely to make it much more useful to many more people. Frankly, I think it is better than either of the books I have written.

This book is good enough that I have not only recommended it to others, I given it to others. Yeah... it is good enough to put my own money down.

This book is very much in line with what an O'Reilly text is all about. Clear, concise explanatory narrative with examples to illustrate and aid complete understanding. Not only is the author good, but the editors seem to have disciplined the flow of the book well also.I have nothing but praise for the way the author presents difficult things in easy terms. The author has a very clear understanding of the subjects discussed and so his explanations are clear and have made some things that were a very diffuclt for me to comprehend for many years immediately obvious. Thank you, thank you all.

This book is enlightening. Not only it discusses the techniques with sufficient details, but it also gives a clear idea of what is behind them and how to use them. It really enables one to extract meaning out of the data.

As an example, chapter 9 about statistics is not your usual enumeration of p-values, t-tests and the like. Instead, it says everything that is NOT in your usual textbook. This chapter starts with this: "[here], I want to explain what classical statistics does, why it is the way it is, and what it is good for". Then the author goes on explaining the design goals of the usual methods. And from those design goals, he concludes about situations in which they are useful, and their limitations. This gives much more insight than what I had read before.

And the whole book is this way: full of rare but very useful information, and wisdom. It's also relatively easy to read (some of the most difficult parts are optional and marked as such) and contains many references.

However, it is not so much about software tools, though there are examples of how to use some of them.

Very good and comprehensive exercise by Mr Janert on how to produce "readable" graphs (read information) on top of massive data volumes, all with open source tools. This is not a book showing samples or "how-to" code that you can run easily on your app (HTML- or OS-based). Instead it goes much deeper than that, explaining the math that supports the data analysis, lots of the statistical theory underlying the data analysis processes, etc. But this is not a book to read during commutes or trips on ebook format, but to be read whilst at home or in a quiet place instead, giving it the care and attention if deserves.

There are also many good things about it too: the Workshops provided are very good step-by-step descriptions of the process taken by Mr Janert to solve them. Given that the subject of the book is dense, this seems like the best idea to help understanding what has been talked about.

Many kinds of graphs (like jitter plots, scatter, mosaic plots, kohonen maps, etc.) and the logic underlying them (logarithms, pareto, regression, estimations, Monte Carlo simulation, etc.) are covered in this book. So I find it a great source of information that can be perfectly used as a superb reference book when developing a projects requiring graphical analysis tools on big volumes of data.

I used this book as a reference--but it's not a 'Cookbook'. In fact, there are plenty of recipes for any of the techniques in this Book, available on the Net. What i don't often find there are well-written explanations of these techniques in sufficient detail to allow you to sit down and code your own implementation from start to finish. This is the real value of this DAWOST. The explanations of Kohonen Maps and Discrete-Event Simulation are particularly excellent.

This is a book about how to design a strategy to understand the organization's data collected using statistical, graphical, analytical and reporting methods and open source tools. This book explains the major concerns about how to extract the information that the data tries to show about products, finances, processes and others. For that purpose, every information engineer should consider:

• The underlying properties of data• The ways to represent the current status of the data• The criteria to select relevant data and attributes• The algorithms to analyze the selected data and attributes• The ways to report the conclusions of the performed data analysis.

The author Philipp K. Janert takes a designer approach rather than an implementer approach. That means that you will gain important suggestions and tips to propose a plan for data analysis, instead of how to build an entire or partial information infrastructure using open source tools like Python, R, PostgreSQL and Weka.

Then, for some developers the lack of full programming constructs may be disappointing. However, I feel that Philipp K. Janert's main goal is to share with us his own professional experiences in real world enterprise analytical projects from a requirements perspective. In fact, many reference and recipe books cover deeply the aforementioned open source technologies so you can start to build a data analysis subsystem from zero, but without this book, you can lack the enterprise's point of view, something much more related to data architecture and data policies.

Despite the implementer approach is not fully covered, you'll be able to understand how the analytical demands can be satisfied using specifically the programming languages Python and R given its speed of execution, numerical analysis capabilities and cross-platform support. Each chapter contains both the Philipp K. Janert's professional experience and the core programming snippets that make such concepts a programming asset.

In conclusion, if it is true that this book will not guide you to develop a data analysis tool with all the specific programming details of Python and R, it is also true that you will gain worthy professional experiences to design strategies, architectures and policies for data analysis.

This review is in exchange of the O'Reilly Bloggers Review Program (oreilly.com/bloggers).

Data Analysis with Open Source Tools is an excellent primer for those who need an overview of the field of Data Analysis, along with pointers to some of the most popular free and open source tools.The book does have it's short comings. There are places where the text is either confusing to read, or just wrong, and they should have a website with links to all of the tools presented, along with the data used with the examples in order to make it easier for the reader to follow along with the examples and jump-start them playing with the tools presented. The presentation of a set of analytic techniques, followed by a workshop where one of the tools is introduced to work through a real example is a strong point of this book. I also appreciated the informal description of classical statistics to help provide context to the subject. This is something that should accompany any introductory text on classical statistics. It both gives an opportunity to see the techniques in action and provides some pointers for those wanting more hands-on experience with a set of techniques. Overall, I would recommend this as the best primer to Data Analysis techniques that I have come across, but it still fall short of being an excellent book.

Bottom LineYes, I would recommend this to a friend

Merchant response: The author has provided some data sets and example files at http://examples.oreilly.com/9780596802363/

For data sets that are publicly available, he didn't replicate the data set, but included the URLs where you can find those data sets. Also, he explained that not all figures in the book have an attached data set; many figures are function plots or otherwise dynamically generated.

If you have any questions, feel free to post them in the forum (http://forums.oreilly.com/category/107/Data-Analysis-with-Open-Source-Tools/) or email booktech@oreilly.com, our book technical support.

If you are expecting a book filled with examples of NoSQL databases like Hadoop and Cassandra, you are definitely going to be disappointed. The key with this book is to look at the cover. Data analysis is the main point of this book, and open source tools are really just a nice sidebar. The data analysis information is fairly solid and ranges from some basic methods, through statistics and eventually getting to some machine learning methods like clustering and categorization. The open source tools portions of the book are based on examples of the various analysis methods, but do not delve too deeply into how the tools work.

Some people may think that some important statistical methods were missing, but the author follows each chapter with recommended resources. These recommendations end up being a huge collection of excellent books that you could review for deeper treatment of the various topics.

The appendices are fantastic. The first talks more about programming tools, the second gives a nice overview of some of the calculus used in the book, and the third talks about where to get the data you want to work with and how to work with it, like cleaning the data and normalizing it. In some cases, you may even want to start with the appendices before getting into the meat of the book.

This book has a bunch of examples of using various tools to try to learn something useful from large sets of data.

I often find myself in a situation where I have some millions of records of something, and I have to figure out what is going on. Much of the book is examples drawn from project that the author worked on, but I find his ideas useful in my work.

The book is not a set by set cookbook, teaching you how to use various tools. It is much more about giving you ideas of what you might want to do, and then at the end of the chapter giving a short discussion on what tools the author has experience with for that type of work.

I would use this book to answer the question: "What type of analysis might be useful?" but not the question "How exactly do I perform the analysis using [GNUPlot or R or whatever]?"