Support programming language such as C++, Python, Java, Perl rather than internal, ad-hoc language (despite the fact that ad-hoc language might be more efficient) (*)

Learning curve (*)

Easy to upgrade (*)

Possibility to work on the cloud, or use map reduce and NoSQL features

Real time features (can be integrated in a real time system, such as auction bidding)

Other criteria?

In some cases, three products might be needed: one for visualization, one with a large number of functions (neural networks, decision trees, constrained logistic regression, time series, simplex) and one that will become your "work horse" to produce the bulk of heavy duty analytics.

What are your top criteria? Related questions:

When do you decide to upgrade or purchase an additional module to an existing product - e.g. SAS Graph or SAS Access on top of SAS Base?

Who is responsible, in your organisation, to make a final decision on the choice of analytic software? End users? Statisticians? Business people? CTO?

Replies to This Discussion

Your list is comprehensive, and I largely agree with your starred items. However, my focus as an enterprise architect is on bringing things into production inside businesses quickly and effectively, and making business users as independently creative as possible (at least until the point where they risk becoming a danger!)

So I would make these modifications:

9. Capability to fetch data on the Internet, or from database (SQL supported).

This will become really important, very quickly, as people start to use curated open data sets. Look at http://www.freebase.com/ and you'll see what I mean. This reduces preparation work for "well known" datasets and starts to achieve convergence between practitioners and will help achieve cross-industry standardisation.

This is important for an enterprise production environment. Otherwise we're going to be frustrated by "process" - initiatives will be founder (due to compliance, MDM focus, and auditor sign-off challenges) until we make major advances in this area.

25. Possibility to work on the cloud, or use map reduce and NoSQL features

You will increasingly need this for curated open data sets (see above), as well as the inevitable social network side of things. It may be a staging area for data preparation (prior to combining it with your in-house data) but you definitely need it.

26. Real time features (can be integrated in a real time system, such as auction bidding)

Again, this is going to be so important. In production things will change steadily (and sometimes suddenly), and you can't take the time to build a new model at your leisure, stop the production environment, introduce the model, and re-start things. Most tools today are static - you load everything into memory at the beginning of your sesson, and then work with what you've got - won't cut it in these circumstances. This is all about out-performing your competition - Colonel John Boyd's OODA Loop principle (http://en.wikipedia.org/wiki/OODA_loop) is going to become increasingly important to businesses.

In my case, the first criteria is price. I work for a small consulting group that (supposedly) can't afford a large investment in an expensive analytic tool. Also, our business model involves a small group of full-time employees supplemented by (well-paid) hourly consultants who supply their own technology. So, if I interview a new consultant, I can't say "by the way we use a technology that will cost you $1000 upfront, so go ahead and buy a license".

A second criteria is delivery mode. For the moment, our client is likely to receive something that they can view. At the moment, it's been PDF files converted from MS Word. So, that means that simple visual tasks that can be completed in Excel are strong candidates where automation is not an issue.

Another criteria is learning curve. I tend to hire graduating statisticians (M.S.) with a programming background. With that said, the programming can't be too difficult. In other words, I'm better off sticking with a technology they may have seen during a previous job or in school. So, R, SAS, or Minitab is a good place to start. However, only R meets my company's "low-budget" requirement.

That's about it. There are two other candidates:

Python because it's used for data processing (free) and has an analytic suite associated with it. (There are probably alternatives in Ruby and Perl if I were to research them.)

Any database that has inline analytics. If your data is going to be stored in a database anyway, it's worth considering just doing the analytics there and exporting the results for visualization in Excel. At the moment, we use Oracle 11g (mostly the free Express edition). I'm assuming that MySQL or PostgreSQL would do just as well. They appear to install a bit more easily and I just picked up a book on the latter that might motivate me to test it out. http://www.manning.com/obe/

Well. I'm sure that's more information than you wanted to know. I hope I at least answered your question. At least its something that I and every analyst who works with me gives serious thought. We second-guess ourselves weekly as no choice is ever perfect and that other tool always seems so much better!

Ah, isn't this a fundamental question in Data Science! Brian has added many practical priorities that I reword here: licensing and renewal cost, skills training of existing staff and screening potential hires. Some others are "contractual obligations" (you'd be surprised how many contracts say "Excel" or "SAS"), legacy systems (some mandated by law), cost of inertia, lack of resources and lack of understanding. A key variable is the "state" of the business. Anyone familiar with Credit Scoring (to take one well-established vertical) models knows that there is much standardization, even at API level for this is a well-worn path and the number of informative fields or dimensions low. Many businesses along these lines happily use SAS and perhaps Matlab. For such an organization, continuity is a must; however the cost in terms of technical debt grows quickly. Those organizations are often unsuccessful in moving to the "next generation" toolkits, and my focus is NOT on those organizations.

With Big Data, most people talk simply of simple aggregation as in Hadoop. As Hadoop was designed for cheap storage, nodes in its cluster are not really powerful (because that precise was the idea!). For that reason trying to build models where the training data is distributed over 60 nodes with weak computing power and RAM, many algorithms cannot be parallelized well enough at this time. However, many useful algorithms can still be made to work by careful message passing and using advanced memory management techniques.

Now I've illustrated two extremes: entrenched business practice and headstrong database developers. Both parties are reluctant to throw the bathwater over their hard-earned expertise in products. For a smaller and growing company, the question is a fundamental one. As computing becomes ever-more important, application of analytic techniques cannot stay insulated from the nitty-gritty of computing. And the cloud now allows for much cheaper access to computing and exponential scaling, more thought is needed before diving willy-nilly into it.

The solution? For my teams, I have drawn a line between "Runtime" (handled by Perl or Python) and the algorithms. The first one is for data munging (and yes you can create an API for it); the latter one could be in SAS or Java or C++ where computational power is needed and not everyone needs to know the output of every step.

In the past year, in my personal time, I have implemented Regression models and clustering in pure C++ (nothing needing licensing), using only one Linear Algebra Library in Eigen. To get Linear Regression and Ridge Regression to work well, produce output identical to R up to 10 digits, did not take that long. What's the benefit, one might ask? In this way, I distilled my practitioner knowledge into my software and paved the way for automatically building Regression models and doing clustering. Until we do that, we would NOT know wher the sticking point is. I talk specifically of Least Squares regression here because many algorithms build on it.

In summary, understanding your business and strategy are absolutely essential. No one tool or two will solve all your problems today and tomorrow _unless_ it has also been solving your problems for the past year."

@Vincent and @Sandeep - great post, and great points! There is an awful lot to consider when choosing an analytic tool, and the wide variety of options can be overwhelming.

Prior to going down your list of things to consider, the person making the decision needs a clear definition of the requirements for the project: how much data, what sort of response times are needed, which systems does it need to integrate with, and what is the business product of the project (i.e. what sort of analysis needs to be done, and how often does it need to be done)? These requirements will help narrow down the set of tools and systems to consider.

Back when I was COO at an ecommerce company, we made technology decisions frequently, and used Total Cost of Ownership (TCO) over 3 years to compare solutions.

In TCO, the goal is to look at all the costs for each solution: the software license of course, but also the hardware costs, training, cost of integration and implementation, and also cost of ongoing maintenance. For example, if you're going with a solution that requires dozens of Hadoop servers, you need to include the monthly costs, such as power and IT support (i.e. for patching the OS regularly) if the servers are on premise, or the monthly cost if they are in the cloud (and those EC2 instances can get expensive quickly!).

For training, implementation, integration, and operations, in addition to the fees you may pay to outside vendors and consultants, you should assign a cost per hour for any employees who will be doing this work as well. Including the cost for your employee's time is a short hand way to address opportunity costs, which can be difficult to estimate. An opportunity cost is the cost of not doing things because your people are busy with this project - so, the more of your people's time it takes, the less other things they can get done.