As data science evolves into a separate and distinct scientific and business discipline, there is talk about the death of traditional statistics. It is true that today's large data sets are unlike the ones we analyzed in graduate statistics classes. It is also true that big data sets have different properties than small data sets. It is very true that one can lie with statistics and present an illusion of reality. It is certainly true that many professional statisticians lack business acumen and communication skills.

Yet, while the sins of statistics are well-known, assertions of it's death is premature. In fact, the proper use of statistics can help us understand massive quantities of data and help clients make better decisions. Statistics professionals are among the smartest folks I know and - properly used - should be a part of well-rounded team of data scientists.

Data science uses a myriad of tools and techniques, including but not limited to: math, statistics, computer science, hacking, business acumen, storytelling and business communication skills. Statistics is part of the tool kit and very important.

Prudent use of statistics can be very useful for finding meaning in messy, large data sets. Misuse or intentional abuse of statistics can mislead and present a false view of reality. At best, statistics helps us simplify complexity to make better decisions faster. At worst, statistics can define and measure the wrong things, create dangerous illusions of reality and cause us to make bad decisions.

One of my favorite books is "How to Lie with Statistics" (1954) by Darrell Huff. Huff demonstrated that you can intentionally lie with statistics or make unintentional errors and discussed specific methods used to fool rather than to inform. He showed how statistical methods are used in reporting massive amounts data (e.g., hard and social science, economics, business conditions, polls, census) yet without honest, clear communication and real understanding it is semantic nonsense at best and dangerous at worst.

"The secret language of statistics, so appealing in a fact-minded culture, is employed to sensationalize, inflate, confuse, and oversimplify," warned Huff. An modern example is Wall Street risk management models known as "Value at Risk" (VAR) prior to the 2008 financial meltdown. The various VAR models were complex and very precise - yet the assumptions embedded in the models were dangerously wrong and created an illusion of reality. Decision makers relying on VAR had a false sense of security that caused inaccurate conclusions - putting not only their own firms at risk but the entire global financial system.

Statisticians resent the old Mark Twain adage "Lies, Damned Lies and Statistics", yet the best and brightest statisticians on Wall Street (called "Quants") created a false view of reality that caused serious economic damage. If we are not careful, data science can also produce an illusion of reality that causes serious harm.

I argue the proper use of statistics can help us better understand complicated reality, make better decisions and make life better. Good quality data, scientific methods and the right statistical tools can help us find valuable, actionable insights in large data sets. Data scientists should have a solid grounding in statistical analysis - including concepts such as inference, correlation, causation and regression analysis. It is useful to understand - among other concepts - the mean, how the median is less influenced by outliers, standard deviation, how the weighting of index components affects results, correlation vs. causation, inflation-adjustment, precision vs. accuracy, the importance of using the appropriate unit of analysis, statistical vs. operational significance, and how performance data is sometimes manipulated.

We should always be skeptical and understand how the biased or careless can manipulate or misrepresent data. Further, it is crucial to understand the distinction between "precision" and "accuracy." Precision means the state or quality of exactness and ability of a measurement to be consistently reproduced. An example is my office is 8.4 miles from my home rather than 8 miles. Accuracy means a faithful measurement or representation of the truth. An example is my office is my office is 8.4 miles west of my home. A problem arises when I tell you my office is 8.4 miles east of my home - precise but not accurate. Statistical analysis will sometimes be precise but not accurate - like Wall Street VAR models.

Beware and be skeptical of key assumptions embedded in statistical models. Are you defining and measuring the right thing(s) to obtain understanding?

Most important, we need to have clarity about what we are attempting to define, measure, describe or explain. Clear and simple communication to the consumers of data science - the decision makers - so they understand and thus make optimal decisions - is the paramount goal.

Smart process applications are a new and emerging category of applications designed to improve human-based knowledge and business processes.

The opportunities to use technology to automate processes through the elimination of manual work are dwindling. Humans are essential elements in the remaining business processes and activities. Smart process apps and vendors help make these people smarter and more effective.

In a transactional process app, the end goal is as little human involvement as possible. The ideal is a fully automated process. People may of course initiate the transaction (such as a purchase) or be a recipient of the results of a transaction system. They may also be involved in handling exceptions, though the goal there is to minimize that over time. Examples would be applications for core human resource management, eCommerce, sales force automation, invoice automation and procurement, core financial management, and the like.

In a smart process app, people are an inherent and desired part of the process or activity. The end goal is to make people more effective and productive participants in a business process, not to reduce or eliminate their involvement. These human-based processes or activities range from relatively simple cases involving one to three people in handling and resolving a case, to service delivery situations involving similar numbers of people handling less predictable and structured service problems, to some or many people working on a project over time, to many people working on a complex operation in unstructured conditions. Software to improve this range of human-based activities or processes is what we include in smart process apps.

In reality, there is no bright line that divides transactional process apps from smart process apps. That’s because there is a spectrum of business processes that ranges from those with minimal human involvement to those with intense human participation.

Straight-through processes at one end. At one end of the spectrum, processes and applications like order processing or vendor-managed inventory that provide straight-through-processing will be transactional applications. Processes where the human involvement is limited to actions like a purchase or receiving financial results output like a financial management system are also transactional apps. Transactional apps also include applications where human involvement is to deal with exceptions that the system could not handle, because a design goal for these applications is to reduce the number of exceptions that require human involvement.

Collaborative activities at the other. At the other end sit highly collaborative activities like the operations of firefighters at a fire scene, a medical team responding to a disaster, or teachers and school administrators preparing a high school class schedule. Applications to support services delivery or projects would also be collaborative applications.

A gray area in between, where the boundary line can be fuzzy. The gray area is for applications like sales force automation or talent management where people are involved as both initiators and recipients of information on a case-by-case basis. Sales force automation is a transactional application, because the human involvement is almost always one person putting information in (e.g., a salesperson entering the results of a sales call) and one person pulling information out (e.g., the same salesperson looking to see all interactions with a client in the past month). Talent management applications have those aspects as well, but they also involve collaboration between employee and manager to discuss ratings and potential improvements, as well as collaboration between managers to assess top performers and their career paths.

Data virtualization is the process of offering data consumers a data access interface that hides the technical aspects of stored data, such as location, storage structure, API, access language, and storage technology. Consuming applications may include: business intelligence, analytics, CRM, enterprise resource planning, and more across both cloud computing platforms and on-premises.

Data Virtualization Benefits:

Decision makers gain fast access to reliable information

Improve operational efficiency - flexibility and agility of integration due to the short cycle creation of virtual data stores without the need to touch underlying sources

Data virtualization abstracts, transforms, federates and delivers data from a variety of sources and presents itself as a single access point to a consumer regardless of the physical location or nature of the various data sources.

Data virtualization is based on the premise of the abstraction of data contained within a variety of data sources (databases, applications, file repositories, websites, data services vendors, etc.) for the purpose of providing a single-point access to the data and its architecture is based on a shared semantic abstraction layer as opposed to limited visibility semantic metadata confined to a single data source.

Data virtualization is an enabling technology which provides the following capabilities:

Data Science Group Event: University of Colorado Denver - Tuesday May 21, 2013 RECOMMENDATION ENGINES - ABSTRACTRecommendation Engines (RE) are software tools and techniques providing item suggestions to a user. The massive growth and variety of information can often overwhelm, leading to poor decisions. While choice is good, more choice is not always better. REs have proved in recent years to be a valuable means for coping with the information overload problem.In their simplest form, personalized recommendations are offered as ranked lists of items. In performing this ranking, REs try to predict what are the most suitable products or services for a user, based on their preferences and constraints. In order to complete this computational task, REs collect preferences from users, which are either explicitly expressed (e.g., as ratings for products) or are inferred by interpreting user actions. For instance, a RE may consider the navigation to a particular product page as an implicit sign of preference for the items shown on that page.Amazon's RE for example relies on a basic formula (collaborative filtering) that suggests products to you based on your viewing history, your purchase history and which related products other customers bought.BIOTom Rampley is a data scientist with a background in finance and psychology. He received his MBA from Indiana University’s Kelley School of Business in 2012, with concentrations in finance and business analytics. Since graduation, he has been working within the Viewer Measurement group at Dish Network LLC on customer segmentation models, the development of recommendation engines, and the implementation of big data IT platforms. He prefers R to SAS, Python to any other scripting language, and while trained as a frequentist currently considers himself Bayes-curious. Outside of work he is married with no kids (yet), a lifelong martial artist, and endlessly nostalgic for the days when he played lead guitar in his grad school rock band. This is his first Data Science meetup presentation.ACCUMULO - SQRRL NOSQL DATABASE - ABSTRACTApache Accumulo is an open-source highly secure NoSQL database created in 2008 by the National Security Agency. It easily integrates with Hadoop, can securely handle massive amounts of structured and unstructured data - at scale cost-effectively - and enables users to move beyond traditional batch processing and conduct a wide variety of real-time analyses. Accumulo is a sorted, distributed key/value store based on Google's BigTable design. It is a system built on top of Hadoop, ZooKeeper and Thrift. Written in Java, Accumulo has cell-level access labels and a server-side programming mechanisms.Accumulo offers "Cell-Level Security" - extending the BigTable data model, adding a new element to the key called "Column Visibility". This element stores a logical combination of security labels that must be satisfied at query time in order for the key and value to be returned as part of a user request. This allows data of varying security requirements to be stored in the same table, and allows users to see only those keys and values for which they are authorized.Sqrrl Enterprise, developed by Sqrrl Data, is the operational data store for large amounts of structured and unstructured data. It is the only NoSQL solution that scales elastically to tens of petabytes of data and that has fine-grained security controls. Sqrrl Enterprise enables development of real-time applications on top of BigData. Sqrrl uses HDFS for storage; Accumulo for security/speed of access; Thrift API for interactivity; and works with map/reduce, visualizations, third party software, and existing schema explored databases.This presentation reviews Accumulo and Sqrrl Enterprise.BIOJohn Dougherty is CIO for Viriton, a consulting and systems integration organization. He is the organizer for Big Data for Business, helping to apply Big Data concepts to C-suite perspectives. He began utilizing applied strategies, using technology, in the early nineties, and has continued to incorporate blue blood technologies in forward thinking solutions.

Business intelligence (BI)/analytics investment will address many technology gaps for the CFO, according to a joint study by Gartner, Inc., and Financial Executives Research Foundation (FERF), the research affiliate of Financial Executives International (FEI). The study shows that 15 of the top 19 business processes that CFOs have identified as requiring improved technology support are largely addressed by BI, analytics and performance management technologies."Responses to the 2013 Gartner Financial Executives International (FEI) CFO Technology Study are consistent with prior years, with the emphasis on business intelligence/analytics and business applications as the top areas for investment and focus," said John van Decker, Gartner research vice president. "With over 20 areas of choices, all of the top 12 that were chosen by CFOs can be addressed and/or improved with investments in BI and analytics."The survey results showed that the top business process area that needs technology investment is to facilitate analysis and decision making (59 percent, up from 57 percent in 2012), followed by the ongoing monitoring of business performance (50 percent), and then collaboration and knowledge management (45 percent, down from 52 percent in 2012). These results are consistent with those of the last five years, which show that organizations are still struggling to make progress with BI and analytics. Many IT organizations have made initial investments, but these tend to be tactically focused and don't address the more fundamental issues of data quality and consistency, which require CFOs and finance teams to work closely with BI specialists in IT.From an enterprise perspective, BI and business applications continue to dominate the CFO's IT investment desires, although they are somewhat behind where they were in 2012. Gartner believes that this is due to the increasing importance of nexus technologies, as those selections have increased significantly in 2013."The survey findings would seem to suggest that the CFO prioritizes business applications higher than the CIO does," said Bill Sinnett, senior director, research at FERF. "If the CIO does not understand this, then there's a chance the CFO will sponsor his or her own initiatives, and not coordinate them with the IT organization. This demonstrates the trend that BI is becoming less of a CIO responsibility and more of a CFO and line-of-business responsibility."Corporate performance management (CPM) projects are the highest on the CFO's BI initiatives list, according to the survey. The top four priorities in this area are addressed by CPM suites, including performance scorecard; budgeting, planning and forecast; financial consolidation; and profitability management. The survey also found that CFOs' understanding of the Nexus of Forces is impacting CFO investment priorities. Enterprises are being challenged to adapt as the nexus of social, mobile, cloud and information and the data that results from their adoption expand exponentially. Social scored low in terms of technology initiatives, but mobile, cloud (including software as a service [SaaS]) and information are priorities. "CFOs have a strong interest in cloud and mobile technologies. SaaS (and cloud-based delivery) is starting to affect business applications. Many CFOs use mobile devices and would be interested in getting access to key business information using these tools," said Mr. Van Decker. "CIOs should use this interest to show how wider investments in cloud and mobile technology could deliver benefits across the organization."Although these nexus capabilities will be a concern more in 2014 and beyond, IT organizations must communicate how more-effective business platforms can be leveraged to deliver better architectures for business applications that are "top of mind" for the CFO. For example, it would be sensible to include the CFO in mobile device deployment to allow him or her to access finance information and analytics. At the same time, CFOs are clearly skeptical about the potential of social technologies, so any investments in this area must be clearly related to business strategies and realizable benefits."The CFO's influence over IT is consistent and, in many organizations, is growing. We have seen in the study that a large percentage of CFOs own the IT function. This year's responses show that 39 percent of IT organizations currently report to the CFO," said Mr. Van Decker. "This high level of reporting to the CFO demonstrates the need for companies to ensure that their CFOs are educated in technology, and underscores just how critical it is that CIOs and CFOs have a common understanding of how to leverage enterprise technology."More detailed analysis is available in the report "Survey Analysis: CFOs' Top Imperatives From

Choosing the right database has never been more challenging, or potentially rewarding. The options available now span a wide spectrum of architectures, each of which caters to a particular workload. The range of pricing is also vast, with a variety of free and low-cost solutions now challenging the long-standing titans of the industry. How can you determine the optimal solution for your particular workload and budget?

Robin Bloor, Ph.D. Chief Analyst of the Bloor Group, and Mark Madsen of Third Nature, Inc. present the findings of their three-month research project focused on the evolution of database technology. They offer practical advice for the best way to approach the evaluation, procurement and use of today's database management systems. Bloor and Madsen clarify market terminology and provide a buyer-focused, usage-oriented model of available technologies.

Smart data scientists use data virtualization to integrate data from many diverse sources - logically and virtualized for on-demand consumption by different data analytical applications. For example, data virtualization is used to address challenges such as rogue data marts, business intelligence apps, enterprise resource planning and content systems and portals.

Data virtualization is the process of offering data consumers a data access interface that hides the technical aspects of stored data, such as location, storage structure, API, access language, and storage technology. Consuming applications may include: data analytics, business intelligence, CRM, enterprise resource planning, and more across both cloud computing platforms and on-premises.

Data virtualization abstracts, transforms, federates and delivers data from a variety of sources and presents itself as a single access point to a consumer regardless of the physical location or nature of the various data sources.

Data virtualization is based on the premise of the abstraction of data contained within a variety of data sources (databases, applications, file repositories, websites, data services vendors, etc.) for the purpose of providing a single-point access to the data and its architecture is based on a shared semantic abstraction layer as opposed to limited visibility semantic metadata confined to a single data source.

Data Virtualization software is an enabling technology which provides the following capabilities:

At the 2013 TDWI World Conference and BI Executive Summit in Las Vegas, speakers and attendees chewed over some of the meatiest trends and hottest technologies in the business intelligence and data warehousing market.One discussion thread centered on strategies for finding business value in big data through the use of technologies such as Hadoop. Another focused on the need to make BI applications more enticing to business users, with more eye-catching designs and interactive elements. Operational intelligence, self-service BI and mobile BI were also on the agenda, as were Agile BI and data warehousing, enterprise information management and big data management.

Humans have a powerful capacity to process visual information, skills that date far back in our evolutionary lineage. And since the advent of science, we have employed intricate visual strategies to communicate data, often utilizing design principles that draw on these basic cognitive skills. In a modern world where we have far more data than we can process, the practice of data visualization has gained even more importance.

From scientific visualization to pop infographics, designers are increasingly tasked with incorporating data into the media experience. Data has emerged as such a critical part of modern life that it has entered into the realm of art, where data-driven visual experiences challenge viewers to find personal meaning from a sea of information, a task that is increasingly present in every aspect of our information-infused lives.