Monday, November 17, 2014

Machine Learning (ML) is one of the most popular field in Computer Science discipline, but is also the most feared by developers. The fear is primarily because it is considered as a scientific field that requires deep mathematical expertise which most of us have forgotten. In today's world, ML has two disciplines: ML, and Applied ML. My goal is to make Machine Learning easier to understand for developers through simple applications. In other words, bridge the gap between a developer and a data scientist. In this blog, I will provide you with a step-by-step guide for building a Linear Regression model in AzureML to predict the price of a car. You will also learn the basics of AzureML along the way, as well as its application it in real-world by creating a Windows Universal Client app.

What is AzureML?

AzureML is meant to democratize Machine Learning and build a new ecosystem and marketplace for monetizing algorithms. You can find more information about AzureML here.

Why AzureML?

Because it is one of the simplest tools to use for Machine Learning. AzureML reduces the barriers to entry for anyone who wants to try out Machine Learning. You don’t have to be a data scientist to build Machine Learning models anymore.

Logical Machine Learning Flow

Figure below illustrates a typical machine learning process with end result in mind.

...

...

Conclusion

AzureML is a new and highly productive tool for Machine Learning. It may be the only tool that lets you publish a machine learning web service directly from your design environment. Machine Learning is a vast topic and Linear Regression models discussed in this article only scratches the surface of the topic. In this article, I went over a stale dataset to showcase AzureML as a predictive analytics tool. You can apply the same procedures and components for Classification and Clustering models. Finally, my goal was in writing about Applied Machine Learning. I am not a Data Scientist, but now with all the productive tools, I feel that I can put to work some of the great algorithms that scientists have already invented.

Thursday, September 25, 2014

Recently, SQL Community feedback from twitter prompted me to look in vain for SQL Server 2014 versions of the AdventureWorks sample databases we’ve all grown to know & love.

I searched Codeplex, then used the bing & even the google in an effort to locate them, yet all I could find were samples on different sites highlighting specific technologies, an incomplete collection inconsistent with the experience we users had learned to expect. I began pinging internally & learned that an update to AdventureWorks wasn’t even on the road map.

2. Where it makes sense to do so, consolidate the DBs (e.g., showcasing Columnstore likely involves a separate DW DB)

3. Documentation to support experimenting with these features

As Microsoft Senior SDE Bonnie Feinberg (b) stated, “I think it would be great to see an AdventureWorks for SQL 2014. It would be super helpful for third-party book authors and trainers. It also provides a common way to share examples in blog posts and forum discussions, for example.”

Wednesday, September 03, 2014

You love Q&A sites like StackOverflow.com and DBA.StackExchange.com, but sometimes it’s hard to find interesting questions that need to be answered. So many people just sit around hitting refresh, knocking out the new incoming questions as soon as they come in. What’s a database person to do?

Use the power of the SQL.

Data.StackExchange.com lets you run real T-SQL queries against a recently restored copy of the StackExchange databases. Here’s my super-secret 3-step process to find questions that I have a shot at answering.

Step 1. Find out how old the restored database is....

...

Step 2. Find questions everybody’s talking about....

...

Step 3. Find questions that people keep looking at....

...

..."

Why web query when you can just SQL your way through StackExchange? I don't know about you, but I often dream in SQL (no lie.. sigh), so this approach to StackExchange struck a cord for me. Now, if only I was actually smart enough to provide good answers... :O

Tuesday, April 29, 2014

Data mining, data analysis, these are the two terms that very often make the impressions of being very hard to understand – complex – and that you’re required to have the highest grade education in order to understand them.

...

By learning from these books, you will quickly uncover the ‘secrets’ of data mining and data analysis, and hopefully be able to make better judgement of what they do, and how they can help you in your working projects, both now and in the future.

I just want to say that, in order to learn these complex subjects, you need to have a completely open mind, be open to every possibility, because that is usually where all the learning happens, and no doubt your brain is going to set itself on fire; multiple times.

...

Learn Data Science from Free Books

There is no better way to learn than from books, and then going out in the world and putting that newly found knowledge to the test, or otherwise we’re bound to forget what we actually had learned. This is a beautiful list of books that every aspiring data scientist should take note of, and add to his list of learning materials.

What books have you read in order to help you begin your own journey in data mining and analysis? I’m sure that the community would love to hear more, and I’m eager to see what I potentially let slip through my fingers myself.

Why Visualizations Matter

While a list of items is great for entering or auditing data, data visualizations are a great way to distill information to what matters most that is understandable quickly.

...

Visualizations in Productivity Apps

We have the privilege of having the largest community of users of productivity applications in the world. Thanks...

...

Faster Creation of Visualizations

Excel 2007 introduced the ability to set the style of a chart with one click and leverage richer graphics such as shadows, anti-aliased lines, and transparency.

Office 2013 was one of our most ground-breaking releases.

...

Richer Interactivity

Part of my role at Microsoft involves presenting on various topics to stakeholders, and increasingly most of these include data visualizations. Only a few years back, I remember ...

...

Visualizations on All Data

In addition, both data volumes and the types of data customers want to visualize have expanded as well.

Excel 2013 also introduced the Data Model, opening the door for workbooks that contained significantly larger datasets than before, with richer way to express business logic directly within the workbook.

Increasingly, we have access to geospatial data, and recently introduced Power Map brings new 3D visualization tool for mapping, exploring, and interacting with geographical and temporal data to Excel, enabling people to discover and share new insights such as trends, patterns, and outliers in their data over time...

...

We are very excited to have introduced Power Q&A as part of the Power BI launch. This innovative experience makes it even easier to understand your data by providing a natural language experience that interprets your question and immediately serves up the correct answer on the fly in the form of an interactive chart or graph. These visualizations change dynamically as you modify the question, creating a truly interactive experience with your data.

...

Visualizations Everywhere

As customers are creating insights and sharing them, we have also invested in ensuring SharePoint 2013 and Office 365 provide full fidelity rendering as the desktop client so their products remain beautiful wherever it’s consumed.

What’s Next?

..."

The Power Q&A looks interesting. I'd love to be able to provide that kind of thing in my apps. But lets see how it plays out over a version or two...

There’s two ways you can get started writing queries against Stack’s databases – the easy way and the hard way.

The Easy Way to Query StackOverflow.com

Point your browser over to Data.StackExchange.com and the available database list shows the number of questions and answers, plus the date of the database you’ll be querying:

...

The Hard Way to Query StackOverflow.COM

First, you’ll need to download a copy of the most recent XML data dump. These files are pretty big – around 15GB total – so there’s no direct download for the entire repository. There’s two ways you can get the September 2013 export:

That’s what we do in our SQL Server Performance Troubleshooting class – specifically, in my modules on How to Think Like the Engine, What Queries are Killing My Server, T-SQL Anti-patterns, and My T-SQL Tuning Process. Forget AdventureWorks – it’s so much more fun to use real StackOverflow.com data to discover tag patterns, interesting questions, and helpful users.

A great resource, both Brent's post and of course the data for when you need some "safe" data, yet in a large enough volume to be meaningful...

Tuesday, November 19, 2013

Word clouds have become inescapable, and it is easy to see why– many people find such a blending of text and visual information easy to understand. But how, exactly, can you generate one of these content confections? Smashing Apps shares its collection of “10 Amazing Word Cloud Generators.”

...

VocabGrabber is different. It doesn’t even make a particularly pretty picture. As the name implies, VocabGrabber uses your text to build a list of vocabulary words, complete with examples of usage pulled from directly from the content. This could be a useful tool for students, or anyone learning something new that comes with specialized terminology. If your learning materials are digital, a simple cut-and-paste can generate a handy list of terms and in-context examples. A valuable find in a list full of fun and useful tools.

In this session, we are presenting 10 amazing word cloud generators for you. Word cloud can be defined as a graphical representation of word frequency, whereas word cloud generators simply are the tools to map data, such as words and tags in a visual and engaging way. These generators come with different features that include different fonts, shapes, layouts and editing capabilities.

Without any further ado, here we are presenting a fine collection of 10 amazing and useful word cloud generators for you. Leave us a comment and let us know what you think of the proliferation of design inspiration in general on the web. Your comments are always more than welcome. Let us have a look. Enjoy!

Tuesday, November 12, 2013

Understanding the DNA of Data Science

Data Science is the competitive advantage of the future for organizations interested in turning their data into a product through analytics. Industries from health, to national security, to finance, to energy can be improved by creating better data analytics through Data Science. The winners and the losers in the emerging data economy are going to be determined by their Data Science teams.

Booz Allen Hamilton created The Field Guide to Data Science to help organizations of all types and missions understand how to make use of data as a resource. The text spells out what Data Science is and why it matters to organizations as well as how to create Data Science teams. Along the way, our team of experts provides field-tested approaches, personal tips and tricks, and real-life case studies. Senior leaders will walk away with a deeper understanding of the concepts at the heart of Data Science. Practitioners will add to their toolboxes.

In The Field Guide to Data Science, our Booz Allen experts provide their insights in the following areas:

Start Here for the Basics provides an introduction to Data Science, including what makes Data Science unique from other analysis approaches. We will help you understand Data Science maturity within an organization and how to create a robust Data Science capability.

Take Off the Training Wheels is the practitioners guide to Data Science. We share our established processes, including our approach to decomposing complex Data Science problems, the Fractal Analytic Model. We conclude with the Guide to Analytic Selection to help you select the right analytic techniques to conquer your toughest challenges.

Life in the Trenches gives a first hand account of life as a Data Scientist. We share insights on a variety of Data Science topics through illustrative case studies. We provide tips and tricks from our own experiences on these real-life analytic challenges.

Putting it All Together highlights our successes creating Data Science solutions for our clients. It follows several projects from data to insights and see the impact Data Science can have on your organization.

...

When I first saw this title, I thought it was going to be one of the make my brain hurt kind of books, but heck, even I can read it! It's actually not dry and is kind of entertaining! If you have "data" (and who doesn't anymore) this free ebook might a good read for you. And really, it won't make your brain explode...

Friday, October 04, 2013

What an awesome way to grok my home town's budget. While you'd think "budget = boring" this sight makes it actually fun to look at, explore and spelunk the budget. It's very eye opening to see where all the money is going...

Thursday, September 26, 2013

Developers, take this course to get an overview of Microsoft Big Data tools as part of the Windows Azure HDInsight and Storage services. As a developer, you'll learn how to create map-reduce programs and automate the workflow of processing Big Data jobs. As a SQL developer, you'll learn Hive can make you instantly productive with Hadoop data.

Added to the billion and one of things I need to learn ASAP. When I find the time and "want to" this series looks like a great way to get started. I've done a tiny bit of hadoop, and I already know I'm going to need all the help I can get up this learning curve...

APIs / Feeds / Data

Welcome to Metro’s developer site – this is a website for technical individuals and entities who are using transportation and multi-modal data in interesting ways. Since first releasing our transit data in the summer of 2009, numerous developers have incorporated our data into their applications — you can see a list of featured applications here.

The Fuzzy Lookup Add-In for Excel was developed by Microsoft Research and performs fuzzy matching of textual data in Microsoft Excel. It can be used to identify fuzzy duplicate rows within a single table or to fuzzy join similar rows between two different tables. The matching is robust to a wide variety of errors including spelling mistakes, abbreviations, synonyms and added/missing data. For instance, it might detect that the rows “Mr. Andrew Hill”, “Hill, Andrew R.” and “Andy Hill” all refer to the same underlying entity, returning a similarity score along with each match. While the default configuration works well for a wide variety of textual data, such as product names or customer addresses, the matching may also be customized for specific domains or languages.

Supported Operating System

Windows 7, Windows Server 2008, Windows Vista

Preinstalled Software (Prerequisites): Microsoft Excel 2010

...

Sounds like something I might be able to use... Now it would be even better if this were a .Net assembly that I could use. Will have to look at this and see what my programming options are...

“Providing free and open access to the U.S. Code in XML is another win for open government,” said Speaker John Boehner and Majority Leader Eric Cantor, in a statement posted to Speaker.gov. “And we want to thank the Office of Law Revision Counsel for all of their work to make this project a reality. Whether it’s our ‘read the bill’ reforms, streaming debates and committee hearings live online, or providing unprecedented access to legislative data, we’re keeping our pledge to make Congress more transparent and accountable to the people we serve.”

House Democratic leaders praised the House of Representatives Office of the Law Revision Counsel (OLRC) for the release of the U.S. Code in XML, demonstrating strong bipartisan support for such measures.

“OLRC has taken an important step towards making our federal laws more open and transparent,” said Whip Steny H. Hoyer, in a statement.

...

“Just this morning, Josh Tauberer updated our public domainU.S. Code parser to make use of the new XML version of the US Code,” said Mill. “The XML version’s consistent design meant we could fix bugs and inaccuracies that will contribute directly to improving the quality of GovTrack’s and Sunlight’s work, and enables more new features going forward that weren’t possible before. The public will definitely benefit from the vastly more reliable understanding of our nation’s laws that today’s XML release enables.” (More from Tom Lee at the Sunlight Labs blog.)

...

“Last year, we reported that House Republicans had the transparency edge on Senate Democrats and the Obama administration,” he said. “(House Democrats support the Republican leadership’s efforts.) The release of the U.S. Code in XML joins projects like docs.house.gov and beta.congress.gov in producing actual forward motion on transparency in Congress’s deliberations, management, and results.

For over a year, I’ve been pointing out that there is no machine-readable federal government organization chart. Having one is elemental transparency, and there’s some chance that the Obama administration will materialize with the Federal Program Inventory. But we don’t know yet if agency and program identifiers will be published. The Obama administration could catch up or overtake House Republicans with a little effort in this area. Here’s hoping they do.”

USC in XML

Each update of the United States Code is a "release point". This page contains links to downloadable files for the most current release point. The available formats are XML, XHTML, and PCC (photocomposition codes, sometimes called GPO locators). Certain limitations currently exist. Although older PDF files (generated through Microcomp) are available on the Annual Historical Archives page, the new PDF files for this page (to be generated through XSL-FO) are not yet available. In addition, the five appendices contained in the United States Code are not yet available in the XML format.

Wednesday, July 17, 2013

The Library of Congress is crowdsourcing an initiative to make it easier for software programs around the world to read, understand and categorize federal legislation.

The library is offering a $5,000 prize to the Challenge.gov contestant whose entry best fits U.S. legislation into Akoma Ntoso, an internationally-developed framework that aims to be the standard for presenting legislative data in machine-readable formats.

The Library of Congress, at the request of the U.S. House of Representatives, is utilizing the Challenge.gov platform to advance the exchange of legislative information worldwide.

Akoma Ntoso (www.akomantoso.org) is a framework used in many other countries around the world to annotate and format electronic versions of parliamentary, legislative and judiciary documents. The challenge, "Markup of U.S. Legislation in Akoma Ntoso", invites competitors to apply the Akoma Ntoso schema to U.S. federal legislative information so it can be more broadly accessed and analyzed alongside legislative documents created elsewhere.

"The Library works closely with the Congress and related agencies to make America’s federal legislative record more widely available through Congress.gov," said Robert Dizard Jr., Deputy Librarian of Congress. "This challenge will build on that accessibility goal by advancing the possibilities related to international frameworks. American legislators, analysts, and the public can benefit from international standards that reflect U.S. legislation, thereby allowing better comparative legislative information. We are initiating this effort as people around the world are working to share legislative information across nations and other jurisdictions."

Utilizing U.S. bill text, challenge participants would attempt to markup the text into electronic versions using the Akoma Ntoso framework. Participants will be expected to identify any issues that appear when applying the Akoma Ntoso schema to U.S. bill text, recommend solutions to resolve those issues, and provide information on the tools used to create the markup.

The challenge, which opened today and closes Oct. 31, 2013, is extended to participants 18 years of age or older. For the official rules and more detailed information about the challenge or to enter a submission, visit akoma-ntoso-markup.challenge.gov.

The competition’s three judges are experts in either U.S. legislation XML standards or the Akoma Ntoso legal schema. The Library of Congress will announce the winner of the $5,000 prize on Dec. 19, 2013.

Akoma Ntoso XML schemas make “visible” the structure and semantic components of relevant digital documents so as to support the creation of high value information services to deliver the power of ICTs to increase efficiency and accountability in the parliamentary, legislative and judiciary contexts.

The human mind’s affinity for making sense of the objects it sees can be explained in a theory called Gestalt psychology. Gestalt psychology, also referred to gestaltism, is a set of laws that accounts for how we perceive or intuit patterns and conclusions from the things we see.

Gestalt laws originate from the field of psychology. Today, however, this set of laws finds relevance in a multitude of disciplines and industries like design, linguistics, musicology, architecture, visual communication, and more.

These laws provide us a framework for explaining how human perception works.

Understanding and applying these laws within the scope of charting and data visualization can help our users identify patterns that matter, quickly and efficiently.

None of the Gestalt laws work in isolation, and in any given scenario, you can find the interplay of two or more of these laws.

Let us cover some of the Gestalt laws that are relevant to enhancing data visualization graphics.

Let us help the Stack Exchange guys to suggest questions to a user that he can answer, based on his answering history, much like the way Amazon suggests you products based on your previous purchase history. If you don’t know what Stack Exchange does – they run a number of Q&A sites including the massively popular Stack Overflow.

Our objective here is to see how we can analyze the past answers of a user, to predict questions that he may answer in future. May Stack Exchange’s current recommendation logic may work better than ours, but that won’t prevent us from helping them for our own learning purposes .

Conclusion In this example, we were doing a lot of manual work to upload the required input files to HDFS, and triggering the Recommender Job manually. In fact, you could automate this entire work flow leveraging Hadoop For Azure SDK. But that is for another post, stay tuned. Real life analysis has much more to do, including writing map/reducers for extracting and dumping data to HDFS, automating creation of hive tables, perform operations using HiveQL or PIG, etc. However, we just examined the steps involved in doing something meaningful with Azure, Hadoop and Mahout.

You may also access this data in your Mobile App or ASP.NET Web application, either by using Sqoop to export this to SQL Server, or by loading it to a Hive table as I explained earlier. Happy Coding and Machine Learning!! Also, if you are interested in scenarios where you could tie your existing applications with HD Insight to build end to end workflows, get in touch with me. -

Just the article I've been looking for. It provides a nice start to finish view of playing with HDInsight and Mahout, which is something I was pulling my hair out over a few months ago...

Father’s Day is approaching and you might be thinking about a good place to have a nice lunch with your Dad… We would like to show you how Data Explorer and Geoflow can help you gather some insights to make a good decision.

In order to achieve this, we will look at publicly available data about Food Establishment Inspections for the past 7 years and we will also leverage the Yelp API to bring ratings and reviews for restaurants. For the purpose of this post, we will focus on the King County area (WA) but you can try to find local data about Food Establishment inspections for your area too.

You will need to sign up for an API key before being able to connect. Please also make sure to read the Yelp API Terms of Use before using this API.

What you will learn in this post:

Import data from the Yelp Web API (JSON) using Data Explorer.

Import public data about Food Establishment Inspections from a CSV file.

Reshape the data in your queries.

Parameterize the Yelp query by turning it into a function, using the Data Explorer formula language, so you can reuse it to retrieve information about different types of restaurants as well as different geographical locations.

Invoke a function given a set of user-defined inputs in an Excel table.

Combine (Merge) two queries.

Load the final query into the Data Model.

Visualize the results in Geoflow.

...

You know you want to play with this... Just admit it. Makes me want to install Office 2013 just so I can... :)

The White House marked the one-year anniversary of its digital government strategy Thursday with a slate of new releases, including a catalog of government APIs, a toolkit for developing government mobile apps and a new framework for ensuring the security of government mobile devices.

Those releases correspond with three main goals for the digital strategy: make more information available to the public; serve customers better; and improve the security of federal computing.

That list of API's and projects just blows my mind... I mean... wow. If you're looking to wander through some code, there HAS to be something here that you'll find interesting. There's something for every language, platform and interest, I think...