Month: April 2013

Wikidata is a free knowledge base that can be read and edited by humans and machines alike. It is for data what Wikimedia Commons is for media files: it centralizes access to and management of structured data, such as interwiki references and statistical information. Wikidata contains data in all languages for which there are Wikimedia projects.

A few Useful Links:

We start our Knowledge Discovery Journey on Cloud Computing with a few simple steps. The first step to find some of the Cloud Computing Terms. It is very easy for humans to just look at a document and quickly identify relevant terms. However, it assumes that you have some knowledge of the topic. Since we want to automate this process as much as possible, we will use some simple tools.

The first tool is Google Search. We simply search on the term “Cloud Computing” and the first entry happens to point to this Wikipedia page. For this stage of the experiment, we will take that page to be a reasonable representation of the current information about cloud computing.

How do we know that this information is current? A look at the history of edits shows that it is being updated almost daily (this is one of the benefits of sources like Wikipedia)

We will parse this page and find the top 20 most frequent bigrams (pairs of words). We do this using a simple python program using the Natural Language Tool Kit library (there are other methods of doing this as well).

We pick a few of the more interesting terms. In the list below, the first column represents the term and the second column the number of times the term occurs in the document.

cloud computing

156

cloud services

14

public cloud

14

private cloud

13

hybrid cloud

10

cloud providers

8

cloud applications

7

cloud based

7

cloud cloud

7

cloud infrastructure

7

cloud service

7

category cloud

6

heterogeneous cloud

6

use cloud

6

cloud clients

5

cloud environment

5

cloud storage

5

cloud symbol

5

cloud user

5

data cloud

5

software cloud

5

Now we have a very crude version of the vocabulary on Cloud computing. This provides us a good starting point for further searches. Before we do that, we will eliminate some of the terms (like cloud cloud).

We can improve this process in several ways.

We can look at more than one page or document. A good candidate is NIST’s Cloud Computing Definition document, which is listed as one of the references in the wikipedia page. There may be others. If we use multiple documents, we may use tf/idf (term frequency/inter document frequency) or some other metric.

We can repeat the term frequency program to include trigrams (triple words like “cloud computing platforms”) and add them to the list.

There are other (better) ways to get the terms and we will reserve that option for the future. A web search for “cloud terminology”, “cloud ontology” reveals some interesting sources like this one – a dictionary of cloud terms.

Our quest, however, is to come up with simple methods of generating these terms ourselves. There are two reasons for doing this. One is that we may need to research topics that are not as popular as cloud computing for which the terminology may not exist. The second reason is that if we know how to automate and refine these terms, we can keep them updated as frequently as we want.

Meta:

If you want the (really crude) Python program I used to derive these terms, you can find it here.

Used another experimental tag cloud generator we built to visualize these tags.

The greatest experiment is nearly always a solo. The individual, seeking to learn, tries something new but only tries it on himself. If he fails, he has hurt only himself. If he succeeds he has made a discovery many people can use. Experiment only with your own time, your own money, your own labor. That’s the honest, sincere type of experiment. It’s rich.

In light of the prosperity of online social media, Web users are shifting from data consumers to data producers. To catch the pulse of this rapidly changing world, it is critical to transform online social media data to information and to knowledge.

This dissertation centers on the issue of modeling the dynamics of user communities, trending stories, topics and user interests in online social media. However, knowledge discovery and management in online social media is challenging because: 1) social media data arrive in the form of continuous streams; 2) the volume of social media data is potentially infinite; and 3) more importantly, social media data is very complex which consists of network, text, tag, click and other information.

Conceptual and Meta Knowledge

In addition to bits of information, you also get higher levels of knowledge, if you take the time to analyze it.

You can glean inter-relationships and Structure by analyzing followers, lists and retweets and other referral formats. For example, by looking at the people and lists that thought leaders (like Tim O’Reilly) follow you can get some sense of the information relationships. By looking at the lists Tim is in, you can also understand a lot more of his following.

You can identify influencers and experts in various topics and industries. Klout tries to do this a bit. You can look at the reach and network effect of certain people on Twitter. You need to augment this analysis by looking beyond Twitter, but Twitter gives you some great starting points.

Everyone on Twitter is reachable (for example, if you want me to notice something, you can just add @dorait in your tweet, drawing my attention. If the information you share makes sense to my audience or appeals to me, I may retweet it. Guy Kawasaki once mentioned that he looks at Tweets where he is tagged.

By analyzing the retweet patterns of experts, you can understand their areas of interest and spheres of influence. You can do it with a few open source tools.

You can understand how information propagates – what, why, how, when by analyzing tweets. Organizations like InfoChimps, Datasift can provide you with a large body of tweets you can use for research. You can make intelligent guesses based on the velocity of propagation (the speed at which topics trend)

You can use several techniques to create your watch signals in a specific space (market, industry segment, geographic region etc.)

If you are interested in this area please contact me at dorai (at) infoassistants.com. Will be happy to answer any questions, elaborate some of these ideas and have a chat.

We are building and testing tools for research. Our (experimental) tools help us gather information and help us in incremental knowledge discovery.

Let us start with a simple experimental topic that is also a rapidly emerging trend – “Cloud Computing”. Our goal is to gain as much knowledge about Cloud Computing as possible. We will use two different approaches.

6W Framework

What – A set of what questions. What is cloud computing? What is the difference between public, private, hybrid clouds? What is the difference between the Web and the Cloud?

Why – A set of why questions. Why should your business care? Why now?

Who – A set of who questions. Who is driving it? Who is adopting it?

When – A set of when questions. When is it appropriate for your business? When is it not?

Where – A set of where questions. Where is it being adopted? Where is it successfully being used?

How – A set of how questions. How do you get started? How do you evaluate it? How do you find the ROI for your business?

To gain some useful knowledge, we need to look at several aspects. These include (in no particular order):

Technologies

Vendors

Market Segments

Products

People (Experts, Influencers)

Trends

Topics (Vocabulary and Ontology)

Research

Patents

Investors and funding

Applications

Adoption

Drivers – that support the trend

Barriers to adoption

Intersections (with other emerging trends)

Events

Publications

Communities

Knowledge Bases

Opportunities

We will piece together this knowledge about cloud computing step by step (over several posts).

The Art of Explanation is built on my years of experience in creating explanations for organizations and educators. My company, Common Craft, is known around the world for making complex ideas easy to understand in the the form of short videos. Through projects with companies such as Google, LEGO, Intel, and Ford Motor Company and the creation of our own library of video explanations, we have been students of explanation for many years. We have experimented and studied the explanation process and seen what is possible. Our videos have been viewed more than 50 million times online, and no other brand is better known for explanations (http://commoncraft.com/videos).

This book, however, is not a series of case studies and exercises or an academic exploration of “the science of explanation.” More than anything, it is a manifesto based on our experiences as professional explainers. We believe deeply in the power of explanation and see this book as an invitation to recognize that power by looking at explanation from a new perspective. When you do, you will see that it represents an unexplored part of your communications, a skill you can understand, practice, and improve.

The various ideas, approaches, and models I provide in these pages are secondary to a simple, higher-level goal: to make explanation a priority. This means thinking about how you explain ideas and how you can put explanations to work to accomplish your goals. It requires that you use explanation as a strategy in problem solving. You must also introduce others to the idea that explanations can create positive change.

I am a big fan of Common Craft In Plain English videos. I use them in my talks and recommend it to others. It is nice to see a book that explains the art of explanation. Looking forward to reading it and will be back with some notes and learning in a future post.

I’m a Python developer at 10gen, the company that makes MongoDB. I help maintain the standard Python driver for MongoDB (PyMongo), and I’m the author of a non-blocking driver called Motor. Both are open source. Coders at 10gen wear lots of bonnets: I do customer support, blogging, consulting, and speaking, and I spend a lot of time making open source contributions and working with people who contribute to our projects.

My third book, Hacking Secret Ciphers with Python, is finished. It is free to download under a Creative Commons license, and available for purchase as a physical book on Amazon for $25 (which qualifies it for free shipping). This book is aimed at people who have no experience programming or with cryptography. The book goes through writing Python programs that not only implement several ciphers but also can hack these ciphers.