human computer interactive learninghttp://clockworkchaos.com/project7
PROJEC7: creating technology to help people make decisionsenSocial Media Naturally Creates Fake Newshttp://clockworkchaos.com/project7/?q=social-media-fake-news
<!-- google_ad_section_start --><div class="field field-name-body field-type-text-with-summary field-label-hidden"><div class="field-items"><div class="field-item even" property="content:encoded"><p>What is fake news? How do we define what is real? The construction of truth is a critical topic for analysts of all sorts. However, there is now more content being created on the internet every second than any person could reasonably consume in their lifetime.<br /><img align="middle" width="500" src="http://clockworkchaos.com/projects/images/sm_fake_news.jpg" alt="Social media, monetization, fake news, and censorship" /><br />
While we may not be worried about the need for critical examination of claims in cat videos, we certainly do care in other about establishing the truthiness of statements related to topics with personal and social implications. Topics with such implications are often political in nature and so such examinations often go hand-in-hand with these topics. As analysts, even if our goal is to remain objective, we often find ourselves put in a position to critically examine these topics via data mining, statistical analysis, or citizen journalism. However, if we are to scale our efforts we need to better understand the means by which such forces can be perpetuated. </p>
<p>Concern over this topic has been ennobled by social media giants such as Google, Facebook, etc. using approaches such as Snopes, social feedback, source monitoring, etc. However, I find it ironic that such companies position themselves as independent curators in such circumstances. While they desire to come across as unbiased the very platforms these companies have created systematically support both the cultivation and propagation of fallacious ideas. Perhaps it is for this reason that creators of digital platforms have begun attempting to right the wrongs their networks have enabled. </p>
<p><b><i>How does social media cultivate fallacious ideas?</i></b><br />
There are primarily three ways that digital platforms perpetuate fallacies:</p>
<ol><li><i>Illusory perception of consensus.</i></li>
<li><i>Inappropriate application of social constructionist frameworks. </i></li>
<li><i>Reinforced perpetuation of incredulity. </i></li>
</ol><p>This is probably not an exhaustive list- feel free to add to it in the comment. Also, I am only going to briefly provide some examples of how each of these concepts applies and may add more over time.</p>
<p><b><i>1. Illusory perception of consensus.</i></b><br />
Essentially this takes place for two reasons. First, on average, your friends have more friends than you do. This may seem contradictory but it is true because the mean and median number of friends on a social network are substantially divergent. Such divergence takes place in exponential or power-law distributions. A few individuals on social networks are responsible for a majority of the social connections. This effect is accentuated on networks that do not force reciprocity such as Twitter and Instagram. This effect is also accentuated by local biases in a users' community; as users accrete connections they tend to do so in homogeneous fashion. </p>
<ul><li> Feld, Scott L. (1991), "Why your friends have more friends than you do", American Journal of Sociology, 96 (6): 1464–1477 </li>
<li> Hodas, Nathan; Kooti, Farshad (May 2013). "Friendship Paradox Redux: Your Friends are More Interesting than You" </li>
</ul><p><b><i>2. Inaccurate application of social constructionist frameworks. </i></b><br />
Ways of thinking and ideas that have practical value typically have applications in applied reality. For example, an effective way of assessing causality will lead to value for a business and to employment for the individual maintaining that way of thinking. In my experience it seems that people with more effective worldviews tend to find themselves increasingly strapped for free time since the opportunity cost for their time has increased. So, the ability for people with successful worldviews to contribute to conversations on social media is diminished relative to their lack of free time. Thus, the content found on social media is confounded with practicality. This manifests itself in three primary ways:</p>
<ul><li>There is less support for practical ideas on social media than is present in the real-world. </li>
<li> Many practical ideas present in the real-world often do not appear as options on social media. This is accentuated on smaller social networks where the tendency for an echo chamber increases due to the lack of diversity in thought.</li>
<li> People with practical worldviews have often honed a level of diplomacy that causes their ideas, when presented, to come across less forcefully. Individuals with high opportunity costs for their time have more to loose and it is more important that they communicate their ideas in less combative fashion.</li>
</ul><p><b><i>3 Reinforcement of incredulous ideas </i></b><br />
In addition to the biases in audience composition and engagement mentioned above, there are also biases in content creation. It is harder to define quality metrics than it is to define reach metrics. Thus, digital platforms primarily monetize content based upon reach. Engagement, if it effects monetization at all, often only does so as an afterthought (second-order factor). This reinforces content that evokes an emotional response in platform users. Content that causes surprise, anger, fear, arousal, or disgust is much more potent than content that encourages critical thought. This encourages bipolarity in content creation. Content becomes increasingly focused upon in-group norms (ie. virtue-signaling) or upon incredulity (ie. click-baiting).</p>
<p>Where do we go from here? As analysts it is useful to be able to identify fallacious content. However, it is even more helpful if we can identify the mechanisms behind the creation of such content. If we can do the later then we are one step closer to being able to automatically identify this content using eg. algorithms. </p>
</div></div></div><!-- google_ad_section_end --><span class="vocabulary field field-name-field-tags field-type-taxonomy-term-reference field-label-above"><h2 class="field-label">Tags:&nbsp;</h2><ul class="vocabulary-list"><li class="vocabulary-links field-item even" rel="dc:subject"><a href="/project7/?q=taxonomy/term/2" typeof="skos:Concept" property="rdfs:label skos:prefLabel">SNA</a></li><li class="vocabulary-links field-item odd" rel="dc:subject"><a href="/project7/?q=taxonomy/term/11" typeof="skos:Concept" property="rdfs:label skos:prefLabel">Social Media</a></li><li class="vocabulary-links field-item even" rel="dc:subject"><a href="/project7/?q=taxonomy/term/73" typeof="skos:Concept" property="rdfs:label skos:prefLabel">Social Network</a></li><li class="vocabulary-links field-item odd" rel="dc:subject"><a href="/project7/?q=taxonomy/term/1" typeof="skos:Concept" property="rdfs:label skos:prefLabel">Social Network Analysis</a></li></ul></span>Sat, 30 Dec 2017 01:15:40 +0000rakirk51 at http://clockworkchaos.com/project7http://clockworkchaos.com/project7/?q=social-media-fake-news#commentsWho governs the data?http://clockworkchaos.com/project7/?q=data_governance
<!-- google_ad_section_start --><div class="field field-name-body field-type-text-with-summary field-label-hidden"><div class="field-items"><div class="field-item even" property="content:encoded"><p><b>When it comes to privacy, should we allow for secret handshakes to take place?</b> In case you haven't been paying attention to trends in data lately, data governance is a hot topic. As we strive to implement regulations and best practices to ensure continued exchange of privacy in our industrial applications, we are engaging in a process of data governance.<br /><img align="middle" width="500" src="http://clockworkchaos.com/projects/images/secret_handshake.jpg" alt="no more secret handshakes..." /><br />
The core idea here is to take governance policies and to translate them into requirements for our applications and for our data. A lot of approaches, in my opinion, tend to over complicate this idea. The goal of this post is to suggest some straightforward approach that I believe we can all deploy today in order to better protect our data.</p>
<p>So perhaps you are following what I'm saying and perhaps you want to come up with some data governance policies. How do you know which policies to go after? The answer comes from two sources. First, from the top-down we examine what our regulatory bodies and what industry best practices are suggesting. Second, from the bottom-up we can actually use an anomaly detection frameworks to find peculiar examples of use within the access patterns for our data. From the top-down we engage in an external research process. For brevity, I will need to come back to this later idea in a subsequent blog post.<br /><img align="middle" width="500" src="http://clockworkchaos.com/projects/images/blissful_ignorance.jpg" alt="can we just be silent...." /><br />
External policies contain a wealth of knowledge we can build upon to inform the types of policies that our businesses should maintain. These policies often come in the form of explicit guidelines from authoritative sources. For example, HIPAA governs the legal use of personal healthcare information. In addition to authoritative sources, we also have influential sources. These typically come from groups of businesses coming to consensus on best practices for a particular type of marketplace. For example, the automotive industry has data guidelines for the use of automative data. Finally, each organization typically has their own standards that are either made obvious or are implicitly present in day-to-day operations. To discover these, talk to the experts in your organization. As you research these guidelines, keep track of the types of data they discuss, to the access restrictions they recommend, and to the scope of use they recommend. </p>
<p>Once we have the guidelines, we need to figure out how to translate these into procedures we can implement into our data applications. Translating is impossible if we do not have a way to define concept of scope. Scope relates to user access, to data type, and to use case. I think people are used to understanding the restriction of data by user type and by use case. However, the concept of data type can actually be fairly complicated. Since data can change types based upon the way it has interacted with various other systems, we need to find a way to track the lineage of each data record. Operational meta data (OMD) refers to the tags we place on an atomic datum to record where it has been in the past. This allows us to create a data lineage for each record. The most obvious way this typically manifests is by keeping track of the source application, creation date, and creation location for each record. Once implemented, such lineage linking then allows for a finer-grained ability to translate create business rules and heuristics to data types.<br /><img align="middle" width="500" src="http://clockworkchaos.com/projects/images/building_governance.jpg" alt="building data governance is up to the experts...." /><br />
Now that we have policies that relate to users, use cases, and OMD types, we can now start to define the defects in our systems. Defects are occurrences of policy violations. So far I have focused primarily upon explicit policies. These policies only tell us whether some type of access is absolutely allowed or disallowed. In such a system we can test its effectiveness by creating fake data access calls and then determining whether we detected defects for these calls. This synthetic transaction approach is very useful for testing the most important types of data access policies. However, in more complex systems such unit testing may not be sufficient as the number of types of interactions increases combinatorially. In addition, such binary decisions may be too rigid for real world applications. For example, sometimes privacy can relate to having an infrequent pattern of access. For these reasons the next post will further discuss some probabilistic techniques we can use to further evaluate the presence of privacy defects in our applications. </p>
<p>Final words: while privacy is not dead, it is up to use to continue to protect and define it. It is worth mentioning that while this approach promises to control data access, it is not necessarily an approach to building a secure system. While these polices tend to relate to the enforcement of various access criteria, it is different than the type of approach you would use to build security into an application. This goes back to the first debate in the last post which discussed the similarities and differences between privacy and concepts such as security, anonymity, and ownership.</p>
</div></div></div><!-- google_ad_section_end --><span class="vocabulary field field-name-field-tags field-type-taxonomy-term-reference field-label-above"><h2 class="field-label">Tags:&nbsp;</h2><ul class="vocabulary-list"><li class="vocabulary-links field-item even" rel="dc:subject"><a href="/project7/?q=taxonomy/term/91" typeof="skos:Concept" property="rdfs:label skos:prefLabel">Data Governance</a></li><li class="vocabulary-links field-item odd" rel="dc:subject"><a href="/project7/?q=taxonomy/term/92" typeof="skos:Concept" property="rdfs:label skos:prefLabel">Privacy</a></li><li class="vocabulary-links field-item even" rel="dc:subject"><a href="/project7/?q=taxonomy/term/93" typeof="skos:Concept" property="rdfs:label skos:prefLabel">Security</a></li><li class="vocabulary-links field-item odd" rel="dc:subject"><a href="/project7/?q=taxonomy/term/94" typeof="skos:Concept" property="rdfs:label skos:prefLabel">Anomaly Detection</a></li></ul></span>Wed, 25 May 2016 18:17:13 +0000rakirk49 at http://clockworkchaos.com/project7http://clockworkchaos.com/project7/?q=data_governance#commentsIs Privacy Dead?http://clockworkchaos.com/project7/?q=dead_privacy
<!-- google_ad_section_start --><div class="field field-name-body field-type-text-with-summary field-label-hidden"><div class="field-items"><div class="field-item even" property="content:encoded"><p><b>Is privacy dead?</b> Vint Cerf, proclaimed as the <i>father of the internet</i>, was the <a href="https://www.ftc.gov/sites/default/files/documents/public_events/internet-things-privacy-security-connected-world/final_transcript.pdf">keynote speaker</a> for a recent FTC workshop on the internet of things (IoT). He stated in his address that privacy may have always been an illusion.<br /><img align="middle" width="500" src="http://clockworkchaos.com/projects/images/anonymous_tinfoil.jpg" alt="I'll just put on my tinfoil hat now...." /><br />
If you consider the history of human culture there has hardly been a time where any individual had private moments. In primordial cultures people dwelt together. In feudal eras only the nobility had any notion of private spaces. It was not until modern times when both the wealth and the population lead to a combination of personal space and a concentrated urban sense of anonymity. But what is privacy, really?</p>
<p><b>Is privacy the same as anonymity</b>? Just because I do not know you are does not mean that you have privacy. I was reminded of this recently while hiking on remote trail. I was enjoying the overlook when I noticed a couple taking pictures in my direction. There I was captured in their image without my express consent. Was it anonymous? In the age of facial recognition and location-aware devices, it's likely there are existing software (e.g. social media sites) that would recognize me. Such software could tag me behind the scenes without my knowledge. Clearly obscurity is increasingly eroding. Our sense of personal boundaries is enlarging such that it encompasses our digital environments.<br /><img align="middle" width="500" src="http://clockworkchaos.com/projects/images/curious_child.jpg" alt="Everywhere you go, they'll be watching you..." /><br /><b>Is privacy the same as security?</b> Perhaps anonymity doesn't matter if you live in a castle? The problem with this line of reasoning is that all attackers have to do is figure out how to climb the walls of the castle. We have seen this happen in modern industry many times when publicly exposed end points are compromised. When this happens people's online presence is no longer private. But are there cases where neither anonymity nor security matter?</p>
<p><b>Is privacy the same as ownership?</b> There are plenty of examples where private places, assets, and ideas are neither anonymous nor secure. In such cases the owner shares their private resource in exchange for money, for notoriety, or for having the public manage their asset. For example, patents protect private ideas in an unsecured commons. Air travel allows passengers to pay for a share of a jet. Ungated private roads in housing developments benefit from public upkeep. But how do we collectively manage the ability to own digital ecosystem.<br /><img align="middle" width="500" src="http://clockworkchaos.com/projects/images/castle_walls.jpg" alt="Time to wall ourselves in!" /><br /><b>What is the role of government?</b> There is an entire field dedicated to answering this question. One of my favorite definitions claims that its role is to balance the public and private benefit of shared resources. For example, how do we balance the need to keep our environment healthy with corporations' need to produce products whose bi-products pollute the environment? In this example pollution is a negative exertnality- it is a cost the public must pay while receiving very little benefit. Corporations can also receive positive externalities from the public. For example, the interstate highway system benefits shipping companies who pay a minute share of the creation cost of this infrastructure. In these examples our collective governance regulates the exchange of private and public resources. </p>
<p><b>How does this apply to data?</b> Data governance as with other forms of governance can relate to the collective management of public and private data resources. We have well-established patterns for collective benefit from public resources. For example, the U.S. Census help organizations learn more about their customers. What about the need for collective positive externalities from private data? Is there a data governance role to regulate the public use of private healthcare data? Is there a way we can use data such as these to help cure diseases and <a href="https://preinventedwheel.com/mobile-innovation-is-saving-lives/">saving lives</a> while minimizing the exposure of private individuals to negative externalities? What if individuals retained ownership in the same way that private housing complexes retain ownership over their public roadways. Would it matter if this data were secure as long as it was made anonymous? As mentioned, increasingly possible to determine who people are based upon their digital signatures. In such a case, what if we only made individuals digital archives a part of such a record after they were deceased?<br /><img align="middle" width="500" src="http://clockworkchaos.com/projects/images/pensive_lincoln.jpg" alt="Reactionary leadership is easy. It takes intelligence to be proactive." /><br /><b>What is our long-term data legacy?</b> I think these problems become easier to think about on the scale of generations. For example, we commonly exploit the private data from deceased generations in publicly available tools such as maps, ancestry charts, and healthcare records. Rather than bemoaning the lack of privacy, I think it is more helpful to learn to this digital commons through a new form of data governance. Such a proactive approach would allow us to maximize the benefit of these records for generations to come. </p>
<p>No, privacy is not dead. Instead, it is a concept that we are all responsible for defining. What we need from the <i>father of the internet</i> is not to bemoan the death of privacy. What we need is advice on how to set open data standards that create common data structure, disclosure protocols, and criteria for using private data in public sets in exchange for money. What we have to gain should surpass what we have to lose. We need to find a form of collective data governance that allows for maximum public benefit with minimal individual exposure. Did I get this right?</p>
</div></div></div><!-- google_ad_section_end --><span class="vocabulary field field-name-field-tags field-type-taxonomy-term-reference field-label-above"><h2 class="field-label">Tags:&nbsp;</h2><ul class="vocabulary-list"><li class="vocabulary-links field-item even" rel="dc:subject"><a href="/project7/?q=taxonomy/term/91" typeof="skos:Concept" property="rdfs:label skos:prefLabel">Data Governance</a></li><li class="vocabulary-links field-item odd" rel="dc:subject"><a href="/project7/?q=taxonomy/term/92" typeof="skos:Concept" property="rdfs:label skos:prefLabel">Privacy</a></li><li class="vocabulary-links field-item even" rel="dc:subject"><a href="/project7/?q=taxonomy/term/93" typeof="skos:Concept" property="rdfs:label skos:prefLabel">Security</a></li></ul></span>Wed, 13 Apr 2016 03:31:37 +0000rakirk48 at http://clockworkchaos.com/project7http://clockworkchaos.com/project7/?q=dead_privacy#commentsNFL Kickers Now 10x More Likely to Miss the Extra Point!http://clockworkchaos.com/project7/?q=nfl_pat
<!-- google_ad_section_start --><div class="field field-name-body field-type-text-with-summary field-label-hidden"><div class="field-items"><div class="field-item even" property="content:encoded"><p>Recent NFL rule change has moved the point after a touch (PAT) down from the 2 yard line back to the 15 yard line. So far the chances of making the extra point is down to 95%. That may seem high, but it is actually 4.5% lower!<br /></p><center>
<iframe height="500" width="720" scrolling="no" frameborder="0" src="http://clockworkchaos.com/projects/nfl/pat/byTypeWithPrediction.html">
<p>Your browser does not support frames, consider using Mozilla/Chrome.</p>
</iframe><p></p></center><br />
Over the last five years the PAT completion rate has remained mostly unchanged. The timing of the rule change coincides with a decrease chance of completing the PAT. This is not due to a reduction in the skill of our NFL kickers. Actually, we can see that NFL kickers have been consistently improving in their ability to complete field goals. Overall, NFL kickers are 3.3% more likely to make a field goal in 2015 than they have been in previous five seasons. Kickers are consistently increasing their ability to complete field goals each year.<br /><img align="middle" width="500" src="http://clockworkchaos.com/projects/nfl/pat/fgTrend.PNG" alt="This chart shows the probability of an NFL completing a field goal from 2010 through 2015. (Scaled 0-100)" /><br />
The chart above shows the probability of an NFL kicker completing a field goal. The chart below shows the same chart zoomed to the scale of the data:<br /><img align="middle" width="500" src="http://clockworkchaos.com/projects/nfl/pat/zoom_fgTrend.PNG" alt="This chart shows the probability of an NFL completing a field goal from 2010 through 2015. (Scaled 75-100)" /><br />
We could argue that their increase in performance is due to improved coaching. The kickers are not asked to attempt kicks which the coach is not confident that they can complete. To test this look at completions by yardage by year:<br /><img align="middle" width="500" src="http://clockworkchaos.com/projects/nfl/pat/successTable.PNG" alt="What is the point of all of these normal distributions?" /><br />
The table above shows the probability that an NFL kicker will complete a field goal by year and by 20-yard bracket. It looks like these kickers are still doing as well across the various yard-based buckets. So if we compare the field goal completion rate to the PAT completion rate, what do we notice?<br /><img align="middle" width="500" src="http://clockworkchaos.com/projects/nfl/pat/byType_simple.PNG" alt="This chart shows the probability of an NFL completing a field goal alongside the probability of completing a PAT from 2010 through 2015. (Scaled 0-100)" /><br />
Here is the same chart zoomed to the scale of the data. I also added a highlighted point that indicates what the expected completion rate should be for PATs.<br /><img align="middle" width="500" src="http://clockworkchaos.com/projects/nfl/pat/byTypeWithPrediction_simple.PNG" alt="This chart shows the probability of an NFL completing a field goal alongside the probability of completing a PAT from 2010 through 2015. It also shows the expected 99.5% PAT completion rate. (Scaled 75-100)" /><br />
The extra point was gone from being a given (99.56%) to having a 19/20 chance. Perhaps this is why the NFL felt motivated to change the rule. Is it a large enough change? If you do the math, it turns out that kickers are now 10x more likely to miss the extra point! Wow- time to adjust our strategies. Will more teams start trying to convert on a touch down?
</div></div></div><!-- google_ad_section_end --><span class="vocabulary field field-name-field-tags field-type-taxonomy-term-reference field-label-above"><h2 class="field-label">Tags:&nbsp;</h2><ul class="vocabulary-list"><li class="vocabulary-links field-item even" rel="dc:subject"><a href="/project7/?q=taxonomy/term/31" typeof="skos:Concept" property="rdfs:label skos:prefLabel">NFL</a></li><li class="vocabulary-links field-item odd" rel="dc:subject"><a href="/project7/?q=taxonomy/term/80" typeof="skos:Concept" property="rdfs:label skos:prefLabel">Analytics</a></li><li class="vocabulary-links field-item even" rel="dc:subject"><a href="/project7/?q=taxonomy/term/47" typeof="skos:Concept" property="rdfs:label skos:prefLabel">analysis</a></li><li class="vocabulary-links field-item odd" rel="dc:subject"><a href="/project7/?q=taxonomy/term/79" typeof="skos:Concept" property="rdfs:label skos:prefLabel">Predictive Analytics</a></li><li class="vocabulary-links field-item even" rel="dc:subject"><a href="/project7/?q=taxonomy/term/34" typeof="skos:Concept" property="rdfs:label skos:prefLabel">prediction</a></li></ul></span>Wed, 30 Sep 2015 05:04:28 +0000rakirk47 at http://clockworkchaos.com/project7http://clockworkchaos.com/project7/?q=nfl_pat#commentsData too big for your algorithm? Use more machineshttp://clockworkchaos.com/project7/?q=distributed_monte_carlo
<!-- google_ad_section_start --><div class="field field-name-body field-type-text-with-summary field-label-hidden"><div class="field-items"><div class="field-item even" property="content:encoded"><p>Data science strives to build applications that help solve modern, complex business problems. Often these problems requires solutions that scale through the use of distributed, parallel computing. However, a lot of our known techniques do not seem to directly scale using these techniques. This post discusses how we can take advantage of the Central Limits Theorem to scale some of our more advanced analysis tools.<br /><img align="middle" width="500" src="http://clockworkchaos.com/projects/model/all_threads.png" alt="What is the point of all of these normal distributions?" /></p>
<p>A distributed algorithm needs to have a couple of important properties. First, we need to be able to break it down into smaller components. Something which was a single, serial process turns into many smaller processes. Second, it needs to be commutative. This means that it needs to be insensitive to the order in which each process takes place. Finally, it needs to be associative. This means that the process needs to be insensitive to the order in which we combine the final results. </p>
<p>
Okay, enough text already. Here is an example. Clearly we can do this for simple operations such as min, max, mean, sum, etc. However, we can also use this for things such as significance tests or Monte Carlo simulations? How? Start with a single simulation that approximates the mean using 100,000 iterations:<br /><img align="middle" width="500" src="http://clockworkchaos.com/projects/model/single_thread.png" alt="The outcomes of a single larger simulation clearly surrounds a central point." /></p>
<p>
Now how about 100 smaller simulations to approximate the same mean?<br /><img align="middle" width="500" src="http://clockworkchaos.com/projects/model/all_threads.png" alt="The outcome of a averaging many smaller simulations clearly surrounds a central point." /></p>
<p>
They look like they are all over the place. But, what happens if I combine them?<br /><img align="middle" width="500" src="http://clockworkchaos.com/projects/model/multi_thread.png" alt="The outcomes of many simulations appear to surround a single shared center." /></p>
<p>Well, they both converged. But, how did they do? Since <a href="https://en.wikipedia.org/wiki/Ordinary_least_squares">ordinary least squares (OLS) </a> method that has the least errors at solving this technique, I will compare each simulation to the OLS results. The result of many smaller simulations is actually slightly more accurate than the single larger simulation. This makes sense: each of the smaller simulations had a different random starting location.</p>
<style>
<!--/*--><![CDATA[/* ><!--*/
table {
width:100%;
}
table, th, td {
border: 0px;
border-collapse: collapse;
}
th, td {
padding: 5px;
text-align: left;
}
table#t01 tr:nth-child(odd) {
background-color: #EBF0FF;
}
table#t01 tr:nth-child(even) {
background-color:#fff;
}
table#t01 th {
background-color: #4775FF;
color: white;
}
/*--><!]]>*/
</style><table id="t01"><tr><th></th>
<th>OLS Value</th>
<th>Single Simulation</th>
<th>Multi Simulation</th>
</tr><tr><td><i>Slope</i></td>
<td>82.1313</td>
<td>82.1326</td>
<td>82.1304</td>
</tr><tr><td><i>Intercept</i></td>
<td>0.2588</td>
<td>0.2580</td>
<td>0.2584</td>
</tr></table><p>You can distribute a variety of data science processes using a simple technique. This technique works because of the <a href="http://www.nist.gov/manuscript-publication-search.cfm?pub_id=823110">Central Limits Theorem</a>. In particular, data drawn from a known distribution will approximate the distribution itself. As Data Scientists, we intuitively use this principle in our daily lives. For example, we use this technique any time we build averages to approximate the typical rate at which something takes or place or when we estimate the typical size of something. We also take advantage of this principle when we test for significance. We can incorporate this with distributed approaches such as Parallel Python, Theano, or PySpark. </p>
<p> However, as our data gets larger and more complex, it gets harder to use process-intensive techniques. At some point even distributed processing will not suffice. Future posts will focus on ways to further simplify these techniques using approximation techniques. </p>
</div></div></div><!-- google_ad_section_end --><span class="vocabulary field field-name-field-tags field-type-taxonomy-term-reference field-label-above"><h2 class="field-label">Tags:&nbsp;</h2><ul class="vocabulary-list"><li class="vocabulary-links field-item even" rel="dc:subject"><a href="/project7/?q=taxonomy/term/86" typeof="skos:Concept" property="rdfs:label skos:prefLabel">Simulation</a></li><li class="vocabulary-links field-item odd" rel="dc:subject"><a href="/project7/?q=taxonomy/term/87" typeof="skos:Concept" property="rdfs:label skos:prefLabel">Monte Carlo</a></li><li class="vocabulary-links field-item even" rel="dc:subject"><a href="/project7/?q=taxonomy/term/38" typeof="skos:Concept" property="rdfs:label skos:prefLabel">python</a></li><li class="vocabulary-links field-item odd" rel="dc:subject"><a href="/project7/?q=taxonomy/term/88" typeof="skos:Concept" property="rdfs:label skos:prefLabel">Spark</a></li><li class="vocabulary-links field-item even" rel="dc:subject"><a href="/project7/?q=taxonomy/term/23" typeof="skos:Concept" property="rdfs:label skos:prefLabel">Algorithm</a></li><li class="vocabulary-links field-item odd" rel="dc:subject"><a href="/project7/?q=taxonomy/term/74" typeof="skos:Concept" property="rdfs:label skos:prefLabel">Data Science</a></li><li class="vocabulary-links field-item even" rel="dc:subject"><a href="/project7/?q=taxonomy/term/81" typeof="skos:Concept" property="rdfs:label skos:prefLabel">Machine Learning</a></li><li class="vocabulary-links field-item odd" rel="dc:subject"><a href="/project7/?q=taxonomy/term/89" typeof="skos:Concept" property="rdfs:label skos:prefLabel">Scalability</a></li><li class="vocabulary-links field-item even" rel="dc:subject"><a href="/project7/?q=taxonomy/term/90" typeof="skos:Concept" property="rdfs:label skos:prefLabel">Distributed Computing</a></li></ul></span>Fri, 25 Sep 2015 06:42:36 +0000rakirk46 at http://clockworkchaos.com/project7http://clockworkchaos.com/project7/?q=distributed_monte_carlo#commentsEvolution is not the best force for change over timehttp://clockworkchaos.com/project7/?q=node/36
<!-- google_ad_section_start --><div class="field field-name-body field-type-text-with-summary field-label-hidden"><div class="field-items"><div class="field-item even" property="content:encoded"><p>We often overlook the importance of the word learning in the catch-phrase "machine learning". But, what does it mean? Learning refers to a process of iterative improvement. While there are quite a few algorithms, there are really just a couple of common learning techniques. One of the most common (and simplest) is gradient descent. This technique is relatively simple.<br /><img align="center" width="750" src="http://clockworkchaos.com/projects/media/linear_learner_gradient.PNG" alt="Computational approach" /></p>
<p>I will walk through the algorithm step by step. Start by instantiating a class:</p>
<div style="background: #ffffff; overflow:auto;width:auto;border:solid gray;border-width:.1em .1em .1em .8em;padding:.2em .6em;">
<pre style="margin: 0; line-height: 125%"><span style="color: #008800; font-weight: bold">class</span> <span style="color: #BB0066; font-weight: bold">Learn</span>():
<span style="color: #008800; font-weight: bold">def</span> <span style="color: #0066BB; font-weight: bold">__init__</span>(<span style="color: #007020">self</span>):
<span style="color: #007020">self</span><span style="color: #333333">.</span>weight<span style="color: #333333">=</span><span style="color: #0000DD; font-weight: bold">1</span>
<span style="color: #007020">self</span><span style="color: #333333">.</span>intercept<span style="color: #333333">=</span><span style="color: #0000DD; font-weight: bold">0</span>
<span style="color: #007020">self</span><span style="color: #333333">.</span>functor<span style="color: #333333">=</span> <span style="color: #008800; font-weight: bold">lambda</span> x: (<span style="color: #007020">self</span><span style="color: #333333">.</span>weight<span style="color: #333333">*</span>x)<span style="color: #333333">+</span><span style="color: #007020">self</span><span style="color: #333333">.</span>intercept
<span style="color: #007020">self</span><span style="color: #333333">.</span>r<span style="color: #333333">=</span><span style="color: #6600EE; font-weight: bold">0.01</span>
</pre></div>
<p>The important part is where I tell the algorithm how to change itself. For something such as linear regression, I can do it in just a few lines of code:</p>
<div style="background: #ffffff; overflow:auto;width:auto;border:solid gray;border-width:.1em .1em .1em .8em;padding:.2em .6em;">
<pre style="margin: 0; line-height: 125%"> <span style="color: #008800; font-weight: bold">def</span> <span style="color: #0066BB; font-weight: bold">score</span>(<span style="color: #007020">self</span>,x,y):
n<span style="color: #333333">=</span><span style="color: #007020">len</span>(x)
grad_m,grad_b<span style="color: #333333">=</span><span style="color: #6600EE; font-weight: bold">0.</span>,<span style="color: #6600EE; font-weight: bold">0.</span>
m_fxn<span style="color: #333333">=</span><span style="color: #008800; font-weight: bold">lambda</span> x: <span style="color: #333333">-</span><span style="color: #6600EE; font-weight: bold">1.</span><span style="color: #333333">*</span>x
b_fxn<span style="color: #333333">=</span><span style="color: #008800; font-weight: bold">lambda</span> x: <span style="color: #333333">-</span><span style="color: #6600EE; font-weight: bold">1.</span>
<span style="color: #008800; font-weight: bold">for</span> i <span style="color: #000000; font-weight: bold">in</span> <span style="color: #007020">range</span>(n):
grad_base<span style="color: #333333">=</span><span style="color: #007020">self</span><span style="color: #333333">.</span>functor(x[i])<span style="color: #333333">-</span>y[i]
grad_m<span style="color: #333333">+=</span>m_fxn(x[i])<span style="color: #333333">*</span>grad_base
grad_b<span style="color: #333333">+=</span>b_fxn(x[i])<span style="color: #333333">*</span>grad_base
norm<span style="color: #333333">=-</span><span style="color: #6600EE; font-weight: bold">2.</span><span style="color: #333333">/</span>n
<span style="color: #007020">self</span><span style="color: #333333">.</span>m<span style="color: #333333">=</span><span style="color: #007020">self</span><span style="color: #333333">.</span>m<span style="color: #333333">-</span>grad_m<span style="color: #333333">*</span>norm<span style="color: #333333">*</span><span style="color: #007020">self</span><span style="color: #333333">.</span>r
<span style="color: #007020">self</span><span style="color: #333333">.</span>b<span style="color: #333333">=</span><span style="color: #007020">self</span><span style="color: #333333">.</span>b<span style="color: #333333">-</span>grad_b<span style="color: #333333">*</span>norm<span style="color: #333333">*</span><span style="color: #007020">self</span><span style="color: #333333">.</span>r
</pre></div>
<p>I do not normally release code snippets, but I figured it would be alight to share something so central to machine learning. As you can see, I can implement this from scratch. The magic behind gradient descent algorithm is ability of it to make changes to its parameters based upon the values of the error function relative to each parameter. In order to solve the problem, I can now just iterate until the solution converges. (If you try this technique and it diverges, then you may need to solve a second set of equations to determine the learning rate). </p>
<div style="background: #ffffff; overflow:auto;width:auto;border:solid gray;border-width:.1em .1em .1em .8em;padding:.2em .6em;">
<pre style="margin: 0; line-height: 125%"> <span style="color: #008800; font-weight: bold">def</span> <span style="color: #0066BB; font-weight: bold">model</span>(<span style="color: #007020">self</span>,x,y,size<span style="color: #333333">=</span><span style="color: #0000DD; font-weight: bold">10</span><span style="color: #333333">**</span><span style="color: #0000DD; font-weight: bold">3</span>):
<span style="color: #008800; font-weight: bold">for</span> i <span style="color: #000000; font-weight: bold">in</span> <span style="color: #007020">range</span>(size):
<span style="color: #007020">self</span><span style="color: #333333">.</span>score(x,y)
</pre></div>
<p>While gradient descent may be obvious to some well-versed readers, I think it is important to point out the advantage of this approach over other approaches to incremental improvements. For example, people often talk about evolution as the most powerful mechanism for change in biological organisms. This is simply not true.</p>
<p>Random variation is a powerful tool for solving problems. When nature uses it we refer to it as evolution. The premise is that successful forms of variation tend to re-occur because the animals that benefit from them advantage interactions with their environment. However, the limitation in the power of evolution becomes evident when comparing it to other forms of improvement over time. For example, we now know that evolution is not the primary drive of change in more advanced organisms. To examine what I mean, consider an evolutionary solution the the previous problem:</p>
<div style="background: #ffffff; overflow:auto;width:auto;border:solid gray;border-width:.1em .1em .1em .8em;padding:.2em .6em;">
<pre style="margin: 0; line-height: 125%"><span style="color: #008800; font-weight: bold">class</span> <span style="color: #BB0066; font-weight: bold">Evolve</span>():
<span style="color: #008800; font-weight: bold">def</span> <span style="color: #0066BB; font-weight: bold">__init__</span>(<span style="color: #007020">self</span>):
<span style="color: #007020">self</span><span style="color: #333333">.</span>weight<span style="color: #333333">=</span><span style="color: #0000DD; font-weight: bold">1</span>
<span style="color: #007020">self</span><span style="color: #333333">.</span>intercept<span style="color: #333333">=</span><span style="color: #0000DD; font-weight: bold">0</span>
<span style="color: #007020">self</span><span style="color: #333333">.</span>wmin,<span style="color: #007020">self</span><span style="color: #333333">.</span>wmax<span style="color: #333333">=</span><span style="color: #6600EE; font-weight: bold">0.</span>,<span style="color: #0000DD; font-weight: bold">10</span><span style="color: #333333">**</span><span style="color: #0000DD; font-weight: bold">3</span>
<span style="color: #007020">self</span><span style="color: #333333">.</span>imin,<span style="color: #007020">self</span><span style="color: #333333">.</span>imax<span style="color: #333333">=</span><span style="color: #6600EE; font-weight: bold">0.</span>,<span style="color: #0000DD; font-weight: bold">10</span><span style="color: #333333">**</span><span style="color: #0000DD; font-weight: bold">3</span>
<span style="color: #007020">self</span><span style="color: #333333">.</span>r<span style="color: #333333">=</span><span style="color: #008800; font-weight: bold">lambda</span> mini,maxi: random<span style="color: #333333">.</span>randrange(mini,maxi)
<span style="color: #007020">self</span><span style="color: #333333">.</span>functor<span style="color: #333333">=</span> <span style="color: #008800; font-weight: bold">lambda</span> x: (<span style="color: #007020">self</span><span style="color: #333333">.</span>weight<span style="color: #333333">*</span>x)<span style="color: #333333">+</span><span style="color: #007020">self</span><span style="color: #333333">.</span>intercept
<span style="color: #007020">self</span><span style="color: #333333">.</span>error<span style="color: #333333">=</span><span style="color: #007020">float</span>(<span style="background-color: #fff0f0">'inf'</span>)
<span style="color: #007020">self</span><span style="color: #333333">.</span>m<span style="color: #333333">=</span><span style="color: #007020">self</span><span style="color: #333333">.</span>weight
<span style="color: #007020">self</span><span style="color: #333333">.</span>b<span style="color: #333333">=</span><span style="color: #007020">self</span><span style="color: #333333">.</span>intercept
<span style="color: #008800; font-weight: bold">def</span> <span style="color: #0066BB; font-weight: bold">score</span>(<span style="color: #007020">self</span>,x,y):
<span style="color: #007020">self</span><span style="color: #333333">.</span>m<span style="color: #333333">=</span><span style="color: #007020">self</span><span style="color: #333333">.</span>weight
<span style="color: #007020">self</span><span style="color: #333333">.</span>b<span style="color: #333333">=</span><span style="color: #007020">self</span><span style="color: #333333">.</span>intercept
<span style="color: #007020">self</span><span style="color: #333333">.</span>weight<span style="color: #333333">=</span><span style="color: #007020">self</span><span style="color: #333333">.</span>r(<span style="color: #007020">self</span><span style="color: #333333">.</span>wmin,<span style="color: #007020">self</span><span style="color: #333333">.</span>wmax)
<span style="color: #007020">self</span><span style="color: #333333">.</span>intercept<span style="color: #333333">=</span><span style="color: #007020">self</span><span style="color: #333333">.</span>r(<span style="color: #007020">self</span><span style="color: #333333">.</span>imin,<span style="color: #007020">self</span><span style="color: #333333">.</span>imax)
n<span style="color: #333333">=</span><span style="color: #007020">len</span>(x)
error<span style="color: #333333">=</span><span style="color: #007020">sum</span>([(y[i]<span style="color: #333333">-</span><span style="color: #007020">self</span><span style="color: #333333">.</span>functor(x[i]))<span style="color: #333333">**</span><span style="color: #0000DD; font-weight: bold">2</span> <span style="color: #008800; font-weight: bold">for</span> i <span style="color: #000000; font-weight: bold">in</span> <span style="color: #007020">range</span>(n)])<span style="color: #333333">**.</span><span style="color: #0000DD; font-weight: bold">5</span>
<span style="color: #008800; font-weight: bold">if</span> error<span style="color: #333333">&lt;</span><span style="color: #007020">self</span><span style="color: #333333">.</span>error:
<span style="color: #007020">self</span><span style="color: #333333">.</span>error<span style="color: #333333">=</span>error
<span style="color: #008800; font-weight: bold">else</span>:
<span style="color: #007020">self</span><span style="color: #333333">.</span>intercept<span style="color: #333333">=</span><span style="color: #007020">self</span><span style="color: #333333">.</span>m
<span style="color: #007020">self</span><span style="color: #333333">.</span>weight<span style="color: #333333">=</span><span style="color: #007020">self</span><span style="color: #333333">.</span>b
<span style="color: #008800; font-weight: bold">def</span> <span style="color: #0066BB; font-weight: bold">model</span>(<span style="color: #007020">self</span>,x,y,size<span style="color: #333333">=</span><span style="color: #0000DD; font-weight: bold">10</span><span style="color: #333333">**</span><span style="color: #0000DD; font-weight: bold">3</span>):
<span style="color: #008800; font-weight: bold">for</span> i <span style="color: #000000; font-weight: bold">in</span> <span style="color: #007020">range</span>(size):
<span style="color: #007020">self</span><span style="color: #333333">.</span>score(x,y)
</pre></div>
<p>If you were to run the above program, you would notice that it improves over time. However, it improves much more slowly. It also does not always converge on the same solution. The above contrasts learning and evolution. Nature still has another trick for solving tough problems. </p>
<p>Animals that have a concept of <i>self</i> and of <i>other</i> are able to make selections based upon the perceived social or environmental abilities of the other animal. This is a sexual selection process. This process is much more powerful than evolution. While it still takes quite a while for an optimal change to become embedded in the core of our genome, humans can markedly change the representation of certain traits within as few as a couple of generations. No longer prone to the whims of chaos, the process of improvement over time now has direction. This selection process causes the random variations in evolution to focus upon particular traits. However, it struggles to balance the rate at which a each parameter converges. For this, nature requires a form of central direction. </p>
<p>Learning is far more potent than either evolution or selection. While selection converges, it lacks direction. Learning offers us a technique to instantaneously determine direction upon each iteration. To summarize, let me illustrate the success of each of the three approaches to solving a simple linear regression equation to determine the weights <b>m</b> and <b>b</b>:<br /><code><br />
Evolving:<br />
Predicted m: 57<br />
Actual m: 43.9652012955<br />
Predicted b: 4<br />
Actual b: 12.1165068006<br />
Selection:<br />
Predicted m: 40.3281413905<br />
Actual m: 43.965201295<br />
Predicted b: 45.6543857325<br />
Actual b: 12.1165068006<br />
Learning:<br />
Predicted m: 43.9651654396<br />
Actual m: 43.9652012955<br />
Predicted b: 12.1164853632<br />
Actual b: 12.1165068006<br /></code><br />
While we can typically solve linear regression using ordinary least squares and matrix operations, I have shown that we can also solve this incrementally using a gradient descent. I have also shown that we can also solve it through random permutation and through collective permutation. Clearly a well informed approach makes a huge difference in the amount of raw computational power we require to solve a problem. This is further evidence that perhaps we should spend at least as much time considering the science behind our approach as we do figuring out how apply as much hardware as possible to an expensive solution.</p>
</div></div></div><!-- google_ad_section_end --><span class="vocabulary field field-name-field-tags field-type-taxonomy-term-reference field-label-above"><h2 class="field-label">Tags:&nbsp;</h2><ul class="vocabulary-list"><li class="vocabulary-links field-item even" rel="dc:subject"><a href="/project7/?q=taxonomy/term/81" typeof="skos:Concept" property="rdfs:label skos:prefLabel">Machine Learning</a></li><li class="vocabulary-links field-item odd" rel="dc:subject"><a href="/project7/?q=taxonomy/term/84" typeof="skos:Concept" property="rdfs:label skos:prefLabel">ML</a></li><li class="vocabulary-links field-item even" rel="dc:subject"><a href="/project7/?q=taxonomy/term/23" typeof="skos:Concept" property="rdfs:label skos:prefLabel">Algorithm</a></li><li class="vocabulary-links field-item odd" rel="dc:subject"><a href="/project7/?q=taxonomy/term/85" typeof="skos:Concept" property="rdfs:label skos:prefLabel">Learning Sciences</a></li><li class="vocabulary-links field-item even" rel="dc:subject"><a href="/project7/?q=taxonomy/term/74" typeof="skos:Concept" property="rdfs:label skos:prefLabel">Data Science</a></li><li class="vocabulary-links field-item odd" rel="dc:subject"><a href="/project7/?q=taxonomy/term/82" typeof="skos:Concept" property="rdfs:label skos:prefLabel">Adaptive Systems</a></li></ul></span>Sat, 23 May 2015 06:11:12 +0000rakirk36 at http://clockworkchaos.com/project7http://clockworkchaos.com/project7/?q=node/36#commentsSupervised learning cannot solve your problemshttp://clockworkchaos.com/project7/?q=supervised_v_unsupervised
<!-- google_ad_section_start --><div class="field field-name-body field-type-text-with-summary field-label-hidden"><div class="field-items"><div class="field-item even" property="content:encoded"><p>I often see aspiring Data Scientists jump to using scaled-up algorithms to infer the relationships between many different features simultaneously. I notice the field as a whole is somewhat indifferent towards this approach. "If your only tool is a hammer then every problem better be a nail". What's wrong with just using a giant hammer? After all, we can just scale up our algorithm using large virtual instances and then proceed to wrap the algorithm into a meta-learning function that tries every combination of feature set and parameter configuration? Let me compare and contrast the two:</p>
<table width="720px"><tr width="720px"><td width="144px"><b>Current use:</b></td>
<td width="288px">Supervised learning is in its glory days. </td>
<td width="288px">Unsupervised seems unpopular at the moment. </td>
</tr><tr width="720px"><td width="144px"><b>History:</b></td>
<td width="288px">It has seen serious improvements and use within the last couple of decades. </td>
<td width="288px">Unsupervised learning has been in use for the better part of 4 decades. </td>
</tr><tr width="720px"><td width="144px"><b>Popularity:</b></td>
<td width="288px">Popularity is based upon ability to solve many different types of problems without much thought about the correct domain model. </td>
<td width="288px">They are unpopular because it can take time to discover the domain model underlying a certain business problem and we assume that we already know the domain model.</td>
</tr><tr width="720px"><td width="144px"><b>Basic idea:</b></td>
<td width="288px">Every new example contributes both the the structure of the model and to the information within the model. </td>
<td width="288px">Every new example offers as much new information as is present within the example. </td>
</tr><tr width="720px"><td width="144px"><b>Role of domain expertise:</b></td>
<td width="288px">Domain models have minimal say in the final model. </td>
<td width="288px">The structure of the model is determined both by a domain model and by mutual co-occurrences within examples. </td>
</tr><tr width="720px"><td width="144px"><b>Robust:</b></td>
<td width="288px">Supervised models are fragile to unseen use cases.</td>
<td width="288px">Unsupervised models are robust to unseen use cases.</td>
</tr><tr width="720px"><td width="144px"><b>Training data:</b></td>
<td width="288px">They do not work well in contexts where there has been no training data. </td>
<td width="288px">They can operate upon any data.</td>
</tr><tr width="720px"><td width="144px"><b>Class learning:</b></td>
<td width="288px">Their output is discrete. We are forced to teach these models using labels. </td>
<td width="288px">Their output is probabilistic and they can learn from labelled data. </td>
</tr><tr width="720px"><td width="144px"><b>Final thoughts:</b></td>
<td width="288px">This is problematic since we often do not want to have to learn from a horrible event before being able to prevent it from happening. </td>
<td width="288px">Later, upon seeing a set of labels, they can immediately perform inferences on the new class.</td>
</tr></table><hr /><p>Supervised learning is like putting your model in a classroom where you have a teacher with a lesson plan. The teacher has their own perspective. The lesson plan has its own scope. The student's knowledge may not be very broad, but it will be quite deep in the areas where you have prepared the lessons. </p>
<p>Unsupervised learning is like putting putting your student (the model) on the streets. The student forced to learn things as a part of its daily exploration. At first it wanders around unsure its surroundings. Over time it is able to guide its own exploration. </p>
<p>At the end of the day, the two students meet. The student in the classroom knows a lot about esoteric concepts and taunts the other student. The student from the street has become a grade A bad ass. Upon hearing the taunt the student from the street hits the student from the classroom in the stomach and steals his lunch money. Apparently the student from the classroom had never had a training example that prepared him for that eventuality. </p>
<hr /><p>At first glance, we believe that it is easier to perform supervised learning. However, upon inspecting the trade-offs, I argue that unsupervised learning is not only easier, it is more philosophically grounded. At this point the reader may be asking: "So, are you opposed to supervised learning?" No! I'm amazed by the capabilities of some of the modern supervised approaches. I'm merely offering an expose on what I see is a problem in the approach commonly used by Data Science practitioners. </p>
<p>I believe it is better to first start with an unsupervised approach and to take the learning from this approach as input for the supervised approach. We can even use our unsupervised model to generate faux data for our supervised learner. The powerful use-case for supervised learning systems is to powerfully manifest those inferences which we have already successfully grounded through the use of I believe that using the two together in a staged approach represents a sort of synergy that will decrease the fragility of supervised models while offering increased inferencing capabilities over what an unsupervised model could. </p>
</div></div></div><!-- google_ad_section_end --><span class="vocabulary field field-name-field-tags field-type-taxonomy-term-reference field-label-above"><h2 class="field-label">Tags:&nbsp;</h2><ul class="vocabulary-list"><li class="vocabulary-links field-item even" rel="dc:subject"><a href="/project7/?q=taxonomy/term/74" typeof="skos:Concept" property="rdfs:label skos:prefLabel">Data Science</a></li><li class="vocabulary-links field-item odd" rel="dc:subject"><a href="/project7/?q=taxonomy/term/81" typeof="skos:Concept" property="rdfs:label skos:prefLabel">Machine Learning</a></li><li class="vocabulary-links field-item even" rel="dc:subject"><a href="/project7/?q=taxonomy/term/18" typeof="skos:Concept" property="rdfs:label skos:prefLabel">Modeling</a></li><li class="vocabulary-links field-item odd" rel="dc:subject"><a href="/project7/?q=taxonomy/term/23" typeof="skos:Concept" property="rdfs:label skos:prefLabel">Algorithm</a></li></ul></span>Wed, 20 May 2015 15:02:12 +0000rakirk45 at http://clockworkchaos.com/project7http://clockworkchaos.com/project7/?q=supervised_v_unsupervised#commentsHoover Dam May Shut Down by End of 2015http://clockworkchaos.com/project7/?q=dam_water_wasters
<!-- google_ad_section_start --><div class="field field-name-body field-type-text-with-summary field-label-hidden"><div class="field-items"><div class="field-item even" property="content:encoded"><p>Of course the dam will still be there, but the water levels may fall below the critical elevation of 1,050 necessary to make generation of power viable. I did a quick projection analysis based upon historicals and discovered that the water will fall to critically low levels by Dec. 2015:<br /></p><center>
<iframe height="500" width="720" scrolling="no" frameborder="0" src="http://clockworkchaos.com/projects/weather/mead/mead_levels.html">
<p>Your browser does not support frames, consider using Mozilla/Chrome.</p>
</iframe><p></p></center><br />
I had the ability to visit the dam recently. Needless to say, I'm amazed by what my grandparent's generation was able to build for us as a part of their legacy. Equally amazing is how quickly we have managed to squander this precious set of resources. In disgust I decided to look at historic water levels by month for the last 80 years. I plotted this data in blue on the chart above. I also plotted the maximum water level that the dam can maintain, 1,241 ft., in a grey dashed line. I did the same for the minimum level at which it can generate power, 1,050 ft. Finally in red I projected the remaining months' of 2015 based upon this historic cyclical patterns present within the last two decades or so. (To see this, you can zoom in the chart using the widgets in the bottom left). Given that the average rate level is 1,058 ft. it follows that there will certainly be days in December where the level will fall either below or very close to the minimum level necessary for the dam to function. Here is a close up:<br /><img align="center" width="500" src="http://clockworkchaos.com/projects/weather/mead/ml_static.PNG" alt="Lake Mead Levels clearly expected to fall close to critical 1,050 ft. mark." /><br />
I suspected it could be based upon the population increase in the dessert. But, to my utter amazement, I discovered that the increase in population levels in Las Vegas and other dessert cities is not the sole contributing factor to the decline to in water levels. Instead, it's clear that we had plenty of water amidst periods of recent history where human consumption had been similar. This is quite clear since the water levels were at an all time high as recently as 2000. Most of these cities had nearly as many inhabitants then as they do no. So, instead, the decline probably has more to do with the critical droughts that seem to be increasing in frequency compared to what we had seen historically. Still, it's hard to imagine that the dam builders would have thought that giant cities in the dessert could have been the resultant consequence of the use of this technology. What's worse is that once the dam ceases operations the plan is for Vegas to receive the remaining water via an underwater aqueduct. Like a blood sucking mosquito, Las Vegas is ready to suck the reservoir dry leaving nothing for the rest of the South West.
</div></div></div><!-- google_ad_section_end -->Thu, 14 May 2015 03:27:09 +0000rakirk43 at http://clockworkchaos.com/project7http://clockworkchaos.com/project7/?q=dam_water_wasters#commentsA Data Scientist's Definition of Data Sciencehttp://clockworkchaos.com/project7/?q=data_science_definition
<!-- google_ad_section_start --><div class="field field-name-body field-type-text-with-summary field-label-hidden"><div class="field-items"><div class="field-item even" property="content:encoded"><p>Data Science can either be a discipline or it can be a profession. The choice is up to the practitioners. It is convenient to move in the direction of a discipline because it is easy. I intend to pursue it as a profession. </p>
<p><img style="margin:0px auto;display:block" width="450" src="http://clockworkchaos.com/projects/hci_manifesto/all_three.png" alt="Data science is a combination of three approaches: research, domain, and computation" /></p>
<p>We have to choose. We can perform the same repetitive tasks, the same cookie-cutter analysis using the same algorithms over and over again in an effort to provide the fastest, most generic results possible. Or we can do something different. We can treat it as a profession worthy of the responsibility and accountability that other professions maintain. As a profession, it is more about the practice, the methods, and the contribution to the field. As a discipline it is more about consistent, repetitive results. Disciplines have a history of becoming obsolete. We used to outsource mechanical workers' disciplines to machines. Then we outsourced engineering workers' disciplines to foreign countries. Now we are starting to outsource knowledge workers' disciplines to algorithms. While there is nothing wrong with powerful algorithms, they only have power when contextualized with meaning. For example, the mechanical workers' discipline can only be outsourced successfully if the resultant machine performs a specific task that has meaning based upon the activities of the worker whom it is augmenting. </p>
<p>I recently read an article that contained collaboration from a Physicist, a Computational Biologist, a Virologist, and a Computer Scientist. At first I thought this should be surprising. But, in realizing my own lack of surprise, I began to think about the direction that the field of applied research has been moving. We are witnessing the merging of what used to be three distinct areas of expertise. Each area has its own foundation and each has several examples of occupational specialties.</p>
<table border="0px"><tr><td>
<ul><li>
<b>Computational expertise</b>: These fields focus upon establishing a foundation based upon the axioms and certainties of formal logic systems. Reasoning is powerful and certain yet difficult to generalize. Examples disciplines include: computational theory, data integration, database administration, information systems, mathematics, electrical engineering, mechanical engineering, etc.
</li>
</ul></td>
<td width="250">
<img align="center" width="250" src="http://clockworkchaos.com/projects/hci_manifesto/computation.png" alt="Computational approach" /></td>
</tr><tr><td>
<ul><li>
<b>Research design skills</b>: These fields focus upon establishing a foundation based upon the notion of social consensus and causal inference. Reasoning is grounded and generalizes well to external, unseen events. However, reasoning is also limited in expressive power. Examples disciplines include: psychology, industrial engineering, sociology, anthropology, archaeology, civil engineering, etc.
</li>
</ul></td>
<td width="250">
<img align="center" width="250" src="http://clockworkchaos.com/projects/hci_manifesto/research.png" alt="Research approach" /></td>
</tr><tr><td>
<ul><li>
<b>Domain knowledge</b>: These fields focus upon establishing a foundation based upon the notion of practical value and political will. Reasoning is clear and cohesive but risks becomes dogmatic. Examples disciplines include: business, accounting, education, technology, manufacturing, politics, etc.
</li>
</ul></td>
<td width="250">
<img align="center" width="250" src="http://clockworkchaos.com/projects/hci_manifesto/domain.png" alt="Domain approach" /></td>
</tr></table><p>The elegance of merging these approach comes, I argue, from the power of merging the established basis upon which each approach is based. This blending allows for the strengths of each to be combined while tending to ameliorate the weaknesses that any one approach offers.<br /><img style="margin:0px auto;display:block" width="600" src="http://clockworkchaos.com/projects/hci_manifesto/sum_all_three.png" alt="Computation + Research + Domain" /><br />
When we combine these approach, there are hybrid approach that form at the edges between any two approaches. These interesting combinations of approaches are also becoming more common:</p>
<ul><li>
<b>Technical specialist</b>: At the edge between domain knowledge and computational expertise, this approach utilizes the advantages of reasoning as it relates to accomplishing a specific, known or defined goal. Example disciplines include: financial analysis, security analysts, DBA, computational biology, applied machine learning, etc.
</li>
<li>
<b>Statistician</b>: At the edge between research design skills and computational expertise, this approach utilizes the advantages of reasoning and computational inference to infer causality and to predict. Example disciplines include: actuarial sciences, machine learning,
</li>
<li>
<b>Applied scientist</b>: At the edge between domain knowledge and research design skills, this approach utilizes the advantages of clearly defined problems and research methods to discover new phenomenon. Example disciplines include: climatology, consumer behavior, marketing research
</li>
</ul><p>At this point the reader is of course wondering what about the intersection of all three approaches. This interesting intersection is one which we have seen before, but which has become increasingly common lately.</p>
<ul><li>
<b>Data Sciences</b>: At the intersection between domain knowledge, computational expertise, and research design skills this approach utilizes the advantages of reasoning, problem definition, and research design to loosen the stranglehold that hard problems. Examples include: human factors engineering, human machine systems engineering (HMSE), human-computer interaction (HCI). Previously we may have called these individuals applied scientists or research engineers.
</li>
</ul><p>As a profession Data Science offers to unite disparate fields. This is not a unique phenomenon. Other professions merge these three approaches. For example, the medical profession merges research, science, and business to treat patients. These three approaches are synergistic. Merging all three approaches will yield unexpected results that are more than the simple summation of the three parts. This means that in the long run the Data Science discipline can become transcendent of any one predecessor approach. When this happens it becomes its own profession. As a profession, it should begin to receive special recognition as its own part of the cultural estate. It should become its own area of research along with honored degrees and esteemed forerunners. </p>
<p><img style="float:left;width:450px" width="450" src="http://clockworkchaos.com/projects/hci_manifesto/all_three.png" alt="Data science is a combination of three approaches: research, domain, and computation" /></p>
<p>This whole post may seem a little (or a lot) self-serving since I am a Data Scientist. However, that is not my intent. Instead, I am trying to illustrate how this field is different than how I have recently seen it characterized. It is not simply applying machine learning to a business’ data. It is different than creating a self-organizing algorithm. It is not something that can be taught through the use of one or a few online courses. It is not a 10-credit certificate.</p>
<p><i><b>Data Science</b> is the science and practice of applying computational expertise, domain knowledge, and research design skills to define, model, and discover in the context of ill-defined, complex problems</i>. The challenge of course is the complexity and breadth of knowledge required to pursue this discipline. If it really requires someone to become an expert in three different fields, then how can we expect to fulfill the need for Data Scientists? As the profession advances, the required skillsets will solidify and it will become clearer how one can pursue this profession.</p>
</div></div></div><!-- google_ad_section_end --><span class="vocabulary field field-name-field-tags field-type-taxonomy-term-reference field-label-above"><h2 class="field-label">Tags:&nbsp;</h2><ul class="vocabulary-list"><li class="vocabulary-links field-item even" rel="dc:subject"><a href="/project7/?q=taxonomy/term/74" typeof="skos:Concept" property="rdfs:label skos:prefLabel">Data Science</a></li><li class="vocabulary-links field-item odd" rel="dc:subject"><a href="/project7/?q=taxonomy/term/25" typeof="skos:Concept" property="rdfs:label skos:prefLabel">HCI</a></li><li class="vocabulary-links field-item even" rel="dc:subject"><a href="/project7/?q=taxonomy/term/75" typeof="skos:Concept" property="rdfs:label skos:prefLabel">Usability</a></li><li class="vocabulary-links field-item odd" rel="dc:subject"><a href="/project7/?q=taxonomy/term/76" typeof="skos:Concept" property="rdfs:label skos:prefLabel">UX</a></li><li class="vocabulary-links field-item even" rel="dc:subject"><a href="/project7/?q=taxonomy/term/77" typeof="skos:Concept" property="rdfs:label skos:prefLabel">UI</a></li><li class="vocabulary-links field-item odd" rel="dc:subject"><a href="/project7/?q=taxonomy/term/78" typeof="skos:Concept" property="rdfs:label skos:prefLabel">Business Intelligence</a></li><li class="vocabulary-links field-item even" rel="dc:subject"><a href="/project7/?q=taxonomy/term/79" typeof="skos:Concept" property="rdfs:label skos:prefLabel">Predictive Analytics</a></li><li class="vocabulary-links field-item odd" rel="dc:subject"><a href="/project7/?q=taxonomy/term/80" typeof="skos:Concept" property="rdfs:label skos:prefLabel">Analytics</a></li><li class="vocabulary-links field-item even" rel="dc:subject"><a href="/project7/?q=taxonomy/term/23" typeof="skos:Concept" property="rdfs:label skos:prefLabel">Algorithm</a></li><li class="vocabulary-links field-item odd" rel="dc:subject"><a href="/project7/?q=taxonomy/term/81" typeof="skos:Concept" property="rdfs:label skos:prefLabel">Machine Learning</a></li><li class="vocabulary-links field-item even" rel="dc:subject"><a href="/project7/?q=taxonomy/term/13" typeof="skos:Concept" property="rdfs:label skos:prefLabel">Artificial Intelligence</a></li><li class="vocabulary-links field-item odd" rel="dc:subject"><a href="/project7/?q=taxonomy/term/82" typeof="skos:Concept" property="rdfs:label skos:prefLabel">Adaptive Systems</a></li></ul></span>Fri, 20 Feb 2015 23:46:08 +0000rakirk41 at http://clockworkchaos.com/project7http://clockworkchaos.com/project7/?q=data_science_definition#commentsCounty-Level Geostratification of the Primary Demographic Factors for the U.S. Populationhttp://clockworkchaos.com/project7/?q=geostratification
<!-- google_ad_section_start --><div class="field field-name-body field-type-text-with-summary field-label-hidden"><div class="field-items"><div class="field-item even" property="content:encoded"><p><br /><b>MEDIAN HOUSEHOLD INCOME:</b> ILLUSTRATED AS A USCB COUNTY-LEVEL CHOROPLETH CHART<br /><img align="center" width="450" src="http://clockworkchaos.com/projects/census/median_income.png" alt="Geographical indices across the U.S. for income." /><br />
It's been a while since last I posted anything. I've been feeling the need to focus more upon driving an understanding of data science rather than upon any particular toolset. For example, I've been focusing too much on information visualization. Not that there's anything wrong with info. viz., but it is a means to an end for me where that end is assisting the decision-making process. I also haven't felt much inclination to publish since I feel I often exhaust the reader with dreariness or with overly exaggerated detail....<br /><br />These things considered, today I was feeling epistemological so I decided to wax about it for a second: Consider the infinite number of possible attributes we could consider examining. Then each of these attributes has a set of possible values. Not every value for a given attribute is equally likely to occur. The relationship between the likelihood of each value often relate to various external factors such as time or space. This seems simple, almost mundane. Yet, this simple observation summarizes more than 90% of the analysis that I've seen used for a typical decision making process. I think the tendency to immediately pursue an algorithm or complex research design betrays the need to first focus upon basic fundamentals. These fundamentals consist of 1) defining the concepts related to an area of inquiry, 2) examining the types of values associated with each of these concepts, and 3) exploring the relationship between these values across time and space. Since I always see longitudinal plots in business analyses, I tend to focus more upon geospatial relationships within my blog because I feel they are under utilized. Simple geospatial charts convey complex relationships; color-coded charts are powerful because they tap into our innate ability to visually detect patterns.<br /><br />So, I gathered a whole bunch of data from the U.S. Census Bureau and decided to illustrate the geostratification of some of the more common demographic factors. I will illustrate age, gender, income, education, and ethnicity. Any variance in the values of these social constructs by region will illustrate insights useful for understanding the social stratification within our own culture.<br /><br /><b>AGE:</b> ILLUSTRATED AS A USCB COUNTY-LEVEL CHOROPLETH CHART<br /><img align="center" width="450" src="http://clockworkchaos.com/projects/census/age.png" alt="Geographical indices for U.S. age population." /><br />
The older populations tend to more densely populate the rural areas surrounding larger metropolitan communities.<br /><br /><b>GENDER:</b> THE PERCENTAGE OF POPULATION THAT IS FEMALE ILLUSTRATED AS A USCB COUNTY-LEVEL CHOROPLETH CHART<br /><img align="center" width="450" src="http://clockworkchaos.com/projects/census/index_female.png" alt="Geographical indices across the U.S. for gender." /><br />
It seems as though there are more women in the South and along the coasts- perhaps they like warmer weather?<br /><br /><b>MEDIAN HOUSEHOLD INCOME:</b> ILLUSTRATED AS A USCB COUNTY-LEVEL CHOROPLETH CHART<br /><img align="center" width="450" src="http://clockworkchaos.com/projects/census/median_income.png" alt="Geographical indices across the U.S. for income." /><br />
The higher household incomes tend to co-occur with larger metropolitan communities. Could this be an artifact of using the median to measure this value?<br /><br /><b>EDUCATION:</b> PERCENTAGE OF INDIVIDUALS OBTAINING BACHELOR'S OR HIGHER DEGREES SHOWN AS A USCB COUNTY-LEVEL CHOROPLETH CHART<br /><img align="center" width="450" src="http://clockworkchaos.com/projects/census/index_bachelordegree.png" alt="Geographical indices across the U.S. for education." /><br />
Education tends to also co-occur with metropolitan communities. Perhaps there is a covariant relationship between education and income?<br /><br /><b>ETHNICITY:</b> ILLUSTRATED AS A SERIES OF USCB COUNTY-LEVEL CHOROPLETH CHARTS, ONE FOR EACH OF THE FIVE LARGEST GROUPS<br /><br /><i>WHITE, NON-HISPANIC POPULATION PROPENSITY</i><br /><img align="center" width="450" src="http://clockworkchaos.com/projects/census/white_nonhispanic.png" alt="Geographical indices for U.S. White, Non-Hispanic population." /><br />
This group seems to densely populate most of the country, however, there are distinct areas where this group is less densely populated.<br /><br /><i>HISPANIC POPULATION PROPENSITY</i><br /><img align="center" width="450" src="http://clockworkchaos.com/projects/census/hispanic_latino.png" alt="Geographical indices for U.S. Hispanic population." /><br />
The hispanic group seems to more densely populate the area in the Southwest of the country. This could account for part of the gap seen in the white, Non-Hispanic population.<br /><br /><i>BLACK, OR AFRICAN-AMERICAN POPULATION PROPENSITY</i><br /><img align="center" width="450" src="http://clockworkchaos.com/projects/census/black_african.png" alt="Geographical indices for U.S. Black or African-American population." /><br />
The black, or African-American population tends to more densely populate the South. This seems to account for the other gap seen in the white, Non-Hispanic population.<br /><br /><i>ASIAN ALONE POPULATION PROPENSITY</i><br /><img align="center" width="450" src="http://clockworkchaos.com/projects/census/asian_alone.png" alt="Geographical indices for U.S. Asian population." /><br />
The Asian population tends to more densely populate the coasts.<br /><br /><i>ASIAN OR PACIFIC-ISLANDER POPULATION PROPENSITY</i><br /><img align="center" width="450" src="http://clockworkchaos.com/projects/census/pacific_islander.png" alt="Geographical indices for U.S. Pacific Islander population." /><br />
Notice how each group seems to have distinct density patterns. This illustrates how powerful these charts are as an analytical tool and this also illustrates why geostratification is an important analysis tool to use in conjunction with longitudinal analysis.<br /><br />The fine print: these charts are called choropleth charts. I created these using d3. However, I'm illustrating these as static images in order to decrease page load times. The charts you see here illustrate the value for each zipcode indexed against the overall propensity for that attribute. This normalization process then allows these charts to illustrate differences in relative degree rather than absolute differences. I use only 9 different colors to populate these charts. I intentionally chose the colors that I am using to both bolster usability and efficacy. All nine colors are actually just different shades of the same color. Furthermore, the colors within these charts will diverge as the values within any given region move further away from the expected value.<br /><img align="center" width="1" src="http://clockworkchaos.com/projects/census/white_nonhispanic.png" /><br />
Data Sources: <a href="http://www.census.gov/acs/www/">U.S. Census Data</a></p>
</div></div></div><!-- google_ad_section_end --><span class="vocabulary field field-name-field-tags field-type-taxonomy-term-reference field-label-above"><h2 class="field-label">Tags:&nbsp;</h2><ul class="vocabulary-list"><li class="vocabulary-links field-item even" rel="dc:subject"><a href="/project7/?q=taxonomy/term/10" typeof="skos:Concept" property="rdfs:label skos:prefLabel">Geolocation</a></li><li class="vocabulary-links field-item odd" rel="dc:subject"><a href="/project7/?q=taxonomy/term/21" typeof="skos:Concept" property="rdfs:label skos:prefLabel">Information Visualization</a></li><li class="vocabulary-links field-item even" rel="dc:subject"><a href="/project7/?q=taxonomy/term/40" typeof="skos:Concept" property="rdfs:label skos:prefLabel">Indexing</a></li><li class="vocabulary-links field-item odd" rel="dc:subject"><a href="/project7/?q=taxonomy/term/35" typeof="skos:Concept" property="rdfs:label skos:prefLabel">choropleth</a></li></ul></span>Tue, 05 Aug 2014 23:16:38 +0000rakirk32 at http://clockworkchaos.com/project7http://clockworkchaos.com/project7/?q=geostratification#comments