You Know

Sunday, March 11, 2018

MS SQL (and various other flavors) allows the user quite wide latitude in defining the result set. The risk with this is unintended results based on an incomplete understanding of the underlying data. For example, a straightforward LEFT JOIN with 1,000,000 records in the left table might return 1,000,001 records due to an unexpected partial duplicate. Worse, the LEFT JOIN might return 1,000,000 records today and 1,000,001 records a few days later after the user is satisfied (wrongly) that the query is correct.
It seems that queries could be written more explicitly, which could lead to easier built-in operations to facilitate the direct query writing. For example, one model of how an explicit query could be written is below using a fictitious repurposing of the DISTINCT clause and a new LEFT AGGREGATE JOIN join type. Of course these actions can be accomplished with the current functionality, but require much more convoluted language that can lead to unintended consequences if not fully understood.

Thursday, August 31, 2017

Wikipedia has pages for each day of the year (e.g., January 1, April 25). Each page contains a list of names of famous people with that birthday along with a short description of each person. I wrote an R script to scrape these lists from each wikipedia page and then analyze the data to find which days have unusually high occurrences of certain words (based on the tf-idf statistic).

Saturday, March 4, 2017

Can you imagine a future whereby nations agree to an alternative to war? Disputes may never go away, but the resolution mechanisms might change. The mechanism's effect would have to be quite similar to the forgone war's would-be effect for both nations to agree.

Presumably, some wars happen when each side believes it would win even though they both cannot
be correct. The inputs would be the quantities of soldiers, armaments, ammunition, technology, etc. Who has the higher total? Now, go back one step: who has an economy capable of generating more of these inputs? Now, go back another step: who has a society that is more conducive to having superior economic output (think productivity, raw materials, skills, incentives)? Effectively, the better equipped social system is likely to be the eventual winner. So, how to simulate this without bloodshed?

Allow countries to buy parts of territories adjacent to its borders; similarly, countries could pledge its own adjacent land as collateral when borrowing money. This needs much more thought, but could have some interesting outcomes.

Your first thought might envisage a mighty Alabama team running on each play, gobbling up chunks of 4, 5, 10 yards each time. Moving the chains, wearing down the opponent's defense. But is this helping to win?

One way to think about time of possession is like baseball's innings and their length. In baseball, each team gets up to nine half-innings to score some runs. With football, we might call a team's half-inning a drive. The team with the ball has a drive to score 3 or 7 points (usually). If a team eats up a lot of clock on a lengthy drive, both teams have a reduced amount of time possible with the ball. If a big-play team scores a touchdown on a mere three plays, then there is only a small amount of time reduced to both teams.

Effectively, the length of a drive's main effect is on the number of drives possible in a half. The better team should want as many drives as possible since this reduces the expected impact of bad luck for themselves and lucky breaks for the opponent; the variance in expected points should decrease leading to a more likely outcome of the better team scoring more points. Conversely, the worse team should want as few drives as possible. Anything can happen on one drive, but maintaining the luck is near impossible over several drives.

This concept has long been acknowledged in basketball. Teams such as North Carolina know they are usually superior and play a high-tempo style so that a few lucky three pointers against them won't sink their hopes. In football, some of the top teams stubbornly persist with the idea that longer time of possession causes more wins; this analysis suggests they would be even better if they adapted.

The bottom line is that coaches of stronger football teams should emphasize a high-tempo style, while weaker teams should focus on tactics that lead to fewer drives. They can do this by recruiting and game-planning accordingly.

Thursday, December 1, 2016

From Wikipedia: Benford's law, also called the first-digit law, is an observation about the frequency distribution of leading digits in many real-life sets of numerical data. The law states that in many naturally occurring collections of numbers, the leading significant digit is likely to be small...

Benford's Law is a technique to screen for fraud in a large number of numeric records. Let's apply this to the 2016 U.S. election:

This chart of overall county vote totals by leading digit looks like it passes Benford's Law, but what about state-by-state results?

The states are sorted by number of counties since Benford's law breaks down with fewer records. The last couple of rows of states have zig-zag lines, probably due to this. There are four interesting states that do not have low county counts:

Virginia (VA) looks a bit off [Clinton won this state by 5%]

Kentucky (KY), Iowa (IA), and Mississippi (MA) look a bit off [Trump won these states by 30%, 10%, and 19%, respectively]

The dataset does not have records for Alaska, by the way.

Basically, the states with the oddest results were typically so lopsided as to scarcely matter.

An interesting observation is that the states with more counties clearly tilt toward Trump, while the states with fewer counties clearly tilt toward Clinton. Why might that be? Maybe the number of counties in a state is a proxy for average population size per county. If true, then the Clinton vote should correlate with the county population size. Does it? Indeed, the correlation is 35%.

Perhaps a theme to the election was the divergent preferences between counties with high populations versus the rest of the country. That suggests a potential solution: density-based laws, which might effectively be the return of city-states cooperating under one national flag. Individuals could pick their preferred city based on its basket of laws, obviating the imposition of a heavy blanket of laws from a national government that seesaws left and right, alienating an alternative half of its citizens each cycle.

Maybe the Constitution even effectively implemented this flavor of federalism in 1789, since state populations were then the size of present day cities. Just food for thought, let's not get carried away.

Wednesday, September 14, 2016

The Estate Tax (a/k/a Death Tax) inspires conflicting impulses. On one hand, it treats a lifelong spender's income differently from that of a lifelong saver just because there is leftover income in the bank (excellent exposition here from Greg Mankiw). On the other hand, it seems unfair for the lucky recipient of the funds to start off life with so much of a leg up.

How can this apparent contradiction of unfairness be resolved?

Notice that an inheritance is really a transaction with two parts: a give and a receive. One potential resolution is to remove the tax on giving, but to enact a tax on receiving funds over a certain threshold (say, $50,000).

With this knowledge, the giver would know that the gift is more effective when less concentrated. Expect the giver to give lower amounts to more people. Be careful not to set the threshold too low or the giver would find it impractical to stay under it and so might just ignore its intended dispersal effect.

There may be some unintended consequences so consider this food for thought.