danvk.org blog2018-04-28T21:19:00+00:00https://danvk.github.io/Dan Vanderkamdanvdk@gmail.comAlphaGo vs. Lee Sedol2016-03-15T05:07:00+00:00https://danvk.github.io/2016/03/15/alphago-vs-lee-sedol<p>I was dimly aware of the <a href="https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol">ongoing competition between AlphaGo and Lee
Sedol</a>, but I hadn’t paid much attention until I saw this chart <a href="https://www.reddit.com/r/dataisbeautiful/comments/4a8336/lee_sedol_vs_alphago_4th_game_thinking_time_in/">on
reddit</a>:</p>
<p><img src="http://i.imgur.com/3HcJKbk.png" border="0" width="700" alt="time per move during game 4" /></p>
<p>It’s hard to read “Lee Sedol’s brilliant attack (78th)” and not get curious!
This led me into a deep dive on the competition. You can read more about <a href="https://gogameguru.com/lee-sedol-defeats-alphago-masterful-comeback-game-4/">the
move</a> or watch a <a href="https://www.youtube.com/watch?v=G5gJ-pVo1gs">15 minute summary</a> of the match. The
full <a href="https://www.youtube.com/watch?v=yCALyQRN3hw&amp;feature=youtu.be&amp;t=3h10m20s">6 hour match</a>, including a press conference afterwards, is also
online. You can learn a lot about Go from listening to the YouTube
commentators. One of them is <a href="https://en.wikipedia.org/wiki/Michael_Redmond_(Go_player)">Michael Redmond</a>, the all-time top-rated
American Go player. Even if you don’t understand how to play Go (I barely do),
it’s fun to watch <a href="https://www.youtube.com/watch?v=SMqjGNqfU6I&amp;feature=youtu.be&amp;t=1h25m32s">the experts react</a> to this move: “Oh, this [is]
very creative.”</p>
<p><img src="/images/alphago.png" width="231" height="190" style="float: right; margin-left: 1em;" />
<a href="https://en.wikipedia.org/wiki/Lee_Sedol">Lee Sedol</a>’s win in the fourth match is being celebrated as a victory for
humankind. But it’s surprising that we’re here at all. AlphaGo, Google
DeepMind’s computer Go program, had already won the first three games and hence
the best-of-five competition. This is a coming of age moment for the neural net
community. Over the past ten years, thanks to the emergence of <a href="https://www.cs.toronto.edu/~kriz/cifar.html">large data
sets</a> and <a href="http://www.nvidia.com/object/what-is-gpu-computing.html">GPUs</a>, the whole field has experienced a renaissance. But
most of the great results have been on toy problems like image classification
or traditional signal processing problems like speech recognition. Beating an
elite Go player for the first time is a marquee result that transcends the
field. As long as I’ve worked in software, Go has been the one game at which
computers couldn’t compete. Everyone thought that this result was years away.</p>
<p>Last fall, AlphaGo competed against <a href="https://en.wikipedia.org/wiki/Fan_Huio">Fan Hui</a>, a lower-ranked Go
professional. It beat him 5-0. This was the first time that a computer Go
program had defeated a professional. What happened next is suggestive.
<a href="http://www.wired.com/2016/03/sadness-beauty-watching-googles-ai-play-go/">According to Wired</a>, he began consulting with the DeepMind team:</p>
<blockquote>
<p>As he played match after match with AlphaGo over the past five months, he
watched the machine improve. But he also watched himself improve. The
experience has, quite literally, changed the way he views the game. When he
first played the Google machine, he was ranked 633rd in the world. Now, he is
up into the 300s. In the months since October, AlphaGo has taught him, a
human, to be a better player. He sees things he didn’t see before. And that
makes him happy. “So beautiful,” he says. “So beautiful.”</p>
</blockquote>
<p>We learn most quickly from our betters—people who can review your work, say
what you did well and point out what would have been better. But if you’re the
best Go player in the world, who do you learn from? I wouldn’t be surprised if
this result prompts humans to discover new styles of play. Computers may get
the best of us at Go in the long run, but we’ll get better at it in the
process.</p>
<p>The fifth and final game happens tonight. If Lee Sedol wins, you could make the
case that he just took a few games to figure out how to get the best of
AlphaGo. If not, it means that it took completely brilliant play from the best
in the world to beat a computer, and it’s likely to be the last time a human
ever pulls this off. The match is being <a href="https://www.youtube.com/watch?v=mzpW10DPHeQ">streamed live on YouTube</a>.</p>
Extending the Grid to Add 1,000 Photos to OldNYC2016-01-19T17:36:00+00:00https://danvk.github.io/2016/01/19/oldnyc-update<p>I recently added around 1,000 new photos to the map on OldNYC. Read on to find
out how!</p>
<div class="center">
<img src="/images/stuytown-update.gif" width="300" height="275" alt="OldNYC update around Stuytown" />
</div>
<p>At its core, OldNYC is based on <a href="https://en.wikipedia.org/wiki/Geocoding">geocoding</a>: the process of going from
textual addresses like “9th Street and Avenue A” to numeric latitudes and
longitudes. There’s a bit of a mismatch here. The NYPL photos have 1930s
addresses and cross-streets, but geocoders are built to work with contemporary
addresses. OldNYC makes an assumption that contemporary geocoders will produce
accurate results for these old addresses. For NYC, this is usually a good
assumption! The street grid hasn’t changed too much in the past 150 years. But
it is an assumption, and it doesn’t always pan out.</p>
<p>Two of the most noticeable problem spots are Stuytown and Park Avenue South:</p>
<div class="center">
<img src="/images/stuytown-pas-missing.png" width="600" height="550" alt="Stuytown and Park Avenue South have no images" />
</div>
<p>The lettered Avenues (A, B, C, D) used to continue above 14th street. This was
the <a href="https://en.wikipedia.org/wiki/Gashouse_District">Gas House district</a>. But in the 1940s, this area was destroyed to
make way for the super-blocks of <a href="https://en.wikipedia.org/wiki/Stuyvesant_Town%E2%80%93Peter_Cooper_Village">Stuyvesant Town</a>. Intersections
like “15th and A” do no exist in the contemporary Manhattan grid and geocoders
can’t make sense of them. But there <em>are</em> <a href="http://www.oldnyc.org/#711319f-a">photos there</a>!</p>
<p>The problem for Park Avenue South is different. <a href="https://en.wikipedia.org/wiki/Park_Avenue#History">Until 1959</a>, it was known
as 4th Avenue. So photographs from the 1930s are recorded as being at, for
example, “4th Avenue and 17th street”, an interesection which no longer exists.
Again, contemporary geocoders can’t make sense of this.</p>
<p>The frustrating thing here is that it’s perfectly obvious where all of these
interesctions <em>should</em> be. Manhattan has a <a href="https://en.wikipedia.org/wiki/Commissioners%27_Plan_of_1811">regular street grid</a>, after
all. So I set out to build my own Manhattan street grid geocoder.</p>
<p>To begin with, I gathered lat/lons for <a href="https://docs.google.com/spreadsheets/d/1Kv0xXYHXUrlFmyIF92NsWTMuuRWAl-heBEI_OejcBA0/edit#gid=0">every intersection</a> that I
could. With <a href="https://github.com/danvk/oldnyc/blob/9aa15e0a4ce46d746531c7f9531ff4a50e17a53f/grid/gold.py#L21-L41">some simple logic</a>, this handled the Avenue renaming issue.</p>
<p>My initial idea to geocode unknown interesections was to interpolate
on the avenues. For example, to find where the intersection of 18th Street and
Avenue A should be, you can assume that the intersections of numbered streets
and Avenue A are evenly spaced and then find where the 18th street intersection
would fall:</p>
<div class="center">
<img src="/images/extrapolation.png" width="552" height="428" alt="Extrapolation of intersections" />
</div>
<p>Mathematically, you fit linear regressions from cross-street to latitude and
longitude. This feels like it should work but, because the streets aren’t all
perfectly spaced, it winds up producing results that don’t quite look right.</p>
<p>While I was playing around with this approach, I realized that I was checking
the results using a different technique: continuing the straight lines of the
streets until they intersected:</p>
<div class="center">
<img src="/images/continued-lines.png" width="403" height="329" alt="Extrapolation of streets" />
</div>
<p>Mathematically, this means that you fit a linear regression to the
latitude→longitude mapping for each Street and Avenue. To find an intersection,
you find the point where these lines intersect. This works so long as the
Streets and Avenues are straight. Fortunately, with a few exceptions like
Avenue C and the West Village, they are (r<sup>2</sup>&gt;0.99).</p>
<p>This approach produced very good results. The oddities which remained were as
likely to be problems with the data as with the geocoder (<a href="http://digitalcollections.nypl.org/items/510d47dd-07a6-a3d9-e040-e00a18064a99">one image</a> was
non-sensically labeled as “25th &amp; D”, which extrapolates to somewhere in the
East River).</p>
<p>While Stuytown and Park Avenue South were clear winners, new photos appeared
all over the map:</p>
<div class="center">
<img src="/images/oldnyc-update.gif" width="400" height="400" alt="Update for lower Manhattan" />
</div>
<p>It even helped uptown:</p>
<div class="center">
<img src="/images/oldnyc-uptown-update.gif" width="640" height="400" alt="Update for Uptown" />
</div>
<p>All told, there are about 1,000 new images on the map. Go check them out and!
And please help transcribe the text on the back of them. My <a href="http://www.danvk.org/2015/01/09/extracting-text-from-an-image-using-ocropus.html">OCR system</a>
didn’t run on the new images, so they’re sorely lacking descriptions.</p>
<p>Here are a few favorites:</p>
<div class="center">
<a href="https://www.oldnyc.org/#716426f-a"><img src="/images/716426f-a.jpg" width="600" height="421" /></a>
<br />
<i>Looking down Avenue B from 15th to 17th Street. This Avenue no longer exists.</i>
</div>
<div class="center">
<a href="https://www.oldnyc.org/#708194f-a"><img src="/images/708194f-a.jpg" width="482" height="348" /></a>
<br />
<i>The Everett House hotel at Union Square in 1906. This building no longer
exists, but there's a new hotel (the W) at the same location.</i>
</div>
My takeaways from NIPS 20152015-12-12T00:00:00+00:00https://danvk.github.io/2015/12/12/nips-2015<p>I’ve just wrapped up my trip to <a href="https://nips.cc/Conferences/2015">NIPS 2015</a> in Montreal and thought I’d jot down a few things that struck me this year:</p>
<ul>
<li>
<p><strong>Saddle Points vs Local Minima</strong></p>
<p>I heard this point repeated in a talk almost every day. In low-dimensional spaces (i.e. the ones we can visualize) local minima are the major impediment to optimizers reaching the global minimum. But this doesn’t generalize. In high-dimensional spaces, local minima are almost non-existent. Instead, there are saddle points: points which are a minimum in some directions but a maximum in others. Intuitively, this makes sense: in N dimensions, the odds of the curvatures all going the same way at a point is (1/2)^N. As Yoshua Bengio said, “it’s hard to build an n-dimensional wall.” This gives an intuition for why procedures like <a href="https://en.wikipedia.org/wiki/Gradient_descent">gradient descent</a> are effective at optimizing the thousands of weights in a neural net: they won’t get stuck in a local optimum. And it gives an intuition for why <a href="https://en.wikipedia.org/wiki/Stochastic_gradient_descent#Momentum">momentum</a> is helpful: it helps gradient descent escape from saddle points.</p>
</li>
<li>
<p><strong>Model Compression</strong></p>
<p>The tutorial on <a href="http://nips2015.sched.org/event/4FAy/high-performance-hardware-for-machine-learning"><em>Hardware for Deep Learning</em></a> was less about new hardware and more about how to make your software get the most out of existing hardware. Due to the high cost of uncached, off-chip memory reads, reducing the memory footprint of your models can be a huge performance win. Bill Dally presented a result on model pruning that I found interesting: by iteratively removing small weights from a model and retraining, they were able to remove 90+% of the weights with zero loss of precision. This parallels an observation from <a href="http://www.ttic.edu/dl/dark14.pdf">transfer learning</a>, that small networks are most effectively trained using the output of larger networks. It would be nice if we could train these smaller networks directly. See the <a href="http://arxiv.org/abs/1510.00149">Deep Compression</a> paper.</p>
</li>
<li>
<p><strong>The importance of canonical data sets / problems</strong></p>
<p>Over and over, talks and posters referenced the same canonical data sets: the <a href="http://yann.lecun.com/exdb/mnist/">MNIST</a> set of handwritten digits, the <a href="https://www.cs.toronto.edu/~kriz/cifar.html">CIFAR</a> and <a href="http://www.image-net.org/">ImageNet</a> images, the <a href="https://catalog.ldc.upenn.edu/LDC93S1">TIMIT</a> speech corpus and the <a href="http://www.arcadelearningenvironment.org/">Atari/Arcade Learning Environment</a> (ALE). These have given researchers in their fields a shared problem on which to experiment, compete, collaborate and measure their progress. If you want to push a field forward, built a good challenge problem.</p>
</li>
<li>
<p><strong>One-shot Learning</strong></p>
<p>There was much high-level talk about how the human brain is very good at learning to perform new tasks quickly. Contrast this with neural nets, which require thousands or millions of training examples to reach human performance. This comparison is somewhat unfair because adult humans have years of experience interacting with the real world from which to draw on. There seems to be a great deal of interest in getting machines to do a better job of transferring general knowledge to specific tasks.</p>
</li>
<li>
<p><strong>This conference has gotten huge!</strong></p>
<p>I’ve read that registrations have gone up significantly over the last few years. This was palpable at the conference. Many of the workshop rooms were packed to the gills and I watched most of the larger talks in overflow rooms. This is probably good thing for the ML community. I’m not sure if it’s a good thing for NIPS.</p>
</li>
</ul>
<p>A few smaller bits that struck me:</p>
<ul>
<li>
<p><a href="http://arxiv.org/abs/1505.00387">Highway Networks</a> are cool. They let you train very deep networks, where the depth has to be learned. They found that depth &gt; 20 was not helpful for MNIST, but was for CIFAR.</p>
</li>
<li>
<p><a href="http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf">AlexNet</a> seems to have become a canonical neural net for experimentation. I saw it referenced repeatedly, e.g. on the <a href="http://arxiv.org/abs/1510.00149">Deep Compression</a> poster.</p>
</li>
<li>
<p>As an example of the above, I really enjoyed the <a href="http://arxiv.org/abs/1407.5104"><em>Pixels to Voxels</em></a> talk. Pulpit Agarwal &amp; co showed that the activations of individual layers of AlexNet correlate to activations in regions of the brain. They were able to use this correspondence to learn what some of the <a href="https://en.wikipedia.org/wiki/Visual_cortex#V4">mid-level regions</a> of the visual cortex are doing.</p>
</li>
<li>
<p>We heard several speakers say that “backprop is not biologically plausible.” I assume this is because we don’t consume nearly enough labels for it to be practical at the scale of a human brain?</p>
</li>
<li>
<p>I asked someone at the NVIDIA booth whether the ML industry is large enough to drive GPU design &amp; sales (as opposed to the game industry). It is. The GPUs designed for ML tend to be more robust than those used in games. They have error correcting codes built-in. In a game, if an arithmetic unit makes a mistake, it’ll be fixed on the next frame. When you’re training a neural net, that mistake can propagate.</p>
</li>
</ul>
<p>I do <a href="http://www.danvk.org/2015/01/11/training-an-ocropus-ocr-model.html">relatively little</a> machine learning in my day-to-day. NIPS is always a bit over my head, but it’s a good way to rekindle my interest in the field.</p>
Dan writes on HammerLab2015-10-21T18:03:00+00:00https://danvk.github.io/2015/10/21/hammerlab-posts<p>I haven’t written a substantial blog post on danvk.org <a href="http://www.danvk.org/2015/01/11/training-an-ocropus-ocr-model.html">since January</a>.
Instead, I’ve been writing over on my groups’s blog at <a href="http://www.hammerlab.org/">hammerlab.org</a>.</p>
<p>Here are the posts I’ve written or edited:</p>
<ul>
<li>
<p><a href="http://www.hammerlab.org/2015/10/13/svg-canvas-the-pileup-js-journey/">SVG→Canvas, the pileup.js Journey</a> (13 Oct 2015)</p>
<p>In which I explain why we changed from using SVG to using canvas for our
genome browser (spoiler: it’s performance). This also introduces
<a href="https://github.com/hammerlab/data-canvas">data-canvas</a>, which compensates for some of the drawbacks of canvas,
e.g. difficulty in tracking clicks and writing tests.</p>
</li>
<li>
<p><a href="http://www.hammerlab.org/2015/07/09/bundling-and-distributing-complex-es6-libraries-in-an-es5-world/">Bundling and Distributing Complex ES6 Libraries in an ES5 World</a> (09 Jul 2015)</p>
<p>I helped edit this post, which was written by <a href="http://arman.aksoy.org/">Arman Aksoy</a>. It
outlines our approach to writing ES6 JavaScript while distributing/bundling
it for ES5 clients.</p>
</li>
<li>
<p><a href="http://www.hammerlab.org/2015/06/19/introducing-pileup-js-a-browser-based-genome-viewer/">Introducing pileup.js, a Browser-based Genome Viewer</a> (19 Jun 2015)</p>
<p>This post introduces pileup.js, the in-browser genome visualizer I’ve been
working on for most of the past year.</p>
</li>
<li><a href="http://www.hammerlab.org/2015/02/14/testing-react-web-apps-with-mocha/">Testing React Web Apps with Mocha</a> (14 Feb 2015)</li>
<li>
<p><a href="http://www.hammerlab.org/2015/02/21/testing-react-web-apps-with-mocha-part-2/">Testing React Web Apps with Mocha (Part 2)</a> (21 Feb 2015)</p>
<p>A two-parter in which I explain how we set up testing for our React web app
using Mocha, rather than Jest. Since writing this post, I’ve completely
changed my mind about how web testing should be done. If your code is
intended to be run in the browser, then you should test it in the browser,
rather than using Node as these posts suggest.</p>
</li>
<li>
<p><a href="http://www.hammerlab.org/2015/01/23/faster-pileup-loading-with-bai-indices/">Faster Pileup Loading with BAI Indices</a> (23 Jan 2015)</p>
<p>BAM is a very widely used bioinformatics file format which stores aligned
reads from a genome. These files can be huge, so the BAM Index (BAI) format
was created to speed up retrieval. We found that the BAI was <em>also</em> too
large for convenient access on the web, so we created an index of the
index.</p>
</li>
<li>
<p><a href="http://www.hammerlab.org/2014/12/05/igv-httpfs/">Streaming from HDFS with igv-httpfs</a> (05 Dec 2014)</p>
<p>How we got data from HDFS into our genome viewer of choice by building a
small piece of infrastructure. We’ve since stopped using this. Nowadays we
have an NFS mount for our HDFS file system and serve from that using nginx.</p>
</li>
</ul>
Launched: OldNYC2015-06-04T02:24:00+00:00https://danvk.github.io/2015/06/04/launched-oldnyc<p><a href="http://www.oldnyc.org/"><img src="/images/oldnyc-big.jpg" width="300" height="236" style="float:right; margin-left: 20px;" /></a>
Two weeks ago I <a href="https://twitter.com/danvdk/status/601381598525796352">launched</a> my latest side project, <a href="http://www.oldnyc.org/">OldNYC</a>. It’s a
collaboration with NYPL Labs which places around 40,000 historical photos of
New York City on a map. Avid readers of this blog know that I’ve been <a href="http://www.danvk.org/2015/01/09/extracting-text-from-an-image-using-ocropus.html">working
on</a> this <a href="http://www.danvk.org/wp/2013-02-09/finding-pictures-in-pictures/">for years</a>.</p>
<p>The response to OldNYC has been completely overwhelming. Hundreds of thousands
of people have used the site. Millions of images have been viewed. Users have
left nearly a thousand comments and fixed thousands of typos in the <a href="http://www.danvk.org/2015/01/09/extracting-text-from-an-image-using-ocropus.html">OCR’d
descriptions</a>.</p>
<p>Rather than say more about the project myself, I’ll let you pick your write-up
of choice. There have been many!</p>
<ul>
<li>New York Times: <a href="http://cityroom.blogs.nytimes.com/2015/05/26/new-york-today-new-views-of-the-past/?_r=0">New York Today: New Views of the Past</a></li>
<li>Gothamist: <a href="http://gothamist.com/2015/05/26/old_nyc_photo_map.php">This Photo Map Will Bring You Back To Old NYC, Block By Block</a></li>
<li>The Guardian: <a href="http://www.theguardian.com/us-news/2015/may/21/new-york-historical-photo-archive-lets-residents-track-changes-to-a-city-in-sepia">New York vintage photo archive lets you track a century of change to your block</a></li>
<li>Citylab: <a href="http://www.citylab.com/design/2015/05/pictures-map-old-new-york-historical/393944/">Mapping the New York That Once Was</a></li>
<li>Pix11 News (TV): <a href="http://pix11.com/2015/05/26/website-gives-a-virtual-tour-of-old-nyc-through-photo-maps/">Website gives a virtual tour of old NYC through photo maps</a></li>
<li>Library Journal: <a href="http://lj.libraryjournal.com/2015/06/digital-content/developer-maps-library-photo-archives/">Developer Maps Library Photo Archives</a></li>
<li>Kottke: <a href="http://kottke.org/15/05/mapping-photos-of-old-nyc">Mapping photos of old NYC</a></li>
<li>Gizmodo: <a href="http://gizmodo.com/here-are-40-000-photos-of-old-new-york-plotted-on-a-cit-1706342527">Here Are 40,000 Photos Of Old New York Plotted on a City Map</a></li>
<li>DNAInfo: <a href="http://www.dnainfo.com/new-york/20150522/midtown/see-old-photos-of-your-neighborhood-with-this-interactive-map">See Old Photos of Your Neighborhood With This Interactive Map</a></li>
<li>West Side Rag: <a href="http://www.westsiderag.com/2015/05/21/throwback-thursday-new-tool-lets-you-see-photos-of-your-street-through-the-years">Throwback Thursday: New Tool Lets You See Photos of Your Street Through the Years</a></li>
<li>Daily Mail: <a href="http://www.dailymail.co.uk/sciencetech/article-3093512/Step-time-OldNY-Interactive-map-allows-users-explore-New-York-40-000-old-photos.html">The Big Apple back in time: Forty thousand historic photographs on an interactive map offer slice of life from old New York</a></li>
<li>Business Insider: <a href="http://www.businessinsider.com/nypl-old-photos-interactive-map-2015-5">This incredible map lets New Yorkers see vintage photos of their street corners</a></li>
<li>EV Grieve: <a href="http://evgrieve.com/2015/05/immerse-yourself-in-archival-photos-of.html">Immerse yourself in archival photos of NYC</a></li>
<li>The Awl: <a href="http://www.theawl.com/2015/05/old-new-york-mapped">Old New York, Mapped</a></li>
<li>The Week: <a href="http://theweek.com/speedreads/556593/check-amazing-archive-historic-new-york-pictures">Check out this amazing archive of historic New York pictures</a></li>
<li>Fast Company: <a href="http://www.fastcodesign.com/3046606/see-what-times-square-and-wall-street-looked-like-100-years-ago">See What Times Square And Wall Street Looked Like 100 Years Ago</a></li>
<li>Free Williamsburg: <a href="http://freewilliamsburg.com/see-photos-of-old-williamsburg-thanks-to-the-ny-public-library/">See photos of old Williamsburg thanks to the NY Public Library</a></li>
<li>PetaPixel: <a href="http://petapixel.com/2015/05/25/oldsf-and-oldnyc-historical-photos-plotted-on-maps/">OldSF and OldNYC: Historical Photos Plotted on Maps</a></li>
<li>Brooklyn Magazine: <a href="http://www.bkmag.com/2015/05/21/here-are-thousands-of-historical-photos-of-new-york-city-all-on-one-interactive-map/">Here are Thousands of Historical Photos of New York City, All on One Interactive Map</a></li>
<li>Mental Floss: <a href="http://mentalfloss.com/article/64253/tour-old-new-york-your-computer">Tour Old New York on Your Computer</a></li>
</ul>
PyCon 2015: Make web development awesome with visual diffing tools2015-04-12T15:05:00+00:00https://danvk.github.io/2015/04/12/pycon-2015-make-web-development-awesome-with-visual-diffing-tools<p>Here’s the <a href="https://youtu.be/jUUTqgzNR3M">video</a> of my talk from PyCon 2015, <a href="https://us.pycon.org/2015/schedule/presentation/395/">Make web development awesome with visual diffing tools</a>:</p>
<iframe width="640" height="360" src="https://www.youtube.com/embed/jUUTqgzNR3M" frameborder="0" allowfullscreen=""></iframe>
<p>Here are the <a href="http://bit.ly/pycon2015-visual-diffs">slides</a> for the talk:</p>
<iframe src="https://docs.google.com/presentation/embed?id=1C6TcdHSBQNcLEIH6SFmmAOLJ7cy1gvIbmCj3cyd7fxE&amp;start=false&amp;loop=false&amp;" frameborder="0" width="520" height="405"></iframe>
<p>The two tools referenced are:</p>
<ul>
<li><a href="https://github.com/bslatkin/dpxdt">dpxdt</a> for generating screenshots</li>
<li><a href="https://github.com/danvk/webdiff">webdiff</a> for viewing image diffs</li>
</ul>
<p>I used <a href="http://www.comparea.org/">comparea</a> and <a href="http://www.dygraphs.com/">dygraphs</a> as sample apps for the demos.</p>
<p>If you enjoyed my talk, you might also enjoy <a href="http://www.onebigfluke.com/">Brett’s</a> talk from a few
years ago, <a href="https://www.youtube.com/watch?v=1wHr-O6gEfc">The Secret of Safe Continuous Deployment: Perceptual Diffs</a>.
It goes into more depth on how screenshots can facilitate rapid deployment.</p>
Training an Ocropus OCR model2015-01-11T04:58:00+00:00https://danvk.github.io/2015/01/11/training-an-ocropus-ocr-model<style>
.bordered {
border: 1px solid rgb(204, 204, 204);
}
</style>
<p>In the <a href="http://www.danvk.org/2015/01/09/extracting-text-from-an-image-using-ocropus.html">last post</a>, we walked through the steps in the
<a href="https://github.com/tmbdev/ocropy">Ocropus</a> OCR pipeline. We extracted text from images like this:</p>
<div class="center">
<img src="/images/ocropus/start.png" width="608" height="137" />
</div>
<p>The results using the default model were passable but not great:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>O1inton Street, aouth from LIYingston Street.
Auguat S, 1934.
P. L. Sperr.
NO REPODUCTIONS.
</code></pre></div></div>
<p>Over the <a href="http://digitalgallery.nypl.org/nypldigital/dgdivisionbrowseresult.cfm?div_id=hh">larger corpus</a> of images, the error rate was around 10%.
The default model has never seen typewriter fonts, nor has it seen ALLCAPS
text, both of which figure prominently in this collection. So its poor
performance comes as no surprise.</p>
<p>In this post I’ll walk through the process of training an Ocropus model to
recognize the typewritten text in this collection. By the end of this post,
the performance will be extremely good.</p>
<h3 id="generating-truth-data">Generating truth data</h3>
<p>Ocropus trains its model using <a href="http://en.wikipedia.org/wiki/Supervised_learning">supervised learning</a>: it requires
images of lines along with correct transcriptions. If you’re trying to
recognize a known font, you can generate arbitrary amounts of labeled data
(using <code class="highlighter-rouge">ocropus-linegen</code>). But in our case, we have to label some images by
hand.</p>
<p>This is tedious and involves a lot of typing. Amazon’s <a href="https://www.mturk.com/mturk/welcome">Mechanical
Turk</a> is a popular way of farming out small tasks like this, but I
prefer to do the transcription myself using <a href="http://www.danvk.org/wp/2013-04-20/generating-training-data/index.html">localturk</a>. It doesn’t
take as long as you might think (I typed 800 lines in about an hour and 20
minutes). And it has the benefit of forcing you to look at a large sample of
your data, something that’s likely to lead to insights.</p>
<div class="center">
<img src="/images/ocropus/localturk.png" width="474" height="401" />
<br /><i>(localturk in action)</i>
</div>
<p>I used <a href="https://gist.github.com/danvk/abec6e0c7657e2c8e86f">this template</a> for the transcription. Ocropus expects truth
data to be in <code class="highlighter-rouge">.gt.txt</code> files with the same name as the PNG files for the
lines. For example:</p>
<ul>
<li><code class="highlighter-rouge">book/0001/010001.png</code></li>
<li><code class="highlighter-rouge">book/0001/010001.gt.txt</code></li>
</ul>
<p>It’s important that you transcribe <em>lines</em>, not entire pages. I initially
transcribed pages and tried to have Ocropus learn on them, but this doesn’t
work at all.</p>
<h3 id="training-a-model">Training a model</h3>
<p>Ocropus trains a model by learning from its mistakes. It transcribes the text
in a line, then adjusts the weights in the Neural Net to compensate for the
errors. Then it does this again for the next line, and the next, and so on.
When it gets to the last line of labeled data, it starts over again. As it
loops through the training data over and over again, the model gets better and
better.</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ocropus-rtrain -o modelname book*/????/*.bin.png
</code></pre></div></div>
<p>This produces lots of output like this:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>2000 70.56 (1190, 48) 715641b-crop-010002.png
TRU: u'504-508 West 142nd Street, adjoining and west of Hamilton'
ALN: u'504-5088 West 422nd Street, adjoining and west of Hammilton'
OUT: u'3od-iS est 4nd Street, doning nd est of Sarilton'
2001 32.38 (341, 48) 726826b-crop-010003.png
TRU: u'NO REPRODUCTIONS'
ALN: u'NO REPRODUCTIONS'
OUT: u'sO EROCoOri'
...
</code></pre></div></div>
<p><code class="highlighter-rouge">TRU</code> is the truth data. <code class="highlighter-rouge">OUT</code> is the output of the model. <code class="highlighter-rouge">ALN</code> is a variant
of the model output which is aligned to the truth data. It’s used to adjust
the model weights more precisely. It typically looks better than the model
output, especially in early iterations. It lets you know that you’re making
progress.</p>
<p>Here’s a video that Thomas, the Ocropus developer, put together. It shows the
network’s output for a single image as it learns (see the <a href="https://www.youtube.com/watch?v=czG5Jk9iC7c">YouTube
page</a> for explanations of the different charts):</p>
<div class="center">
<iframe width="560" height="315" src="//www.youtube.com/embed/czG5Jk9iC7c" frameborder="0" allowfullscreen=""></iframe>
</div>
<p>For my first model, I used 400 of the labeled lines as training data and held
out the other 400 as test data. Ocropus saves models to disk every 1000
iterations, so it’s simple to evaluate the model’s performance as it learns:</p>
<div class="center">
<img src="/images/ocropus/err-train-test.png" width="552" height="326" />
</div>
<p>The error rate starts high (over 50%) but quickly comes down to about 2% after
10,000 iterations, eventually hitting a minimum of 0.96% at 16,000 iterations.</p>
<p>The error rate on the test set is consistently about 3% higher than that on
the training set. The best error rate on the test set was 4.20%.</p>
<p>There’s a lot of variation in the error rate. You might expect it to slowly
decrease over time, but that’s not at all the case. I’m not quite sure how to
interpret this. Does the error rate spike at 17,000 iterations because the
model tries to jolt itself out of a local minimum? Is it just randomness?</p>
<p>In any case, it’s important to generate a chart like this. Choosing the wrong
model could lead to needlessly bad performance.</p>
<h3 id="training-with-more-data">Training with more data.</h3>
<p>You’d expect that training on more data would yield a better model. So for my
next model, I trained on all 800 labeled images (rather than just 400). I
didn’t have a test set. Here’s what the error rate looked like:</p>
<div class="center">
<img src="/images/ocropus/err-train800.png" width="472" height="294" />
</div>
<p>This doesn’t make much sense to me. The lowest error rate on the 800 training
images is 3.59%. But the model from the previous section achieved an error
rate of 2.58% on the same data set (average of 0.96% and 4.20%). And it only
saw half the data! How is that possible? Maybe this model just had bad luck.</p>
<p>There’s the same pattern as before of occasional spikes in error rate. More
disturbing, after around 40,000 iterations, I started seeing lots of
<a href="https://github.com/tmbdev/ocropy/issues/5">FloatingPointErrors</a>. It’s unclear to me exactly what this means.
Perhaps the model is diverging?</p>
<p>Here’s another model that I trained for even longer:</p>
<div class="center">
<img src="/images/ocropus/err-typewriter.png" width="522" height="275" />
</div>
<p>It achieves an error rate of 0.89% at iteration 33,000, then spikes to over
15% at 37,000. It eventually gets back down to 0.85% after 53,000 iterations,
then starts spiking again. By the time I stopped it, I was again seeing lots
of <code class="highlighter-rouge">FloatingPointErrors</code>.</p>
<p>The point of all this is that the error rates are quite erratic, so you need
to look at them before choosing which model you use!</p>
<h3 id="training-with-the-default-model">Training with the default model</h3>
<p>So far we’ve built our models from scratch. But you can also build on top of
an existing model.</p>
<p>Even though it’s never seen typewriter text or ALLCAPS, the <a href="http://www.tmbdev.net/">default Ocropus
model</a> presumably knows a lot about Latin characters and the
relationship between them in English words. And I trust the Ocropus
developers to build a good Ocropus model far more than I trust myself.</p>
<p>You train on top of an existing model using the <code class="highlighter-rouge">--load</code> option:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ocropus-rtrain --load en-default.pyrnn.gz -o my-model *.png
</code></pre></div></div>
<p>Here’s what the error rate looks like:</p>
<div class="center">
<img src="/images/ocropus/err-default.png" width="468" height="286" />
</div>
<p>Now we’re getting somewhere: the error rate gets all the way down to 0.277%!</p>
<p>Something interesting happens when you get the error rate significantly below
1%. The “mistakes” that the model makes are quite likely to be errors that
<em>you</em> made while transcribing truth data! I noticed that I misspelled some
words and even hallucinated new words like “the” into some of the lines.</p>
<p>Even crazier, there were typos in the original images that I subconsciously
corrected:</p>
<div class="center">
<img src="/images/ocropus/milstein-typo.png" width="594" height="30" />
</div>
<p>(Look at the second to last word.)</p>
<p>A model with a 0.2% error rate is good enough to produce readable text. For
example, here’s what it produces for the image from the <a href="http://www.danvk.org/2015/01/09/extracting-text-from-an-image-using-ocropus.html">last post</a>:</p>
<p><img class="bordered" src="/images/ocropus/line1.png" width="610" height="35" /><br />
→ <code class="highlighter-rouge">Clinton Street, south from Livingston Street.</code>
<br />
<img class="bordered" src="/images/ocropus/line2.png" width="163" height="25" />
<br />
→ <code class="highlighter-rouge">P. L. Sperr.</code>
<br />
<img class="bordered" src="/images/ocropus/line3.png" width="233" height="24" />
<br />
→ <code class="highlighter-rouge">NO REPRODUCTIONS.</code>
<br />
<img class="bordered" src="/images/ocropus/line4.png" width="208" height="27" />
<br />
→ <code class="highlighter-rouge">August 5, 1934.</code></p>
<p>i.e. it’s perfect. Here’s the output of the Neural Net for the last line:</p>
<div class="center">
<img src="/images/ocropus/new-network.png" width="699" height="216" />
</div>
<p>Compare that to what it was before:</p>
<div class="center">
<img src="/images/ocropus/network-labeled.png" width="699" height="216" />
</div>
<p>There’s still some ambiguity around <code class="highlighter-rouge">5</code>/<code class="highlighter-rouge">S</code>, but it makes the right call. The
<code class="highlighter-rouge">a</code> vs <code class="highlighter-rouge">s</code> error is completely gone.</p>
<h3 id="conclusions">Conclusions</h3>
<p>At this point the model is good enough. If I were to improve it further, I’d
either improve my <a href="http://www.danvk.org/2015/01/07/finding-blocks-of-text-in-an-image-using-python-opencv-and-numpy.html">image cropper</a> or incorporate some kind of spell
checking as a post-processing step.</p>
<p>The behavior of the models as they’re trained is sometimes inscrutable.
Finding a good one involves a lot of trial and error. To avoid flailing,
measure your performance constantly and keep a list of ideas to explore.
“Train a model starting with the pre-built one” was item #6 on my list of
ideas and it took me a while to get around to trying it. But it was the
solution!</p>
<p>If you’re feeling lost or frustrated, go generate some more training
data. At least you’ll be doing something useful.</p>
<p>At the end of the day, I’m very happy with the OCR model I built. Ocropus has
some rough edges, but it’s simple enough that you can usually figure out
what’s going on and how to fix problems as they come up. And the results speak
for themselves!</p>
Extracting text from an image using Ocropus2015-01-09T22:45:00+00:00https://danvk.github.io/2015/01/09/extracting-text-from-an-image-using-ocropus<style>
.bordered {
border: 1px solid rgb(204, 204, 204);
}
</style>
<p>In the <a href="http://www.danvk.org/2015/01/07/finding-blocks-of-text-in-an-image-using-python-opencv-and-numpy.html">last post</a>, I described a way to crop an image down to just the
part containing text. The end product was something like this:</p>
<div class="center">
<img src="/images/ocropus/start.png" width="608" height="137" />
</div>
<p>In this post, I’ll explain how to extract text from images like these using the
<a href="https://github.com/tmbdev/ocropy">Ocropus</a> <a href="http://en.wikipedia.org/wiki/Optical_character_recognition">OCR</a> library. Plain text has a number of advantages
over images of text: you can search it, it can be stored more compactly and
it can be reformatted to fit seamlessly into web UIs.</p>
<p>I don’t want to get too bogged down in the details of why I went with Ocropus
over its more famous cousin, <a href="https://code.google.com/p/tesseract-ocr/">Tesseract</a>, at least not in this post. The
gist is that I found it to be:</p>
<ol>
<li>more transparent about what it was doing.</li>
<li>more hackable</li>
<li>more robust to <a href="http://stackoverflow.com/questions/27592430/how-can-i-tell-tesseract-that-my-font-has-a-particular-size">character segmentation issues</a></li>
</ol>
<p>This post is a bit long, but there are lots of pictures to help you get through
it. Be strong!</p>
<h3 id="ocropus">Ocropus</h3>
<p>Ocropus (or Ocropy) is a collection of tools for extracting text from scanned
images. The basic pipeline looks like this:</p>
<div class="center">
<img src="/images/ocropus/ocropus-pipeline.png" width="304" height="341" />
</div>
<p>I’ll talk about each of these steps in this post. But first, we need to install
Ocropus!</p>
<h3 id="installation">Installation</h3>
<p>Ocropus uses the <a href="http://www.scipy.org/about.html">Scientific Python</a> stack. To run it, you’ll need
<code class="highlighter-rouge">scipy</code>, <code class="highlighter-rouge">PIL</code>, <code class="highlighter-rouge">numpy</code>, <code class="highlighter-rouge">OpenCV</code> and <code class="highlighter-rouge">matplotlib</code>. Setting this up is a bit of a pain, but you’ll only ever have to do it once (at least until you get a new computer).</p>
<p>On my Mac running Yosemite, I set up <a href="http://brew.sh">brew</a>, then ran:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>brew install python
brew install opencv
brew install homebrew/python/scipy
</code></pre></div></div>
<p>To make this last step work, I had to follow the workaround described in <a href="https://github.com/Homebrew/homebrew/issues/16016#issuecomment-42912638">this
comment</a>:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cd</span> /usr/local/Cellar/python/2.7.6_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages
rm cv.py cv2.so
ln <span class="nt">-s</span> /usr/local/Cellar/opencv/2.4.9/lib/python2.7/site-packages/cv.py cv.py
ln <span class="nt">-s</span> /usr/local/Cellar/opencv/2.4.9/lib/python2.7/site-packages/cv2.so cv2.so
</code></pre></div></div>
<p>Then you can follow the instructions on the <a href="https://github.com/tmbdev/ocropy">ocropy site</a>. You’ll know
you have things working when you can run <code class="highlighter-rouge">ocropus-nlbin --help</code>.</p>
<h3 id="binarization">Binarization</h3>
<p>The first step in the Ocropus pipeline is <em>binarization</em>: the conversion of the
source image from grayscale to black and white.</p>
<p>There are many ways to do this, some of which you can read about
<a href="https://docs.google.com/presentation/d/1N1scoKZhmneH_qyLCjdVcAWKqqL65T3ahKrk2-1Tvcg/edit#slide=id.i39">in this presentation</a>. Ocropus uses a form of <a href="http://opencv-python-tutroals.readthedocs.org/en/latest/py_tutorials/py_imgproc/py_thresholding/py_thresholding.html">adaptive
thresholding</a>, where the cutoff between light and dark can vary
throughout the image. This is important when working with scans from books,
where there can be variation in light level over the page.</p>
<p>Also lumped into this step is <em>skew estimation</em>, which tries to rotate the
image by small amounts so that the text is truly horizontal. This is done more
or less through brute force: Ocropy tries 32 different angles between +/-2°
and picks the one which maximizes the variance of the row sums. This works
because, when the image is perfectly aligned, there will be huge variance
between the rows with text and the blanks in between them. When the image is
rotated, these gaps are blended.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ocropus-nlbin <span class="nt">-n</span> 703662b.crop.png <span class="nt">-o</span> book
</code></pre></div></div>
<div class="center">
<img src="/images/ocropus/binarized.png" width="608" height="137" />
</div>
<p>The <code class="highlighter-rouge">-n</code> tells Ocropus to suppress page size checks. We’re giving it a small,
cropped image, rather than an image of a full page, so this is necessary.</p>
<p>This command produces two outputs:</p>
<ul>
<li><code class="highlighter-rouge">book/0001.bin.png</code>: binarized version of the first page (above)</li>
<li><code class="highlighter-rouge">book/0001.nrm.png</code>: a “flattened” version of the image, before binarization.
This isn’t very useful.</li>
</ul>
<p>(The Ocropus convention is to put all intermediate files in a <code class="highlighter-rouge">book</code> working directory.)</p>
<h3 id="segmentation">Segmentation</h3>
<p>The next step is to extract the individual lines of text from the image. Again,
there are many ways to do this, some of which you can read about in this
<a href="https://docs.google.com/presentation/d/1N1scoKZhmneH_qyLCjdVcAWKqqL65T3ahKrk2-1Tvcg/edit#slide=id.i151">presentation on segmentation</a>.</p>
<p>Ocropus first estimates the “scale” of your text. It does this by finding
connected components in the binarized image (these should mostly be individual
letters) and calculating the median of their dimensions. This corresponds to
something like the <a href="http://en.wikipedia.org/wiki/X-height">x-height</a> of your font.</p>
<p>Next it tries to find the individual lines of text. The sequence goes something
like this:</p>
<ol>
<li>It removes components which are too big or too small (according to <em>scale</em>).
These are unlikely to be letters.</li>
<li>It applies the <a href="http://www.cs.cornell.edu/courses/CS6670/2011sp/lectures/lec02_filter.pdf">y-derivative of a Gaussian kernel</a> (p. 42) to
detect top and bottom edges of the remaining features. It then blurs this
horizontally to blend the tops of letters on the same line together.</li>
<li>The bits between top and bottom edges are the lines.</li>
</ol>
<p>A picture helps explain this better. Here’s the result of step 2 (the edge
detector + horizontal blur):</p>
<div class="center">
<img src="/images/ocropus/edge-blur.png" width="608" height="137" />
</div>
<p>The white areas are the tops and the black areas are the bottoms.</p>
<p>Here’s the another view of the same thing:</p>
<div class="center">
<img src="/images/ocropus/edge-boxes.png" width="608" height="137" />
</div>
<p>Here the blue boxes are components in the binarized image (i.e. letters). The
wispy green areas are tops and the red areas are bottoms. I’d never seen a
Gaussian kernel used this way before: its derivative is an edge detector.</p>
<p>Here are the detected lines, formed by expanding the areas between tops and
bottoms:</p>
<div class="center">
<img src="/images/ocropus/line-components.png" width="608" height="137" />
</div>
<p>It’s interesting that the lines needn’t be simple rectangular regions. In fact,
the bottom two components have overlapping y-coordinates. Ocropus applies these
regions as masks before extracting rectangular lines:</p>
<p><img class="bordered" src="/images/ocropus/line1.png" width="610" height="35" /><br />
<img class="bordered" src="/images/ocropus/line2.png" width="163" height="25" /><br />
<img class="bordered" src="/images/ocropus/line3.png" width="233" height="24" /><br />
<img class="bordered" src="/images/ocropus/line4.png" width="208" height="27" /></p>
<p>Here’s the command I used (the <code class="highlighter-rouge">g</code> in <code class="highlighter-rouge">ocropus-gpageseg</code> stands for “gradient”):</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ocropus-gpageseg <span class="nt">-n</span> <span class="nt">--maxcolseps</span> 0 book/0001.bin.png
</code></pre></div></div>
<p>The <code class="highlighter-rouge">--maxcolseps 0</code> tells Ocropus that there’s only one column in this image.
The <code class="highlighter-rouge">-n</code> suppresses size checks, as before.</p>
<p>This has five outputs:</p>
<ul>
<li><code class="highlighter-rouge">book/0001.pseg.png</code> encodes the segmentation. The color at each pixel
indicates which column and line that pixel in the original image belongs to.</li>
<li><code class="highlighter-rouge">book/0001/01000{1,2,3,4}.bin.png</code> are the extracted line images (above).</li>
</ul>
<h3 id="character-recognition">Character Recognition</h3>
<p>After all that prep work, we can finally get to the fun part: character
recognition using a <a href="http://en.wikipedia.org/wiki/Artificial_neural_network">Neural Net</a>.</p>
<p>The problem is to perform this mapping:</p>
<div class="center">
<img class="bordered" src="/images/ocropus/line4.png" width="208" height="27" />
→ <code>August 5, 1934.</code>
</div>
<p>This is challenging because each line will have its own quirks. Maybe
binarization produced a darker or lighter image for this line. Maybe skew
estimation didn’t work perfectly. Maybe the typewriter had a fresh ribbon and
produced thicker letters. Maybe the paper got water on it in storage.</p>
<p>Ocropus uses an <a href="http://en.wikipedia.org/wiki/Long_short_term_memory">LSTM Recurrent Neural Net</a> to learn this mapping. The
default model has 48 inputs, 200 nodes in a hidden layer and 249 outputs.</p>
<p>The inputs to the network are columns of pixels. The columns in the image are
fed into the network, one at a time, from left to right. The outputs are scores
for each possible letter. As the columns for the <code class="highlighter-rouge">A</code> in the image above are fed
into the net, we’d hope to see a spike from the <code class="highlighter-rouge">A</code> output.</p>
<p>Here’s what the output looks like:</p>
<div class="center">
<img src="/images/ocropus/network.png" width="700" height="330" />
</div>
<p>The image on the bottom is the output of the network. Columns in the text and
the output matrix correspond to one another. Each row in the output corresponds
to a different letter, reading alphabetically from top to bottom. Red means a
strong response, blue a weaker response. The red streak under the <code class="highlighter-rouge">A</code> is a
strong response in the <code class="highlighter-rouge">A</code> row.</p>
<p>The responses start somewhere around the middle to right half of each letter,
once the net has seen enough of it to be confident it’s a match. To extract a
transcription, you look for maxima going across the image.</p>
<p>In this case, the transcription is <code class="highlighter-rouge">Auguat S, 1934.</code>:</p>
<div class="center">
<img src="/images/ocropus/network-labeled.png" width="699" height="216" />
</div>
<p>It’s interesting to look at the letters that this model gets wrong. For
example, the <code class="highlighter-rouge">s</code> in <code class="highlighter-rouge">August</code> produces the strongest response on the <code class="highlighter-rouge">a</code> row.
But there’s also a (smaller) response on the correct <code class="highlighter-rouge">s</code> row. There’s also
considerable ambiguity around the <code class="highlighter-rouge">5</code>, which is transcribed as an <code class="highlighter-rouge">S</code>.</p>
<p>My #1 <a href="https://github.com/tmbdev/ocropy/issues/16">feature request</a> for Ocropus is for it to output more metadata
about the character calls. While there might not be enough information in the
image to make a clear call between <code class="highlighter-rouge">Auguat</code> and <code class="highlighter-rouge">August</code>, a post-processing
step with a dictionary would clearly prefer the latter.</p>
<p>The transcriptions with the default model are:</p>
<p><img class="bordered" src="/images/ocropus/line1.png" width="610" height="35" /><br />
→ <code class="highlighter-rouge">O1inton Street, aouth from LIYingston Street.</code>
<br />
<img class="bordered" src="/images/ocropus/line2.png" width="163" height="25" /> → <code class="highlighter-rouge">P. L. Sperr.</code>
<br />
<img class="bordered" src="/images/ocropus/line3.png" width="233" height="24" /> → <code class="highlighter-rouge">NO REPODUCTIONS.</code>
<br />
<img class="bordered" src="/images/ocropus/line4.png" width="208" height="27" /> → <code class="highlighter-rouge">Auguat S, 1934.</code></p>
<p>This is passable, but not great. The Ocropus site explains why:</p>
<blockquote>
<p>There are some things the currently trained models for ocropus-rpred will not
handle well, largely because they are nearly absent in the current training
data. That includes all-caps text, some special symbols (including “?”),
typewriter fonts, and subscripts/superscripts. This will be addressed in a
future release, and, of course, you are welcome to contribute new, trained
models.</p>
</blockquote>
<p>We’ll fix this in the <a href="http://www.danvk.org/2015/01/11/training-an-ocropus-ocr-model.html">next post</a> by training our own model.</p>
<p>The command to make predictions is:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ocropus-rpred -m en-default.pyrnn.gz book/0001/*.png
</code></pre></div></div>
<p>I believe the <code class="highlighter-rouge">r</code> stands for “RNN” as in “Recurrent Neural Net”.</p>
<p>The outputs are <code class="highlighter-rouge">book/0001/01000{1,2,3,4}.txt</code>.</p>
<p>If you want to see charts like the one above, pass <code class="highlighter-rouge">--show</code> or <code class="highlighter-rouge">--save</code>.</p>
<h3 id="extracting-the-text">Extracting the text</h3>
<p>We’re on the home stretch!</p>
<p>One way to get a text file out of Ocropus is to concatenate all the transcribed
text files:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>cat book/????/??????.txt &gt; ocr.txt
</code></pre></div></div>
<p>The files are all in alphabetical order, so this should do the right thing.</p>
<p>In practice, I found that I often disagreed with the line order that Ocropus
chose. For example, I’d say that <code class="highlighter-rouge">August 5, 1934.</code> is the second line of the
image we’ve been working with, not the fourth.</p>
<p>Ocropus comes with an <code class="highlighter-rouge">ocropus-hocr</code> tool which converts its output to <a href="http://en.wikipedia.org/wiki/HOCR">hOCR
format</a>, an HTML-based format designed by Thomas Breuel, who also developed
Ocropus.</p>
<p>We can use it to get bounding boxes for each text box:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ocropus-hocr -o book/book.html book/0001.bin.png
$ cat book/book.html
...
&lt;div class='ocr_page' title='file book/0001.bin.png'&gt;
&lt;span class='ocr_line' title='bbox 3 104 607 133'&gt;O1inton Street, aouth from LIYingston Street.&lt;/span&gt;&lt;br /&gt;
&lt;span class='ocr_line' title='bbox 3 22 160 41'&gt;P. L. Sperr.&lt;/span&gt;&lt;br /&gt;
&lt;span class='ocr_line' title='bbox 1 1 228 19'&gt;NO REPODUCTIONS.&lt;/span&gt;&lt;br /&gt;
&lt;span class='ocr_line' title='bbox 377 67 579 88'&gt;Auguat S, 1934.&lt;/span&gt;&lt;br /&gt;
&lt;/div&gt;
...
</code></pre></div></div>
<p>Ocropus tends to read text more left to right than top to bottom. Since I know
my images only have one column of text, I’d prefer to emphasize the top-down
order. I wrote a <a href="https://github.com/danvk/oldnyc/blob/master/ocr/tess/extract_ocropy_text.py">small tool</a> to reorder the text in the way I wanted.</p>
<h3 id="conclusions">Conclusions</h3>
<p>Congrats on making it this far! We’ve walked through the steps of running the
Ocropus pipeline.</p>
<p>The overall results aren’t good (~10% of characters are incorrect), at least
not yet. In the <a href="http://www.danvk.org/2015/01/11/training-an-ocropus-ocr-model.html">next post</a>, I’ll show how to train a new LSTM model that
completely destroys this problem.</p>
Finding blocks of text in an image using Python, OpenCV and numpy2015-01-07T00:00:00+00:00https://danvk.github.io/2015/01/07/finding-blocks-of-text-in-an-image-using-python-opencv-and-numpy<p>As part of an <a href="http://www.danvk.org/wp/2013-02-09/finding-pictures-in-pictures/">ongoing project</a> with the New York Public Library, I’ve been
attempting to <a href="http://en.wikipedia.org/wiki/Optical_character_recognition">OCR</a> the text on the back of the <a href="http://digitalgallery.nypl.org/nypldigital/dgdivisionbrowseresult.cfm?div_id=hh">Milstein Collection</a>
images. Here’s what they look like:</p>
<div class="center">
<img src="/images/milstein-backing.jpg" width="332" height="512" />
</div>
<p>A few things to note:</p>
<ul>
<li>There’s a black border around the whole image, gray backing paper and then white paper with text on it.</li>
<li>Only a small portion of the image contains text.</li>
<li>The text is written with a tyepwriter, so it’s monospace. But the typewriter font isn’t <a href="http://digitalgallery.nypl.org/nypldigital/dgkeysearchdetail.cfm?trg=1&amp;strucID=397983&amp;imageID=708193b&amp;total=683&amp;num=160&amp;word=4002&amp;s=1&amp;notword=&amp;d=&amp;c=&amp;f=13&amp;k=3&amp;lWord=&amp;lField=&amp;sScope=Name&amp;sLevel=&amp;sLabel=Brown%20Brothers%20%28New%20York%29&amp;sort=&amp;imgs=20&amp;pos=163&amp;e=w&amp;cdonum=0">always consistent</a> across the collection. Sometimes a <a href="http://digitalgallery.nypl.org/nypldigital/dgkeysearchdetail.cfm?trg=1&amp;strucID=362388&amp;imageID=701471b&amp;total=534&amp;num=180&amp;word=Pumps&amp;s=3&amp;notword=&amp;d=&amp;c=&amp;f=2&amp;k=1&amp;lWord=&amp;lField=&amp;sScope=&amp;sLevel=&amp;sLabel=&amp;sort=&amp;imgs=20&amp;pos=184&amp;e=w&amp;cdonum=0">single image</a> has two fonts!</li>
<li>The image is slightly rotated from vertical.</li>
<li>The images are ~4x the resolution shown here (2048px tall)</li>
<li>There are ~34,000 images: too many to affordably <a href="https://www.mturk.com/mturk/welcome">turk</a>.</li>
</ul>
<p>OCR programs typically have to do some sort of <a href="http://en.wikipedia.org/wiki/Document_layout_analysis">page-layout analysis</a> to
find out where the text is and carve it up into individual lines and
characters. When you hear “OCR”, you might think about fancy Machine Learning
techniques like <a href="http://en.wikipedia.org/wiki/Artificial_neural_network">Neural Nets</a>. But it’s a dirty secret of the trade that
page layout analysis, a much less glamorous problem, is at least as important
in getting good results.</p>
<p>The most famous OCR program is <a href="https://code.google.com/p/tesseract-ocr/">Tesseract</a>, a remarkably long-lived open
source project developed over the past 20+ years at HP and Google. I quickly
noticed that it performed much better on the Milstein images when I manually
cropped them down to just the text regions first:</p>
<div class="center">
<img src="/images/cropping.png" width="620" height="512" />
</div>
<p>So I set out to write an image cropper: a program that could automatically find
the green rectangle in the image above. This turned out to be surprisingly
hard!</p>
<p><a href="http://en.wikipedia.org/wiki/Computer_vision">Computer Vision</a> problems like this one are difficult because they’re so
incredibly easy for humans. When you looked at the image above, you could
immediately isolate the text region. This happened instantaneously, and you’ll
never be able to break down exactly how you did it.</p>
<p>The best we can do is come up with ways of breaking down the problem in terms
of operations that are simple for computers. The rest of this post lays out a
way I found to do this.</p>
<p>First off, I applied the <a href="http://en.wikipedia.org/wiki/Canny_edge_detector">canny edge detector</a> to the image. This produces
white pixels wherever there’s an edge in the original image. It yields
something like this:</p>
<div class="center">
<img src="/images/edges.png" width="332" height="512" />
</div>
<p>This removes most of the background noise from the image and turns the text
regions into bright clumps of edges. It turns the borders into long, crisp
lines.</p>
<p>The sources of edges in the image are the borders and the text. To zero in on
the text, it’s going to be necessary to eliminate the borders.</p>
<p>One really effective way to do this is with a <a href="http://scikit-image.org/docs/dev/auto_examples/applications/plot_rank_filters.html">rank filter</a>. This
essentially replaces a pixel with something like the median of the pixels
to its left and right. The text areas have lots of white pixels, but the
borders consist of just a thin, 1 pixel line. The areas around the borders will
be mostly black, so the rank filter will eliminate them. Here’s what the image
looks like after applying a vertical and horizontal rank filter:</p>
<div class="center">
<img src="/images/edges-rankfilter.png" width="332" height="512" />
</div>
<p>The borders are gone but the text is still there! Success!</p>
<p>While this is effective, it still leaves bits of text <em>outside</em> the borders
(look at the top left and bottom right). That may be fine for some
applications, but I wanted to eliminate these because they’re typically
uninteresting and can confuse later operations. So instead of applying the
rank filter, I found the <a href="http://docs.opencv.org/trunk/doc/py_tutorials/py_imgproc/py_contours/py_contours_begin/py_contours_begin.html">contours</a> in the edge image. These are sets of
white pixels which are connected to one another. The border contours are easy
to pick out: they’re the ones whose <a href="http://en.wikipedia.org/wiki/Minimum_bounding_box">bounding box</a> covers a large
fraction of the image:</p>
<div class="center">
<img src="/images/edges-bordercontour.png" width="332" height="512" />
</div>
<p>With polygons for the borders, it’s easy to black out everything outside them.</p>
<p>What we’re left with is an image with the text and possibly some other bits due
to smudges or marks on the original page.</p>
<p>At this point, we’re looking for a crop <code class="highlighter-rouge">(x1, y1, x2, y2)</code> which:</p>
<ol>
<li>maximizes the number of white pixels inside it and</li>
<li>is as small as possible.</li>
</ol>
<p>These two goals are in opposition to one another. If we took the entire image,
we’d cover all the white pixels. But we’d completely fail on goal #2: the crop
would be unnecessarily large. This should sound familiar: it’s a classic
<a href="http://en.wikipedia.org/wiki/Precision_and_recall">precision/recall tradeoff</a>:</p>
<ul>
<li>The recall is the fraction of white pixels inside the cropping rectangle.</li>
<li>The precision is the fraction of the image outside the cropping rectangle.</li>
</ul>
<p>A fairly standard way to solve precision/recall problems is to optimize the <a href="http://en.wikipedia.org/wiki/F1_score">F1
score</a>, the harmonic mean of precision and recall. This is what we’ll try
to do.</p>
<p>The set of all possible crops is quite large: <em>W</em><sup>2</sup><em>H</em><sup>2</sup>,
where <em>W</em> and <em>H</em> are the width and height of the image. For a 1300x2000 image,
that’s about 7 trillion possibilities!</p>
<p>The saving grace is that most crops don’t make much sense. We can simplify the
problem by finding individual chunks of text. To do this, we apply <a href="http://homepages.inf.ed.ac.uk/rbf/HIPR2/dilate.htm">binary
dilation</a> to the de-bordered edge image. This “bleeds” the white pixels
into one another. We do this repeatedly until there are only a few connected
components. Here’s what it looks like:</p>
<div class="center">
<img src="/images/edges-dilated.png" width="560" height="373" />
</div>
<p>As we hoped, the text areas have all bled into just a few components. There are
five connected components in this image. The white blip in the top right
corresponds to the “Q” in the original image.</p>
<p>By including some of these components and rejecting others, we can form good
candidate crops. Now we’ve got a <a href="http://en.wikipedia.org/wiki/Subset_sum_problem">subset sum problem</a>: which subset of
components produces a crop which maximizes the F1 score?</p>
<p>There are 2<sup><em>N</em></sup> possible combinations of subsets to examine. In
practice, though, I found that a greedy approach worked well: order the
components by the number of white pixels they contain (in the original image).
Keep adding components while it increases the F1 score. When nothing improves
the score, you’re done!</p>
<p>Here’s what that procedure produces for this image:</p>
<div class="center">
<img src="/images/edges-chosen.png" width="560" height="373" />
</div>
<p>The components are ordered as described above. Component #1 contains the most
white pixels in the original image. The first four components are accepted and
the fifth is rejected because it hurts the F1 score:</p>
<ol>
<li>Accept #1, F1 Score → 0.886</li>
<li>Accept #2, F1 Score → 0.931</li>
<li>Accept #3, F1 Score → 0.949</li>
<li>Accept #4, F1 Score → 0.959</li>
<li>Reject #5 (F1 Score → 0.888)</li>
</ol>
<p>Applying this crop to the original image, you get this:</p>
<div class="center">
<img src="/images/milstein-cropped.jpg" width="875" height="233" />
</div>
<p>That’s 875x233, whereas the original was 1328x2048. That’s a <strong>92.5% decrease in
the number of pixels, with no loss of text</strong>! This will help any OCR tool focus
on what’s important, rather than the noise. It will also make OCR run faster,
since it can work with smaller images.</p>
<p>This procedure worked well for my particular application. Depending on how you
count, I’d estimate that it gets a perfect crop on about 98% of the images, and
its errors are all relatively minor.</p>
<p>If you want to try using this procedure to crop your own images, you can find
the <a href="https://github.com/danvk/oldnyc/blob/master/ocr/tess/crop_morphology.py">source code here</a>. You’ll need to install <code class="highlighter-rouge">OpenCV</code>, <code class="highlighter-rouge">numpy</code> and <code class="highlighter-rouge">PIL</code>
to make it work.</p>
<p>I tried several other approaches which didn’t work as well. Here are some
highlights:</p>
<ul>
<li>
<p>I ran the image through Tesseract to find areas which contained letters.
These should be the areas that we crop to! But this is a bit of a chicken
and the egg problem. For some images, Tesseract misses the text completely.
Cropping fixes the problem. But we were trying to find a crop in the first
place!</p>
</li>
<li>
<p>I tried running the images through <a href="https://www.flameeyes.eu/projects/unpaper">unpaper</a> first, to remove noise and
borders. But this only worked some of the time and I found unpaper’s
interface to be quite opaque and hard to tweak.</p>
</li>
<li>
<p>I ran canny, then calculated row and column sums to optimize the x- &amp;
y-coordinates of the crop independently. The text regions did show up clearly
in charts of the row sums:
<img src="/images/milstein-rowsums.png" width="379" height="258" /><br />
The four spikes are the tops and bottoms of the two borders. The broad
elevated region in the middle is the text. Making this more precise turned
out to be hard. You lose a lot of structure when you collapse a
dimension—this problem turned out to be easier to solve as a single 2D
problem than as two 1D problems.</p>
</li>
</ul>
<p>In conclusion, I found this to be a surprisingly tricky problem, but I’m happy
with the solution I worked out.</p>
<p>In the next post, I’ll talk about my experience running OCR tools over these
cropped images.</p>
Choosing an iOS Podcasting App2014-12-29T00:00:00+00:00https://danvk.github.io/2014/12/29/choosing-an-ios-podcasting-app<p>I recently switched back to iOS after a few years using Android. One aspect of
Android that always bothered me was that I had trouble finding a great
Podcasting app. I’d happily used <a href="http://www.macstories.net/reviews/instacast-3-review/">Instacast</a> on iOS, but I couldn’t find
anything quite like it for Android.</p>
<p>I eventually settled on <a href="https://play.google.com/store/apps/details?id=com.bambuna.podcastaddict&amp;hl=en">Podcast Addict</a>. It epitomizes a stereotype of
Android apps: tons of features, gajillions of options and a UI that was clearly
designed by an engineer.</p>
<p>After switching back to iOS, I had a surprisingly hard time finding a
Podcasting app which could do everything I wanted. For me, the hard
requirements were:</p>
<ol>
<li>A view of podcasts ordered by most-recently updated</li>
<li>Stream by default—don’t download episodes unless I ask!</li>
<li>An option to disable play-through.</li>
</ol>
<p>In the last few years, <strong>Instacast</strong> received a <a href="http://vemedio.com/blog/posts/instacast-5-available-today">major update</a>. This introduced a new player interface which doesn’t work on the iOS lock screen. As a result, you can’t use the play/pause/volume features on your earbuds!</p>
<p>It also lacks the ability to sort podcasts by most-recently-updated. I subscribe to a mix of podcasts that update <a href="http://www.npr.org/rss/podcast/podcast_detail.php?siteId=5183215">frequently</a> and <a href="http://www.dancarlin.com/hardcore-history-series/">infrequently</a>, so I find this to be a far more useful view than a complete list of all episodes. In particular, I don’t want an hourly podcast to crowd out less frequently updated shows.</p>
<p>So I set out to find a new Podcasting app. I really wish the iOS App Store had a try-before-you-buy option. Order-by-date is a fairly obscure feature, and it was impossible to determine if an app supported it without buying it.</p>
<p>Since I spent about $10 trying out apps and deleting them, here’s my guide to spare you that chore!</p>
<h4 id="instacast">Instacast</h4>
<p>As I mentioned, it’s missing the order by most recently updated feature and the ability to control playback via the lock screen.</p>
<h4 id="overcast">Overcast</h4>
<p>Overcast is the best looking of the bunch. I appreciate its freemium model and functional shoutouts to alternative apps—they made moving my list of feeds between apps painless.</p>
<p>That being said, I had a few problems with Overcast:</p>
<ul>
<li>It doesn’t support streaming</li>
<li>Disabling play-through is a paid feature (which is fine, but it’s worth noting!)</li>
<li>It makes a distinction between all episodes and unplayed episodes for each Podcast which I don’t find to be very helpful. I’m not a completist—I tend to pick and choose episodes.</li>
<li>It’s very aggressive about downloading new episodes!</li>
</ul>
<h4 id="podcasts-built-in-app">Podcasts (built-in app)</h4>
<p>Podcasts is a barebones app for listening to Podcasts. I couldn’t tell if it supported streaming (I don’t think it does) and it definitely doesn’t support showing your podcasts ordered by when they were last updated. Next!</p>
<h4 id="podcruncher">PodCruncher</h4>
<p>This app came up when I searched for Downcast on the App store and looked promising. Unfortunately, it hasn’t been updated for the new “flat” look in iOS 7 and it doesn’t support the larger screen on the iPhone 6. It also lacks the most-recently-updated view.</p>
<h4 id="downcast">Downcast</h4>
<p>Finally I found <a href="http://www.downcastapp.com/">Downcast</a>, which meets all my Podcasting needs. It lets you order your podcasts by most-recently updated, though the option is tricky to find (Edit→Sort→Publication Date):</p>
<p><img src="/images/sort-by-mru.png" width="375" height="327" alt="Sort by Publication Date option" /></p>
<p>And there’s an option to disable play-through on the player screen, if you can recognize it:</p>
<p><img src="/images/disable-playthrough.png" width="375" height="172" alt="Option to disable playthrough" /></p>
<p>Next on my list to try were Pocket Casts and Castro, but I stopped when I found that I was happy with Downcast.</p>
<p>After all this, I’m coming to appreciate that everyone has different ways of using their podcasting app, which makes designing them hard. I don’t care at all about Smart Playlists (which many apps emphasize) and I care about slightly obscure features like ordering my podcasts by when they were updated. This is a case of <a href="http://prog21.dadgum.com/160.html">dangling by a trivial feature</a>.</p>
JavaScript String slice, substr, substring: which to use?2014-11-17T00:00:00+00:00https://danvk.github.io/2014/11/17/javascript-string-splice-substr-substring-which-to-use<p>I recently read Douglas Crockford’s <a href="http://www.amazon.com/JavaScript-Good-Parts-Douglas-Crockford/dp/0596517742"><em>JavaScript: The Good Parts</em></a>. It’s a
classic (published in 2008) which is credited with reviving respect for
JavaScript as a programming language. Given its title, it’s also famously short.</p>
<p>One very specific thing it cleared up for me was what to do with all of
JavaScript’s various substring methods:</p>
<ul>
<li><a href="https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/substr">String.prototype.substr</a>(start[, length])</li>
<li><a href="https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/substring">String.prototype.substring</a>(start[, stop])</li>
</ul>
<p>One method takes an offset and a length. The other takes two offsets. The names
don’t reflect this distinction in any way, so they’re impossible to remember.
I resorted to writing an <a href="http://support.alfredapp.com/tutorials:clipboard-snippets">Alfred Snippet</a> with the syntaxes, to quickly look
it up (since I always had to).</p>
<p>They also have some other differences in behavior: <code class="highlighter-rouge">substring</code> doesn’t allow
its offset to be negative, but <code class="highlighter-rouge">substr</code> does. This is arbitrary. It’s
impossible to remember because there’s no rhyme or reason to it.</p>
<p>Crockford cleared this all up. The solution is to never use <em>either</em> method!</p>
<p>Instead, you should use the <code class="highlighter-rouge">slice</code> method:</p>
<ul>
<li><a href="https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/slice">String.prototype.slice</a>(start[, stop])</li>
</ul>
<p>This takes two offsets. Either can be negative. And it’s the exact same as the
corresponding <a href="https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/slice">Array <code class="highlighter-rouge">slice</code></a> method.</p>
<p>So: stop using <code class="highlighter-rouge">substr</code> and <code class="highlighter-rouge">substring</code>. Use <code class="highlighter-rouge">slice</code>!</p>
GitHub integration and Image Diff improvements headline webdiff 0.82014-11-07T00:00:00+00:00https://danvk.github.io/2014/11/07/github-integration-and-image-diff-improvements-headline-webdiff-0-8<p>I’ve released <a href="https://pypi.python.org/pypi/webdiff/0.8.0">webdiff 0.8.0</a>, which you can install via:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pip install --upgrade webdiff
</code></pre></div></div>
<p>The most interesting new features are GitHub pull request integration and expanded image diffing modes.</p>
<p>You can view a GitHub Pull Request in webdiff by running something like:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>webdiff https://github.com/hammerlab/cycledash/pull/175
</code></pre></div></div>
<p>Any github Pull Request URL will do. This will pull down the files from GitHub
to local disk and then diff them in the standard webdiff UI. My main use case
for this is looking at screenshot diffs and thinking “I want to see bigger images in this PR diff”. Speaking of which…</p>
<p><img src="/images/webdiff-swipe.gif" width="700" height="451" /></p>
<p>This version includes a few improvements to the image diff mode:</p>
<ol>
<li>A “shrink to fit” option, which is enabled by default. This shrinks large images to fit in your browser window.</li>
<li>Consistent use of red/green borders for before/after images.</li>
<li>“Onion Skin” diff mode, which fades one image into the other.</li>
<li>“Swipe” diff mode, which lets you lets you drag a dividing line between the images.</li>
</ol>
<p>These are based on <a href="https://github.com/blog/817-behold-image-view-modes">GitHub’s image view modes</a>. I still find “blink” to be
the most helpful for spotting small changes, but now you’ve got choices!</p>
<p>There were a few smaller changes as well. Full release notes are <a href="https://pypi.python.org/pypi/webdiff/0.8.0">on PyPI</a>.</p>
Life after Google, Six Months In2014-10-31T00:00:00+00:00https://danvk.github.io/2014/10/31/life-after-google-six-months-in<p>It’s been almost exactly six months since I ended an eight year run at Google.
One of the biggest reasons to do this was to come back up to speed with the
open source ecosystem and to experience a different working environment (sample size 1→2!).</p>
<p>When I joined Google, it seemed like our tech stack was years ahead of anything
else. This included tools like <a href="http://google-engtools.blogspot.com/2011/08/build-in-cloud-how-build-system-works.html">BUILD files</a> for managing dependencies and
compilation, <a href="http://www.quora.com/What-is-Borg-at-Google">borg</a> for running jobs in a data center, and <a href="https://developers.google.com/closure/compiler/">closure
compiler</a> for dealing with JavaScript’s idiosyncracies.</p>
<p>The open source world has come a long way since 2006. It’s solved many of the
same problems that Google did, but in different ways. There’s the hadoop stack
for distributed work. And there are polyfills and CommonJS for JavaScript.
These aren’t necessarily better or worse than Google’s solutions, but they are
different. And they’re the tools that new developers are learning.</p>
<p>Long term, this is a big problem for Google. “Ahead of the field” can rapidly
turn into “Not Invented Here”. “Better” can become merely “different”.</p>
<p>A simple example of this is <a href="http://docs.closure-library.googlecode.com/git/namespace_goog.html"><code class="highlighter-rouge">goog.bind</code></a>. It works around some confusing
behavior involving <code class="highlighter-rouge">this</code> in JavaScript. It was a great tool in 2004. But
modern JavaScript has its own solution (<a href="https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Function/bind">Function.prototype.bind</a>). It’s
been around for years and is supported in 90+% of browsers. Google will still
be writing <code class="highlighter-rouge">goog.bind</code> long after IE8 ceases to be relevant.</p>
<p>Facebook has a better solution to this: they always use the latest version of
the JavaScript standard, and then <a href="https://github.com/facebook/jstransform">transpile</a> to something that older
browsers will understand. This way they stay on the mainstream of technological
development.</p>
<p>Here are a few other things I’ve found notable:</p>
<ul>
<li>
<p><strong>Almost any Google technology has an open-source equivalent.</strong><br />
Examples include Travis-CI (similar to TAP), the Hadoop stack (MapReduce,
CNS, Dremel, …), CommonJS (Closure modules).</p>
</li>
<li>
<p><strong>Package managers</strong><br />
For most Google engineers, something like 95+% of the code you work with is
first-party (i.e. written at Google). For a smaller group, this ratio is
going to be dramatically different. As a result, third party package managers
are much more important. And the good news is that they’ve gotten much better
over the last eight years! Leaving Google has finally forced me to learn how
to use tools like NPM and pip. I’ve found this incredibly empowering. I used
to avoid external dependencies for personal projects. I suspect many Google
engineers do the same.</p>
</li>
<li>
<p><strong>Markdown is more pervasive than I’d realized</strong><br />
I was vaguely aware of <a href="http://daringfireball.net/projects/markdown/">Markdown</a> while I was at Google, but didn’t really see
the point. Now that I’m out, I’ve been surprised to see how pervasive it is.
I suspect that much of this comes from GitHub, where you use Markdown for
READMEs, issues and code review comments. You also use it with GitHub pages,
so I’m typing in it right now! It’s worthwhile to learn Markdown a bit
better. It’s like HTML with infinitely less boilerplate.</p>
</li>
<li>
<p><strong>Knowledge of git</strong><br />
It’s completely reasonable for someone with ten years of experience at Google
to never have run git. I’m happy I was involved enough with open source
projects that I have years of basic experience with it. But others have
clearly gone much deeper. My git skills have gotten much better since leaving
Google. I know how to use <code class="highlighter-rouge">git rebase</code> now!</p>
</li>
<li>
<p><strong>sysadmin knowledge</strong><br />
borg meant that I never had to learn any systems administration. It’s a
particular blind spot for me. For example, <a href="http://upstart.ubuntu.com/">upstart</a> was released over eight
years ago, just after I joined Google. It’s incredibly widely used. I’d never
heard of it.</p>
</li>
<li>
<p><strong>Google uses a lot of email</strong><br />
We use almost no email in my new group. HipChat takes the place of a group
mailing list and your coworkers @mention you when they want you to chime in.
The chat format keeps things short. I still get emails for code reviews, but
I could imagine this going through GitHub notifications instead.</p>
</li>
<li>
<p><strong>Buying your own lunch isn’t that big a deal.</strong><br />
As it turns out, there are people in NYC who will make you food in exchange
for money, or even deliver it! The variety is much greater than what you get
at a Google Cafe, and it’s nice to have a good reason to go outside during
the day. I also like the more flexible eating schedule. Almost everyone in my
office eats lunch very late, perhaps at 3 or 4.</p>
</li>
</ul>
Fully Migrated to GitHub Pages2014-10-23T00:00:00+00:00https://danvk.github.io/2014/10/23/fully-migrated-to-github-pages<p><img src="/images/jekyll.png" alt="Jekyll Logo" width="249" height="115" style="float:right" /></p>
<p>My danvk.org site is now fully hosted on GitHub pages. I changed the DNS entry last night.</p>
<p>My hope was to do this without breaking anything. That didn’t prove to be
possible, but I came close. And overall, the process wasn’t too bad! It was
helpful to make a <a href="https://github.com/danvk/danvk.github.io/issues/1">census of material</a> that was on my old site using access
logs. This turned up a few redirects I wouldn’t have thought of, and also
reminded me of the many features my old site accumulated over the years. Some
of these are now accessible under the “Features” menu on the new site.</p>
<p>There were a few pain points:</p>
<ul>
<li>
<p><strong>Redirects</strong></p>
<p>It would have been enormously helpful if GitHub pages supported something
like <a href="https://en.wikipedia.org/wiki/Rewrite_engine">mod_rewrite</a>. As it was I had to kill a few old links because I
was <a href="http://stackoverflow.com/questions/26374707/can-jekyll-serve-content-based-on-a-url-parameter">completely unable to generate 301/302 redirects</a>. I wound up
<a href="https://github.com/danvk/danvk.github.io/commit/f99fa0d6ef808a2ba468587d3f7eab800d448f1e">hard-coding JavaScript redirects</a> instead. It’s not ideal, and I’ll
probably lose some pagerank, but it’s the best I could do.</p>
</li>
<li>
<p><strong>Migrating the domain</strong></p>
<p>I really hate DNS. It’s impossible to know whether your site isn’t working
because you’ve misconfigured your DNS, or because the new records haven’t
propagated out yet. I’m surprised more sites don’t go down because of DNS
problems. The <a href="https://www.whatsmydns.net/">Global DNS Propagation Tracker</a> was indispensible as a
sanity check.</p>
</li>
</ul>
<p>One thing that worked really well was migrating my old WordPress blog. I’d
expected this to be a complete pain, but it was nothing of the sort. I used
<a href="http://www.httrack.com/">httrack</a> to mirror the rendered version of my blog
site to a folder on local disk. Then I checked that folder into GitHub. Done.</p>
<p>Finally, I learned a lot about <a href="http://jekyllrb.com/">Jekyll</a> from this process. It really is a
static content generator. It’s not a serving system. This is why it can’t do
things like 301/302 redirects. GitHub pages would be a much more powerful
serving system if it included support for things like mod_rewrite. But then it
would be less Jekyll-y. The beauty of the system is that it’s pure static
content, and hence insanely fast and simple.</p>
Filtering JSON with pyjsonselect and jss2014-10-13T00:00:00+00:00https://danvk.github.io/2014/10/13/filtering-json-with-pyjsonselect-and-jss<p>The data store for <a href="http://www.comparea.org/">Comparea</a> is a
<a href="https://github.com/danvk/comparea/blob/master/comparea/static/data/comparea.geo.json">giant</a>
23MB <a href="http://geojson.org/">GeoJSON</a> file. Most of the space in that file is
taken up by the giant lists of coordinates which define the boundaries of each
shape. But there’s also some interesting metadata hidden amongst all those
latitudes and longitudes:</p>
<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
</span><span class="s2">"features"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="s2">"geometry"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="s2">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Polygon"</span><span class="p">,</span><span class="w">
</span><span class="s2">"coordinates"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
</span><span class="p">[</span><span class="w">
</span><span class="p">[</span><span class="w">
</span><span class="mf">-69.89912109375001</span><span class="p">,</span><span class="w">
</span><span class="mf">12.452001953124963</span><span class="w">
</span><span class="p">],</span><span class="w">
</span><span class="p">[</span><span class="w">
</span><span class="mf">-69.89570312500004</span><span class="p">,</span><span class="w">
</span><span class="mf">12.422998046875009</span><span class="w">
</span><span class="p">],</span><span class="w">
</span><span class="err">...</span><span class="w">
</span><span class="p">]</span><span class="w">
</span><span class="p">]</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="s2">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Feature"</span><span class="p">,</span><span class="w">
</span><span class="s2">"properties"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="s2">"description"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Aruba is an island."</span><span class="p">,</span><span class="w">
</span><span class="s2">"wikipedia_url"</span><span class="p">:</span><span class="w"> </span><span class="s2">"http://en.wikipedia.org/wiki/Aruba"</span><span class="p">,</span><span class="w">
</span><span class="s2">"area_km2"</span><span class="p">:</span><span class="w"> </span><span class="mf">154.67007756254557</span><span class="p">,</span><span class="w">
</span><span class="s2">"population"</span><span class="p">:</span><span class="w"> </span><span class="mi">103065</span><span class="p">,</span><span class="w">
</span><span class="s2">"population_year"</span><span class="p">:</span><span class="w"> </span><span class="s2">"???"</span><span class="p">,</span><span class="w">
</span><span class="s2">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Aruba"</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="s2">"id"</span><span class="p">:</span><span class="w"> </span><span class="s2">"ABW"</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">]</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>
<p>I’d hoped that I could use <a href="http://stedolan.github.io/jq/">jq</a> to filter out
all the coordinates and just look at the metadata. But I got bogged down
reading through its extensive <a href="http://stedolan.github.io/jq/manual/">manual</a>.
At the end of the day, I didn’t want to learn an ad-hoc language just for
filtering JSON files.</p>
<p>The maddening thing was that there’s already a great language for selecting
elements in trees: <strong>CSS Selectors</strong>! I did some searching and learned that there’s
already a standard for applying CSS-like selectors to JSON called
<a href="http://jsonselect.org/#overview">JSONSelect</a>. It dates from 2011. It has a
spec and <a href="https://github.com/lloyd/JSONSelectTests">conformance tests</a>, and
it’s been implemented in a <a href="http://jsonselect.org/#code">number of languages</a>.</p>
<p>So I picked my language of choice (Python) and began implementing a new command
line tool for filtering JSON files.</p>
<p>The first issue I ran into: the standard <a href="https://github.com/mwhooker/jsonselect">Python
implementation</a> didn’t conform to the
standard! It only implemented 2/3 levels of CSS selectors from the spec, and
many of the interesting selectors are in level 3.</p>
<p>The reference <a href="https://github.com/lloyd/JSONSelect/blob/master/src/jsonselect.js">JavaScript implementation</a>
was only 572 lines of code and, with all those tests, I figured it wouldn’t be
too hard to port it directly to Python. This was a fun project—there’s
something very zen about coding against a spec, getting test after test to
pass. I learned about a few nuances of JavaScript and Python by doing this:</p>
<ul>
<li>Their regular expressions differ in how they specify <a href="http://stackoverflow.com/questions/3835917/how-do-i-specify-a-range-of-unicode-characters">unicode
ranges</a></li>
<li>the reference implementation made use of the <code class="highlighter-rouge">null</code> vs. <code class="highlighter-rouge">undefined</code>
distinction</li>
<li>JavaScript’s <code class="highlighter-rouge">typeof</code> function is quite odd</li>
<li>JavaScript’s <code class="highlighter-rouge">Array.prototype.concat</code> method is quite subtle in its behavior</li>
</ul>
<p>I wound up re-implementing all of these quirks these in Python.</p>
<p>At the end of the day, I published
<a href="https://github.com/danvk/pyjsonselect/">pyjsonselect</a>, the first
fully-conformant JSONSelect implementation in Python. A small win for the open
source world!</p>
<h2 id="jss">jss</h2>
<p>So, how does the tool work? You can read about installation and basic usage <a href="https://github.com/danvk/jss">on
github</a>, but here are a few motivating examples.</p>
<p>jss is a JSON→JSON converter. It supports three modes:</p>
<ol>
<li>select: find all the values that match a selector (1→N)</li>
<li>filter out (<code class="highlighter-rouge">-v</code>): remove all values which match a selector (1→1)</li>
<li>filter in (<code class="highlighter-rouge">-k</code>): keep only values which match a selector (1→1)</li>
</ol>
<p>Here’s how the filter out mode works:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ jss -v '.coordinates' comparea.geo.json
</code></pre></div></div>
<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
</span><span class="s2">"features"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="s2">"geometry"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="s2">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Polygon"</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="s2">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Feature"</span><span class="p">,</span><span class="w">
</span><span class="s2">"properties"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="s2">"description"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Aruba is an island."</span><span class="p">,</span><span class="w">
</span><span class="s2">"wikipedia_url"</span><span class="p">:</span><span class="w"> </span><span class="s2">"http://en.wikipedia.org/wiki/Aruba"</span><span class="p">,</span><span class="w">
</span><span class="s2">"area_km2"</span><span class="p">:</span><span class="w"> </span><span class="mf">154.67007756254557</span><span class="p">,</span><span class="w">
</span><span class="s2">"population"</span><span class="p">:</span><span class="w"> </span><span class="mi">103065</span><span class="p">,</span><span class="w">
</span><span class="s2">"population_year"</span><span class="p">:</span><span class="w"> </span><span class="s2">"???"</span><span class="p">,</span><span class="w">
</span><span class="s2">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Aruba"</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="s2">"id"</span><span class="p">:</span><span class="w"> </span><span class="s2">"ABW"</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="err">...</span><span class="w">
</span><span class="p">]</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>
<p>That knocked out all the <code class="highlighter-rouge">coordinates</code> keys from the GeoJSON file!</p>
<p>I eventually did figure out how to do this in jq. Here’s what it looks like:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ jq 'del(..|.coordinates?| select(. != []))' comparea.geo.json
(same output)
</code></pre></div></div>
<p>To come up with that incantation, I had to dig through jq’s <a href="https://github.com/stedolan/jq/issues/319">github
issues</a>. It’s certainly not
something I could re-type from memory! The <code class="highlighter-rouge">jss</code> version is clear as could be.</p>
<p>It’s also significantly faster. For the 23MB <code class="highlighter-rouge">comparea.geo.json</code> file, the
<code class="highlighter-rouge">jss</code> command runs in 1.7s on my laptop vs. 12.9s for <code class="highlighter-rouge">jq</code>. The trick to this
speed is <a href="http://stackoverflow.com/questions/26221309/preorder-traversal-using-python-generators-with-a-mechanism-to-ignore-subtrees">appropriate
pruning</a>
of the selector search.</p>
<p>Here’s how the “select” mode works:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ jss '.name' comparea.geo.json
</code></pre></div></div>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>"Aruba"
"Afghanistan"
"Angola"
"Anguilla"
"Albania"
"Andorra"
"United Arab Emirates"
"Argentina"
"Armenia"
"American Samoa"
...
</code></pre></div></div>
<p>Unlike “filter out”, which maps one JSON object to another JSON object,
“select” extracts multiple values from a single object. Each line of output is
its own JSON object. This is why it’s 1→N, vs 1→1 for the other modes. It’s
useful if you want to do more processing using grep, sed and other familiar
line-oriented tools.</p>
<h2 id="fancy-selectors">Fancy selectors</h2>
<p>You can specify as operations as you like. Here’s a more complex invocation:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ jss -v .coordinates -k '.features&gt;*:has(:contains("ZAF"))' comparea.geo.json
</code></pre></div></div>
<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
</span><span class="s2">"features"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="s2">"geometry"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="s2">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Polygon"</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="s2">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Feature"</span><span class="p">,</span><span class="w">
</span><span class="s2">"id"</span><span class="p">:</span><span class="w"> </span><span class="s2">"ZAX"</span><span class="p">,</span><span class="w">
</span><span class="s2">"properties"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="s2">"description"</span><span class="p">:</span><span class="w"> </span><span class="s2">"South Africa, officially the Republic of South Africa, is a country located at the southern tip of Africa. It has 2,798 kilometres of coastline that stretches along the South Atlantic and Indian oceans."</span><span class="p">,</span><span class="w">
</span><span class="s2">"population_source"</span><span class="p">:</span><span class="w"> </span><span class="s2">"World Factbook"</span><span class="p">,</span><span class="w">
</span><span class="s2">"sov_a3"</span><span class="p">:</span><span class="w"> </span><span class="s2">"ZAF"</span><span class="p">,</span><span class="w">
</span><span class="s2">"freebase_mid"</span><span class="p">:</span><span class="w"> </span><span class="s2">"/m/0hzlz"</span><span class="p">,</span><span class="w">
</span><span class="s2">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"South Africa"</span><span class="p">,</span><span class="w">
</span><span class="s2">"population_source_url"</span><span class="p">:</span><span class="w"> </span><span class="s2">"https://www.cia.gov/library/publications/the-world-factbook/fields/2119.html"</span><span class="p">,</span><span class="w">
</span><span class="s2">"area_km2_source_url"</span><span class="p">:</span><span class="w"> </span><span class="s2">"https://www.cia.gov/library/publications/the-world-factbook/fields/2147.html"</span><span class="p">,</span><span class="w">
</span><span class="s2">"population_date"</span><span class="p">:</span><span class="w"> </span><span class="s2">"July 2014"</span><span class="p">,</span><span class="w">
</span><span class="s2">"wikipedia_url"</span><span class="p">:</span><span class="w"> </span><span class="s2">"http://en.wikipedia.org/wiki/South_Africa"</span><span class="p">,</span><span class="w">
</span><span class="s2">"area_km2"</span><span class="p">:</span><span class="w"> </span><span class="mi">1214470</span><span class="p">,</span><span class="w">
</span><span class="s2">"area_km2_source"</span><span class="p">:</span><span class="w"> </span><span class="s2">"World Factbook"</span><span class="p">,</span><span class="w">
</span><span class="s2">"population"</span><span class="p">:</span><span class="w"> </span><span class="mi">48375645</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">]</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>
<p>After filtering out the <code class="highlighter-rouge">coordinates</code> fields, it keeps only elements
directly under the <code class="highlighter-rouge">features</code> key (i.e. a top-level feature) which contains
“ZAF” somewhere (the “sov_a3” field, in this case).</p>
<p>Isn’t this just as complicated as the jq syntax? Sure! But at least you learned
something useful. If you get better at writing CSS selectors as a result of
filtering JSON files, then that’s great! You’ve become a better web developer
in the process.</p>
<p>You can install <code class="highlighter-rouge">jss</code> with <code class="highlighter-rouge">pip</code>. Read more <a href="https://github.com/danvk/jss">on github</a>!</p>
Facebook (non-)Insights2014-10-13T00:00:00+00:00https://danvk.github.io/2014/10/13/facebook-non-insights<p>There was a spike of traffic to <a href="https://www.comparea.org">Comparea</a> over the weekend:</p>
<p><img src="/images/comparea-spike.png" alt="Spike of traffic to comparea.org" title="Spike of traffic to comparea.org" /></p>
<p>Awesome! All of it came from Facebook and went to a comparison of <a href="https://www.comparea.org/FXX+AUS">France vs Australia</a>:</p>
<p><img src="/images/comparea-all-facebook.png" alt="All traffic coming from Facebook" title="All traffic coming from Facebook" /></p>
<p>I can easily get insight into who <a href="https://twitter.com/TeachingIdeas/status/520656010391093248">tweeted
it</a>. But Facebook
is a big black box. In theory, I can use <a href="https://www.facebook.com/FacebookInsights">Facebook
Insights</a> to track this. It claims
that 50 actions have led to ~2400 visits to my site, but declines to say anything more:</p>
<p><img src="/images/comparea-facebook-non-insight.png" alt="All information is suppressed" title="All information is suppressed" /></p>
<p>I understand that Facebook is more private than Twitter, but this is
frustrating. I’d hope to at least see which country the shares are happening
in. Are these French people coming to Comparea? Australians? Facebook won’t
even reveal that all the recent visits are to a single URL! I’d think that
thousands of clicks would be enough to provide anonymity there.</p>
<p>I’m not sure exactly what sorts of insights Facebook can share without
revealing identities, but I’d very much hoped for more than this.</p>
<p>Has anyone had positive experiences with Facebook Insights?</p>
Trying out GitHub Pages2014-10-01T00:00:00+00:00https://danvk.github.io/2014/10/01/post<p>I’m going to try hosting my site and blog on GitHub pages. My hope is that
blogging using GitHub and Markdown will lower the barrier to writing, and that
GitHub pages will eliminate any worries about performance and security while
hosting my own site.</p>
<p>This is all very much a work in progress, so feedback is welcome!</p>