Kimberly Fessel BlogKimberly Fessel is a Senior Data Scientist at Metis. Her enthusiasm for data storytelling often leads her toward better math, better visuals, and better science!https://kimfetti.github.io/
5 Significant Object Detection Challenges and Solutionshttps://kimfetti.github.io/algorithms/literature%20reviews/object-detection-challenges/
Tue, 10 Sep 2019 00:00:00 +0000<!--more-->
<p>The field of computer vision has experienced substantial progress recently owing largely to advances in deep learning, specifically convolutional neural nets (CNNs). Image classification, where a computer classifies or assigns labels to an image based on its content, can often see great results simply by leveraging pre-trained neural nets and fine-tuning the last few throughput layers.</p>
<p>Classifying <em>and</em> finding an unknown number of individual objects within an image, however, was considered an extremely difficult problem only a few years ago. This task, called object detection, is now feasible and has even been productized by companies like <a href="https://cloud.google.com/vision/docs/drag-and-drop">Google</a> and <a href="https://www.ibm.com/watson/services/visual-recognition/">IBM</a>. But all of this progress wasn’t easy! Object detection presents many substantial challenges beyond what is required for image classification. After a brief introduction to the topic, let’s take a deep dive into several of the interesting obstacles these problems face along with various emerging solutions.</p>
<h2 id="introduction">Introduction</h2>
<p>The ultimate purpose of object detection is to locate important items, draw rectangular bounding boxes around them, and determine the class of each item discovered. Applications of object detection arise in <a href="https://www.quora.com/What-are-some-interesting-applications-of-object-detection">many different fields</a> including detecting pedestrians for self-driving cars, monitoring agricultural crops, and even real-time ball tracking for sports. Researchers have dedicated a substantial amount of work towards this goal over the years: from <a href="https://en.wikipedia.org/wiki/Viola%E2%80%93Jones_object_detection_framework">Viola and Jones</a>’s facial detection algorithm published in 2001 to <a href="https://arxiv.org/abs/1708.02002">RetinaNet</a>, a fast, highly accurate one-state detection framework released in 2017. The introduction of CNNs marks a pivotal moment in object detection history, as nearly all modern systems use CNNs in some form. That said, the remainder of this post will focus on deep learning solutions for object detection, though similar challenges confront other approaches as well.</p>
<h2 id="challenges">Challenges</h2>
<h3 id="1-dual-priorities-object-classification-and-localization">1. Dual priorities: object classification and localization</h3>
<p>The first major complication of object detection is its added goal: not only do we want to classify image objects but also to determine the objects’ positions, generally referred to as the <em>object localization</em> task. To address this issue, researchers most often use a multi-task loss function to penalize both misclassifications and localization errors.</p>
<p>Regional-based CNNs represent one popular class of object detection frameworks. These methods consist of the generation of region proposals where objects are likely to be located followed by CNN processing to classify and further refine object locations. Ross Girshick et al. developed <a href="https://arxiv.org/pdf/1504.08083.pdf">Fast R-CNN</a> to improve upon their initial results with <a href="https://arxiv.org/pdf/1311.2524.pdf">R-CNN</a>. As its name implies, Fast R-CNN provides a dramatic speed-up, but accuracy also improves because the classification and localization tasks are optimized using one unified multi-task loss function. Each candidate region that may contain an object is compared to the image’s true objects. Candidate regions then incur penalties for both false classifications and misalignment of the bounding boxes. Hence, the loss function consists of two kinds of terms:</p>
<p>\[\mathcal{L}(p, u, t^u, v) = \overbrace{\mathcal{L}_c(p,u)}^{classification} + \lambda\overbrace{\left[u\geq 1\right] \mathcal{L}_l(t^u, v)}^{localization}, \]</p>
<p>where the classification term imposes log loss on the predicted probability of the true object class \(u\) and the localization term is a smooth \(L_1\) loss for the four positional components that define the rectangle. Note that the localization penalty does not apply to the background class when no object is present, \(u=0\). Also note the parameter \(\lambda\) may be adjusted to prioritize either classification or localization more strongly.</p>
<h3 id="2-speed-for-real-time-detection">2. Speed for real-time detection</h3>
<p>Object detection algorithms need to not only accurately classify and localize important objects, they also need to be incredibly fast at prediction time to meet the real-time demands of video processing. Several key enhancements over the years have boosted the speed of these algorithms, improving test time from the 0.02 frames per second (fps) of R-CNN to the impressive 155 fps of Fast YOLO.</p>
<p><a href="https://arxiv.org/pdf/1504.08083.pdf">Fast R-CNN</a> and <a href="https://arxiv.org/pdf/1506.01497.pdf">Faster R-CNN</a> aim to speed up the original R-CNN approach. R-CNN uses <a href="https://koen.me/research/pub/uijlings-ijcv2013-draft.pdf">selective search</a> to generate 2,000 candidate regions of interest (RoIs) and passes each RoI through a CNN base individually, which causes a massive bottleneck since the CNN processing is quite slow. Fast R-CNN instead sends the entire image through the CNN base just once and then matches the RoIs created with selective search to the CNN feature map, yielding a 20-fold reduction in processing time. While Fast R-CNN is much speedier than R-CNN, yet another speed barrier persists. It takes approximately 2.3 seconds for Fast R-CNN to perform object detection on a single image, and selective search accounts for a full 2 seconds of that time! Faster R-CNN replaces selective search with a separate sub-neural network to generate RoIs, creating another 10x speed up and thus testing at a rate of about 7-18 fps.</p>
<p>Despite these impressive improvements, videos are typically shot at at least 24 fps, meaning Faster R-CNN will likely not keep pace. Regional-based methods consist of two separate phases: proposing regions and processing them. This task separation proves to be somewhat inefficient. Another major type of object detection systems relies on a unified one-state approach instead. These so-called single-shot detectors fully locate and classify objects during a single pass over the image, which substantially decreases test time. One such single-shot detector <a href="https://www.cv-foundation.org/openaccess/content_cvpr_2016/papers/Redmon_You_Only_Look_CVPR_2016_paper.pdf">YOLO</a> begins by laying out a grid over the image and allows each grid cell to detect a fixed number of objects of varying sizes. For each true object present in the image, the grid cell associated with the object’s center is responsible for predicting this object. A complex, multi-term loss function then ensures that all localization and classification occurs within one process. One version of this method, Fast YOLO, has even achieved rates of 155 fps; however, classification and localization accuracy drops off sharply at this elevated speed.</p>
<p>Ultimately, today’s object detection algorithms attempt to strike a balance between speed and accuracy. Several design choices beyond the detection framework influence these outcomes. For example, <a href="https://arxiv.org/pdf/1804.02767.pdf">YOLOv3</a> allows images of varying resolution: high-res images typically see better accuracy but slower processing times and vice versa for low-res images. The choice of the CNN base also affects the speed-accuracy tradeoff. Very deep networks like the 164 layers used in Inception-ResNet-V2 yield impressive accuracy, but pale in comparision to frameworks with VGG-16 in terms of speed. Object detection design choices must be made in context depending on whether speed or accuracy takes priority.</p>
<h3 id="3-multiple-spatial-scales-and-aspect-ratios">3. Multiple spatial scales and aspect ratios</h3>
<p>For many applications of object detection, items of interest may appear in a wide range of sizes and aspect ratios. Practitioners leverage several techniques to ensure detection algorithms are able to capture objects at multiple scales and views.</p>
<h4 id="anchor-boxes">Anchor boxes</h4>
<p>Instead of selective search, Faster R-CNN’s updated region proposal network uses a small sliding window across the image’s convolutional feature map to generate candidate RoIs. Multiple RoIs may be predicted at each position and are described relative to reference <em>anchor boxes</em>. The shapes and sizes of these anchor boxes are carefully chosen to span a range of different scales and aspect ratios. This allows various types of objects to be detected with the hopes that the bounding box coordinates need not be adjusted much during the localization task. Other frameworks, including single-shot detectors, also adopt anchor boxes to initialize regions of interest.</p>
<center>
<img src="https://kimfetti.github.io/images/anchors.png" alt="Anchor boxes" width="500" />
<p><em> Carefully chosen anchor boxes of varying sizes and aspect ratios help create diverse regions of interest.</em></p>
</center>
<h4 id="multiple-feature-maps">Multiple feature maps</h4>
<p>Single-shot detectors must place special emphasis on the issue of multiple scales because they detect objects with a single pass through the CNN framework. If objects are detected from the final CNN layers alone, only large items will be found as smaller items may lose too much signal during downsampling in the pooling layers. To address this problem, single-shot detectors typically look for objects within multiple CNN layers including earlier layers where higher resolution remains. Despite the precaution of using multiple feature maps, single-shot detectors notoriously struggle to detect small objects, especially those in tight groupings like a flock of birds.</p>
<center>
<img src="https://kimfetti.github.io/images/ssd.png" alt="SSD with multiple feature maps" width="800" />
<p><em> Feature maps from multiple CNN layers help predict objects at multiple scales.</em></p>
</center>
<h4 id="feature-pyramid-network">Feature pyramid network</h4>
<p>The <a href="https://arxiv.org/pdf/1612.03144.pdf">feature pyramid network (FPN)</a> takes the concept of multiple feature maps one step further. Images first pass through the typical CNN pathway, yielding semantically rich final layers. Then to regain better resolution, FPN creates a top-down pathway by upsampling this feature map. While the top-down pathway helps detect objects of varying sizes, spatial positions may be skewed. Lateral connections are added between the original feature maps and the corresponding reconstructed layers to improve object localization. FPN currently provides one of the leading ways to detect objects at multiple scales, and YOLO was augmented with this technique in <a href="https://arxiv.org/pdf/1804.02767.pdf">its 3rd version</a>.</p>
<center>
<img src="https://kimfetti.github.io/images/fpn.png" alt="Feature pyramid network" width="450" />
<p><em> The feature pyramid network detects objects of varying sizes by reconstructing high resolution layers from layers with greater semantic strength.</em></p>
</center>
<h3 id="4-limited-data">4. Limited data</h3>
<p>The limited amount of annotated data currently available for object detection proves to be another substantial hurdle. Object detection datasets typically contain ground truth examples for about a dozen to a hundred classes of objects, while image classification datasets can include upwards of 100,000 classes. Furthermore, crowd sourcing often produces image classification tags for free (for example, by parsing the text of user-provided photo captions). Gathering ground truth labels <em>along with</em> accurate bounding boxes for object detection, however, remains incredibly tedious work.</p>
<p>The COCO dataset, provided by Microsoft, currently leads as some of the best object detection data available. COCO contains 300,000 segmented images with <a href="https://github.com/pjreddie/darknet/blob/master/data/coco.names">80 different categories</a> of objects with very precise location labels. Each image contains about 7 objects on average, and items appear at very broad scales. As helpful as this dataset is, object types outside of these 80 select classes will not be recognized if training solely on COCO.</p>
<p>A very interesting approach to alleviate data scarcity comes from YOLO9000, the <a href="https://arxiv.org/pdf/1612.08242.pdf">second version of YOLO</a>. YOLO9000 incorporates many important updates into YOLO, but it also aims to narrow the dataset gap between object detection and image classification. YOLO9000 trains simultaneously on both COCO and <a href="http://www.image-net.org/">ImageNet</a>, an image classification dataset with tens of thousands of object classes. COCO information helps precisely locate objects, while ImageNet increases YOLO’s classification “vocabulary.” A hierarchical WordTree allows YOLO9000 to first detect an object’s concept (such as “animal/dog”) and to then drill down into specifics (such as “Siberian husky”). This approach appears to work well for concepts known to COCO like animals but performs poorly on less prevalent concepts since RoI suggestion comes solely from the training with COCO.</p>
<center>
<img src="https://kimfetti.github.io/images/yolo9000.png" alt="YOLO9000 WordTree and examples" width="700" />
<p><em> YOLO9000 trains on both COCO and ImageNet to increase classification "vocabulary."</em></p>
</center>
<h3 id="5-class-imbalance">5. Class imbalance</h3>
<p>Class imbalance proves to be an issue for most classification problems, and object detection is no exception. Consider a typical photograph. More likely than not, the photograph contains a few main objects and the remainder of the image is filled with background. Recall that selective search in R-CNN produces 2,000 candidate RoIs per image–just imagine how many of these regions do not contain objects and are considered negatives!</p>
<p>A recent approach called focal loss is implemented in <a href="https://arxiv.org/abs/1708.02002">RetinaNet</a> and helps diminish the impact of class imbalance. In the optimization loss function, focal loss replaces the traditional log loss when penalizing misclassifications:
\[ FL(p_u) = -\overbrace{(1-p_u)^\gamma\;}^{*} \log(p_u)\]
where \(p_u \) is the predicted class probability for the true class and \(\gamma &gt; 0\). The additional factor (*) reduces loss for well-classified examples with high probabilities, and the overall effect de-emphasizes classes with many examples that the model knows well, such as the background class. Objects of interest occupying minority classes, therefore, receive more significance and see improved accuracy.</p>
<h2 id="conclusion">Conclusion</h2>
<p>Object detection is customarily considered to be much harder than image classification, particularly because of these five challenges: dual priorities, speed, multiple scales, limited data, and class imbalance. Researchers have dedicated much effort to overcome these difficulties, yielding oftentimes amazing results; however, significant challenges still persist.</p>
<p>Basically all object detection frameworks continue to struggle with small objects, especially those bunched together with partial occlusions. Real-time detection with top-level classification and localization accuracy remains challenging, and practitioners must often prioritize one or the other when making design decisions. Video tracking may see improvements in the future if some continuity between frames is assumed rather than processing each frame individually. Futhermore, an interesting enhancement that may see more exploration would extend the current two-dimensional bounding boxes into three-dimensional bounding cubes. Even though many object detection obstacles have seen creative solutions, these additional considerations–and plenty more–signal that object detection research is certainly not done!</p>
https://kimfetti.github.io/algorithms/literature%20reviews/object-detection-challenges/Simple Ways to Improve Your Matplotlibhttps://kimfetti.github.io/visualizations/matplotlib-improvements/
Mon, 12 Aug 2019 00:00:00 +0000<!--more-->
<p><a href="https://matplotlib.org/">Matplotlib</a> is typically the first data visualization package that Python programmers learn. While its users can create basic figures with just a few lines of code, these resulting default plots often prove insufficient in both design aesthetics and communicative power. Simple adjustments can lead to dramatic improvements, however, and in this post, I will share several tips on how to upgrade your Matplotlib figures.</p>
<p>In the examples that follow, I will be using information found in <a href="https://www.kaggle.com/crawford/80-cereals">this Kaggle dataset about cereals</a>. I have normalized three features (calories, fat, and sugar) by serving size to better compare cereal nutrition and ratings. Details about these data transformations and the code used to generate each example figure can be found on <a href="https://github.com/kimfetti/Blog/blob/master/matplotlib_improvements.ipynb">GitHub</a>.</p>
<h2 id="remove-spines">Remove Spines</h2>
<p>The first Matplotlib default to update is that black box surrounding each plot, comprised of four so-called “spines.” To adjust them we first <a href="https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.gca.html">get our figure’s axes</a> via pyplot and then change the visibility of each individual spine as desired.</p>
<p>Let’s say, for example, we want to remove the top and right spines. If we have imported Matplotlib’s pyplot submodule with:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">matplotlib</span> <span class="kn">import</span> <span class="n">pyplot</span> <span class="k">as</span> <span class="n">plt</span>
</code></pre></div></div>
<p>we just need to add the following to our code:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">plt</span><span class="o">.</span><span class="n">gca</span><span class="p">()</span><span class="o">.</span><span class="n">spines</span><span class="p">[</span><span class="s">'top'</span><span class="p">]</span><span class="o">.</span><span class="n">set_visible</span><span class="p">(</span><span class="bp">False</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">gca</span><span class="p">()</span><span class="o">.</span><span class="n">spines</span><span class="p">[</span><span class="s">'right'</span><span class="p">]</span><span class="o">.</span><span class="n">set_visible</span><span class="p">(</span><span class="bp">False</span><span class="p">)</span>
</code></pre></div></div>
<p>and the top and right spines will no longer appear. Removing these distracting lines allows more focus to be directed toward your data.</p>
<center>
<img src="https://kimfetti.github.io/images/spines.png" alt="Remove matplotlib spines" width="800" />
<p><em>Removing distracting spines can help people focus on your data.</em></p>
</center>
<h2 id="explore-color-options">Explore Color Options</h2>
<p>Matplotlib’s <a href="https://matplotlib.org/3.1.1/users/dflt_style_changes.html#colors-color-cycles-and-color-maps">default colors just got an upgrade</a> but you can still easily change them to make your plots more attractive or even to reflect your company’s brand colors.</p>
<h3 id="hex-codes">Hex Codes</h3>
<p>One of my favorite methods for updating Matplotlib’s colors is directly passing <a href="https://htmlcolorcodes.com/">hex codes</a> into the color argument because it allows me to be extremely specific about my color choices.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">plt</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="o">...</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s">'#0000CC'</span><span class="p">)</span>
</code></pre></div></div>
<p><a href="https://www.w3schools.com/colors/colors_picker.asp">This handy tool</a> can help you select an appropriate hex color by testing it against white and black text as well as comparing several lighter and darker shades. Alternatively, you can take a more scientific approach when choosing your palette by checking out <a href="http://vrl.cs.brown.edu/color">Colorgorical</a> by Connor Gramazio from the Brown Visualization Research Lab. The Colorgorical tool allows you to build a color palette by balancing various preferences like human perceptual difference and aesthetic pleasure.</p>
<h3 id="xkcd-colors">xkcd Colors</h3>
<p>The <a href="https://xkcd.com/color/rgb/">xkcd color library</a> provides another great way to update Matplotlib’s default colors. These 954 colors were specifically curated and named by several hundred thousand participants of the <a href="https://blog.xkcd.com/2010/05/03/color-survey-results/">xkcd color name survey</a>. You can use them in Matplotlib by prefixing their names with “xkcd:”.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">plt</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="o">...</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s">'xkcd:lightish blue'</span><span class="p">)</span>
</code></pre></div></div>
<center>
<img src="https://kimfetti.github.io/images/color.png" alt="Explore matplotlib colors" width="900" />
<p><em>Matplotlib's default colors can easily be updated by passing hex codes or referencing the xkcd library.</em></p>
</center>
<h2 id="layer-graph-objects">Layer Graph Objects</h2>
<p>Matplotlib allows users to layer multiple graphics on top of each other, which proves convenient when comparing results or setting baselines. Two useful properties should be utilized while layering: 1) <code class="highlighter-rouge">alpha</code> for controlling each component’s opacity and 2) <code class="highlighter-rouge">zorder</code> for moving objects to the foreground or background.</p>
<h3 id="opacity">Opacity</h3>
<p>The alpha property in Matplotlib adjusts an object’s opacity. This value ranges from zero to one with zero being fully transparent (invisible 👀) and one being entirely opaque. Reducing alpha will make your plot objects see-through, allowing multiple layers to be seen at once as well as allowing overlapping points to be distinguished, say, in a scatter plot.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">plt</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="o">...</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.5</span><span class="p">)</span>
</code></pre></div></div>
<center>
<img src="https://kimfetti.github.io/images/alpha.png" alt="Adjust matplotlib opacity" width="800" />
<p><em>Decreasing alpha reduces opacity and can help you visualize overlapping points.</em></p>
</center>
<h3 id="order">Order</h3>
<p>Matplotlib’s zorder property determines how close objects are to the foreground. Objects with smaller zorder values appear closer to the background, while those with larger values present closer to the front. If I’m making a scatter plot with an accompanying line plot, for example, I can bring the line forward by increasing its zorder.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">plt</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="o">...</span><span class="p">,</span> <span class="n">zorder</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span> <span class="c">#background</span>
<span class="n">plt</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="o">...</span><span class="p">,</span> <span class="n">zorder</span><span class="o">=</span><span class="mi">2</span><span class="p">)</span> <span class="c">#foreground</span>
</code></pre></div></div>
<center>
<img src="https://kimfetti.github.io/images/zorder.png" alt="Control layer order with zorder" width="600" />
<p><em> Plot objects can be brought to the foreground or pushed to the background by adjusting zorder.</em></p>
</center>
<h2 id="annotate-main-points-or-examples">Annotate Main Points or Examples</h2>
<p>Many visuals can benefit from the annotation of main points or specific, illustrative examples because these directly convey ideas and boost the validity of results. To add text to a Matplotlib figure, just include annotation code specifying the desired text and its location.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">plt</span><span class="o">.</span><span class="n">annotate</span><span class="p">(</span><span class="n">TEXT</span><span class="p">,</span> <span class="p">(</span><span class="n">X_POSITION</span><span class="p">,</span> <span class="n">Y_POSITION</span><span class="p">),</span> <span class="o">...</span><span class="p">)</span>
</code></pre></div></div>
<p>The cereal dataset used to produced this blog’s visuals contains nutritional information about several brand name cereals along with a feature labeled as “rating.” One might firstly assume that “rating” is a score indicating cereals that consumers prefer. In the zorder figure above, however, I built a quick linear regression model showing that the correlation between calories per cup and rating is practically non-existent. It seems unlikely that calories would not factor into consumer preference, so we may already be skeptical about our initial assumption about “rating.”</p>
<p>This misconception becomes even more obvious when examining the extremes: Cap’n Crunch is the lowest rated cereal while All-Bran with Extra Fiber rates the highest. Annotating the figure with these representative examples immediately dispels false assumptions about “rating.” This rating information more likely indicates a cereal’s nutritional value. (I have also annotated the cereal with the most calories per cup; Grape Nuts is likely not meant to be consumed in such large quantities! 😆)</p>
<center>
<img src="https://kimfetti.github.io/images/annotate.png" alt="Annotate examples" width="700" />
<p><em> Annotating your visuals with a few examples can improve communication and add legitimacy.</em></p>
</center>
<h2 id="baseline-and-highlight">Baseline and Highlight</h2>
<p>Adding a baseline to your visuals helps set expectations. A simple horizontal or vertical line provides others with appropriate context and often speeds along their understanding of your results. Highlighting a specific region of interest, meanwhile, can further emphasize your conclusions and also facilitates communication with your audience. Matplotlib offers several options for baselining and highlighting, including horizontal and vertical lines, shapes such as rectangles, horizontal and vertical span shading, and filling between two lines.</p>
<h3 id="horizontal-and-vertical-lines">Horizontal and Vertical Lines</h3>
<p>Let’s now consider the interplay between fat and sugar in our cereal dataset. A basic scatter plot of this relationship doesn’t appear interesting at first, but after exploring further, we find the median fat per cup of cereal is just one gram because so many cereals contain no fat at all. Adding this baseline helps people arrive at this finding much more quickly.</p>
<center>
<img src="https://kimfetti.github.io/images/baseline.png" alt="Add a baseline" width="800" />
<p><em> A horizontal or vertical baseline can help set the stage for your data.</em></p>
</center>
<p>In other cases you may want to completely remove the default x- and y-axes that Matplotlib provides and create your own axes based on some data aggregate. This process requires three key steps: 1) remove all default spines, 2) remove tick marks, and 3) add new axes as horizontal and vertical lines.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#1. Remove spines</span>
<span class="k">for</span> <span class="n">spine</span> <span class="ow">in</span> <span class="n">plt</span><span class="o">.</span><span class="n">gca</span><span class="p">()</span><span class="o">.</span><span class="n">spines</span><span class="o">.</span><span class="n">values</span><span class="p">():</span>
<span class="n">spine</span><span class="o">.</span><span class="n">set_visible</span><span class="p">(</span><span class="bp">False</span><span class="p">)</span>
<span class="c">#2. Remove ticks</span>
<span class="n">plt</span><span class="o">.</span><span class="n">xticks</span><span class="p">([])</span>
<span class="n">plt</span><span class="o">.</span><span class="n">yticks</span><span class="p">([])</span>
<span class="c">#3. Add horizontal and vertical lines</span>
<span class="n">plt</span><span class="o">.</span><span class="n">axhline</span><span class="p">(</span><span class="n">Y_POSITION</span><span class="p">,</span> <span class="o">...</span><span class="p">)</span> <span class="c">#horizontal line</span>
<span class="n">plt</span><span class="o">.</span><span class="n">axvline</span><span class="p">(</span><span class="n">X_POSITION</span><span class="p">,</span> <span class="o">...</span><span class="p">)</span> <span class="c">#vertical line</span>
</code></pre></div></div>
<center>
<img src="https://kimfetti.github.io/images/new_axes.png" alt="Create new axes" width="550" />
<p><em> You can also create new axes for your data by removing spines and ticks and adding custom lines.</em></p>
</center>
<h3 id="rectangle">Rectangle</h3>
<p>Now that we have plotted the cereals’ fat and sugar contents on new axes, it appears that very few cereals are low in sugar but high in fat. That is, the upper-left quadrant is nearly empty. This seems reasonable because cereals typically are not savory. To make this point abundantly clear, we could direct attention to this low-sugar, high-fat area by drawing a rectangle around it and annotating. Matplotlib provides access to several shapes through its <a href="https://matplotlib.org/3.1.1/api/patches_api.html#module-matplotlib.patches">patches module</a>, including a rectangle or even a <a href="https://matplotlib.org/3.1.1/gallery/shapes_and_collections/dolphin.html#sphx-glr-gallery-shapes-and-collections-dolphin-py">dolphin</a>. Begin by importing code for the rectangle:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">matplotlib.patches</span> <span class="kn">import</span> <span class="n">Rectangle</span>
</code></pre></div></div>
<p>Then to create a rectangle on the figure, grab the current axes and add a rectangular patch with its location, width, and height:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">plt</span><span class="o">.</span><span class="n">gca</span><span class="p">()</span><span class="o">.</span><span class="n">add_patch</span><span class="p">(</span><span class="n">Rectangle</span><span class="p">((</span><span class="n">X_POSITION</span><span class="p">,</span> <span class="n">Y_POSITION</span><span class="p">),</span> <span class="n">WIDTH</span><span class="p">,</span> <span class="n">HEIGHT</span><span class="p">,</span> <span class="o">...</span><span class="p">)</span>
</code></pre></div></div>
<p>Here, the x- and y-positions refer to the placement of the lower-left corner of the rectangle.</p>
<center>
<img src="https://kimfetti.github.io/images/rectangle.png" alt="Add a rectangle" width="600" />
<p><em> To direct people toward a particular part of your visual, consider adding a rectangle.</em></p>
</center>
<h3 id="shading">Shading</h3>
<p>Shading provides an alternative option for drawing attention to a particular region of your figure, and there are a few ways to add shading with Matplotlib.</p>
<p>If you intend to highlight an entire horizontal or vertical area, just layer a span into your visual:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">plt</span><span class="o">.</span><span class="n">axhspan</span><span class="p">(</span><span class="n">Y_START</span><span class="p">,</span> <span class="n">Y_END</span><span class="p">,</span> <span class="o">...</span><span class="p">)</span> <span class="c">#horizontal shading</span>
<span class="n">plt</span><span class="o">.</span><span class="n">axvspan</span><span class="p">(</span><span class="n">X_START</span><span class="p">,</span> <span class="n">X_END</span><span class="p">,</span> <span class="o">...</span><span class="p">)</span> <span class="c">#vertical shading</span>
</code></pre></div></div>
<p>Previously discussed properties like <code class="highlighter-rouge">alpha</code> and <code class="highlighter-rouge">zorder</code> are critical here because you will likely want to make your shading transparent and/or move it to the background.</p>
<center>
<img src="https://kimfetti.github.io/images/shading.png" alt="Shading for highlighting" width="800" />
<p><em> Shading also provides an effective way to highlight a particular region of your plot.</em></p>
</center>
<p>If the area you would like to shade follows more complicated logic, however, you may instead <a href="https://matplotlib.org/api/_as_gen/matplotlib.pyplot.fill_between.html">shade between two user-defined lines</a>. This approach takes a set of x-values, two sets of y-values for the first and second lines, and an optional <code class="highlighter-rouge">where</code> argument that allows you to use logic to filter down to your region of interest.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">plt</span><span class="o">.</span><span class="n">gca</span><span class="p">()</span><span class="o">.</span><span class="n">fill_between</span><span class="p">(</span><span class="n">X_VALUES</span><span class="p">,</span> <span class="n">Y_LINE1</span><span class="p">,</span> <span class="n">Y_LINE2</span><span class="p">,</span> <span class="n">WHERE</span><span class="o">=</span><span class="n">FILTER</span> <span class="n">LOGIC</span><span class="p">,</span> <span class="o">...</span><span class="p">)</span>
</code></pre></div></div>
<p>To shade the same area that was previously highlighted with a rectangle, simply define an array of equally spaced sugar values for the x-axis, fill between the median and max fat values on the y-axis (high fat), and filter down to sugar values less than the median (low sugar).</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">sugars</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">linspace</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">sugars_per_cup</span><span class="o">.</span><span class="nb">min</span><span class="p">(),</span> <span class="n">df</span><span class="o">.</span><span class="n">sugars_per_cup</span><span class="o">.</span><span class="nb">max</span><span class="p">(),</span> <span class="mi">1000</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">gca</span><span class="p">()</span><span class="o">.</span><span class="n">fill_between</span><span class="p">(</span><span class="n">sugars</span><span class="p">,</span> <span class="n">df</span><span class="o">.</span><span class="n">fat_per_cup</span><span class="o">.</span><span class="n">median</span><span class="p">(),</span> <span class="n">df</span><span class="o">.</span><span class="n">fat_per_cup</span><span class="o">.</span><span class="nb">max</span><span class="p">(),</span>
<span class="n">WHERE</span><span class="o">=</span><span class="n">sugars</span> <span class="o">&lt;</span> <span class="n">df</span><span class="o">.</span><span class="n">sugars_per_cup</span><span class="o">.</span><span class="n">median</span><span class="p">(),</span> <span class="o">...</span><span class="p">)</span>
</code></pre></div></div>
<center>
<img src="https://kimfetti.github.io/images/fill_between.png" alt="Fill between lines" width="600" />
<p><em> More complex shading logic is accomplished by filling between two lines and applying a filter.</em></p>
</center>
<h2 id="conclusion">Conclusion</h2>
<p>Matplotlib often gets a bad reputation due to its poor defaults and the shear amount of code needed to produce decent looking visuals. Hopefully, the tips provided in this blog will help you address the first issue, though I’ll admit that the final few example figures required many updates and subsequently a sizable amount of code. If the required bulk of code bothers you, the <a href="https://seaborn.pydata.org/">Seaborn</a> visualization library is an excellent alternative to Matplotlib. It comes with better defaults overall, demands fewer lines of code, and supports customization via traditional Matplotlib syntax if needed.</p>
<p>The main thing to keep in mind when you visualize data–no matter which package you choose–is your audience. The suggestions I’ve offered here aim to smooth out the data communication process by 1) removing extraneous bits like unnecessary spines or tick marks, 2) telling the data story quicker by setting expectations with layering and baselines, and 3) highlighting main conclusions with shading and annotations. The resulting aesthetics also improve, but the primary goal is stronger and more seamless data communication.</p>
<p><em>I recently shared content similar to this in a data visualization talk at <a href="https://odsc.com/training/portfolio/custom-data-visualizations-with-python/">ODSC NYC</a>. You can access my original conference materials as well as the code that powers each example figure in the links below.</em></p>
<table>
<tbody>
<tr>
<td><a href="https://github.com/kimfetti/Blog/blob/master/matplotlib_improvements.ipynb">Check out this code on GitHub!</a></td>
<td> </td>
<td><a href="http://bit.ly/odscNyc19_dataviz">Check out my ODSC conference materials with Google Colab!</a></td>
</tr>
</tbody>
</table>
https://kimfetti.github.io/visualizations/matplotlib-improvements/Solving the Birthday Problem on Marshttps://kimfetti.github.io/mathematics/visualizations/planetary-birthday-problem/
Mon, 10 Dec 2018 00:00:00 +0000<!--more-->
<p>I was recently asked to develop a challenge problem for the <a href="https://www.thisismetis.com">Metis</a> data science bootcamp. Perhaps it was my background in math or maybe my penchant for mild torture, but I decided to have students answer a few exercises from <a href="https://www.amazon.com/Challenging-Problems-Probability-Solutions-Mathematics-ebook/dp/B00A3M0VV8">Fifty Challenging Problems with Solutions</a> by Mosteller. This book is full of classic problems in probability, and I highly recommend it to anyone prepping for data science interviews!</p>
<p>One of my favorite sections in this book is the birthday series, which includes a version of the birthday problem. This problem is about as famous as a probability question can get. It has been featured on <a href="https://www.npr.org/templates/story/story.php?storyId=4542341">NPR</a>, written about in an <a href="https://en.wikipedia.org/wiki/A_Fall_of_Moondust">Arthur C. Clarke novel</a>, and it even has its <a href="https://en.wikipedia.org/wiki/Birthday_problem">own Wikipedia page</a>.</p>
<p>The problem goes something along the lines of:</p>
<blockquote>
<p><span class="teaser">You are throwing a party and inviting random people you have never met. What’s the fewest number of <br /> people you need to invite to have at least 50% probability that two strangers will have the same birthday? <br />(Birth year need not match.)</span></p>
</blockquote>
<p>If you haven’t solved this one before, feel free to take a moment and give it a shot. Be warned – <strong>spoilers ahead</strong>!</p>
<h2 id="solving-with-probability">Solving with Probability</h2>
<p>Rather than the brute force approach, it turns out that the answer can be found much more easily by considering the complementary case; that is, “How many people can you invite to expect a 50% chance that all invited people have <em>unique</em> birthdays?” This “unsuccessful” probability along with the “successful” probability will sum to one.</p>
<p>Keeping the complementary case in mind, note that the first person at your party can have their birthday on any calendar day, but after that, each person must have a different day. Let \(p_u\) be the probability that \(r\) people each have unique birthdays. We find
\[p_u = 1 \cdot \frac{N-1}{N} \cdot \frac{N-2}{N} \cdots \frac{N-r+1}{N} = \frac{N!}{(N-r)!N^r}\]
where \(N\) is the number of days in one year.</p>
<p>Backtracking to the original birthday problem, we now just need to find the minimum value of \(r\) people that satisfy:
\[p_{s} = 1 - \frac{N!}{(N-r)!N^r} &gt; \frac{1}{2}.\]</p>
<p>This expression doesn’t look all that pleasant to be solved outright, so instead we can build a little solver in Python--or the language of your choice!--to be able to compute \(p_s\) for any given \(r\) and \(N\).</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>def prob_birthday_success(r, N=365):
if r &gt; N:
return 1.
factorial = reduce(lambda x, y: x*y, range(N-r+1, N+1))
power = N**r
return (1 - factorial/power)
</code></pre></div></div>
<p>We steadily increase \(r\) and once we hit the \(p_s \geq \frac{1}{2}\) mark, we have our desired party size. The table below illustrates solutions for 50% probability as well as a few others values of \(p_s\).</p>
<center>
<table width="400">
<caption>Party size required for several success probabilities</caption>
<colgroup>
<col span="1" style="width: 50%;" />
<col span="1" style="width: 50%;" />
</colgroup>
<thead>
<tr>
<th>Probability</th>
<th>People Required</th>
</tr>
</thead>
<tbody>
<tr>
<td>0.05</td>
<td>7</td>
</tr>
<tr>
<td>0.1</td>
<td>10</td>
</tr>
<tr>
<td>0.25</td>
<td>15</td>
</tr>
<tr>
<td>0.5</td>
<td>23</td>
</tr>
<tr>
<td>0.75</td>
<td>32</td>
</tr>
<tr>
<td>0.9</td>
<td>41</td>
</tr>
<tr>
<td>0.999</td>
<td>70</td>
</tr>
</tbody>
</table>
</center>
<p>Notice that if we invite just 23 people to our party, we will have a 50-50 chance in finding a shared birthday. Inviting 60 or 70 people pretty much guarantees it. Incredible!</p>
<p>So it’s plain to see that \(p_s\) increases rapidly as our party gets bigger here on Earth, but this led me to consider: “What would happen if the party took place on, say, Mars or Jupiter?” Or in less whimsical terms: “How many people would we need if we varied the year length, \(N\)?”</p>
<h2 id="planetary-results">Planetary Results</h2>
<p>The first step in solving the birthday problem for the rest of our solar system is gathering <a href="https://www.universetoday.com/37507/years-of-the-planets/">year lengths for each planet</a>, which vary wildly: from a meager 88 days on Mercury up to a whopping 60,182 days on Neptune. In fact, your entire life will be confined to a single orbital period of Neptune. There goes that Neptunian birthday party you’ve always wanted! (Admittedly, the definition of a “birthday” gets a little murky on these other planets… but more on this later.)</p>
<p>Once year lengths have been gathered, working out the problem for different values of \(N\) is as simple as returning to the Python function introduced earlier. The required number of party goers to achieve a 50% probability of birthday matching on each planet can be found below. As a child of the 80s, I must tell you it is <em>VERY</em> difficult for me to not include Pluto on this chart. But there’s always hope for a <a href="https://www.space.com/40550-pluto-planet-debate-flares-up-again.html">Plutonian comeback</a>! 🙏</p>
<center>
<iframe frameborder="no" border="0" marginwidth="0" marginheight="0" align="center" width="1050" height="500" src="https://public.tableau.com/views/PlanetaryBirthdayProblem/Planets-50?:showVizHome=no&amp;:embed=true"> </iframe>
</center>
<p><br /></p>
<p>Even on Neptune where there are over 60,000 Earth days per year, we only require 290 people to have a 50% chance of matching birthdays. That’s an amazingly small amount for such a massive number of days in each year!</p>
<p>So now back to our broader question: “What’s the overall trend as \(N\) increases?” Well, the chart above is great for being able to read information associated with every planet, but it’s a bit misleading trendwise because both axes are on a log scale. Let’s take a look at this same information without the axial scaling.</p>
<center>
<iframe frameborder="no" border="0" marginwidth="0" marginheight="0" align="center" width="700" height="400" src="https://public.tableau.com/views/PlanetaryBirthdayProblem/Planets-50-Trueaxes?:showVizHome=no&amp;:embed=true"> </iframe>
</center>
<p><br /></p>
<p>Ah ha! Viewing the data this way, a trend that looks roughly like a power relationship emerges. We will now take a closer look at the analytic expression for \(p_s\) to dive deeper into this relation.</p>
<h2 id="expanding-solution-with-approximations">Expanding Solution with Approximations</h2>
<p>We are about to embark upon the amazing world of expansions and approximations--AKA put your math pants on and fasten your seat belts! (If this sort of nerdout isn’t your thing, no worries. Just skip ahead to the end of this section where all will revealed… 🔮 Much of this work can also be found in <a href="https://www.amazon.com/Challenging-Problems-Probability-Solutions-Mathematics-ebook/dp/B00A3M0VV8">Mosteller</a> as his solution to the birthday problem.)</p>
<p>First recall that
\[e^{-x} = 1 - x + \frac{x^2}{2!} - \frac{x^3}{3!}+ \cdots,\]
so ultimately, \(e^{-x} \approx 1 - x\) for very small values of \(x\).</p>
<p>Now represent \(p_u\)--that’s the unsuccessful probability--as
\[p_u = \frac{N(N-1)(N-2)\cdots(N-r+1)}{N^r} = \frac{N^r - \hat{k}}{N^r} = 1 - \frac{k}{N}\]
where \(k\) contains multiple factors but all are smaller than \(N\).</p>
<p>Combining these two expressions, we then find that
\[p_u = 1 - \frac{k}{N} \approx e^{-k/N},\]
which is valid because \(k/N\) is typically much smaller than one.</p>
<p>Now let’s further consider the values contained within \(k\). Expanding out the numerator in \(p_u\), we have
\[p_u=\frac{N(N-1)\cdots(N-r+1)}{N^r} = \frac{N^r - N^{r-1}\left[0+1+2+\cdots(r-1)\right] + \cdots}{N^r} = 1 - \frac{0 + 1 + 2 + \cdots (r-1)}{N} + \cdots.\]
The quantity \(k\) represents several terms, but to leading order, it just looks a sum of the integers between 0 and \(r-1\). More specificially,
\[k \approx 0 + 1 + 2 + \cdots + r-1 = \sum_{i=0}^{r-1}j = \frac{r(r-1)}{2}.\]
So where does this lead us? Returning to our exponential expression above, we have
\[p_u \approx e^{-k/N} \approx e^{-r(r-1)/2N},\]
which looks <em>much</em> more tractable than the original expression we had for \(p_u\) containing those factorials. We can even come up with an expression to relate \(r\) to \(N\) more explicitly in the leading order.</p>
<p>Subbing in \(p_u = 1 - p_s\) and taking the natural log of each side, we eventually find
\[\frac{r(r-1)}{2N} \approx -\ln{(1 - p_s)},\]
which means
\[r(r-1) \approx -2N\ln{(1-p_s)}.\]
So there you have it! Selecting in any given value for \(p_s\) will fix the log factor and the other two quantities are related as
\[\mathcal{O}\left(r\right) \sim \mathcal{O}\left(\sqrt{N}\right).\]
The trend we saw in the planet chart was indeed a power relationship; specifically, \(r\) goes like \(\sqrt{N}\) as \(N\) increases for the birthday problem. This means that even on planets with many, many days in a year, we don’t really need to increase our party size by all that much to ensure our 50-50 chance of finding birthday twins.</p>
<h2 id="approximation-in-action">Approximation in Action</h2>
<p>How good is this approximation in practice? Well, the trendline we saw earlier in the true-scale axes chart was auto-fitted in Tableau with a power trend, and indeed, the equation for the resulting line was found to be</p>
<p>\[r = 1.28548 \cdot N^{0.491503}\]</p>
<p>So our square-root relationship appears to hold true.</p>
<p>We can also more explicitly consider what happens as \(N\) gets larger. Because we are estimating
\[1-\frac{k}{N} \approx e^{-k/N},\]
this approximation should actually become <em>more</em> valid as \(N\) becomes larger since \(k/N\) will resultingly grow smaller.</p>
<p>Now set \(p_s = \frac{1}{2}\) and let \(r_{1/2}\) be the 50-50 chance party size. Plotting the left-hand side of our approximation
\[\frac{r_{1/2}(r_{1/2}-1)}{2N} \approx \ln{2}\]
for various evenly sampled values of \(N\), we see in the figure below that this estimation indeed becomes more valid and encounters less variance about \(\ln{(2)}\) as \(N \to \infty\). (There is an added layer of complexity in this problem, however, because we require \(r_{1/2}\) to be an integer; this stipulation makes our approximation dance about the line a bit even at large values of \(N\).)</p>
<center>
<img src="https://kimfetti.github.io/images/planetary-birthday-approx.png" alt="Approximation plot" width="900" height="400" />
</center>
<h2 id="conclusion">Conclusion</h2>
<p>The birthday problem is a classic that has been examined from several different angles. I hope you’ve enjoyed this planetary rendition and the subsequent deep dive into analytic approximations to explore how \(r\) is related to year length.</p>
<p>A few final thoughts:</p>
<ol>
<li>
<p>It is well-known that birthdays are <a href="http://www.panix.com/~murphy/bday.html">not equally distributed throughout all 365 days</a>, especially if you focus on one region of the world. So how does non-uniformity affect our birthday solution?</p>
<p>It turns out that the uniform distribution of birthdays we used throughout this post is actually a <a href="https://www.jstor.org/stable/2318556?seq=1#page_scan_tab_contents">worst-case scenario</a> in terms of successfully finding birthmates. If birthdays are skewed toward one day or another, the odds that you will find birthday twins at your party actually increase… but not significantly. Attempts at calculating the birthday problem with real-world datasets have shown the 23-person group to be a pretty consistent solution, even when considering <a href="https://www.stat.wisc.edu/techreports/tr591.pdf">non-uniform distributions</a>.</p>
</li>
<li>
<p>I eluded to this earlier, but the idea of a “birthday” gets a bit complicated when thinking about other planets. I often mentioned my findings in terms of “Earth days” because I calculated each planet’s revolution about the Sun in the number of times it takes Earth to rotate about its own axis. What does that mean in the context of this problem?</p>
<p>Consider two cases: Jupiter and Mercury.</p>
<ul>
<li>Firstly, Jupiter rotates about its own axis in about <a href="https://www.universetoday.com/37507/years-of-the-planets/">9 hours and 55 minutes</a>, faster than any other planet in our solar system. So while Jupiter takes roughly 4,333 Earth days to complete its orbit about the Sun, this actually amounts to 10,476 <em>Jovian</em> days. That’s a lot more potential “birthdays!”</li>
<li>Mercury, on the other hand, completes a rotation about its axis <em>slower</em> than any other planet. It takes about 176 Earth days for Mercury to rotate, which is <em>longer</em> than Mercury’s revolution about the Sun. Ultimately, a “year” on Mercury is half as long as a “day.” The birthday problem is completely moot because everyone born in the same Mercurian year is automatically born on the same Mercurian day! (… Please ignore the fact that no one is ever actually born on Mercury. 😆)</li>
</ul>
<p><br />
I have chosen to disregard this planetary difference in the definition of a “day” for simplicity, but keeping with the same Python function introduced near the beginning of this post, we could relatively easily compute revised birthday solutions. Please let me know if you work this out!</p>
</li>
</ol>
<table>
<tbody>
<tr>
<td><a href="https://github.com/kimfetti/Blog/blob/master/planetary_birthday_problem.ipynb">Check out this code on GitHub!</a></td>
<td> </td>
<td><a href="https://public.tableau.com/profile/kimberly.fessel#!/vizhome/PlanetaryBirthdayProblem/Planets-50">Check out this viz on Tableau!</a></td>
</tr>
</tbody>
</table>
https://kimfetti.github.io/mathematics/visualizations/planetary-birthday-problem/