Stripe Bloghttps://stripe.com/blog
The Stripe BlogAndroid SDK updateshttps://stripe.com/blog/android-sdk-updates
<p>
We just launched <a href="https://github.com/stripe/stripe-android/blob/master/CHANGELOG#L1">version 1.1.0 of our Android SDK</a>, which fixes a few bugs and adds some new features:
</p>
<div class="feature-list">
<h3><strong class="new"><span>New</span></strong>Threading control</h3>
<p>
Based on your feedback, we’ve added the option to granularly control threading in your app without having to use <code>AsyncTasks</code>. Many modern Android apps take advantage of the abstractions afforded by <a href="http://reactivex.io/">ReactiveX</a>, so we’ve <a href="https://github.com/stripe/stripe-android#createtokensynchronous">included an example</a> on how to do it with RxJava.
</p>
<h3><strong class="update"><span>Update</span></strong><code>brand</code> instead of <code>type</code> for cards</h3>
<p>
Like the other Stripe API libraries, we now keep track of a card’s <code>brand</code> instead of its <code>type</code> in the Android SDK as well. There’s no change required—we’ve automatically mapped <code>Card#getType()</code> to <code>Card#getBrand()</code>.
</p>
<h3><strong class="new"><span>New</span></strong>Look up a card’s funding source</h3>
<p>
When you tokenize a card, the SDK will now return whether it’s a credit, debit, or prepaid card in case you want to handle these card types differently in your app.
</p>
<h3><strong class="new"><span>New</span></strong>Javadoc</h3>
<p>
We’ve added <a href="https://en.wikipedia.org/wiki/Javadoc">Javadoc</a> to all of the major classes in the SDK so that you can look up functions and parameters right from your favorite IDE:
</p>
<div class='image-center'><img width="499" height="148" data-hires="true" src='/img/blog/posts/android-sdk-updates/ide.png'></a></div>
</div>
<p>
To start using these features, just <a href="https://github.com/stripe/stripe-android">download the latest library</a> or update the inclusion line in your <a href="https://gradle.org/">gradle</a> dependencies. (We’ve also updated our <a href="https://github.com/stripe/stripe-android#building-the-example-project">example app</a> to take advantage of these new features.)
</p>
<p>
We’ll be adding many more features to our Android SDK over the next few months—if you have questions or feedback, <a href="mailto:michael.mcduffee@stripe.com">please let me know</a>!
</p>Mon, 05 Dec 2016 00:00:00 -0000https://stripe.com/blog/android-sdk-updates2016-12-05T00:00:00ZReproducible research: Stripe’s approach to data sciencehttps://stripe.com/blog/reproducible-research
<p>When people talk about their data infrastructure, they tend to focus on the technologies: <a href="http://hadoop.apache.org/">Hadoop</a>, <a href="https://github.com/twitter/scalding">Scalding</a>, <a href="http://impala.apache.org/">Impala</a>, and the like. However, we’ve found that just as important as the technologies themselves are the principles that guide their use. We’d like to share our experience with one such principle that we’ve found particularly useful: reproducibility.</p>
<p>
We’ll talk about our motivation for focusing on reproducibility, how we’re using <a href="http://jupyter.org/">Jupyter Notebooks</a> as our core tool, and the workflow we’ve developed around Jupyter to operationalize our approach.
</p>
<div class="image-center">
<video id="jupyter-screencast" width="720" height="440" poster="/img/blog/posts/reproducible-research/jupyter-demo.png" data-hires="true" loop controls style="border-radius: 5px;">
<source src="https://s3-us-west-1.amazonaws.com/stripe-images/videos/blog/reproducible-research/jupyter-demo.mp4" type="video/mp4">
<source src="https://s3-us-west-1.amazonaws.com/stripe-images/videos/blog/reproducible-research/jupyter-demo.webm" type="video/webm">
</video>
<p>Jupyter notebooks are a fantastic way to create reproducible data science research.</p>
</div>
<hr>
<h2>Motivation</h2>
<p>Data tools are most often used to generate some kind of exploratory analysis report. At Stripe, an example is an investigation of the probability that a card gets declined, given the time since its last charge. The investigator writes a query, which is executed by a query engine like <a href="https://aws.amazon.com/redshift/">Redshift</a>, and then runs some further code to interpret and visualize the results.</p>
<p>The most common way to share results from these sorts of studies is to compose an email and attach some graphs. But this means that viewers of the report don’t know how the query was constructed and analyzed. As a result, they are unable to review the work in depth, or to extend it themselves. It’s very easy to commit methodological errors when asking questions of data; an unintended bias here, or a missed corner case there, can lead to entirely incorrect conclusions.
<p>In academia, the peer review system helps catch these errors. Many in the scientific community have championed the practice of open science, where data and code are released along with experimental results, such that reviewers can independently recreate the original results. Taking inspiration from this movement, we sought to make data reports within Stripe transparent and reproducible, so that anyone at the company can look at a report and understand how it was generated. Just like an always-green test suite forces developers to write better code, we wanted to see if requiring all analyses be reproducible would force us to produce better reports.</p>
<hr>
<h2>Implementation</h2>
<p>Our implementation of reproducible analysis centers on <a href="http://jupyter.org/">Jupyter Notebook</a>, a web-based frontend to the Jupyter interactive computing environment which provides an interface similar to Mathematica or Matlab.</p>
<p class="image-center">
<img src="/img/blog/posts/reproducible-research/benfords-law.png" width="690">
</p>
<p>Jupyter Notebook also comes with built-in functionality to convert a notebook into a publishable HTML document. You can see a <a href="http://nbviewer.jupyter.org/gist/danielhfrank/dc98c757009d1f4c37d1">sample of one of our published notebooks</a>, studying the relationship between Benford’s Law and the amounts of each charge made on Stripe.</p>
<p>Now, let’s say that Alice wants to share a notebook with Bob. The state of the interactive environment can be persisted as a JSON file containing both the code input to the notebook and data output from it. To share the notebook, Alice would typically send this notebook file directly to Bob. Now, when Bob opens it, he’ll see the same outputs as Alice, but may not be able to do much with them. These outputs include computational results and plots’ image data, but not the values of any of the variables that Alice was working with. To inspect these variables and extend Alice’s work, he’ll have to recompute them from the code inputs. However, there may have been certain cells that only run correctly on the Alice’s computer, or some cells might have been rearranged in a way that unintentionally broke the flow of computation. It’s easy to miss mistakes like these when you’re able to share a notebook with the results embedded, so we decided to try something different.</p>
<p class="image-center">
<img src="/img/blog/posts/reproducible-research/git-diagram.png" data-hires="true" width="568">
</p>
<p>In our workflow, developers and data scientists work on a notebook locally and check this source file into Git. To publish their work, they use our common deployment framework, which executes the notebook code once it hits our servers. The results are translated into HTML, which are served statically. Importantly, we strip results from the notebook files in a pre-commit hook, meaning that only code is checked into our repositories. This ensures that the results are fully reproduced from scratch when the notebook is published. Thus, it’s a requirement that all notebooks be programmatically executable from back to front, without needing any manual steps to run. If you were on a Stripe computer, you could run the notebook above with one click and obtain the same results. This is a huge deal!</p>
<p>To make this workflow possible, we had to write some additional tooling to enable the same code to run on developers’ laptops and production servers. The bulk of this work involved access to our query engines, which is perhaps the most common obstacle to collaboration on data analysis projects. Even very well-organized workflows often require a data file to be present at a particular path, or some out of band authentication step with the machines running the queries. The key to overcoming these challenges was to create a common entry point in code to access these query engines from developers’ laptops, as well as our servers. This way, a notebook that runs on one developer’s computer will always run correctly on everyone else’s.</p>
<p>Adding this tooling also greatly improved the experience of doing exploratory data analysis within the notebook. Prior to our reproducibility tooling, setting up data access was tedious, time-consuming, and error-prone. Automating and standardizing this process allowed data scientists and developers to focus on their analysis instead.</p>
<hr>
<h2>Conclusion</h2>
<p>Reproducibility makes data science at Stripe feel like working on <a href="https://github.com/">GitHub</a>, where anyone can obtain and extend others’ work. Instead of islands of analysis, we share our research in a central repository of knowledge. This makes it dramatically easier for anyone on our team to work with our data science research, encouraging independent exploration. </p>
<p>We approach our analyses with the same rigor we apply to production code: our reports feel more like finished products, research is fleshed out and easy to understand, and there are clear programmatic steps from start to finish for every analysis.</p>
<p>We’ve switched over to reproducible reports, and we’re not looking back. Delivering them requires more up-front work, but we’ve found it to be a good long-term investment. If you give it a try, we think you’ll feel the same way!</p>
<p class="cta">
Like this post? Join the Stripe engineering team. <a href="/jobs?ref=blog#engineering" class="button">View Openings</a>
</p>
<script src="/assets/blog/jquery.viewport.mini-a681f77c3139a12430d32a92f3e7fd96.js"></script>
<script type="text/javascript">
var playing = false;
var $video = $('#jupyter-screencast');
// Play the video once it’s visible in the viewport
function shouldPlay() {
if ($video.is(':in-viewport') && !playing) {
$video[0].play();
}
}
// Check on page load, scroll, and resize events
$(window).scroll(shouldPlay);
$(window).resize(shouldPlay);
shouldPlay();
</script>
Tue, 22 Nov 2016 00:00:00 -0000https://stripe.com/blog/reproducible-research2016-11-22T00:00:00ZStarted on Stripe Atlashttps://stripe.com/blog/started-on-stripe-atlas
<div class="background"></div>
<a href="https://medium.com/started-on-stripe-atlas">
<span class="title">Started on Stripe Atlas.</span> Founders in 110+ countries have used Atlas to get started. <span class="arrow">We've collected some of their first-hand accounts</span>
</a>Mon, 14 Nov 2016 00:00:00 -0000https://stripe.com/blog/started-on-stripe-atlas2016-11-14T00:00:00ZWorks with Stripehttps://stripe.com/blog/works-with-stripe
<div class="background"></div>
<a href="/works-with">
<strong>Works with Stripe.</strong> Explore hundreds of pre-built tools and products to help your business do more with Stripe. <span class="arrow">Learn more</span>
</a>Mon, 07 Nov 2016 00:00:00 -0000https://stripe.com/blog/works-with-stripe2016-11-07T00:00:00ZService discovery at Stripehttps://stripe.com/blog/service-discovery-at-stripe
<p>
With so many new technologies coming out every year (like <a href="http://kubernetes.io/">Kubernetes</a> or <a href="https://www.habitat.sh/">Habitat</a>), it’s easy to become so entangled in our excitement about the future that we forget to pay homage to the tools that have been quietly supporting our production environments. One such tool we've been using at Stripe for several years now is <a href="https://www.consul.io/">Consul</a>. Consul helps discover services (that is, it helps us navigate the thousands of servers we run with various services running on them and tells us which ones are up and available for use). This effective and practical architectural choice wasn't flashy or entirely novel, but has served us dutifully in our continued mission to provide reliable service to our users around the world.
</p>
<p>
We’re going to talk about:
</p>
<ul>
<li>What service discovery and Consul are,</li>
<li>How we managed the risks of deploying a critical piece of software,</li>
<li>The challenges we ran into along the way and what we did about them.</li>
</ul>
<p>
You don’t just set up new software and expect it to magically work and solve all your problems—using new software is a process. This is an example of what the process of using new software in production has looked like for us.
</p>
<p class="image-center">
<a href="/img/blog/posts/service-discovery-at-stripe/consul-illustration.png" class="zoom img" maxSize="2200x1640">
<img src="/img/blog/posts/service-discovery-at-stripe/consul-illustration.png" width="800">
</a>
</p>
<h2>What’s service discovery?</h2>
<p>
Great question! Suppose you’re a load balancer for Stripe, and a request to create a charge has come in. You want to send it to an API server. Any API server!
</p>
<p>
We run thousands of servers with various services running on them. Which ones are the API servers? What port is the API running on? One amazing thing about using AWS is that our instances can go down at any time, so we need to be prepared to:
</p>
<ul>
<li>Lose API servers at any time,</li>
<li>Add extra servers to the rotation if we need additional capacity.</li>
</ul>
<p>
This problem of tracking changes around which boxes are available is called service discovery. We use a tool called Consul from <a href="https://www.hashicorp.com/">HashiCorp</a> to do service discovery.
</p>
<p>
The fact that our instances can go down at any time is actually very helpful—our infrastructure gets regular practice losing instances and dealing with it automatically, so when it happens it’s just business as usual. It’s easier to handle failure gracefully when failure happens often.
</p>
<hr>
<h2>Introduction to Consul</h2>
<p>
Consul is a service discovery tool: it lets services register themselves and to discover other services. It stores which services are up in a database, has client software that puts information in that database, and other client software that reads from that database. There are a lot of pieces to wrap your head around!
</p>
<p>
The most important component of Consul is the database. This database contains entries like “<code>api-service</code> is running at IP 10.99.99.99 at port 12345. It is up.”
</p>
<p>
Individual boxes publish information to Consul saying “Hi! I am running <code>api-service</code> on port 12345! I am up!”.
</p>
<p>
Then if you want to talk to the API service, you can ask Consul “Which <code>api-services</code> are up?”. It will give you back a list of IP addresses and ports you can talk to.
</p>
<p>
Consul is a distributed system itself (remember: we can lose any box at any time, which means we could
lose the Consul server itself!) so it uses a consensus algorithm called Raft to keep its database in sync.
</p>
<p>
If you’re interested in consensus in Consul, <a href="https://www.consul.io/docs/internals/consensus.html">read more here</a>.
</p>
<hr>
<h2>The beginning of Consul at Stripe</h2>
<p>
We started out by only writing to Consul—having machines report whether or not they were up to the Consul server, but not using that information to do service discovery. We wrote some <a href="https://puppet.com/">Puppet</a> configuration to set it up, which wasn’t that hard!
</p>
<p>
This way we could uncover potential issues with running the Consul client and get experience operating it on thousands of machines. At first, no services were being discovered with Consul.
</p>
<p>
What could go wrong?
</p>
<h3>Addressing memory leaks</h3>
<p>
If you add a new piece of software to every box in your infrastructure, that software could definitely go wrong! Early on we ran into memory leaks in Consul’s stats library: we noticed that one box was taking over 100MB of RAM and climbing. This was a bug in Consul, <a href="https://github.com/armon/go-metrics/commit/02567bbc4f518a43853d262b651a3c8257c3f141">which got fixed</a>.
</p>
<p>
100MB of memory is not a big leak, but the leak was growing quickly. Memory leaks in general are worrisome because they're one of the easiest ways for one process on a box to Totally Ruin Things for other processes on the box.
</p>
<p>
Good thing we decided not to use Consul to discover services to start! Letting it sit on a bunch of production machines and monitoring memory usage let us find out about a potentially serious problem quickly with no negative impact.
</p>
<h3>Starting to discover services with Consul</h3>
<p>
Once we were more confident that running Consul in our infrastructure would work, we started adding a few clients to talk to Consul! We made this less risky in 2 ways:
</p>
<ul>
<li>Only use Consul in a few places to start,</li>
<li>Keep a fallback system in place so that we could function during outages.</li>
</ul>
<p>
Here are some of the issues we ran into. We’re not listing these to complain about Consul, but rather to emphasize that when using new technology, it’s important to roll it out slowly and be cautious.
</p>
<p>
<strong>A ton of Raft failovers.</strong> Remember that Consul uses a consensus protocol? This copies all the data on one server in the Consul cluster to other servers in that cluster. The primary server was having a ton of problems with disk I/O—the disks weren’t fast enough to do the reads that Consul wanted to do, and the whole primary server would hang. Then Raft would say “oh, the primary is down!” and elect a new primary, and the cycle would repeat. While Consul was busy electing a new primary, it would not let anybody write anything or read anything from its database (because consistent reads are the default).
</p>
<p>
<strong>Version 0.3 broke SSL completely.</strong> We were using Consul’s SSL feature (technically, TLS) for our Consul nodes to communicate securely. One Consul release just broke it.
<a href="https://github.com/hashicorp/consul/pull/233">We patched it.</a>
This is an example of a kind of issue that isn’t that difficult to detect or scary (we tested in QA, realized SSL was broken, and just didn’t roll out the release), but is pretty common when using early-stage software.
</p>
<p>
<strong>Goroutine leaks.</strong> We started using <a href="https://www.consul.io/docs/guides/leader-election.html">Consul’s leader election</a>. and there was a goroutine leak that caused Consul to quickly eat all the memory on the box. The Consul team was really helpful in debugging this and we fixed a bunch of memory leaks (different memory leaks from before).
</p>
<p>
Once all of these were fixed, we were in a much better place. Getting from “our first Consul client” to “we’ve fixed all these issues in production” took a bit less than a year of background work cycles.
</p>
<hr>
<h2>Scaling Consul to discover which services are up</h2>
<p>
So, we’d learned about a bunch of bugs in Consul, and had them fixed, and everything was operating much better. Remember that step we talked about at the beginning, though? Where you ask Consul “hey, what boxes are up for <code>api-service</code>?” We were having intermittent problems where the Consul server would respond slowly or not at all.
</p>
<p>
This was mostly during raft failovers or instability; because Consul uses a strongly-consistent store its availability will always be weaker than something that doesn't. It was especially rough in the early days.
</p>
<p>
We still had fallbacks, but Consul outages became pretty painful for us. We would fall back to a hardcoded set of DNS names (like “apibox1”) when Consul was down. This worked okay when we first rolled out Consul, but as we scaled and used Consul more widely, it became less and less viable.
</p>
<h3>Consul Template to the rescue</h3>
<p>
Asking Consul which services were up (via its HTTP API) was unreliable. But we were happy with it otherwise!
</p>
<p>
We wanted to get information out of Consul about which services were up without using its API. How?
</p>
<p>
Well, Consul would take a name (like <code>monkey-srv</code>) and translate it into one or several IP addresses (“this is where <code>monkey-srv</code> lives”). Know what else takes in names and outputs IP address? A DNS server! So we replaced Consul with a DNS server. Here’s how: <a href="https://github.com/hashicorp/consul-template">Consul Template</a> is a Go program that generates static configuration files from your Consul database.
</p>
<p>
We started using Consul Template to generate DNS records for Consul services. So if <code>monkey-srv</code> was running on IP 10.99.99.99, we’d generate a DNS record:
</p>
<pre>
monkey-srv.service.consul IN A 10.99.99.99
</pre>
<p>
Here’s what that looks like in code. You can also find our real <a href="https://gist.github.com/ebroder/51fef2a3fdf275ec43e5">Consul Template configuration</a> which is a little more complicated.
</p>
<pre>
{{range service $service.Name}}
{{$service.Name}}.service.consul. IN A {{.Address}}
{{end}}
</pre>
<p>
If you're thinking "wait, DNS records only have an IP address, you also need to know which port the server is running on," you are right! DNS A records (the kind you normally see) only have an IP address in them. However, DNS SRV records can have ports in them, and we also use Consul Template to generate SRV records.
</p>
<p>
We run Consul Template in a cron job every 60 seconds. Consul Template also has a “watch” mode (the default) which continuously updates configuration files when its database is updated. When we tried the watch mode, it DOSed our Consul server, so we stopped using it.
</p>
<p>
So if our Consul server goes down, our internal DNS server still has all the records! They might be a little old, but that’s fine. What’s awesome about our DNS server is that it’s not a fancy distributed system, which means it’s a much simpler piece of software, and much less likely to spontaneously break. This means that I can just look up <code>monkey-srv.service.consul</code> get an IP, and use it to talk to my service!
</p>
<p>
Because DNS is a shared-nothing eventually consistent system we can replicate and cache it a bunch (we have 5 canonical DNS servers and every server has a local DNS cache and knows how to talk to any of the 5 canonical servers) so it's fundamentally more resilient than Consul.
</p>
<h3>Adding a load balancer for faster healthchecks</h3>
<p>
We just said that we update DNS records from Consul every 60 seconds. So, what happens if an API server explodes? Do we keep sending requests to that IP for 45 more seconds until the DNS server gets updated? We do not! There’s one more piece of the story: <a href="http://www.haproxy.org/">HAProxy</a>.
</p>
<p>
HAProxy is a load balancer. If you give a healthcheck for the service it’s sending requests to, it can make sure that your backends are up! All of our API requests actually go through HAProxy. Here’s how it works:
</p>
<ul>
<li>Every 60 seconds, Consul Template writes an HAProxy configuration file.</li>
<li>This means that HAProxy always has an approximately correct set of backends.</li>
<li>If a machine goes down, HAProxy realizes quickly that something has gone wrong (since it runs healthchecks every 2 seconds).</li>
</ul>
<p>
This means we restart HAProxy every 60 seconds. But does that mean we drop connections when we restart HAProxy? No. To avoid dropping connections between restarts, we use <a href="http://cbonte.github.io/haproxy-dconv/1.7/management.html#4">HAProxy’s graceful restarts feature</a>. It’s still possible to drop some traffic with this restart policy, <a href="https://engineeringblog.yelp.com/2015/04/true-zero-downtime-haproxy-reloads.html">as described here</a>, but we don’t process enough traffic that it’s an issue.
</p>
<p>
We have a standard healthcheck endpoint for our services—almost every service has a <code>/healthcheck</code> endpoint that returns 200 if it’s up and and errors if not. Having a standard is important because it means we can easily configure HAProxy to check service health.
</p>
<p>
When Consul is down, HAProxy will just have a stale configuration file, which will keep working.
</p>
<hr>
<h2>Trading consistency for availability</h2>
<p>
If you’ve been paying close attention, you’ll notice that the system we started with (a strongly consistent database which was guaranteed to have the latest state) was very different from the the system we finished with (a DNS server which could be up to a minute behind). Giving up our requirement for consistency let us have a much more available system—Consul outages have basically no effect on our ability to discover services.
</p>
<p>
An important lesson from this is that consistency doesn’t come for free! You have to be willing to pay the price in availability, and so if you’re going to be using a strongly consistent system it’s important to make sure that’s something you actually need.
</p>
<h3>What happens when you make a request</h3>
<p>
We covered a lot in this post, so let’s go through the request flow now that we’ve learned how it all works.
</p>
<p>
When you make a request for <code>https://stripe.com/</code>, what happens? How does it end up at the right server? Here’s a simplified explanation:
</p>
<ol>
<li>It comes into one of our public load balancers, running HAProxy,</li>
<li>Consul Template has populated a list of servers serving stripe.com in the <code>/etc/haproxy.conf</code> configuration file,</li>
<li>HAProxy reloads this configuration file every 60 seconds,</li>
<li>HAProxy sends your request on to a stripe.com server! It makes sure that the server is up.</li>
</ol>
<p>
It’s actually a little more complicated than that (there’s actually an extra layer, and Stripe API requests are even more complicated, because we have systems to deal with PCI compliance), but all the core ideas are there.
</p>
<p>
This means that when we bring up or take down servers, Consul takes care of removing them from the HAProxy rotation automatically. There’s no manual work to do.
</p>
<hr>
<h2>More than a year of peace</h2>
<p>
There are a lot of areas we’re looking forward to improving in our approach to service discovery. It’s a space with loads of active development and we see some elegant opportunities for integrating our scheduling and request routing infrastructure in the near future.
</p>
<p>
However, one of the important lessons we’ve taken away is that simple approaches are often the right ones. This system has been working for us reliably for more than a year without any incidents. Stripe doesn’t process anywhere near as many requests as Twitter or Facebook, but we do care a very great deal about extreme reliability. Sometimes the best wins come from deploying a stable, excellent solution instead of a novel one.
</p>
<p class="cta">
Like this post? Join the Stripe engineering team. <a href="/jobs?ref=blog#engineering" class="button">View Openings</a>
</p>
Mon, 31 Oct 2016 00:00:00 -0000https://stripe.com/blog/service-discovery-at-stripe2016-10-31T00:00:00ZA primer on machine learning for fraud detectionhttps://stripe.com/blog/a-primer-on-machine-learning-for-fraud-detection
<p><a href="/radar/guide">Stripe Radar</a> is a collection of tools to help businesses detect and prevent fraud. At Radar’s core is a machine learning engine that scans every card payment across Stripe’s 100,000+ businesses, aggregates information from those payments into behavioral signals that are predictive of fraud, and blocks payments that have a high probability of being fraudulent.</p>
<p>Radar’s power comes from all the data we obtain from the Stripe “network.” Instead of requiring users to label charges manually, Radar obtains the “ground truth” of fraud directly from our banking partners. Just as importantly, the signals we use in our models include aggregates over the entire stream of payments processed by Stripe: when a card is used for the first time on a Stripe business, there’s an 80% chance we’ve seen that card elsewhere on the Stripe network, and those previous interactions provide valuable information about potential fraud.</p>
<p>If you’re curious to learn more, we’ve put together a detailed outline that describes how we use machine learning at Stripe to detect and prevent fraud.</p>
<p><strong><a href="/radar/guide" class="arrow">Read more</a></strong></p>
Thu, 27 Oct 2016 00:00:00 -0000https://stripe.com/blog/a-primer-on-machine-learning-for-fraud-detection2016-10-27T00:00:00ZStripe Radarhttps://stripe.com/blog/radar
<div class="background"></div>
<a href="/radar">
<strong>Introducing Stripe Radar.</strong> Modern tools to help you beat fraud, fully integrated with your payments. <span class="arrow">Learn more</span>
</a>Wed, 19 Oct 2016 00:00:00 -0000https://stripe.com/blog/radar2016-10-19T00:00:00ZIntroducing Veneur: high performance and global aggregation for Datadoghttps://stripe.com/blog/introducing-veneur-high-performance-and-global-aggregation-for-datadog
<p>When a company writes about their <a href="https://en.wikipedia.org/wiki/Observability">observability</a> stack, they often focus on sweet visualizations, advanced anomaly detection or innovative data stores. Those are well and good, but today we’d like to talk about the tip of the spear when it comes to observing your systems: metrics pipelines! Metrics pipelines are how we get metrics from where they happen—our hosts and services—to storage quickly and efficiently so they can be queried, all without interrupting the host service.</p>
<p>First, let’s establish some technical context. About a year ago, Stripe started the process of migrating to <a href="https://www.datadoghq.com/">Datadog</a>. Datadog is a hosted product that offers metric storage, visualization and alerting. With them we can get some marvelous dashboards to monitor our Observability systems:</p>
<div class="image-center">
<a href="/img/blog/posts/veneur/datadog-observability-dashboard.png" class="zoom img" maxSize="1600x1232">
<img src="/img/blog/posts/veneur/datadog-observability-dashboard.png" width="680">
</a>
<p>Observability Overview Dashboard</p>
</div>
<p>Previously, we’d been using some nice open-source software but it was sadly unowned and unmaintained internally. Facing the high cost—in money and people—we decided that outsourcing to Datadog was a great idea. Nearly a year later, we’re quite happy with the improved visibility and reliability we’ve gained through significant effort in this area. One of the most interesting aspects of this work was how to even metric!</p>
<hr>
<h2>Using StatsD for metrics</h2>
<p>There are many ways to instrument your systems. Our preferred method is the <a href="https://github.com/etsy/statsd">StatsD</a> style: a simple text-based protocol with minimal performance impact. Code is instrumented to emit UDP to a central server at runtime whenever measured stuff happens.</p>
<p>Like all of life, this choice has tradeoffs. For the sake of brevity, we’ll quickly mention the two downsides of StatsD that are most relevant to us: its use of <a href="https://en.wikipedia.org/wiki/User_Datagram_Protocol#Reliability_and_congestion_control_solutions">inherently unreliable</a> UDP, and its role as a Single Point of Failure for timer aggregation.</p>
<p>As you may know, UDP is a “fire and forget” protocol that does not require any acknowledgement by the receiver. This makes UDP pretty fast for the client, but also means that client has no way to ensure that the metric was received by anyone! When combined with the network and the host’s natural protections that cause traffic to be dropped, you’ve got a problem.</p>
<p>Another problem is the Single Point of Failure. The poor StatsD server has to process a lot of UDP packets if you’ve got a non trivial number of sources. Add to that the nightmare of the machine going down and the need to shard or use other tricks to scale out, and you’ve got your work cut out for you.</p>
<hr>
<h2>DogStatsD and the lack of “global”</h2>
<p>Aware that a central StatsD can be a problem for some, Datadog takes a different approach: Each host runs an instance of <a href="http://docs.datadoghq.com/guides/dogstatsd/">DogStatsD</a> as part of the <a href="https://github.com/datadog/dd-agent">Datadog agent</a>. This neatly sidesteps most performance problems but created a large feature regression for Stripe: no more global percentiles. Datadog only supports per-host aggregations for histograms, timers and sets.</p>
<p>Remember that, with StatsD, you emit a metric to the downstream server each time the event occurs. If you’re measuring API requests and emitting that metric on each host, you are now sending your timer metric to the local Datadog agent which aggregates them and flushes them to Datadog’s servers in batches. For counters, this is great because you can just add them together! But for percentiles we’ve got problems. Imagine you’ve got hundreds of servers each doing an unequal number of API requests with unequal workloads. Our percentiles are not representative of how our whole API is behaving. Even worse, once we’ve generated the percentiles for our histograms there is no meaningful way, mathematically, to combine them. (More precisely, the percentiles of arbitrary subsamples of a distribution are not <a href="https://en.wikipedia.org/wiki/Sufficient_statistic">sufficient</a> for the percentiles of the full distribution).</p>
<p>Stripe needs to know the overall percentiles because each host’s histogram only has a small subset of random requests. We needed something better!</p>
<hr>
<h2>Enter Veneur</h2>
<p>To provide these features to Stripe we created <a href="https://github.com/stripe/veneur" class="github">Veneur</a>, a DogStatsD server with global aggregation capability. We’re happily running it in production and you can too! It’s open-source and we’d love for you to take a look.</p>
<p>Veneur runs in place of Datadog’s bundled DogStatsD server, listening on the same port. It flushes metrics to Datadog just like you’d expect. That’s where the similarities end, however, and <a href="https://github.com/stripe/veneur#how-veneur-is-different-than-official-dogstatsd">the magic begins</a>.</p>
<p>Instead of aggregating the histogram and emitting percentiles at flush time, <a href="https://github.com/stripe/veneur#forwarding">Veneur forwards</a> the histogram on to a global Veneur instance which merges all the histograms and flushes them to Datadog at the next window. It adds a bit of delay—one flush period—but the result is a best-of-both mix of local and global metrics!</p>
<div class="image-center">
<a href="/img/blog/posts/veneur/datadog-charge-creation-latency.png" class="zoom img" maxSize="1000x406">
<img src="/img/blog/posts/veneur/datadog-charge-creation-latency.png" width="680">
</a>
<p>We monitor the performance of many of our API calls, such as this chart of various percentiles for creating a charge. Red bars are deploys!</p>
</div>
<hr>
<h2>Approximate, mergeable histograms</h2>
<p>As mentioned earlier, the essential problem with percentiles is that, once reported, they can’t be combined together. If host A received 20 requests and host B received 15, the two numbers can be added to determine that, in total, we had 35 requests. But if host A has a 99th percentile response time of 8ms and host B has a 99th percentile response time of 10ms, what’s the 99th percentile across both hosts?</p>
<p>The answer is, “we don’t know”. Taking the mean of those two percentiles results in a number that is statistically meaningless. If we have more than two hosts, we can’t simply take the percentile of percentiles either. We can’t even use the percentiles of each host to infer a range for the global percentile—the global 99th percentile could, in rare cases, be larger than any of the individual hosts’ 99th percentiles. We need to take the original set of response times reported from host A, and the original set from host B, and combine those together. Then, from the combined set, we can report the real 99th percentile across both hosts. That’s what forwarding is for.</p>
<p>Of course, there are a few caveats. If each histogram stores all the samples it received, the final histogram on the global instance could potentially be huge. To sidestep this issue, Veneur uses an approximating histogram implementation called a <a href="https://github.com/tdunning/t-digest">t-digest</a>, which uses constant space regardless of the number of samples. (Specifically, we wrote <a href="https://github.com/stripe/veneur/blob/master/tdigest/merging_digest.go">our own Go port</a> of it.) As the name would suggest, <a href="https://github.com/stripe/veneur#approximate-histograms">approximating histograms</a> return approximate percentiles with some error, but this tradeoff ensures that Veneur’s memory consumption stays under control under any load.</p>
<hr>
<h2>Degradation</h2>
<p>The global Veneur instance is also a single point of failure for the metrics that pass through it. If it went down we would lose percentiles (and sets, since those are forwarded too). But we wouldn’t lose everything. Besides the percentiles, StatsD histograms report a counter of how many samples they’ve received, and the minimum/maximum samples. These metrics can be combined without forwarding (if we know the maximum response time on each host, the maximum across all hosts is just the maximum of the individual values, and so on), so they get reported immediately without any forwarding. Clients <a href="https://github.com/stripe/veneur#magic-tag">can opt out of forwarding</a> altogether, if they really do want their percentiles to be constrained to each host.</p>
<div class="image-center">
<a href="/img/blog/posts/veneur/veneur-dashboard.png" class="zoom img" maxSize="1580x696">
<img src="/img/blog/posts/veneur/veneur-dashboard.png" width="680">
</a>
<p>Veneur’s Overview Dashboard, well instrumented and healthy!</p>
</div>
<hr>
<h2>Other cool features and errors</h2>
<p>Veneur—named for the Grand Huntsman of France, master of dogs!—also has a few other tricks:</p>
<ul>
<li>Drop-in replacement for Datadog’s included DogStatsD. It even processes events and service checks!</li>
<li>Written in <a href="https://golang.org/">Go</a> to minimize deployment troubles to a single binary</li>
<li>Use of <a href="https://en.wikipedia.org/wiki/HyperLogLog">HyperLogLogs</a> for counting the unique members of a set efficiently with fixed memory consumption</li>
<li>Extensive <a href="https://github.com/stripe/veneur#metrics">metrics</a> (natch) so you can watch the watchers</li>
<li>Efficient <a href="https://github.com/stripe/veneur#compressed-chunked-post">compressed, chunked POST requests</a> sent concurrently to Datadog’s API</li>
<li><a href="https://github.com/stripe/veneur#performance">Extremely fast</a></li>
</ul>
<p>Over the course of Veneur’s development we also iterated a lot. Our initial implementation was purely a global DogStatsD implementation without the forwarding or merging. It was really fast, but we quickly decided that processing more packets faster wasn’t really going to get us very far.</p>
<p>Next we took some twists and turns through “smart clients” that tried to route metrics to the appropriate places. This was initially promising, but we found that supporting this for each of our language runtimes and use cases was prohibitively expensive and undermined (Dog)StatsD’s simplicity. Some of our instrumentation is as simple as an <a href="https://en.wikipedia.org/wiki/Netcat ">nc</a> command and that simplicity is very helpful to quickly instrument things.</p>
<p>While overall our work was transparent, we did cause some trouble when we initially turned the global features back on. Some teams had come to rely on per-host for very specific metrics. When we had to fail back to host local for some refactoring, we caused problems to teams who had just adapted to using global features. Argh! Each of these wound up being positive learning experiences, and we found Stripe’s engineers to be very accommodating. Thanks!</p>
<hr>
<h2>Thanks and future work</h2>
<p>The Observability team would like to thank Datadog for their support and advice in the creation of Veneur. We’d also like to thank our friends and teammates at Stripe for their patience as we iterated to where we are today. Specifically for the occasional broken charts, metrics outages and other hilarious-in-hindsight problems we caused along the way.</p>
<p>We’ve been running <a href="https://github.com/stripe/veneur" class="github">Veneur</a> in production for months and have been enjoying the fruits of our labor. We’re now operating at a stable, more mature pace for improvements around efficiency learned from monitoring its production behavior. We hope to leverage Veneur in the future for continued improvements to the features and reliability of our metrics pipeline. We’ve discussed additional protocol features, unified formats, per-team accounting and even incorporating other sensor data like tracing spans. Veneur’s speed, instrumentation and flexibility give us lots of room to grow and improve. Someone’s gotta feed those wicked cool visualizations and anomaly detectors!</p>
<p class="cta">
Like this post? Join the Stripe engineering team. <a href="/jobs?ref=blog#engineering" class="button">View Openings</a>
</p>Tue, 18 Oct 2016 00:00:00 -0000https://stripe.com/blog/introducing-veneur-high-performance-and-global-aggregation-for-datadog2016-10-18T00:00:00ZStripe in Japan!https://stripe.com/blog/stripe-in-japan
<div class="ja hidden">
<div class="image-center">
<div class="toggle">
<span>日本語</span>
<a class="switcher" href="/locale?locale=en&redirect=/blog/stripe-in-japan">English</a>
</div>
</div>
<p>本日、日本へ向けて Stripe の標準機能をいよいよ正式リリースいたします！</p>
<p>これより日本のすべての企業は Stripe に登録し、Stripe の全機能を活用できるようになります。主な機能は — 即時アカウント登録、130 通貨以上の決済への対応、<a href="/connect">Connect</a> のマーケットプレイス機能、国内最速レベルの振込周期 — など、今までの日本市場には存在しなかったものです。
<p>昨年日本でベータ版を開始した際、スタートアップ・エコシステムの急成長、新しいビジネスへの要望 ( サースやマーケットプレイスなど ) やグローバル展開を視野に見据えた日本企業からの関心の高さを受け、私たちのプロダクトが日本の起業家たちに変化をもたらす、新しい機会を与えると感じました。</p>
<p>昨年以来、これらの機能は何千もの革新的な日本企業と共に、徹底的なテストが行われてきました。<a href="https://peatix.com">Peatix</a> ( コミュニティの育成・管理を行うためのツール開発 ) 、<a href="https://gengo.com">Gengo</a> ( 世界中のビジネスに向けた翻訳サービスを提供 ) 、<a href="https://www.ana.co.jp/">ANA</a> ( 日本最大の航空会社 ) の各社を含め、これまでフィードバックをお寄せいただいた皆様のおかげで、日本で無事に安定した製品をリリースできますことを深く感謝いたします。</p>
<p>日本のユーザをサポートするために、信頼できる日本語サポートを提供できる現地チームを立ち上げ、原宿に拠点も設立いたしました。採用情報にご興味のある方は是非<a href="/jobs">ご連絡</a>ください！</p>
<p>日本に提供される新しいインフラストラクチャで現在・将来のユーザが作り出すビジネス、製品、サービスに期待しています。フィードバックやご質問があれば是非<a href="mailto:daniel.heffernan@stripe.com">ご一報</a>ください。皆様のお声をお待ちしております。</p>
<p class="cta">日本ですぐに支払いの受付を開始する。<a href="/signup" class="button">Stripe にユーザ登録</a></p>
</div>
<div class="en">
<div class="image-center">
<div class="toggle">
<a class="switcher" href="/locale?locale=ja&redirect=/blog/stripe-in-japan">日本語</a>
<span>English</span>
</div>
</div>
<p>Today, we’re excited to publicly launch Stripe in Japan!</p>
<p>Every business in Japan can now sign up for Stripe and take advantage of the complete Stripe stack. Many core Stripe features—instant setup, support for 130+ currencies, the ability to <a href="/connect">build marketplaces</a>, fast and frequent transfers, and more—have not been available in the Japanese market before.</p>
<p>When we started our beta in Japan last year, we saw an opportunity for our product to make a difference for local entrepreneurs: there is a growing local startup ecosystem, an appetite to build new types of businesses (like SaaS companies and marketplaces), and an increasing interest from Japanese companies to expand beyond the local market and go global.</p>
<p>Over the last year, thousands of the most innovative Japanese companies have battle-tested these features with us. Companies like <a href="https://peatix.com">Peatix</a> (which builds tools to manage and grow communities), <a href="https://gengo.com">Gengo</a> (which provides translations for businesses around the world), and <a href="https://www.ana.co.jp/">ANA</a> (the largest airline in Japan). We’d like to thank all our beta users for their feedback as we’ve rolled out and polished our product for Japan.</p>
<p>To support our Japanese users, we’ve also built a local team to provide reliable Japanese-language support out of our Harajuku office. If you’re interested in joining our growing team, <a href="/jobs">please reach out</a>!</p>
<p>We’re looking forward to seeing what these and future users will build on the new infrastructure we’re bringing to Japan. And if you have any questions or feedback, just <a href="mailto:daniel.heffernan@stripe.com">drop me a line</a>.</p>
<p class="cta">Start accepting payments instantly in Japan. <a href="/signup" class="button">Sign up for Stripe</a></p>
</div>
<script>
var dateEl = $('#blog-post--stripe-in-japan > article > header > p > span');
var dateEn = ' on October 4, 2016';
var dateJa = ' 2016年10月4日';
var headerEl = $('#blog-post--stripe-in-japan > article > header > h1 > a');
var headerEn = 'Stripe in Japan!';
var headerJa = '日本の皆様へ'
$('.hidden').hide();
$('.switcher').click(function(e) {
e.preventDefault();
$('.en, .ja').fadeToggle(100);
if (dateEl.html().indexOf('2016年10月4日') > 0) {
dateEl.html(dateEn);
headerEl.html(headerEn);
$('html').attr('lang', 'en');
} else {
dateEl.html(dateJa);
headerEl.html(headerJa);
$('html').attr('lang', 'ja');
}
});
</script>
Tue, 04 Oct 2016 00:00:00 -0000https://stripe.com/blog/stripe-in-japan2016-10-04T00:00:00ZStripe in Singapore!https://stripe.com/blog/stripe-in-singapore
<p>Today, we’re excited to fully launch Stripe to all Singaporean businesses—any entrepreneur in Singapore can now instantly start accepting payments.</p>
<p>Since our beta launch just last year, we’ve been thrilled to work with many innovative Singaporean companies to start and scale their businesses. Between subscription companies like <a href="https://www.tradegecko.com/">Tradegecko</a> and <a href="https://guavapass.com/">Guavapass</a>, on-demand platforms like <a href="https://www.grab.com/sg/">Grab</a>, <span class="nowrap">e-commerce</span> startups like <a href="https://grain.com.sg/">Grain</a> and <a href="https://www.hipvan.com/">Hipvan</a>, and new kinds of marketplaces like <a href="http://oddle.me/">Oddle</a>, more than 50% of venture-backed companies in Singapore now use Stripe. In addition, we’ve worked with global platforms like <a href="https://www.shopify.com/">Shopify</a>, <a href="https://deliveroo.com.sg/">Deliveroo</a>, and <a href="https://www.kickstarter.com/">Kickstarter</a> to make their services available in Singapore.
<p>With more than 90% of the population already using a smartphone, Singapore is one of the heaviest adopters of mobile technology in the world. And so, out of the box, we’ll enable Singaporean businesses to accept both <a href="/apple-pay">Apple Pay</a> and <a href="/android-pay">Android Pay</a>.</p>
<p>Pricing is simple and predictable: 3.4% + S$0.50 per successful charge. Volume pricing is available for businesses at scale—please <a href="https://stripe.com/contact/sales">get in touch</a> if you expect to process more than S$50,000 a month.</p>
<p>Founded as a <a href="http://www.yoursingapore.com/travel-guide-tips/about-singapore.html">nexus for trade</a> between India, China, and Europe, Singapore embodies international commerce. We’re excited to support the next generation of global businesses being built by Singaporean entrepreneurs.</p>
<p>On that note, we’d love to grow our team in Singapore to help more businesses manage their global payments easily and securely. If you’d like to work with us in Singapore, or elsewhere, <a href="mailto:piruze.sabuncu@stripe.com">please reach out</a>!</p>
<p class="cta">
Start accepting payments instantly. <a href="/signup" class="button">Get Started with Stripe</a>
</p>
Tue, 20 Sep 2016 00:00:00 -0000https://stripe.com/blog/stripe-in-singapore2016-09-20T00:00:00Z