"In case you're curious what Hawaii's EAS/WEA interface looks like, I believe it's similar to this. Hypothesis: they test their EAS authorization codes at the beginning of each shift and selected the wrong option."

This is absolutely classic enterprisey, government-standard web UX -- a dropdown template selection and an easily-misclicked pair of tickboxes to choose test or live mode.

At Facebook we have been working on automated reasoning about concurrency in our work with the Infer static analyzer. RacerD, our new open source race detector, searches for data races — unsynchronized memory accesses, where one is a write — in Java programs, and it does this without running the program it is analyzing. RacerD employs symbolic reasoning to cover many paths through an app, quickly.

LambCI is a tool I began building over a year ago to run tests on our pull requests and branches at Uniqlo Mobile. Inspired at the inaugural ServerlessConf a few weeks ago, I recently put some work into hammering it into shape for public consumption.
It was borne of a dissatisfaction with the two current choices for automated testing on private projects. You can either pay for it as a service (Travis, CircleCI, etc) — where 3 developers needing their own build containers might set you back a few hundred dollars a month. Or you can setup a system like Jenkins, Strider, etc and configure and manage a database, a web server and a cluster of build servers .
In both cases you’ll be under- or overutilized, waiting for servers to free up or paying for server power you’re not using. And this, for me, is where the advantage of a serverless architecture really comes to light: 100% utilization, coupled with instant invocations.

An Irish veterinarian with degrees in history and politics has been unable to convince a machine she can speak English well enough to stay in Australia.

Louise Kennedy is a native English speaker, has excellent grammar and a broad vocabulary. She holds two university degrees – both obtained in English – and has been working in Australia as an equine vet on a skilled worker visa for the past two years.

But she is now scrambling for other visa options after a computer-based English test – scored by a machine – essentially handed her a fail in terms of convincing immigration officers she can fluently speak her own language.

This is idiotic. Computer-based voice recognition is in no way reliable enough for this kind of job. It's automated Kafkaesque bureaucracy -- "computer says no". Shame on Oz

Very interesting writeup on how Netflix are finding operating a serverless scripting system; they offer scriptability in their backend and it's used heavily by devs to provide features. Lots of having to reinvent the wheel on packaging, deployment, versioning, and test/staging infrastructure

This is an extremely detailed post on the state of dynamic checkers in C/C++ (via the inimitable Marc Brooker):

Recently we’ve heard a few people imply that problems stemming from undefined behaviors (UB) in C and C++ are largely solved due to ubiquitous availability of dynamic checking tools such as ASan, UBSan, MSan, and TSan. We are here to state the obvious — that, despite the many excellent advances in tooling over the last few years, UB-related problems are far from solved — and to look at the current situation in detail.

Nasdaq said in a statement that "certain third parties improperly propagated test data that was distributed as part of the normal evening test procedures."

"For July 3, 2017, all production data was completed by 5:16 PM as expected per the early close of the markets," the statement continued. "Any data messages received post 5:16 PM should be deemed as test data and purged from direct data recipient's databases. UTP (Unlisted Trading Privileges) is asking all third parties to revert to Nasdaq Official Closing Prices effective at 5:16 PM."

Interesting! We discussed similar ideas in $prevjob, good to see one hitting production globally.

RIPE Atlas probes form the backbone of the RIPE Atlas infrastructure. Volunteers all over the world host these small hardware devices that actively measure Internet connectivity through ping, traceroute, DNS, SSL/TLS, NTP and HTTP measurements. This data is collected and aggregated by the RIPE NCC, which makes the data publicly available. Network operators, engineers, researchers and even home users have used this data for a wide range of purposes, from investigating network outages to DNS anycasting to testing IPv6 connectivity.

Anyone can apply to host a RIPE Atlas probe. If your application is successful (based on your location), we will ship you a probe free of charge. Hosts simply need to plug their probe into their home (or other) network.

Probes are USB-powered and are connected to an Ethernet port on the host’s router or switch. They then automatically and continuously perform active measurements about the Internet’s connectivity, and this data is sent to the RIPE NCC, where it is aggregated and made publicly available. We also use this data to create several Internet maps and data visualisations. [....]

The hardware of the first and second generation probes is a Lantronix XPort Pro module with custom powering and housing built around it. The third generation probe is a modified TP-Link wireless router (model TL-MR 3020) with a small USB thumb drive in it, but this probe does not support WiFi.

A sandboxed local environment that replicates the live AWS Lambda environment almost identically – including installed software and libraries, file structure and permissions, environment variables, context objects and behaviors – even the user and running process are the same.

We must recognise that even formal verification can leave gaps and hidden assumptions that need to be teased out and tested, using the full battery of testing techniques at our disposal. Building distributed systems is hard. But knowing that shouldn’t make us shy away from trying to do the right thing, instead it should make us redouble our efforts in our quest for correctness.

'Using machine learning in real-world production systems is complicated by a host of issues not found in small toy examples or even large offline research experiments. Testing and monitoring are key considerations for assessing the production-readiness of an ML system. But how much testing and monitoring is enough? We present an ML Test Score rubric based on a set of actionable tests to help quantify these issues.'

Additionally, LocalStack provides a powerful set of tools to interact with the cloud services, including a fully featured KCL Kinesis client with Python binding, simple setup/teardown integration for nosetests, as well as an Environment abstraction that allows to easily switch between local and remote Cloud execution.

When testing RealTime software a simulator is often employed, which injects events into the program which do not occur in RealTime.
If you are writing software that controls or monitors some process that exists in the real world, it takes a long time to test it. But if you simulate it, there is no reason in the simulated software (if it is disconnected from the real world completely) not to make the apparent system time inside your software appear to move at a much faster rate. For example, I have written simulators that can verify the operational steps taken by industrial controllers over a 12 hour FakeTime period, which executes in 60 seconds. This allows me to run '12 hours' of fake time through my test cases and test scenarios, without waiting 12 hours for the testing to complete. Of course, after a successful fakeTime test, an industrial RealTime system still needs to be tested in non-simulated fashion.

Invariably, when I see a lot of developer effort in production support I also find an unreliable QA environment. It is both unreliable in that it is frequently not available for testing, and unreliable in the sense that the system’s behavior in QA is not a good predictor of its behavior in production.

'I would strongly encourage you to avoid repeating the mistakes of testing methodologies that focus entirely on max achievable throughput and then report some (usually bogus) latency stats at those max throughout modes. The techempower numbers are a classic example of this in play, and while they do provide some basis for comparing a small aspect of behavior (what I call the "how fast can this thing drive off a cliff" comparison, or "pedal to the metal" testing), those results are not very useful for comparing load carrying capacities for anything that actually needs to maintain some form of responsiveness SLA or latency spectrum requirements.'

Some excellent advice here on how to measure and represent stack performance.

Also: 'DON'T use or report standard deviation for latency. Ever. Except if you mean it as a joke.'

Any design that is hard to test is crap. Pure crap. Why? Because if it's hard to test, you aren't going to test it well enough. And if you don't test it well enough, it's not going to work when you need it to work. And if it doesn't work when you need it to work the design is crap.

we’ve recently gained momentum on standardizing our [cross-platform test] drivers. Human-readable, machine-testable specs, coded in YAML, prove which code conforms and which does not. These YAML tests are the Cat-Herd’s Crook: a tool to guide us all in the same direction.

'any nontrivial development of Lambda functions will require a simple, automated build/deploy process that also fills a couple of Lambda’s gaps such as the use of node modules and environment variables.'

An implementation of Amazon's DynamoDB, focussed on correctness and performance, and built on LevelDB (well, @rvagg's awesome LevelUP to be precise). This project aims to match the live DynamoDB instances as closely as possible (and is tested against them in various regions), including all limits and error messages.

Why not Amazon's DynamoDB Local? Because it's too buggy! And it differs too much from the live instances in a number of key areas.

We use DynamoDBLocal in our tests -- the availability of that tool is one of the key reasons we have adopted Dynamo so heavily, since we can safely test our code properly with it. This looks even better.

a proxy that mucks with your system and application context, operating at Layers 4 and 7, allowing you to simulate common failure scenarios from the perspective of an application under test; such as an API or a web application. If you are building a distributed system, Muxy can help you test your resilience and fault tolerance patterns.

'Let's talk about finding bugs in distributed systems for a bit.
These chaos monkey-style fault testing systems are all well and good, but by being application independent they're a very blunt instrument.
Particularly they make it hard to search the fault space for bugs in a directed manner, because they don't 'know' what the system is doing.
Application-aware scripting of faults in a dist. systems seems to be rarely used, but allows you to directly stress problem areas.
For example, if a bug manifests itself only when one RPC returns after some timeout, hard to narrow that down with iptables manipulation.
But allow a script to hook into RPC invocations (and other trace points, like DTrace's probes), and you can script very specific faults.
That way you can simulate cross-system integration failures, *and* write reproducible tests for the bugs they expose!
Anyhow, I've been doing this in Impala, and it's been very helpful. Haven't seen much evidence elsewhere.'

a tool which simplifies tracing and testing of Java programs. Byteman allows you to insert extra Java code into your application, either as it is loaded during JVM startup or even after it has already started running. The injected code is allowed to access any of your data and call any application methods, including where they are private. You can inject code almost anywhere you want and there is no need to prepare the original source code in advance nor do you have to recompile, repackage or redeploy your application. In fact you can remove injected code and reinstall different code while the application continues to execute. The simplest use of Byteman is to install code which traces what your application is doing. This can be used for monitoring or debugging live deployments as well as for instrumenting code under test so that you can be sure it has operated correctly. By injecting code at very specific locations you can avoid the overheads which often arise when you switch on debug or product trace. Also, you decide what to trace when you run your application rather than when you write it so you don't need 100% hindsight to be able to obtain the information you need.

In July 2015, CARB did some follow up testing and again the cars failed—the scrubber technology was present, but off most of the time. How this happened is pretty neat. Michigan’s Stefanopolou says computer sensors monitored the steering column. Under normal driving conditions, the column oscillates as the driver negotiates turns. But during emissions testing, the wheels of the car move, but the steering wheel doesn’t. That seems to have have been the signal for the “defeat device” to turn the catalytic scrubber up to full power, allowing the car to pass the test. Stefanopolou believes the emissions testing trick that VW used probably isn’t widespread in the automotive industry. Carmakers just don’t have many diesels on the road. And now that number may go down even more.

a specialized packet sniffer designed for displaying and logging HTTP traffic. It is not intended to perform analysis itself, but to capture, parse, and log the traffic for later analysis. It can be run in real-time displaying the traffic as it is parsed, or as a daemon process that logs to an output file. It is written to be as lightweight and flexible as possible, so that it can be easily adaptable to different applications.

Testing an HTTP Library can become difficult sometimes. RequestBin is fantastic for testing POST requests, but doesn't let you control the response. This exists to cover all kinds of HTTP scenarios. Additional endpoints are being considered.

You might expect this to create a user with the username ‘user’, but then we’d get conflicts between every test that wanted to call their user ‘user’ which would prevent tests from running safely against the same deployment of the exchange.

Instead, ‘user’ is just an alias that is only meaningful while this one test is running. The DSL creates a unique username that it uses when talking to the actual system. Typically this is done by adding a postfix so the real username is still reasonably understandable e.g. user-fhoai42lfkf.

'Aerospike offers phenomenal latencies and throughput -- but in terms of data safety, its strongest guarantees are similar to Cassandra or Riak in Last-Write-Wins mode. It may be a safe store for immutable data, but updates to a record can be silently discarded in the event of network disruption. Because Aerospike’s timeouts are so aggressive–on the order of milliseconds -- even small network hiccups are sufficient to trigger data loss. If you are an Aerospike user, you should not expect “immediate”, “read-committed”, or “ACID consistency”; their marketing material quietly assumes you have a magical network, and I assure you this is not the case. It’s certainly not true in cloud environments, and even well-managed physical datacenters can experience horrible network failures.'

On 1st March 2015 I discovered that in 2012 HP had filed a patent (WO2014027990) with the USPO for ‘Performance tests in a continuous deployment pipeline‘ (the patent was granted in 2014). [....] HP has filed several patents covering standard Continuous Delivery (CD) practices. You can help to have these patents revoked by providing ‘prior art’ examples on Stack Exchange.

In fairness, though, this kind of shit happens in most big tech companies. This is what happens when you have a broken software patenting system, with big rewards for companies who obtain shitty troll patents like these, and in turn have companies who reward the engineers who sell themselves out to write up concepts which they know have prior art. Software patents are broken by design!

Vaurien is basically a Chaos Monkey for your TCP connections. Vaurien acts as a proxy between your application and any backend. You can use it in your functional tests or even on a real deployment through the command-line.

Vaurien is a TCP proxy that simply reads data sent to it and pass it to a backend, and vice-versa. It has built-in protocols: TCP, HTTP, Redis & Memcache. The TCP protocol is the default one and just sucks data on both sides and pass it along.

Having higher-level protocols is mandatory in some cases, when Vaurien needs to read a specific amount of data in the sockets, or when you need to be aware of the kind of response you’re waiting for, and so on.

Vaurien also has behaviors. A behavior is a class that’s going to be invoked everytime Vaurien proxies a request. That’s how you can impact the behavior of the proxy. For instance, adding a delay or degrading the response can be implemented in a behavior.

Both protocols and behaviors are plugins, allowing you to extend Vaurien by adding new ones.

Last (but not least), Vaurien provides a couple of APIs you can use to change the behavior of the proxy live. That’s handy when you are doing functional tests against your server: you can for instance start to add big delays and see how your web application reacts.

Today, 23andMe announced what Forbes reports is only the first of ten deals with big biotech companies: Genentech will pay up to $60 million for access to 23andMe's data to study Parkinson's. You think 23andMe was about selling fun DNA spit tests for $99 a pop? Nope, it's been about selling your data all along.

I've been bitten by poor key distribution in tests in the past, so this is spot on: 'I'd run it with Zipfian, Pareto, and Dirac delta distributions, and I'd choose read-modify-write transactions.'

And of course, a dataset bigger than all combined RAM.

Also: http://smalldatum.blogspot.ie/2014/04/biebermarks.html -- the "Biebermark", where just a single row out of the entire db is contended on in a read/modify/write transaction: "the inspiration for this is maintaining counts for [highly contended] popular entities like Justin Bieber and One Direction."

'In Linux, /dev/full or the always full device[1][2] is a special file that always returns the error code ENOSPC (meaning "No space left on device") on writing, and provides an infinite number of null characters to any process that reads from it (similar to /dev/zero). This device is usually used when testing the behaviour of a program when it encounters a "disk full" error.'

'A constant throughput, correct latency-recording variant of wrk. This is a must-have when measuring network service latency -- corrects for Coordinated Omission error:

wrk's model, which is similar to the model found in many current load generators, computes the latency for a given request as the time from the sending of the first byte of the request to the time the complete response was received. While this model correctly measures the actual completion time of individual requests, it exhibits a strong Coordinated Omission effect, through which most of the high latency artifacts exhibited by the measured server will be ignored. Since each connection will only begin to send a request after receiving a response, high latency responses result in the load generator coordinating with the server to avoid measurement during high latency periods.

Good example of why I always "calibrate" latency tools with ^Z tests. If ^Z results don't make sense, don't use [the] tool. ^Z test math examples: If you ^Z for half the time, Max is obvious. [90th percentile] should be 80% of the ^Z stall time.

We’ve started running game day exercises at Stripe. During a recent game day, we tested failing over a Redis cluster by running kill -9 on its primary node, and ended up losing all data in the cluster. We were very surprised by this, but grateful to have found the problem in testing. This result and others from this exercise convinced us that game days like these are quite valuable, and we would highly recommend them for others.

Excellent post. Game days are a great idea. Also: massive Redis clustering fail

This is a very solid benchmarking post, examining Kafka in good detail. Nicely done. Bottom line:

I basically spend 2/3 of my work time torture testing and operationalizing distributed systems in production. There's some that I'm not so pleased with (posts pending in draft forever) and some that have attributes that I really love. Kafka is one of those systems that I pretty much enjoy every bit of, and the fact that it performs predictably well is only a symptom of the reason and not the reason itself: the authors really know what they're doing. Nothing about this software is an accident. Performance, everything in this post, is only a fraction of what's important to me and what matters when you run these systems for real. Kafka represents everything I think good distributed systems are about: that thorough and explicit design decisions win.

Interesting -- I hadn't heard of this being an official practise anywhere before (although we actually did it ourselves this week)...

If a build has made it [past the 'integration test' phase], it is ready to be deployed to one or more internal environments for user-acceptance testing. Users could be UI developers implementing a new feature using the API, UI Testers performing end-to-end testing or automated UI regression tests. As far as possible, we strive to not have user-acceptance tests be a gating factor for our deployments. We do this by wrapping functionality in Feature Flags so that it is turned off in Production while testing is happening in other environments.