'The main change is that now GTK+ takes about ⅓ of the time to build
compared to the Autotools build, with likely bigger wins on older/less
powerful hardware; the Visual Studio support on Windows should be at
least a couple of orders of magnitude easier (shout out to Fan
Chun-wei for having spent so, so many hours ensuring that we could
even build on Windows with Visual Studio and MSVC); and maintaining
the build system should be equally easier for everyone on any platform
we currently support.'

This paper introduces a new AMQ data structure, a Counting Quotient Filter, which addresses all of these shortcomings and performs extremely well in both time and space: CQF performs in-memory inserts and queries up to an order of magnitude faster than the original quotient filter structure from which it takes its inspiration, several times faster than a Bloom filter, and similarly to a cuckoo filter. The CQF structure is comparable or more space efficient than all of them too. Moreover, CQF does all of this while supporting counting, outperforming all of the other forms in both dimensions even though they do not. In short, CQF is a big deal!

"Solidity/EVM is by far the worst programming environment I have ever encountered. It would be impossible to write even toy programs correctly in this language, yet it is literally called "Solidity" and used to program a financial system that manages hundreds of millions of dollars."

This is an extremely detailed post on the state of dynamic checkers in C/C++ (via the inimitable Marc Brooker):

Recently we’ve heard a few people imply that problems stemming from undefined behaviors (UB) in C and C++ are largely solved due to ubiquitous availability of dynamic checking tools such as ASan, UBSan, MSan, and TSan. We are here to state the obvious — that, despite the many excellent advances in tooling over the last few years, UB-related problems are far from solved — and to look at the current situation in detail.

If you’re one of the people who think [exactly-once support is impossible], I’d ask you to take an actual look at what we actually know to be possible and impossible, and what has been built in Kafka, and hopefully come to a more informed opinion. So let’s address this in two parts. First, is exactly-once a theoretical impossibility? Second, how does Kafka support it.

We must recognise that even formal verification can leave gaps and hidden assumptions that need to be teased out and tested, using the full battery of testing techniques at our disposal. Building distributed systems is hard. But knowing that shouldn’t make us shy away from trying to do the right thing, instead it should make us redouble our efforts in our quest for correctness.

Much has been written on the pros and cons of microservices, but unfortunately I’m still seeing them as something being pursued in a cargo cult fashion in the growth-stage startup world. At the risk of rewriting Martin Fowler’s Microservice Premium article, I thought it would be good to write up some thoughts so that I can send them to clients when the topic arises, and hopefully help people avoid some of the mistakes I’ve seen. The mistake of choosing a path towards a given architecture or technology on the basis of so-called best practices articles found online is a costly one, and if I can help a single company avoid it then writing this will have been worth it.

Netflix discuss how they handle the eternal dependency-management problem which arises with lots of microservices:

Using the monorepo as our requirements specification, we began exploring alternative approaches to achieving the same benefits. What are the core problems that a monorepo approach strives to solve? Can we develop a solution that works within the confines of a traditional binary integration world, where code is shared? Our approach, while still experimental, can be distilled into three key features:

Publisher feedback — provide the owner of shared code fast feedback as to which of their consumers they just broke, both direct and transitive. Also, allow teams to block releases based on downstream breakages. Currently, our engineering culture puts sole responsibility on consumers to resolve these issues. By giving library owners feedback on the impact they have to the rest of Netflix, we expect them to take on additional responsibility.

Managed source — provide consumers with a means to safely increment library versions automatically as new versions are released. Since we are already testing each new library release against all downstreams, why not bump consumer versions and accelerate version adoption, safely.

Distributed refactoring — provide owners of shared code a means to quickly find and globally refactor consumers of their API. We have started by issuing pull requests en masse to all Git repositories containing a consumer of a particular Java API. We’ve run some early experiments and expect to invest more in this area going forward.

What I find interesting is that Amazon dealt effectively with the first two many years ago, in the form of their "Brazil" build system, and Google do the latter (with Refaster?). It would be amazing to see such a system released into an open source form, but maybe it's just too heavyweight for anyone other than a giant software company on the scale of a Google, Netflix or Amazon.

Fascinating Unicode details -- a lot of which were new to me. Love the heat map of usage in Wikipedia:

One more interesting way to visualize the codespace is to look at the distribution of usage—in other words, how often each code point is actually used in real-world texts. Below is a heat map of planes 0–2 based on a large sample of text from Wikipedia and Twitter (all languages). Frequency increases from black (never seen) through red and yellow to white.

You can see that the vast majority of this text sample lies in the BMP, with only scattered usage of code points from planes 1–2. The biggest exception is emoji, which show up here as the several bright squares in the bottom row of plane 1.

20 pages of Google's software dev practices, with emphasis on the build system (since it was written by the guy behind Blaze). Naturally, some don't make a whole lot of sense outside of Google, but still some good stuff here

'This document is intended to help those with a basic knowledge of machine learning get the benefit of best practices in machine learning from around Google. It presents a style for machine learning, similar to the Google C++ Style Guide and other popular guides to practical programming. If you have taken a class in machine learning, or built or worked on a machine­-learned model, then you have the necessary background to read this document.'

Much of my professional work for the last 10+ years has revolved around handing, importing and exporting CSV files. CSV files are frustratingly misunderstood, abused, and most of all underspecified. While RFC4180 exists, it is far from definitive and goes largely ignored.

Partially as a companion piece to my recent post about how CSV is an encoding nightmare, and partially an expression of frustration, I've decided to make a list of falsehoods programmers believe about CSVs. I recommend my previous post for a more in-depth coverage on the pains of CSVs encodings and how the default tooling (Excel) will ruin your day.

This is intriguing -- using Jupyter notebooks to embody data analysis work, and ensure it's reproducible, which brings better rigour similarly to how unit tests improve coding. I must try this.

Reproducibility makes data science at Stripe feel like working on GitHub, where anyone can obtain and extend others’ work. Instead of islands of analysis, we share our research in a central repository of knowledge. This makes it dramatically easier for anyone on our team to work with our data science research, encouraging independent exploration.

We approach our analyses with the same rigor we apply to production code: our reports feel more like finished products, research is fleshed out and easy to understand, and there are clear programmatic steps from start to finish for every analysis.

Crockford chose not to version [the] JSON definition: 'Probably the boldest design decision I made was to not put a version number on JSON so there is no mechanism for revising it. We are stuck with JSON: whatever it is in its current form, that’s it.' Yet JSON is defined in at least six different documents.

Specifically, the following 3 classes of errors were implicated in 92% of the major production outages in this study and could have been caught with simple code review:

Error handlers that ignore errors (or just contain a log statement); error handlers with “TODO” or “FIXME” in the comment; and error handlers that catch an abstract exception type (e.g. Exception or Throwable in Java) and then take drastic action such as aborting the system.

(Interestingly, the latter was a particular favourite approach of some misplaced "fail fast"/"crash-only software design" dogma in Amazon. I wasn't a fan)

Peter Lawrey: Our view is that Checked Exception makes more sense for library writers as they can explicitly pass off errors to the caller. As a caller, especially if you are new to a product, you don't understand the exceptions or what you can do about them. They add confusion.

For this reason we use checked exceptions internally in the lower layers and try to avoid passing them in our higher level interfaces. Note: A high percentage of our fall backs are handling iOExceptons and recovering from them. [....]

My experience is that the more complex and layered your libraries the more essential checked exceptions become. I see them as essential for scalability of your software.

The syntax of the regular-expression language is awful. There are various incompatable forms of the language and the quotation conventions are baroquen [sic]. Nevertheless, there is a great deal of useful software, for example grep, that uses regular expressions to specify the desired behavior.

Although regular-expression systems are derived from a perfectly good mathematical formalism, the particular choices made by implementers to expand the formalism into useful software systems are often
disastrous: the quotation conventions adopted are highly irregular; the egregious misuse of parentheses, both for grouping and for backward reference, is a miracle to behold. In addition, attempts to
increase the expressive power and address shortcomings of earlier designs have led to a proliferation of incompatible derivative languages.

'The algorithm of Lamport timestamps is a simple algorithm used to determine the order of events in a distributed computer system. As different nodes or processes will typically not be perfectly synchronized, this algorithm is used to provide a partial ordering of events with minimal overhead, and conceptually provide a starting point for the more advanced vector clock method. They are named after their creator, Leslie Lamport.'

See also vector clocks (which I think would be generally preferable nowadays).

Earlier this year, I asked a question on Stack Overflow about a data structure for loaded dice. Specifically, I was interested in answering this question: "You are given an n-sided die where side i has probability pi of being rolled. What is the most efficient data structure for simulating rolls of the die?"

This data structure could be used for many purposes. For starters, you could use it to simulate rolls of a fair, six-sided die by assigning probability 1616 to each of the sides of the die, or a to simulate a fair coin by simulating a two-sided die where each side has probability 1212 of coming up. You could also use this data structure to directly simulate the total of two fair six-sided dice being thrown by having an 11-sided die (whose faces were 2, 3, 4, ..., 12), where each side was appropriately weighted with the probability that this total would show if you used two fair dice. However, you could also use this data structure to simulate loaded dice. For example, if you were playing craps with dice that you knew weren't perfectly fair, you might use the data structure to simulate many rolls of the dice to see what the optimal strategy would be. You could also consider simulating an imperfect roulette wheel in the same way.

Outside the domain of game-playing, you could also use this data structure in robotics simulations where sensors have known failure rates. For example, if a range sensor has a 95% chance of giving the right value back, a 4% chance of giving back a value that's too small, and a 1% chance of handing back a value that's too large, you could use this data structure to simulate readings from the sensor by generating a random outcome and simulating the sensor reading in that case.

The answer I received on Stack Overflow impressed me for two reasons. First, the solution pointed me at a powerful technique called the alias method that, under certain reasonable assumptions about the machine model, is capable of simulating rolls of the die in O(1)O(1) time after a simple preprocessing step. Second, and perhaps more surprisingly, this algorithm has been known for decades, but I had not once encountered it! Considering how much processing time is dedicated to simulation, I would have expected this technique to be better- known. A few quick Google searches turned up a wealth of information on the technique, but I couldn't find a single site that compiled together the intuition and explanation behind the technique.

At the end of last week, the White House published a draft for a Source Code Policy. The policy requires every public agency to publish their custom-build software as Free Software for other public agencies as well as the general public to use, study, share and improve the software. At the Free Software Foundation Europe (FSFE) we believe that the European Union, and European member states should implement similar policies. Therefore we are interested in your feedback to the US draft.

MIT biological engineers have created a programming language that allows them to rapidly design complex, DNA-encoded circuits that give new functions to living cells. Using this language, anyone can write a program for the function they want, such as detecting and responding to certain environmental conditions. They can then generate a DNA sequence that will achieve it.
"It is literally a programming language for bacteria," says Christopher Voigt, an MIT professor of biological engineering. "You use a text-based language, just like you're programming a computer. Then you take that text and you compile it and it turns it into a DNA sequence that you put into the cell, and the circuit runs inside the cell."

Any design that is hard to test is crap. Pure crap. Why? Because if it's hard to test, you aren't going to test it well enough. And if you don't test it well enough, it's not going to work when you need it to work. And if it doesn't work when you need it to work the design is crap.

'There are three easy to make mistakes in go. I present them here in the way they are often found in the wild, not in the way that is easiest to understand. All three of these mistakes have been made in Kubernetes code, getting past code review at least once each that I know of.'

we’ve recently gained momentum on standardizing our [cross-platform test] drivers. Human-readable, machine-testable specs, coded in YAML, prove which code conforms and which does not. These YAML tests are the Cat-Herd’s Crook: a tool to guide us all in the same direction.

It seems git's default behavior in many situations is -- despite communicating objectID by content-addressable hashes which should be sufficient to assure some integrity -- it may not actually bother to *check* them. Yes, even when receiving objects from other repos. So, enabling these configuration parameters may "slow down" your git operations. The return is actually noticing if someone ships you a bogus object. Everyone should enable these.

Little known to the rest of the world, a team of open source software developers played a small but integral part in helping to stop the spread of Ebola in Sierra Leone, solving a payroll crisis that was hindering the fight against the disease.

Emerson Tan from NetHope, a consortium of NGOs working in IT and development, told the tale at the Chaos Communications Congress in Hamburg, Germany. “These guys basically saved their country from complete collapse. I can’t overestimate how many lives they saved,” he said about his co-presenters, Salton Arthur Massally, Harold Valentine Mac-Saidu and Francis Banguara, who appeared over video link.

netty-http library solves [Netty usability issues] by using JAX-RS annotations to build a HTTP path routing layer on top of netty. In addition, the library implements a guava service to manage the HTTP service. netty-http allows users of the library to just focus on writing the business logic in HTTP handlers without having to worry about the complexities of path routing or learning netty pipeline internals to build the HTTP service.

We've written something very similar, although I didn't even bother supporting JAX-RS annotations -- just a simple code-level DSL.

This is my bet: the age of dynamic languages is over. There will be no new successful ones. Indeed we have learned a lot from them. We’ve learned that library code should be extendable by the programmer (mixins and meta-programming), that we want to control the structure (macros), that we disdain verbosity. And above all, we’ve learned that we want our languages to be enjoyable.

But it’s time to move on. We will see a flourishing of languages that feel like you’re writing in a Clojure, but typed. Included will be a suite of powerful tools that we’ve never seen before, tools so convincing that only ascetics will ignore.

PICO-8 is a fantasy console for making, sharing and playing tiny games and other computer programs. When you turn it on, the machine greets you with a shell for typing in Lua commands and provides simple built-in tools for creating your own cartridges.

So cute! See also Voxatron, something similar for voxel-oriented 3D gaming

Hologram exposes an imitation of the EC2 instance metadata service on developer workstations that supports the [IAM Roles] temporary credentials workflow. It is accessible via the same HTTP endpoint to calling SDKs, so your code can use the same process in both development and production. The keys that Hologram provisions are temporary, so EC2 access can be centrally controlled without direct administrative access to developer workstations.

As we will see below, there has long been ample evidence that errors in spreadsheets are pandemic. Spreadsheets, even after careful development, contain errors in one percent or more of all formula cells. In large spreadsheets with thousands of formulas, there will be dozens of undetected errors. Even significant errors may go undetected because formal testing in spreadsheet development is rare and because even serious errors may not be apparent.

a regex-based, Turing-complete programming language. It's main feature is taking some text via standard input and repeatedly applying regex operations to it (e.g. matching, splitting, and most of all replacing). Under the hood, it uses .NET's regex engine, which means that both the .NET flavour and the ECMAScript flavour are available.

Testing an HTTP Library can become difficult sometimes. RequestBin is fantastic for testing POST requests, but doesn't let you control the response. This exists to cover all kinds of HTTP scenarios. Additional endpoints are being considered.

Excellent cut-out-and-keep guide to why you should add a caching layer. I've been following this practice for the past few years, after I realised that #6 (recovering from a failed cache is hard) is a killer -- I've seen a few large-scale outages where a production system had gained enough scale that it required a cache to operate, and once that cache was damaged, bringing the system back online required a painful rewarming protocol. Better to design for the non-cached case if possible.

If you wanted to do an iterative graph computation like PageRank, it would literally be faster to sort the edges from scratch each and every iteration, than to use unsorted edges. If you want to do graph computation, please sort your edges.

Actually, you know what: if you want to do any big data computation, please sort your records. Stop talking sass about how Hadoop sorts things it doesn't need to, read some papers, run some tests, and then sort your damned data. Or at least run faster than me when I sort your data for you.

Good advice on production-quality, decent-scale usage of Kinesis in Java with the official library: batching, retries, partial failures, backoff, and monitoring. (Also, jaysus, the AWS Cloudwatch API is awful, looking at this!)