The European Parliament has approved budget to improve the EU’s IT infrastructure by extending the free software security audit programme (FOSSA) and by including a bug bounty approach in the programme.

The Commission intends to conduct a small-scale "bug bounty" activity on open-source software with companies already operating in the market. The scope of this action is to:

Run a small-scale "bug bounty" activity for open source software project or library for a period of up to two months maximum;
The purpose of the procedure is to provide the European institutions with open source software projects or libraries that have been properly screened for potential vulnerabilities;
The process must be fully open to all potential bug hunters, while staying in-line with the existing Terms of Service of the bug bounty platform.

"Solidity/EVM is by far the worst programming environment I have ever encountered. It would be impossible to write even toy programs correctly in this language, yet it is literally called "Solidity" and used to program a financial system that manages hundreds of millions of dollars."

We must recognise that even formal verification can leave gaps and hidden assumptions that need to be teased out and tested, using the full battery of testing techniques at our disposal. Building distributed systems is hard. But knowing that shouldn’t make us shy away from trying to do the right thing, instead it should make us redouble our efforts in our quest for correctness.

oh, god -- I'm not keen on this take: how's about designing systems that recognise the risks?

"Everything was going fine, but then suddenly, there were an additional 4,000 votes cast. Because it was a local election, which are normally very small, people were surprised and asked, 'how did this happen?'"

The culprit was not voter fraud or hacked machines. It was a single event upset (SEU), a term describing the fallout of an ionizing particle bouncing off a vulnerable node in the machine's register, causing it to flip a bit, and log the additional votes. The Sun may not have been the direct source of the particle—cosmic rays from outside the solar system are also in the mix—but solar-influenced space weather certainly contributes to these SEUs.

In January last year the airport unearthed a scheme whereby campaigners were using automated software to generate complaints against the airport. Officials caught out the set-up when the two anti-Heathrow enthusiasts forgot to take into account the hour going back in October, and began complaining about flights that had not yet taken off or arrived.

Specifically, the following 3 classes of errors were implicated in 92% of the major production outages in this study and could have been caught with simple code review:

Error handlers that ignore errors (or just contain a log statement); error handlers with “TODO” or “FIXME” in the comment; and error handlers that catch an abstract exception type (e.g. Exception or Throwable in Java) and then take drastic action such as aborting the system.

(Interestingly, the latter was a particular favourite approach of some misplaced "fail fast"/"crash-only software design" dogma in Amazon. I wasn't a fan)

omgwtfbbq. 1: User reports that their gnome session leaks processes; 2: systemd modifies default session behaviour to kill all processes, including screen/tmux; 3: _everyone_ complains because they break 30 years of UNIX process semantics, then 4: they request that tmux/screen hack their shit to workaround their brokenness. Get fucked, systemd. This is the kind of shit that would finally drive me to BSDland

At 14:50 Pacific Time on April 11th, our engineers removed an unused GCE IP block from our network configuration, and instructed Google’s automated systems to propagate the new configuration across our network. By itself, this sort of change was harmless and had been performed previously without incident. However, on this occasion our network configuration management software detected an inconsistency in the newly supplied configuration. The inconsistency was triggered by a timing quirk in the IP block removal - the IP block had been removed from one configuration file, but this change had not yet propagated to a second configuration file also used in network configuration management. In attempting to resolve this inconsistency the network management software is designed to ‘fail safe’ and revert to its current configuration rather than proceeding with the new configuration. However, in this instance a previously-unseen software bug was triggered, and instead of retaining the previous known good configuration, the management software instead removed all GCE IP blocks from the new configuration and began to push this new, incomplete configuration to the network.

One of our core principles at Google is ‘defense in depth’, and Google’s networking systems have a number of safeguards to prevent them from propagating incorrect or invalid configurations in the event of an upstream failure or bug. These safeguards include a canary step where the configuration is deployed at a single site and that site is verified to still be working correctly, and a progressive rollout which makes changes to only a fraction of sites at a time, so that a novel failure can be caught at an early stage before it becomes widespread. In this event, the canary step correctly identified that the new configuration was unsafe. Crucially however, a second software bug in the management software did not propagate the canary step’s conclusion back to the push process, and thus the push system concluded that the new configuration was valid and began its progressive rollout.

The failure of the ParkbyText system, operated by National Controlled Parking Systems (NCPS), was described by one employee contacted by a midlands motorist unable to pay for his parking at a train station as a “Y2K moment”. The system failure caused early morning panic for thousands of drivers who tried unsuccessfully to use text messages or an app to pay for their parking ahead of returning to work after the bank holiday weekend.

Impact was that they had to stop enforcement until the day passed, I think.

'There are three easy to make mistakes in go. I present them here in the way they are often found in the wild, not in the way that is easiest to understand. All three of these mistakes have been made in Kubernetes code, getting past code review at least once each that I know of.'

I can confirm, there is a help forum from the "deutsche telekom", they say there is a feature called MEC (it's mainly for setting phone parameters to match their network), active on all their SIM cards, which is not correctly handled by any of the OnePlus Devices (one, two, x) so it writes constantly to flash memory, killing it arround 100.000 writes which is 3-6 weeks.

on Normalization of Deviance, with a few anecdotes from Silicon Valley. “The gradual process through which unacceptable practice or standards become acceptable. As the deviant behavior is repeated without catastrophic results, it becomes the social norm for the organization.”

Facebook Infer uses logic to do reasoning about a program's execution, but reasoning at this scale — for large applications built from millions of lines of source code — is hard. Theoretically, the number of possibilities that need to be checked is more than the number of estimated atoms in the observable universe. Furthermore, at Facebook our code is not a fixed artifact but an evolving system, updated frequently and concurrently by many developers. It is not unusual to see more than a thousand modifications to our mobile code submitted for review in a given day. The requirements on the program analyzer then become even more challenging because we expect a tool to report quickly on these code modifications — in the region of 10 minutes — to fit in with developers' workflow. Coping with this scale and velocity requires advanced mathematical techniques. Facebook Infer uses two such techniques: separation logic and bi-abduction.

Separation logic is a theory that allows Facebook Infer's analysis to reason about small, independent parts of the application storage, rather than having to consider the entirety of the memory potentially at every step. That would be a daunting task on modern processors with their large addressable virtual memories.

Bi-abduction is a logical inference technique that allows Facebook Infer to discover properties about the behavior of independent parts of the application code. By storing these properties between runs, Facebook Infer needs to analyze only the parts of the software that have changed, reusing the results of its previous analysis where it can.

By combining these approaches, our analyzer is able to find complex problems in modifications to an application built from millions of lines of code, in minutes.

I was in the middle of writing a breakdown of what went wrong, but you've beat me to it.
Basically, they have a LinuxSecureRandom class that's supposed to override the standard SecureRandom. This class reads from /dev/urandom and should provide cryptographically secure random values.
They also seed the generator using SecureRandom#setSeed with data pulled from random.org. With their custom SecureRandom, this is safe because it mixes the entropy using XOR, so even if the random.org data is dodgy it won't reduce security. It's just an added bonus.
BUT! On some devices under some circumstances, the LinuxSecureRandom class doesn't get registered. This is likely because /dev/urandom doesn't exist or can't be accessed for some reason. Instead of screaming bloody murder like any sensible implementation would, they just ignore that and fall back to using the standard SecureRandom.
If the above happens, there's a problem because the default implementation of SecureRandom#setSeed doesn't mix. If you set the seed, it replaces the entropy entirely. So now the entropy is coming solely from random.org.
And the final mistake: They were using HTTP instead of HTTPS to make the webservice call to random.org. On Jan 4, random.org started enforcing HTTPS and returning a 301 Permanently Moved error for HTTP - see https://www.random.org/news/. So since that date, the entropy has actually been the error message (turned into bytes) instead of the expected 256-bit number. Using that seed, SecureRandom will generate the private key for address 1Bn9ReEocMG1WEW1qYjuDrdFzEFFDCq43F 100% of the time. Ouch. This is around the time that address first appears, so the timeline matches.
I haven't had a thorough look at what they've replaced it with in the latest version, but initial impressions are that it's not ideal. Not disastrous, but not good.

'Due to how the banner notifications process the Unicode text. The banner briefly attempts to present the incoming text and then "gives up" thus the crash'. Apparently the entire Springboard launcher crashes.

Yes, this is a fucking 32-bit integer overflow. Whatever software was used, it calculated the sum of all inputs using 32-bit variables, which overflow at about 20 BTC if signed or 40 BTC if not. The fee was supposed to be 0xC350 = 50,000 satoshis, but it turned out to be 0x2,0000,C350 = 8,589,984,592 satoshis.
Captains of the industry. If they were captains of any other industry, like say for example automotive, we'd have people dying in car crashes between two stationary vehicles.

I call out the Honeybadger gem specifically because was the most recent time I'd been bit by a seemingly good thing promoted in the community: monkey patching third party code. Now I don't fault Honeybadger for making their product this way. It provides their customers with direct business value: "just require 'honeybadger' and you're done!" I don't agree with this sort of practice. [....]

I distrust everything [in Ruby] but a small set of libraries I've personally vetted or are authored by people I respect. Why is this important? Without a certain level of scrutiny you will introduce odd and hard to reproduce bugs. This is especially important because Ruby offers you absolutely zero guarantee whatever the state your program is when a given method is dispatched. Constants are not constants. Methods can be redefined at run time. Someone could have written a time sensitive monkey patch to randomly undefined methods from anything in ObjectSpace because they can. This example is so horribly bad that no one should every do, but the programming language allows this. Much worse, this code be arbitrarily inject by some transitive dependency (do you even know what yours are?).

More Avro trouble with "bytes" fields! Avoid using "bytes" fields in Avro if you plan to interoperate with either of the Python implementations; they both fail to marshal them into JSON format correctly. This is the official "avro" library, which produces UTF-8 errors when a non-UTF-8 byte is encountered

The JVM by default exports statistics by mmap-ing a file in /tmp (hsperfdata). On Linux, modifying a mmap-ed file can block until disk I/O completes, which can be hundreds of milliseconds. Since the JVM modifies these statistics during garbage collection and safepoints, this causes pauses that are hundreds of milliseconds long. To reduce worst-case pause latencies, add the -XX:+PerfDisableSharedMem JVM flag to disable this feature. This will break tools that read this file, like jstat.

In the literature, Rahman et al. found that a very cheap algorithm actually performs almost as well as some very expensive bug-prediction algorithms. They found that simply ranking files by the number of times they've been changed with a bug-fixing commit (i.e. a commit which fixes a bug) will find the hot spots in a code base. Simple! This matches our intuition: if a file keeps requiring bug-fixes, it must be a hot spot because developers are clearly struggling with it.

When a player adopted democracy in Civilization, their aggression would be automatically reduced by 2. Code being code, if Gandhi went democratic his aggression wouldn't go to -1, it looped back around to the ludicrously high figure of 255, making him as aggressive as a civilization could possibly be.

Turns out there are a few bugs in EMR's S3 support, believe it or not.

1. 'Consider disabling Hadoop's speculative execution feature if your cluster is experiencing Amazon S3 concurrency issues. You do this through the mapred.map.tasks.speculative.execution and mapred.reduce.tasks.speculative.execution configuration settings. This is also useful when you are troubleshooting a slow cluster.'

Redis uses forking to perform persistence flushes, which means that once every 30 minutes it performs like crap (and kills the 99th percentile latency). Given this, various Redis people have been benchmarking fork() times on various Xen platforms, since Xen has a crappy fork() implementation

I have repeatedly been confounded to discover just how many mistakes in both test and application code stem from misunderstandings or misconceptions about time. By this I mean both the interesting way in which computers handle time, and the fundamental gotchas inherent in how we humans have constructed our calendar — daylight savings being just the tip of the iceberg.

In fact I have seen so many of these misconceptions crop up in other people’s (and my own) programs that I thought it would be worthwhile to collect a list of the more common problems here.

'As a bonus, ThreadSanitizer finds some other types of bugs: thread leaks, deadlocks, incorrect uses of mutexes, malloc calls in signal handlers, and more. It also natively understands atomic operations and thus can find bugs in lock-free algorithms. [...] The tool is supported by both Clang and GCC compilers (only on Linux/Intel64). Using it is very simple: you just need to add a -fsanitize=thread flag during compilation and linking. For Go programs, you simply need to add a -race flag to the go tool (supported on Linux, Mac and Windows).'

This leads us to the bug. The size of the array is determined by Math.max(buf.length << 1, newcount). Ordinarily, buf.length << 1 returns double buf.length, which would always be much larger than newcount for a 2 byte write. Why was it not? The problem is that for all integers larger than Integer.MAX_INTEGER / 2, shifting left by one place causes overflow, setting the sign bit. The result is a negative integer, which is always less than newcount. So for all byte arrays larger than 1073741824 bytes (i.e. one GB), any write will cause the array to resize, and only to exactly the size required.

Oh Apple, you asshats. This is some seriously shitty programming. iMessage on iOS devices caches the "iMessage-capable" flag for all numbers, indefinitely, so if you switch from iPhone to Android, messages from your friends' iPhones won't get delivered to you henceforth -- and to add insult to injury, it claims they do with a "Delivered." status appearing under the message. This is happening to me right now...

It's common for even the best programmers to make simple mistakes. And commonly, a refactoring which seems safe can leave behind code which will never do what's intended. We're used to getting help from the compiler, but it doesn't do much beyond static type checking. Using error-prone to augment the compiler's static analysis, you can catch more mistakes before they cost you time, or end up as bugs in production. We use error-prone in Google's Java build system to eliminate classes of serious bugs from entering our code, and we've open-sourced it, so you can too!

Yes you read that right. It’s copying all the email from the Junk Folder back into the Junk Folder again!. This is legal IMAP, so our server proceeds to create a new copy of each message in the folder. It then expunges the old copies of the messages, but it’s happening so often that the current UID on that folder is up to over 3 million. It was just over 2 million a few days ago when I first emailed the user to alert them to the situation, so it’s grown by another million since. The only way I can think this escaped QA was that they used a server which (like gmail) automatically suppresses duplicates for all their testing, because this is a massively bad problem.

The "-x" switch will expand the step/slew boundary from 128ms to 600 seconds, ensuring the time is slewed (drifted slowly towards the correct time at a max of 5ms per second) rather than "stepped" (a sudden jump, potentially backwards). Since slewing has a max of 5ms per second, time can never "jump backwards", which is important to avoid some major application bugs (particularly in Java timers).

This would appear to be the paper which sparked off the drama around BitCoin thefts from wallets generated on Android devices:

The SecureRandom PRNG is the primary source of randomness for Java and is used e.g., by cryptographic operations. This underlines its importance regarding security. Some of fallback solutions of the investigated implementations [are] revealed to be weak and predictable or capable of being inﬂuenced. Very alarming are the defects found in Apache Harmony, since it is partly used by Android.

I've wound up mentioning this twice in the past week, so it's worth digging up and bookmarking!

Under certain circumstances a TCP connection can end up in a "deadlock", where neither the client nor the server is able to write data out or read data in. This is caused by two factors. First, a client or server cannot perform two transactions at once; a read cannot be performed if a write transaction is in progress, and vice versa. Second, the buffers that exist at either end of the TCP connection are of limited size. The deadlock occurs when both the client and server are trying to send an amount of data that is larger than the combined input and output buffer size.

Daniel Lemire comments on the recent cases of bugs in spreadsheets causing major impact:

There are several critical problems with a tool like Excel that need to be widely known:

* Spreadsheets do not support testing. For anything that matters, you should validate and test your code automatically and systematically;

* Spreadsheets make code reviews impractical. To visually inspect the code, you need to click and each and every cell. In practice, this means that you cannot reasonably ask someone to read over your formulas to make sure that there is no mistake;

* Spreadsheets encourage redundancies. Spreadsheets encourage copy-and-paste. Though copying and pasting is sometimes the right tool, it also creates redundancies. These redundancies make it very difficult to update a spreadsheet: are you absolutely sure that you have changed the formula throughout?

Agreed on all three, particularly on the impossibility of testing. IMO, everyone who may be in a job where automation via spreadsheet is likely, needs training in SDE fundamentals: unit testing, the important of open source and open data for reproducibility, version control, and code review. We are all computer scientists now.

What the Reinhart-Rogoff affair shows is the extent to which austerity has been sold on false pretenses. For three years, the turn to austerity has been presented not as a choice but as a necessity. Economic research, austerity advocates insisted, showed that terrible things happen once debt exceeds 90 percent of G.D.P. But “economic research” showed no such thing; a couple of economists made that assertion, while many others disagreed. Policy makers abandoned the unemployed and turned to austerity because they wanted to, not because they had to. So will toppling Reinhart-Rogoff from its pedestal change anything? I’d like to think so. But I predict that the usual suspects will just find another dubious piece of economic analysis to canonize, and the depression will go on and on.

You've probably heard that countries with a high debt:GDP ratio suffer from slow economic growth. The specific number 90 percent has been invoked frequently. That's all thanks to a study conducted by Carmen Reinhardt and Kenneth Rogoff for their book This Time It's Different. But the results have been difficult for other researchers to replicate. Now three scholars at the University of Massachusetts have done so in "Does High Public Debt Consistently Stifle Economic Growth? A Critique of Reinhart and Rogoff" and they find that the Reinhart/Rogoff result is based on opportunistic exclusion of Commonwealth data in the late-1940s, a debatable premise about how to weight the data, and most of all a sloppy Excel coding error.

Read Mike Konczal for the whole rundown, but I'll just focus on the spreadsheet part. At one point they set cell L51 equal to AVERAGE(L30:L44) when the correct procuedure was AVERAGE(L30:L49). By typing wrong, they accidentally left Denmark, Canada, Belgium, Austria, and Australia out of the average. When you run the math correctly "the average real GDP growth rate for countries carrying a public debt-to-GDP ratio of over 90 percent is actually 2.2 percent, not -0.1 percent."

From JPL's Laboratory for Reliable Software (LaRS). Great reference; there's some really useful recommendations here, and good explanations of familiar ones like "prefer composition over inheritance". Many are supported by FindBugs, too.

Here's the full list:

compile with checks turned on;
apply static analysis;
document public elements;
write unit tests;
use the standard naming conventions;
do not override field or class names;
make imports explicit;
do not have cyclic package and class dependencies;
obey the contract for equals();
define both equals() and hashCode();
define equals when adding fields;
define equals with parameter type Object;
do not use finalizers;
do not implement the Cloneable interface;
do not call nonfinal methods in constructors;
select composition over inheritance;
make fields private;
do not use static mutable fields;
declare immutable fields final;
initialize fields before use;
use assertions;
use annotations;
restrict method overloading;
do not assign to parameters;
do not return null arrays or collections;
do not call System.exit;
have one concept per line;
use braces in control structures;
do not have empty blocks;
use breaks in switch statements;
end switch statements with default;
terminate if-else-if with else;
restrict side effects in expressions;
use named constants for non-trivial literals;
make operator precedence explicit;
do not use reference equality;
use only short-circuit logic operators;
do not use octal values;
do not use floating point equality;
use one result type in conditional expressions;
do not use string concatenation operator in loops;
do not drop exceptions;
do not abruptly exit a finally block;
use generics;
use interfaces as types when available;
use primitive types;
do not remove literals from collections;
restrict numeric conversions;
program against data races;
program against deadlocks;
do not rely on the scheduler for synchronization;
wait and notify safely;
reduce code complexity

while we planned for the case of the server losing a disk or entirely biting the dust, or the total loss of the VM’s filesystem, we didn’t plan for the case of filesystem corruption, and the way the corruption affected our mirroring system triggered some very unforeseen and pathological conditions. [...] the corruption was perfectly mirrored... or rather, due to its nature, imperfectly mirrored. And all data on the anongit [mirrors] was lost.

One risk demonstrated: by trusting in mirroring, rather than a schedule of snapshot backups covering a wide time range, they nearly had a major outage. Silent data corruption, and code bugs, happen -- backups protect against this, but RAID, replication, and mirrors do not.

Another risk: they didn't have a rate limit on project-deletion, which resulted in the "anongit" mirrors deleting their (safe) data copies in response to the upstream corruption. Rate limiting to sanity-check automated changes is vital. What they should have had in place was described by the fix: 'If a new projects file is generated and is more than 1% different than the previous file, the previous file is kept intact (at 1500 repositories, that means 15 repositories would have to be created or deleted in the span of three minutes, which is extremely unlikely).'

Massive Java concurrency fail in recent 1.6 and 1.7 JDK releases -- the java.util.HashMap type now spin-locks on an AtomicLong in its constructor.

Here's the response from the author: 'I'll acknowledge right up front that the initialization of hashSeed is a bottleneck but it is not one we expected to be a problem since it only happens once per Hash Map instance. For this code to be a bottleneck you would have to be creating hundreds or thousands of hash maps per second. This is certainly not typical. Is there really a valid reason for your application to be doing this? How long do these hash maps live?'

Oh dear. Assumptions of "typical" like this are not how you design a fundamental data structure. fail. For now there is a hacky reflection-based workaround, but this is lame and needs to be fixed as soon as possible. (Via cscotta)

"The NYMEX panel found that Infinium had finished writing the algorithm only the day before it introduced it to the market, and had tested it for only a couple of hours in a simulated trading environment to see how it would perform. The firm's normal testing processes take six to eight weeks. When the algorithm started its frenetic buying spree, the measures designed to shut it down automatically did not work. One was supposed to turn the system off if a maximum order size was breached, but because the machine was placing lots of small orders rather than a single big one the shutdown was not triggered. The other measure was meant to prevent Infinium from selling or buying more than a certain number of contracts, but because of an error in the way the rogue algorithm had been written, this, too, failed to spot a problem."

Amazing response to a security bug report. 'what's happening in this bug report right now is a perfect example of how *not* to do security response. When faced with two people who clearly know a few things about secure coding, rather than taking their advice and actually fixing the root cause of the problem (or abandon it as a hopeless situation, which is probably the more appropriate response), you've chosen to waste our time by demanding that we write weaponized exploits to exploit what most people already know to be exploitable. To top it off, when shown repeatedly how your half-baked "fixes" don't actually fix anything, rather than taking our advice you just add another small hurdle that can be trivially bypassed. It would be sad if it weren't so funny. I've decided that it's time to stop beating a dead horse. Usually I get paid good money to own software this hard, and I don't think you're worth making an exception. Best of luck, I'm sure you'll figure it out eventually.'

LWN follows up on the FH_DATE_PAST_20XX fiasco. 'It would appear that what SpamAssassin needs is some dedicated maintenance talent which is not dependent on evening hours put in by developers committed to other projects.' I wish