As far as I can tell, by default, these languages apply crude code-point sorting. Human beings understand that the characters e, é, E, É, è, ê, and so forth, should be considered as the same letter (e) with accents. There are exceptions to this rule, but the default which consists in sorting accentuated characters after the letter ‘z’ is just not reasonable. The way case is handled is patently stupid. You might prefer A to come before a, or vice versa, but no human being would ever sort the letters as A,B,a,b or a,b,A,B.

To get a sensible default, programming languages force you to use complicated code. In JavaScript, it is burdensome but easy enough….

> v.sort(Intl.Collator().compare)['a','e','é','f']

However, I am not sure what the equivalent is in Python and Swift. It does not jump at me in the documentation of the respective standard librairies. I did not even look at it for other popular programming languages like Go, C++, and so forth.

It is unacceptably difficult to do the “right thing”. The net result is that many programmers do not sort strings properly. If you use a natural language with non-English characters, you see the effect in many applications. It looks bad.

Thankfully, most major software products get it right. Microsoft Office, Google Docs, Apple apps… all do the right thing. It creeps up in small budget applications. I have to use one in my daily life as an employee of a public university, and it annoys me.

Academic excellence is not a strong predictor of career excellence. There is weak correlation between grades and job performance. Grant reviews the evidence in details in his New York Times piece.When recruiting research assistants, I look at grades as the last indicator. I find that imagination, ambition, initiative, curiosity, drive, are far better predictors of someone who will do useful work with me. Of course, these characteristics are themselves correlated with high grades, but there is something to be said about a student who decides that a given course is a waste of time and that he works on a side project instead. Breakthroughs don’t happen in regular scheduled classes, they happen in side projects. We want people who complete the work they were assigned, but we also need people who can reflect critically on what is genuinely important. I don’t have any need for a smart automaton: I already have many computers.I have applied the same principle with my two sons: I do not overly stress the importance of good grades, encouraging them instead to pursue their own interests and to go beyond their classes.

Our hearts do not regenerate. Thus a viable strategy might to transplant brand new hearts from pigs. This is much harder than it appears, however. But progress is being made. Researchers are now able to keep baboons alive for months with transplanted pig hearts. To achieve this good result, the scientists had to use an immunosuppressant drug to prevent unwanted growth in the pig’s heart. With some luck, some of us could benefit from transplanted heart pigs in the near future.

Cataract is the most common cause of blindness. It can be “cured” by removing your natural lens and replacing them with artificial lenses called IOL (intraocular lenses). This therapy was invented in the 1940s, but it took 40 years before it became widespread in wealthy countries. It is still out of reach in many countries. Yet the cost of intraocular lenses is less than 10$ and the procedure is inexpensive (it costs less than 25$ in total in some countries). Even today, in many rich countries, access to this therapy is restricted. Finally, in 2017, a government agency in UK recommended that we stop rationing access to cataract surgery.

an accumulating body of evidence suggests the inverse: methodological quality and, consequently, reliability of published research works in several fields may be decreasing with increasing journal rank

My own recommendation to colleagues and students has been that if peer-reviewed publications are warranted, then it is fine to target serious well-managed venues, irrespective of their “prestige”.

It is hard enough to do solid research, if you also have to tune it so that it outcompetes other proposals in a competition for prestige, I fear that you may discourage good research practices. Scientists care too little about modesty, it is their downfall.

Using the best individual and collectively peer-reviewed economic models, the total cost of Paris – through slower GDP growth from higher energy costs – will reach $1-2 trillion every year from 2030. (…) It’s so expensive because green energy isn’t ready to replace fossil fuels at scale. Nations are using expensive subsidies and other policies to force immature green technologies on consumers and businesses. We need to change course. The smart option, backed by economic science, is to adopt a technology-led policy. This means investing far more into green energy research and development. Rather than forcing the rollout of immature energy sources, we need to ensure that green energy can out-compete fossil fuels.

I really like the term “technology-led policy”. If you want to change the world for the better, then making the good things cheap using technology and science is the golden path.

Over time, our bodies accumulate a small fraction of “senescent cells”. It is believed that these disfunctional cells contribute to the diseases of old age. For the last few years, researchers have been looking for senolytics, drugs that can kill senescent cells. It turns out that two antibiotics approved for medical use are potent senolytics.

Chinese leaders have been increasingly focused on scientific misconduct, following ongoing reports of researchers there using fraudulent data, falsifying CVs and faking peer reviews. In May, the government announced sweeping reforms to improve research integrity. One of those was the creation of a national database of misconduct cases. Inclusion on the list could disqualify researchers from future funding or research positions, and might affect their ability to get jobs outside academia. (Source: Nature)

We need to recognize that the scientific enterprise is fundamentally on an honor-based system. It is trivial to cheat in science. You can work hard to collect data, or make it up as you go. Except for the most extreme cases, the penalty for cheating is small because there is almost always plausible deniability.

The energy density of lithium-ion batteries doubled between 1995 and 2005 but only increased by about 15% between 2005 and 2015. It is estimated that there is relatively little further gains in energy density possible with lithium-ion batteries. However, our mobile devices typically consume far less power than they did only a few years ago while offering faster processing.

On average, women show stronger preferences for mates with good earning capacity than men do, while men show stronger preferences for physically attractive mates than women do (…) we found little evidence that these sex differences were smaller in countries with greater gender equality.

Schools train us to provide the right answers to predefined questions. Yet anyone with experience from the real world knows that, more often than not, the difficult part is to find the right question.

To make a remarkable contribution, you need to start by asking the right question. I will go further than this: the questions you are asking might define who you are.

What is a good question?

The great questions are tractable and fruitful. They lead you on a path of discovery. It is easy to ask how to cure cancer, but that’s not a good question because it does not help anyone do medical research.

Secret questions are the best: if you are the only one with this question in mind, then you may be holding a gold mine. Questions that everyone is having are proportionally worthless. (E.g., see Zero to One by Peter Thiel)

You may think that by studying hard, by learning all the answers, you will get better at asking great questions. I am not sure it works.

In fact, knowing too much can harm you. I would take a B student who has fresh questions as a Ph.D. student over a typical overeager A+ student who frets about getting everything right. It is a poorly held secret that some of the very best researchers and innovators were average students.

Do the following experiment. Pick a scholarly field, any field, then spend two weeks reading everything about it that you can. Next, write down 5 questions. I can almost guarantee you that these 5 questions will be already covered by sources you read. They will be “known” questions.

So to find good questions, you have to maintain some distance from the material. This should be uncontroversial if you consider that I define “good questions” to be “secret” or “highly original”.

Our minds tend to frame everything in terms of the patterns we have learned. Spend two years studying Marxism and every single problem will feel like a Marxist problem to you. It becomes difficult for you to come up with new questions outside of the frame.

Don’t get me wrong: smart people who know more tend to be more creative, everything else being equal… but there is a difference between being knowledgeable and having been locked into a frame of mind.

Yet here is how many researchers work. They survey the best papers from the last major conference or journal issue in their field. Importantly, they make sure to read what everyone is reading and to make sure to make theirs the frame of minds of the best people. They make sure that they can repeat the most popular questions and answers. They look at the papers, look for holes or possibilities for improvement and work from there. What this ensures that there are a few leaders (people writing about genuine novel ideas) followed by a long and nearly endless stream of “me too” papers that offer minor and inconsequential variations.

It is easier to judge these things in retrospect. In computer science, we had the XML craze at the turn of the century. Dozens of XML papers appeared each year at each of the top database conferences. I wrote about the untold story of the death of this idea. How could so many people get so excited at the same time by what was a dead-end?

I believe that people are happy to be handed out questions and will often rush out to provide highly sophisticated thorough answers… whether or not the question is the right one.

My claim is that the people leading are not unnaturally smart, knowledgeable or creative. The people who answer other people’s questions are not dumb or unimaginative. The main difference is one of focus. You either focus on asking good questions or you focus on providing good answers.

The world would be better if we had more people asking better questions.

How might we ask better questions?

Pay attention to what is around you and violates your worldview. How did Fleming discover penicillin? He noticed that some mold that had invaded his dirty lab appeared to kill bacteria. He asked the right question at that time.

Be patient. Reportedly, Einstein once stated, “It’s Not That I’m so Smart, It’s Just That I Stay with Problems Longer.” The longer you work on a problem, the more likely you are to find interesting questions. (See Forthmann et al. 2018) The easiest way to miss the great questions is to dismiss the problems as uninteresting and move on too quickly.

Be physically active, go for a walk. Chaining yourself to a desk is likely counterproductive. I used to think that being an all-out intellectual was the best route, but I now believe that I was grossly mistaken. I personally take a walk outside almost every morning on weekdays. (See Oppezzo and Schwartz, 2014).

Don’t be too social. Social pressure toward conformity trigger intense instinctive reactions. It is simply hard to go against the herd. Thus you are better off not know too much about where the herd is. In concrete terms, spend entirely days by yourself.

Ask a lot of questions. If you want to become good at providing the right answers, train yourself to answer lots of questions. If you want to become good at asking questions, ask a lot of them.

Always question your own thoughts and work.

The scientific mind does not so much provide the right answers as ask the right questions. (attributed to Levi-Strauss)

Amazon, the e-commerce giant, is entering the chip-making business with its AWS Graviton processors, designed for cloud servers and based on an ARM architecture (like the processor in your phone). The initial reports are somewhat negative.

Suppose that you are given two arrays. Maybe you have a list of cities from the USA and a list of cities from Europe. You want to generate a new list which mixes the two lists, taking a sample from one array (say 50%), and a sample from the other array (say 50%). So if you have 50 cities from the USA and 50 cities from Europe, you want a new array that contains, in random order, 25 cities from the USA and 25 cities from Europe.

We need this kind of mixed sampling all the time in machine learning or data science. This summer, I was running simulations and the bulk of the time was spent mixing arrays. I need to pick, say, 25% of all elements from one array and combine them with, say, 75% of all elements from another array.

There are many bad ways to solve this problem. But here is a reasonable one. First you pick a sample from the first array using reservoir sampling; then you pick a sample from the other array (again using reservoir sampling), and you finally apply a random shuffle to the result.

Reservoir sampling is an efficient way to sample N values from an array:

So how efficient is it? Suppose that I have two arrays made of a million elements each and I want to sample half a million elements from each. On my iMac, I use a bit over 12 CPU cycles per input element (so about 24 million cycles in total). You probably can go even faster, but this approach has the benefit of being both simple and efficient.

The results of both within- and between-family research designs revealed no consistent evidence of a link between birth order and the personality traits of extraversion, neuroticism, agreeableness, conscientiousness, and openness.

Most modern cultures use numbers based on a decimal system (base 10). However, in many European languages (e.g., French and Danish) the number 20 is used as a base (eighty in French is four-twenty). We call such system vigesimal. They are common in Africa. The Maya counted in base 20. I am told that the Gauls used a vigesimal system, but I could not find a credible supporting source (the Gauls also used Greek and Latin).

Many of our everyday plastic items (like plastic bottles) contain a chemical called BPA. Our bodies can ingest it, but it is evacuated within hours. There has been an intense lobby to ban it in the spirit of the precautionary principle; it does affect mice (causing genetic mutations in offsprings) but there is no proof that it harms human beings. Should you buy goods that are said to be BPA-free? They are made with alternative chemicals, so the question is whether these alternative chemicals are safer. Horan et al. provide evidence that the alternatives can be harmful.

Coal’s popularity can be explained in large part by how nuclear power is failing us.

We subsidize electric cars because we assume that they are more environmentally friendly. They certainly lack an exhaust pipe which is great for people around the car. Electric cars make it easy to “export” pollution: you can keep dense cities or even entire countries cleaner… But the batteries and their toxic chemicals must still end up somewhere.

What is the larger picture? If you care only about climate change, then electric cars are slightly beneficial, as long as you are not producing your electricity from coal…

When powered by average European electricity, electric vehicules are found to reduce global warming potential by 20% to 24% compared to gasoline and by 10% to 14% relative to diesel under the base case assumption of a 150,000 km vehicle lifetime. Electric vehicules powered by coal electricity are expected to cause an increase in global warming potential of 17% to 27% compared with diesel and gasoline ICEVs. Hawkins et al., 2012

Yet if you care about other types of environmental impacts, electric cars may be less beneficial…

Thanks in part to the success of direct-to-consumer marketing campaigns by drug companies, the notion that major depression and allied disorders are caused by a “chemical imbalance” of neurotransmitters, such as serotonin and norepinephrine, has become a virtual truism in the eyes of the public (…) the evidence for the chemical imbalance model is at best slim (…) There is no known “optimal” level of neurotransmitters in the brain, so it is unclear what would constitute an “imbalance.” Nor is there evidence for an optimal ratio among different neurotransmitter levels. Moreover, although serotonin reuptake inhibitors, such as fluoxetine (Prozac) and sertraline (Zoloft), appear to alleviate the symptoms of severe depression, there is evidence that at least one serotonin reuptake enhancer, namely tianepine (Stablon), is also efficacious for depression (Akiki, 2014). The fact that two efficacious classes of medications exert opposing effects on serotonin levels raises questions concerning a simplistic chemical imbalance model.

lowering dietary carbohydrate increased energy expenditure during weight loss maintenance. This metabolic effect may improve the success of obesity treatment, especially among those with high insulin secretion.

I should warn that this study refers to “lowering sugar” not getting rid of it entirely.

science has slowed enormously per dollar or hour spent. That evidence demands a large-scale institutional response. It should be a major subject in public policy, and at grant agencies and universities

While I accept their demonstration, it is not clear what (if anything in particular) is causing this lack of productivity.

Collison and Nielsen fall short of offering a solution. Maybe we ought to reinvent discovery?

Suppose that you are given a table. You know the number of rows, as well as how many distinct value each column has. For example, you know that there are two genders (in this particular table). Maybe there are 73 distinct age values. For a concrete example, take the standard Adult data set which is made of 48842 rows.

How many distinct entries do you expect the table to have? That is, if you remove all duplicate rows, what is the number of rows left?

There is a standard formula for this problem: Cardenas’ formula. It uses the simplistic model where there is no relationship between the distinct columns. In practice, it will tend to overestimate the number of rows. However, despite it is simplicity, it often works really well.

Let p be the product of all column cardinalities, and let n be number of rows, then the Cardenas estimate is p * (1 – (1 – 1/p)n). Simple right?

So let us put in the numbers… my column cardinalities are 16,16,15,5,2,94,21648,92,42,7,9,2,6,73,119; and I have 48842 rows. So what is Cardenas’ prediction?

Zero.

At least, that’s what the Java function returns.

Why is that? The first problem is that 1 – 1/p is 1 when p is that large. And even if you could compute 1 – 1/p accurately enough, taking it to the power of 48842 is a problem.

So what do you do?

You can switch to something more accurate than double precision, that is quadruple precision (also called binary128). There is no native 128-bit floats in Java, but you can emulate them using the BigDecimal class. The code gets much uglier. Elegance aside, I assumed it would be a walk in the park, but I found that the implementation of the power function was numerically unstable, so I had to roll my own (from multiplications).

It scales up to billions of rows and up to products of cardinalities that do not fit in any of Java’s native type. Though the computation involves fancy data types, it is probably more than fast enough for most applications.

Modern processors execute instructions in parallel in many different ways: multi-core parallelism is just one of them. In particular, processor cores can have several outstanding memory access requests “in flight”. This is often described as “memory-level parallelism”. You can measure the level of memory-level parallelism your processors has by traversing an array randomly either by following one path, or by following several different “lanes”. We find that recent Intel processors have about “10 lanes” of memory-level parallelism.

The kind of memory-level parallelism I am interested in has to do with out-of-cache memory accesses. Thus I use a 256MB block of memory. This is large enough not to fit into a processor cache. However, because it is so large, we are likely to suffer from a virtual-memory-related fault. This can significantly limit memory-level parallelism if the page sizes are too small. By default on the Linux distributions I use, the pages span 4kB (whether on 64-bit ARM or x64). Empirically, that is too small. Thankfully, it is easy to reconfigure the pages so that they span 2MB or more (“huge pages”). On Apple’s devices, whether it be an iPhone or an iPad Pro, I believe that the pages always span 16kB and that this cannot be easily reconfigured.

Before I continue, let me present the absolute timings (in second) using a single lane (thus no memory-level parallelism). Apple makes two version of its most recent processor, the A12 (in the iPhone) and the A12X (in the iPad Pro).

Intel skylake (4kB pages)

0.73 s

Intel skylake (2MB pages)

0.61 s

Apple A12 (16kB pages)

0.96 s

Apple A12X (16kB pages)

0.97 s

Apple A10X (16kB pages)

1.15 s

According to these numbers, the Intel server has the upside over the Apple mobile devices. But that’s only part of the story. What happens as you increase the number of lanes (while keeping the code single threaded) is interesting. As you increase the number of lanes, Apple processors start to beat the Intel Skylake in absolute, raw speed.

Another way to look at the problem is to measure the “speedup” due to the memory-level parallelism: we divide the time it takes to traverse the array using 1 lane by the time it takes to do so using X lane. We see that the Intel Skylake processor is limited to about a 10x or 11x speedup whereas the Apple processors go much higher.

Thoughts:

I’d be very interested in knowing how Qualcomm and Samsung processors compare.

It goes without saying that my server-class Skylake machine uses a lot more power than the iPhone.

If I could increase the page size on iOS, we would get even better numbers for the Apple devices.

The fact that the A12 has higher timings when using a single lane suggests that its memory subsystem has higher latency than a Skylake-based PC. Why is that? Could Apple just crank up the frequency of the DRAM memory and beat Intel throughout?

Why is Intel limited to 10x memory-level parallelism? Why can’t they do what Apple does?

Credit: I owe much of the design of the experiment and C++ code to Travis Downs, with help from Nathan Kurz. The initial mobile app for Apple devices was provided by Benoît Maison, you can find it on GitHub along with the raw results and a “console” version that runs under macOS and Linux. I owe the A12X numbers to Stuart Carnie and the A12 numbers to Victor Stewart.