Tuesday, 19 January 2016

Unlike other Silicon Valley companies that are "making the world a better place" (according to Silicon Valley, the TV series), [x] is "making the world a radically better place" (according to today's presenter).

If you followed Google I/O this year or gone to some other Google talk about [x] and its moonshots, or if you've just glued yourself to the aforementioned TV series, in all three cases you would have seen the following slide:

At MIT today, we heard about 2 projects from [x] (out of the 10 or so that are currently public): self-driving cars and internet balloons.

-----

A huge problem? The 1.2 million annual traffic accidents, 93% of which are due to human error.The radical solution:self-driving cars and the required road infrastructure.Breakthrough tech: software with realtime sensor processing.

The principle on which this whole autonomous car thing hinges, is an initial full laser mapping of the urban area in which the car is to drive (road and buildings and all) with which the car's real-time sensor data is then aligned for accurate positioning and localization within the lane.

Is it feasible to have to pre-map every urban environment, and then to update it when it changes? Ah, well on the one hand Google Streetview cars have already shown us something about feasibility, but in the longterm, the self-driving cars will continue to collect data as they drive, driving and mapping simultaneously.

In fact, already these self-driving cars can send alerts to the other self-driving cars on the road when things look different than expected (lane closures on this street, sending you an updated map...). Such alerts also get sent to the home base for potential updates to the whole system. The cars are connected to each other and their home base via the internet. (Is this information transfer secure? Not to worry, Google knows a thing or two about internet security.)

So, some basic rules are hard-coded, but there's also a lot of machine learning from the many hours spent by these cars on the roads in all sorts of situations. These cars have to learn about the distribution of pedestrian movements (how fast they walk, how quickly they can switch direction, etc.), the typical behaviors of pedestrians, bicyclists, and pedestrians in response to bicyclists. They plot out the trajectories of all vehicles currently on the road and anticipate their next move.

The big challenge? Achieving ridiculous recall and precision. A recall of 99% when pedestrians are involved is not going to do it (you just can't afford to lose a few pedestrians here and there while you tweak your algorithm). Recall is very much about the safety of the pedestrians, but precision is also about the safety of the vehicles (and their passengers). If the car freaks out at every road sign blowing in the wind, not only will the ride be very uncomfortably jerky, but the car might swerve into other cars to avoid hitting the mistakenly classified "pedestrian".

There's other behaviors built-in for the comfort and safety of the passenger: for instance, shifting in the lane (all while avoiding other cars) when passing large trucks. Even if you have everything under control, you don't want your passenger getting antsy about a truck that's too close for comfort.

These cars also slow down at railroads, near parked cars, and while passing bicyclists. Their long-range and short-range sensors ensure the car is very much aware of its surroundings. So much so that the 15 cm resolution of its sensors allows the cars to recognize the fine hand gestures of the traffic controller waving in the middle of the intersection or the bicyclist signaling to change lanes. In making decisions, the cars also make use of all sorts of contextual information: are other cars moving? Why have they stopped? Are there traffic cones around?

And all of this computation and communication is happening on a single CPU. How's that for efficient resource sharing? (but watch out for GPUs coming to a car near you...)

These cars have been designed for zero driver assistance. Are you going to see any sort of control device built into them like a wheel or a break pedal? No chance. This is Google's approach. No need so far: of the 13 driverless car incidents to date, all were the fault of the other drivers. (Side thought: what if the sheer sight of this car on the road is distracting?)
But these cars sure go through a lot of situational testing. And yes, they're good in harsh weather conditions too (confirmed by hardware reliability tests and buckets of water). The QA testing position for the self-driving car project must be damn awesome.

-----

Another huge problem? 2/3 of the world does not have internet.The radical solution? Balloons!Breakthrough tech: large-scale dynamic optimization of balloon network.

We're talking globaloptimization (literally, global). Consider a network of balloons that are distributed around the world that need to follow pre-planned flight paths, adapt to changing wind conditions, and deal with intermittent (sometimes flaky) instructions - all while providing continuity of internet service. This is Project Loon.

Communication with these balloons as they pass over the oceans is through satellite phones. In these conditions, instructions can be dropped, intermittent, or conflicting, and the balloons must nevertheless make good decisions based on limited information and changing wind gusts.

So how does it all work? These balloons fly at an altitude of 20 km - twice as high as airplanes and the weather, so at least a few less problems to deal with. They follow air currents at different altitudes, and steer with vertical motion to end up in an air current moving in the desired direction. An internal balloon pumps air in and out, and with essentially the power of a fan, can move the exterior balloon up and down. Additional power comes from solar cells, but in most cases the wind currents are sufficient to propel the balloons.

A network of balloons thus moves through air currents, one displacing another, to provide continuous, seamless internet service to the cities below. It's kind of like how when you're moving, your service has to remain continuous despite shifting cell towers; but in this case, the city below is stationary, and it is the internet source that is moving above. This is the local optimization part.

Sometimes, balloons also need to be dispatched to the location of a natural disaster, and this has to happen fast. Balloons also need to function in all kinds of harsh conditions, and with local repair most often unavailable, redundancy is key. Redundancy, redundancy, redundancy. Remember how the self-driving cars had 1 CPU? Well these babies have upwards of 40. And if something does goes down, you have to go fetch it... wherever it ends up (can you climb trees and mountains?). Another damn awesome job.

These projects, and all the rest in the [x] repository are driven in part by the slogan: "we need to fail faster". Innovation comes from trying radically new things, and radically new things can often lead to failure. Failing faster means trying again sooner.

I gotta hand it to you, Google sells it well. Another take-away? It seems Google likes the word radical.

In this post, I will focus instead on the common strands passing through the works of Bill Freeman, Joshua Tenenbaum, Josh McDermott, and Nancy Kanwisher - both to highlight the great interdisciplinary collaborations happening at MIT, and to give a broader sense of how neuroscience and computer science are informing each other, and leading to cool new insights and innovations.

Bill Freeman presented the work spearheaded by his graduate student Andrew Owens: "Visually indicated sounds". Teaming up with Josh McDermott, who studies computational audition at the MIT Department of Brain and Cognitive Sciences, they linked sound to material properties and vice versa. Given a silent video as input (of a wooden stick hitting or scratching some surface), the team developed an algorithm that synthesizes realistic sound to go along with it. To do so they needed to convert videos of different scenes (with a mixture of materials) into some perceptually-meaningful space, and link them to sounds that were also represented in some perceptually-meaningful way. What does "perceptually-meaningful" refer to? The goal is to transform the complex mess that is colored pixels and audio waveforms into some stable representations that allows similar materials to be matched together and associated with the same material properties. For instance, pictures (and videos) of different foliage will look very different from each other (the shape and the color may have almost no pixel-overlap) and yet, somehow, the algorithm needs to discover the underlying material similarity.

Here is one place where CNNs (convolutional neural nets) have been successful at transforming a set of pixels into some semantic representation (enough to perform scene recognition, object detection, and the other high-level computer vision tasks that the academic and industry communities have recently flooded the media outlets with). CNNs can learn almost human-like associations between images and semantics (like labels) or between images and other images. Owens and colleagues used CNNs to represent their silent video frames.

On the sound side of things, waveforms were converted into "cochleagrams" - stable representations of sound that allow waveforms coming from similar sources (e.g. materials, objects) to be associated with each other even if individual timestamps of the waveforms have almost no overlap. Now to go from silent video frames to synthesized sounds, RNNs (recurrent neural nets) were used (RNNs are great for representing and learning sequences, by keeping around information from previous timesteps to make predictions for successive timesteps). The cochleagrams predicted by the RNNs could then be transformed back into sound, the final output of the algorithm. More details in their paper.

This work is a great example of the creative new problems that computer vision researchers are tackling. With the powerful representational architectures that deep neural networks provide, higher and higher-level tasks can be achieved - tasks that we would typically associate with human imagination and creativity (e.g. inferring what sound is emitted by some object, what lies beyond the video frame, what is likely to happen next, etc.). In turn, these powerful architectures are interesting from a cognitive science perspective as well: how are the artificial neural networks representing different features? images? inputs? What kinds of associations, correlations, and relationships do they learn from unstructured visual data? Do they learn to meaningfully associate semantically-related concepts? Cognitive scientists can give computer scientists some ideas about which representations may be reasonable for different tasks, given what is known from decades of experiments on the human brain. But the other side of the story is that computer scientists can prod these artificial networks to learn about the representational choices that the networks have converged on, and then cognitive scientists can design experiments to check if the networks in the human brain do the same. This allows the exploration of a wide space of hypotheses at low cost (no poking human brains required), to narrow down the focus of cognitive scientists in asking whether the human brain has converged on similar representations (or if not, how can it be more optimal?)

Nancy Kanwisher mentioned how advances in deep neural networks are helping to understand functional representation in the brain. Kanwisher has done pioneering work on functional specialization in the brain (which brain areas are responsible for which of our capabilities) - including discovering the fusiform face area (FFA). In her talk, she discussed how the "Principle of Modular Design" (Marr, 1982) just makes sense - it is more efficient. She mentioned some examples of work from MIT showing there are specialized areas for faces, language, visual words, even theory of mind. By giving human participants different tasks to do and scanning their brain (using fMRI), neuroscientists can test hypotheses about the function of different brain regions (they check whether the brain signal in those regions changes significantly as they give participants different tasks to do). Some experiments, for instance, have demonstrated that certain language-specific areas of the brain are not involved during logic tasks, arithmetic, or music (tasks that are sometimes hypothesized to depend on language). Experiments have shown that the brain's specialization is not all natural selection, and that specialized brain areas can develop as a child learns. Other experiments (with Josh McDermott) have shown that uniquely-human brain regions exist, like ones selective to music and human speech (but not other sounds). Other experiments probe causality: what happens if specific brain regions are stimulated or dysfunctional? How are the respective functions affected or impaired? Interestingly, stimulating the FFA using electrodes can cause people's representations of faces to change. Correspondingly, stimulating other areas of the brain using TMS can cause moral judgements to shift.

Kanwisher is now working with Josh Tenenbaum to look for areas of the brain that might be responsible for intuitive physical inference. Initial findings are showing that the regions activated during intuitive physics reasoning are the same ones responsible for action planning and motor control. Knowing how various functional areas are laid out in the brain, how they communicate with each other, and which resources they pool together, can help provide insights for new artificial neural architectures. Conversely, artificial neural architectures can help us support or cast doubt on neuroscience hypotheses by replicating human performance on tasks using different architectures (not just the ones hypothesized).

Josh Tenenbaum is working on artificial architectures that can make the same inferences humans make, but also make the same mistakes (for instance, Facebook's AI that reasons about the stability of towers of blocks, makes different incorrect predictions than humans). The best CNNs today are great at the tasks for which they are trained, sometimes even outperforming humans, but often also making very different mistakes. Why is it not enough to just get right what humans get right, without also having to get wrong what they get wrong? The mistakes humans make are often indicative of the types of broad inferences they are capable of, and uncover the generalizing power of the human mind. This is why one-shot learning is possible: humans can learn whole new concepts from a single example (and Tenenbaum has many demos to prove it). This is why we can explain, imagine, problem solve, and plan. Tenenbaum says: "intelligence is not just pattern recognition. It is about modeling the world", and by this he means "analysis by synthesis".

Tenenbaum wants to re-engineer "the game engine in your head". His group is working on probabilistic programs that can permit causal inference. For example, their algorithm can successfully recognize the parameters of a face (shape and texture; the layout and type of facial features) as well as the lighting and viewing angle used for the picture. Their algorithm does this by sampling from a generative model that iteratively creates and refines new faces, and then matching the result to the target face. Once a match is found, the parameters chosen for the synthesized face can be considered a good approximation for the parameters of the target face. Interestingly, this model, given similar tasks as humans (e.g. to determine if two faces are the same or different) takes similar amounts of time (corresponding to task difficulty) and makes similar mistakes. This is a good hint that the human brain might be engaged in a similar simulation/synthesis process during recognition tasks.

Tenenbaum and colleagues have made great strides in showing how "analysis by synthesis" can be used to solve and achieve state-of-the-art performance on difficult tasks like face recognition, pose estimation, and character identification (even passing the visual Turing test). As is the case for much of current neural network research, the original inspiration comes from the 80s and 90s. In particular, Hinton's Helmholtz Machine had a wake-sleep cycle where recognition tasks were interspersed with a type of self-reinforcement (during "sleep") that helped the model learn on its own, even when not given new input. This approach helps the model gain representational power, and might give some clues as well about human intelligence (what do we do when we sleep?).

How does the human mind make the inferences it does? How does it jump to its conclusions? How does it transfer knowledge gained on one task and apply it to a novel one? How does it learn abstract concepts? How does it learn from a single example? How does the human mind represent the world around it, and what physical structures are in place in order to accomplish this? How is the brain wired? These questions are driving all of the research described here and will continue to pull together the efforts of neuroscientists and computer scientists in the coming years more than ever before. Our new and ever-developing tools for constructing artificial systems and probing into natural ones are establishing more and more points of contact between fields. Symposia such as these can give one a small hint of what the tip of the iceberg might look like.