Vision is arguably one of the most challenging, and potentially useful, problem in modern science and engineering for its enormous complexity in knowledge representation, learning and the computing mechanisms of the biologic systems. For such a complex problem, we must look for a long term solution, and be cautious that many apparently promising ways may lead to dead ends. By analogy, suppose some monkeys want to reach the moon, they may choose to (1) climb a tree, indeed, a tree could be so tall that a monkey climbs diligently for a life time, (2) grab the moon from a well at night time, or (3) ride a hot air balloon! All these methods appear to be smart and actually very cute, and people can enjoy measurable progress over time! while the real solution (building a spacecraft) looks hopeless for a long time and appears to be totally ridiculous to ordinary eyes ! In reality, most people simply do not have the patience to learn astrophysics and rocket science, which are too complex and boring for them.

------------------------------------ Comic illustrated by my daughter Stephanie Zhu (11 yrs old drawn in 2010): How to reach the moon.

Some students asked me whether vision is just an application area of machine learning (which currently means training Boosting or SVM classifiers with large number of examples) as it appears to them. If so, what left for vision researchers is just to design good features. Such question is a real insult to vision and reflects the misleading research trend that poses vision as simple as a classification problem. This is no longer surprising to me, as the young generation not only never heard of Ulf Grenander (the father of pattern theory), but now didn't know who David Marr (father of computational vision) was. By analogy, machine learning, with its popular meaning, is very much like the method practiced by Chinese herbal clinics over the past three thousand years. Ancient people, who had little knowledge of modern medicine, tried on 100s of materials (roots, seeds, shells, worms, insects, etc) just like machine learning people test on various features. These ingredients are mixed with weights and boiled to black and bitter soup as drugs --- a regression process. It is believed that such soup can cure all illness including cancer, SARS and HmNn flu, without having to understand either the biologic functions and causes of the illnesses or the mechanism of the drugs. All you need is to find the right ingredients and mix them in the right proportion (weights). In theory, actually you can prove that this is true (essentially, modern medicine mixes some sort of ingredients as well), just like machine learning methods are guaranteed to solve all problems if they have enough features and examples, according to statistics theory! But the question is: with the space of ingredients so large, how do we find the right ingredients effectively (with realistic number of examples for training as well as number of patients for testing)? For vision, we got to study the complex structures of the images, the rich spaces and their compositions, and the variety of models and representations.

Research methodologies in vision (and other sciences and engineering) can be summarized in three approaches or stages: Hack, Math, and Stat. Hacks are heuristics or somethings that somehow work somewhere, but you cannot tell exactly how and where they work. Math is on the opposite side, it tells us that under certain conditions, things can be said analytically or with a gaurantee of performance, but often the conditions are quite limited and do not apply to general situations in the real world. Stat is essentially regression. With lots of parameters, you eventually can fit any data, but lack a physical explanation. Hack, math, and stat are therefore different levels of interpretations or models. It is interesting to see examples in discplines that has a longer history, say physics. The Chinese expedition [1405-1433] in the Ming dynasty was the most advanced of its time when folks sailed 2/3 of the world reaching Africa and Europe without even knowing the Earth is round ! The technique they used is called celestial navigation (see the picture below), which I call "hacks" here. People used the constalletions to find the north and the latitude. the constellations are very much like shape features we are using today for object recognition. It was not precise, but worked to some extent in practice. A beautiful math theory appeared in the 1680s* when Newton invented the gravitation theory which is simple and explains the movements of stars and planets. But the math is not suffisticated enough to explain fully the motion of moon. Newton was reported said that the lunar theory "made his head ache and kept him awake so often that he would think of it no more"**. In 1750s, it was the French talents like Euler and a few others who came to rescue. They invented the least-sqaured method to fit the observational data perfectly with regression (see the equation below). Such regression equations looks very familiar in machine learning today. Hack, math, and stat are all useful tools and methods, and often a complex solution integrates all three of them. For example, in image compression and coding, we have the math in information theory, wavelets and computational mononic analysis at its core. Then we also use statistics for the frequency of various elements in the code book. Finally the coding scheme contains numerous engineering hacks to make it work in real images/video, such as jpg and mpeg. It is likely the solution to vision will rely on all three aspects, and you need to be good at all three aspects if you are serious at solving the vision problems.

------------------------------------------------------ From a talk I gave at the Frontiers of Vision at Boston in 2011. Download the whole ppt.

Reductionism is a beloved research strategy in many areas of modern sciences. It says that if you cannot solve a problem, you should divide it into smaller components as any complex system is nothing but the sum of its parts. This methodology was practiced by early vision researchers in the 1980s, for example, numerous methods for edge detection, segmentation, shape-from-X etc. But, people found that even the simplest problem like edge detection couldn't be solved, because the definition of an edge depends on tasks in higher levels and even human labelers cannot agree whether there is an edge without specifying the task levels. Unlike physicists who can choose to study a system or phenomenon at a given scale or status, computer vision researchers found themselves very unfortunate: each single image contains so many patterns and tasks across many levels! The figure below shows how much we the humans can infer, parse, and reason about in space, time and causal-effect from a single image.

------ This is a figure that I drew in our MURI 2015 project: Understanding Scenes and Events by Joint Parsing, Cognitive Reasoning and Lifelong Learning..

The table below lists a set of questions that we must solve, all together, in order to understand a single image. So, we go the opposite direction: if you cannot solve a simple problem, you may have to solve a complex one! This motivated our work for developing a unified representation --- spatial, temporal and causal and-or graph and making joint parsing of all the tasks on the table (see our demo page ). Now it reminds me of a loud slogan in machine learning: "You should never solve a problem more than is necessary (by Vapnik)". This was used to argue for discriminative models against generative models. The slogan itself has nothing wrong, but unfortunately we just don't have such well-defined problems to solve in computer vision! Face detection perhaps is a rare exception when you don't consider the image context. Edge detection was thought to be a classification problem, but it is not.
I am also reminded that physicists are taking our approach lately. For example, the concept of Dark Matter/Energy is to construct a more complex system than what we can see, and in superstring theory, people go to 10 dimensional space in order to put relativity theory and quantum mechanics in peace.

--------------------------------- This table lists the aspects for scene understanding that we promised in 2010 to study in the ONR MURI project.

In images and videos, many entities and relations are infeasible to detect by their appearances using existing approaches, and most of them do not even show in any pixels. Yet, they are pervasive and govern the placement and motion of the visible entities that are relatively easier to detect. By analogy, they are like the dark matters and dark energy in cosmology which physicists study in a standard cosmology model. Studying such "dark entities" and "dark relations" in vision are crucial for filling the performance gaps in the recognition of objects, scenes, actions and events. More specificallty, “dark matter” corresponds to entities which are infeasible to recognize by visual appearances. This includes i) status of an agent (human and animal)’s goals and intents, like hungry, thirsty, which trigger actions; ii) status of an object, such as a door is “locked”; and iii) stuff like water which has no specific geometric shape or appearance. “Dark energy” refers to hidden relations which drive the motion of objects in a video. To name a few: i) physical forces like gravity and supporting relations between objects; ii) causal effects and causal relations between actions and the changing object statuses; and iii) attraction relations between an object (like food) and an agent (hungry); and so on.

In computer vision, two prevailing representational paradigms are: (I) the view-centered and appearance-based models; and (II) theobject/scene centered geometric based representation. I argue that a deep level fo representation is task-centered based n the dark matters and dark energies. Therefore we must integrate the "visible" (geometry and appearance) and the "dark" (FPIC), and develop algorithms for joint inference and reasoning. This also explains why we cannot solve those "simple" problems and must tackle the "complex" one jointly.

---------------------------------Written based on a ICCV2013 paper and a NSF grant in 2014.

In recent years, as many areas in AI become promising, researchers from funding agents, academics and industry re-started the conversation on commonsense: how do we the humans, and then machines, acquire commonsense knowledge? and how do we, and machines, use coomonsense for reasoning and task-solving? Commonsense used in our daily life is so massive, it is not just those statements or relations that people collect from the Internet through mining, such as "Los Angeles is the second largest city in the US. ". What are the most important commonsense in AI? It is non-trivial to define. Here I propose two criteria: (i) It must be broadly applicable in scope. Each piece of knowledge has its scope in a certain space and time interval and is only applicable to certain set of entities. (ii) It must be frequently used for human tasks.

For example, average human have basic knowledge for assembling new IKEA furniture through try-and-error, although the time spent is much longer than hiring someone who has assembled this furniture many times. The latter is special skills trained from the same examples, not commonsense. For the most important commonsense, I agree with cognitive psychologists: Human brain has obtained two simulation engines through evolution: One is the physical simulation engine to understand the intuitive physics in our environment; and the other is a social simulation engine to deal with other agents (humans, animals). I believe these structures are almost innate, and almost unconscious and instantaneous, but massive. They are what our current Vision, Cognition and AI systems are particularly bad at.

-----------This is excerpted from a talk to a meeting on Commonsense in 2015.

In the 1980s, AI research plummeted into an era which is known as the "AI winter". In my opinion, this description is inaccurate. While the logic-based AI went to a cold winter, a probability-based and data-driven AI has prevailed gradually. AI has divided into at least six areas: Vision, Language, Cognition, Learning, Robotics, and Social aspects (economics, multi-agents, game theory, ethics, morality, etc). In the 1960-1980s, popular AI research has been built on the foundation of logics, such as predicate calculus (knowledge representation and reasoning), event calculus (spatial and temporal reasoning), and situation calculus (causal reasoning and robot planning). Logic is crispy and interpretable, but cannot handle uncertainty in real world signals. As real data become available in the 1990s, thanks to the so-called sensor revolution, each of these fields grows into its own community to study their tasks and data. In the turn of the new millennium, all these areas found their new foundations on statistical modeling, learning and computing. In the past few years, people begin to collaborate across the boundaries, for example, vision is tightly integrated with learning, vision meets cognition, vision meets language, vision for robotics etc. As this trend continues, we are heading to a new "Era of Big Integration" on the foundation of probabilistical representation and computing (Calculus of probability). No wonder people (James Maxwell) said that probability is the true logic!