“The goal of computer vision is to observe the world and understand what you’re seeing,” says Larry Zitnick of Microsoft Research. “So how can we teach computers to verbally describe a scene in the same way as a human? Semantic scene understanding of images is a challenging and fundamental problem in computer vision.”

It’s a problem complicated by dependence on another problem in computer vision: object recognition. This task of locating and identifying objects in an image requires extracting complex visual information from an image, which is, in general, a difficult, unsolved problem in computing.

Since the technology for object recognition in photographs is slowing work in semantic scene understanding, is it possible to bypass the issue of photorealistic scene recognition?

The research team chose to use an object data set of clip-art images: a girl and a boy, an outdoors background, toys, animals, and picnic objects. In addition, the object data set contained different body poses and facial expressions for the children.

Zitnick and his co-authors used scenes built from the clip art to study semantic understanding of visual features such as the arrangement of people in relationship to surrounding objects, facial expressions, body poses, and various combinations of such visual features.

The Advantages of Simplicity

In one experiment, the scenes were created using Amazon’s Mechanical Turk. Participants were given sentences describing a scene and were asked to create a visual representation of the sentence using clip art from the world of two children—known as “Jenny” and “Mike.”

The team found that use of abstract scenes instead of real images provided two main advantages:

The visual arrangement and attributes of the objects in the scene were known, which enabled the research to focus on the core problem of semantic scene understanding. The researchers were able to avoid problems arising from the use of noisy automatic object and attribute detectors in real images.

Abstract scenes enabled dense sampling to learn about subtle nuances in semantic meaning. While real image data sets may be quite large, they tend to contain a diverse set of scenes, resulting in a sparse sampling of many semantic concepts.

Using clip art to detect subtle nuances in semantic meaning.

Dense sampling made it possible to learn that, while the sentences “Jenny is next to Mike,” “Jenny ran after Mike,” and “Jenny ran to Mike” are similar, each had distinctly different visual interpretations.

“With densely sampled training data,” Zitnick explains, “we could learn that ‘ran after’ implies Mike also has a running pose, while ‘run to’ does not. Similarly, we could learn that ‘next to’ does not imply that Jenny is facing Mike, while ‘ran after’ and ‘ran to’ do.”

Learning Common Sense

The team also is studying whether common-sense knowledge can be learned directly by observing abstract scenes and analyzing the relationships between objects, both spatially and temporally. The notion of common sense in semantic scene understanding is one that Zitnick wants to highlight. He observes that people generally forget that common sense is an important component to semantic scene understanding.

“We actually need common-sense knowledge about the world outside the scene in order to describe it,” he says. “Common-sense knowledge about the world helps you filter out what is uninteresting versus what is novel. For example, when we showed pictures of Mike and Jenny being chased by a bear to Turkers, the sentences they used to describe the scene didn’t mention the ketchup bottle on the picnic table or the toys. That wasn’t critical to the ‘story’ of the scene. I think that’s the other exciting part about this clip-art work: It’s a way to gather common-sense knowledge.”

The “Mike and Jenny” data set has been made available to the research community. Zitnick says it will take a year or two before it’s possible to assess the impact of the work.

“It’s been very interesting, though,” he says. “People from different fields, from computer vision to natural language processing [NLP], are using the data set. We’ve been studying how to go from sentences to scenes, but not from scenes to sentences, and that’s largely because I don’t have a background in NLP. So I’m working with NLP researchers.

“That’s pretty exciting for me, because I think as technology advances, disciplines will evolve. Right now, we have silos of learning, but 10 years from now, you won’t be able to achieve anything in computer vision unless you’re also an expert in the other subfields of artificial intelligence.”

Multiple Fields of Research

A multidisciplinary approach should suit Zitnick well. His background has encompassed a number of computing fields since he first enrolled at Carnegie Mellon University (CMU) as an electrical-engineering major.

“I liked programming. In elementary school, I entered statewide programming competitions on a TRS-80,” he laughs. “I majored in EE because I didn’t realize there was such a thing as computer science, where you’d get to program all the time. When I did, I switched.”

“I spend a lot of time drawing and designing, and I’ve always wanted a computing device that would be a good substitute for pen and paper.”

— Larry Zitnick

In the 1990s, artificial intelligence was a hot topic in computing science.

“It was going to be the cure for everything,” Zitnick recalls. “I applied for summer jobs in AI, and the closest I could get was a stereo-vision computing project. We used Fortran for programming on a beast of a machine that you had to wiggle every so often to make it work. I was on that project for two years, and we ended up with a patent for one of the algorithms. Then, during grad school, I worked for a startup that was building portable 3-D cameras.”

Artificial intelligence remained close to Zitnick’s heart, and partway through grad school at CMU, he switched to machine learning. When he graduated and joined Microsoft Research, he began working again on computer-vision projects, for which he has been able to use his background in machine learning and AI.

Stylus-based research is another field that interests Zitnick because one of his hobbies is woodworking; he designs and builds mid-century Danish-modern furniture.

“I spend a lot of time drawing and designing,” he says, “and I’ve always wanted a computing device that would be a good substitute for pen and paper, so I’ve done some work with stylus-based computing. The most recent one is about cleaning up handwriting and line drawings.”

Pacific Northwest Lifestyle

Woodworking is only one of Zitnick’s hobbies. He appreciates the advantages of being based in Redmond, Wash., because his family’s favorite pastime is enjoying the outdoors. They often go camping and hiking around the Seattle area.

“We spend a lot of time in the woods when I’m not working,” he says. “My wife and I like to take our two kids backpacking. We love living in Seattle. It’s a great city for a young family. It’s so easy getting to some really beautiful places.”

Zitnick also appreciates his work environment.

“At Microsoft Research, you get to do research, but you also have time to code and collaborate. And when it comes to collaborating, you get to work with the smartest people in their fields. Having all this expertise to draw upon is so important, because research is becoming more of an interdisciplinary effort. Three years ago, I never would’ve predicted I would be working with NLP researchers.

“The cross-discipline aspects of the research we’re now doing is taking computer vision and computer science to a whole new level. I can’t wait to see what we’ll achieve”