The Exploratory Advanced Research Program

Automated Video Feature Extraction Workshop Summary Report

October 10-11 2012

PART TWO: DISCUSSION

For the second part of the workshop, attendees were given an opportunity to review items discussed during part one and then focus on data needs and requirements. Attendees then participated in general group discussion to identify challenges and put forward possible solutions.

Developing Tools

Lincoln Cobb, Technology Facilitator at FHWA's Office of Safety Research and Development, began by highlighting that in addition to the valuable NDS data, there is also the equally useful roadway information database running in parallel and providing helpful context. Cobb stated that this other rich resource is useful both on its own and for integrating with the NDS data.

Cobb explained that it is exciting to learn more about the breadth of applications for these types of data and that there are many research questions. As the data improve, more will be collected and the need for automated extraction will only get bigger. Cobb stated that looking at roadway safety applications is the most relevant use for the data; however, automated feature extraction applications under discussion cross all modes—wherever human beings are controlling the system these approaches will be valid (e.g., in marine and freight train areas).

SHRP2 is a year away from completing NDS data collection, and there are ongoing discussions between TRB, FHWA, NHSTA, and AASHTO looking at ways to make the data more useful and easily accessible. Specific items being looked at include reduced datasets that would be easy to access and answer limited questions with; using richer trip headers; and basic linkage between trip files and road segments is also being discussed.

Cobb informed workshop participants that the workshop is part of the effort to support the development and deployment of tools to make the extraction of information economical. There are two groups of users to cater to: (1) researchers who want to take the data and apply them to a specific project; and (2) tool developers. Researchers need easy access to clear data so there are several privacy and security details to overcome to reach this point given the strong protections in place for the subjects. For example, Cobb highlighted that the current situation is a long way from being able to send 2 petabytes of data to a researcher via the internet. He confirmed that USDOT is looking for guidance in this regard and would like to identity what makes the data useful. Questions to be addressed include:

What reduced datasets would be of particular use to researchers?

What user tool will be helpful?

How can access be improved?

Cobb confirmed these items are being discussed, highlighting that the more economical, timely, and automated the process of data extraction the better for all. Any progress that researchers can make with tools that look at the data being collected now will be of particular value in the medium term. Cobb confirmed that tools for the short term are just as important as the long term. He noted that partial solutions that are deployable can be extremely helpful—reiterating that a comprehensive solution for automated data extraction is not essential at this time.

Cobb concluded with a reminder of some of the key observations to emerge from researchers at the workshop. These included the theory that context is crucial—it is important not to just focus on any one thing or center of your interest. Cobb also noted that a great deal of research is underway and there are many opportunities for collaboration. He highlighted that progress in automation seems to be moving ahead in many fields and research focus areas but the basic research that will support the development of deployable tools is also interesting.

Advanced Research Agenda

Next, David Kuehn provided some concluding remarks from the perspective of the FHWA's EAR Program. Kuehn noted the value of the multidisciplinary and multidomain solutions discussed during the workshop and confirmed that this is an approach that fits well within the context of EAR Program-sponsored projects. He noted that automated video analytics research is important for saving lives and reducing injuries and hopefully will also be something of interest to the larger research community.

Kuehn highlighted that it would be useful to capture why the recent advances in machine learning mean it is possible to solve issues now that were not possible 5 to 10 years ago. He also noted that it would be useful to know some of the approaches that would be most effective moving forward and what else needs to be known to advance.

Kuehn stated that there appears to be some uncertainty about what the ideal features are for extracting. He noted that it would be useful to identify the salient features for driver safety that could easily be extracted for short-term solutions. Finally, the issue of context was highlighted—specifically, addressing the balance of how much supporting context is needed to go with extracted features so data can be easily sent out to investigators as soon as possible.

State of the Practice

Wider group discussion initially focused on the state of the practice in feature extraction and how it can be applied to feature extraction problems. It was noted that although some features are easy to obtain, it is more challenging if a precise and exact orientation is required. The level of detail depends on the quality of image, desired features, and the purpose of those features. It was also noted that it would be useful to start out with a hierarchy of needs, even though currently not all of those needs can be met—for example, it would be useful to know exactly where a person is looking but eye trackers do not currently exist to obtain those data (although obtaining head pose is possible).

Another question addressed how much information can be obtained from feature extraction. For example, knowing if hands are on or off the steering wheel would be particularly useful; however, it was also highlighted that information must already be available within available naturalistic driving study data—if another camera is needed to collect those data, it is impossible to proceed in that direction.

One workshop participant explained that many hours are spent on crude video coding—for example, using head pose to calculate what the driver is doing. Despite this, coders cannot tell exactly where the driver's eyes are directed because of issues such as resolution, wearing glasses, and glare. It was suggested that dividing the cockpit into regions and identifying where the driver is looking at that moment in time would be good enough to perform some analysis. It was then explained that it currently takes approximately 1 minute to code 10 minutes of video, so to code hours of video takes considerable time.

Group discussion suggested that having a tool that could do that operation as well, or better, than coders would be a big advantage. It was agreed that obtaining head pose from videos is currently possible and hand tracking can be performed very well with algorithms, so the technique would potentially be possible and reliably repeatable with such a computer-based approach. One final comment highlighted that a lot of open-source software already exists that can perform this automated processing but there is still a lot of research to be done depending on application.

Data Sources

Discussion also focused on the idea that the NDS dataset is rich with not just videos but other variables. It was suggested that when performing the video coding, if visual information is ambiguous about where a driver is looking it could be possible to go into other types of data to attempt to confirm where they are looking and disambiguate the video using the other data sources. Other data sources to help identify where a person is looking could include current speed, braking status, and weather conditions. An automated system capable of performing that task would be more in line with what a human does.

One approach could be to use an integrated dataset, comprised of video and extracted features, that takes the video and reduces it to a simplified dataset that can be combined with other data. It was noted that computer vision researchers prefer data that are not reduced, preferring to go through the original data and perform the reduction that would provide the necessary analysis to pass on for behavior analysis.

Discussion then moved to the requirement for multidisciplinary teams to connect with the needs of the analysis. A two-step process was put forward: (1) produce features and analysis out of videos; (2) datamine these for behavioral analysis. It was noted that a team effort would be required to pull these two stages together.

Integrating Data

One workshop participant stated that a lot of the current focus is about what happens inside a vehicle; however, researchers are also interested in events outside the vehicle. In particular, the roadway features are crucial so, using the front view camera, it would be useful to link roadway data and onboard vehicle data acquisition system data.

One possible application could be to examine electronic billboards that may be distracting people with motion. It was noted that it would be useful if there was a feature that could be pulled out of the video to show the difference between a static and animated billboard and objectively assess how long a driver stares at those. A question on this subject focused on how hard it would be to automatically decode something that is moving versus a static sign. For example, would it be possible to automatically go through the video and find every sign, or to find someone doing certain maneuvers. It was explained that these are called gesture events, and there has been a lot of work in this field recently; however, event recognition is currently not as good as static object recognition.

Another suggestion was to develop the ability to pick out where a bicyclist is on the road to study how a driver interacts with a bicyclist, or bicycle lane, without trawling through trips manually. It was noted that these tasks are not something that have to be performed in real time and can all be processed offline after the data have been collected.

Data Reduction

One workshop participant stated that partial solutions are important and researchers should not attempt to process too much data on the first step. Features that are too high level should not be extracted at this time but should be saved for later. It was suggested that the first step should be to reduce the huge size of the video and remove the personally identifiable data and then let a researcher deal with low-level features. For example, just to focus on and study the eyes would be a huge reduction in the size of the data. Extracting some simple features would make the data more easily available.

Another workshop participant said that computer vision researchers want access to the original data. For example, with 1.5 million trip files, computer vision could automatically distinguish from a bike or a jogger using existing examples; however, examples of where bikes show up in the data would be needed to train the computers to recognize that.

A potential application could be to study the difference between having a conversation with a passenger and using a hands-free device. It was suggested that it would be useful to objectively investigate if this changes where people look; however, to achieve this it will be necessary to identify when drivers are conversing within millions of trip files and then analyze the effect on driving. It was also noted that when using the term "reduced dataset," it does not mean picking and choosing which elements are important but instead means picking entire segments in which there appear to be instances of something that is useful.

Advances In Machine Learning

Workshop participants proceeded to discuss some of the recent advances in machine learning over the past 5 years. For example, if researchers wanted to find vehicles that had specific hand-held devices in them, is there something that is common among them that machine learning could recognize? It was suggested that instead of attempting to recognize what type of device a driver is holding, it would be better to look for the action of picking up a device. It would then be possible to go to the database and retrieve all similar events and look at when this happens and at what time. An example was given that if people do not turn their phones off, there is a tendency to visually check to see who is calling even if the phone does not get answered. If glancing at the phone supports the supposition that Bluetooth makes driving safer, it would be invaluable information for safety researchers and automakers.

Participants agreed that information on whether the driver is communicating, sitting, or stretching is all of use to researchers. It was noted that head pose and hand position can provide a good indication of what a driver is doing and a face camera can also say a lot about a driver. For example, an algorithm for facial expression could potentially identify and index each video sequence to state if the driver is happy or angry.

An important issue raised by workshop participants was the computational challenges involved in processing these data. The computational cost of running algorithms on such a large dataset requires an understanding of the machines and hardware required to process them. It was noted that although it takes time to perform offline learning to learn the features, once a feature is learned the classification does not take much time and can even be performed in real time by some systems. Even running classification algorithms in real time would take over a million hours of video, so simply attempting to index a selection of five features would take a long time. Extracting features will require large computing infrastructure (such as multiple parallel processing) to process these big data, which relates to the wider issue of data reduction and segmenting the data into interesting pieces.

Next Steps

FHWA's Office of Safety R& D and SHRP2 will continue to pursue incremental and potentially breakthrough approaches to making the NDS data more accessible and useful. On the incremental end of the spectrum, possibilities include the concept of "remote secure enclaves;" reduced datasets for which privacy management issues could be mitigated; richer trip headers to help researchers identify data of interest; and the development of a database of linkages between trip data files and the roadway segments on which the trips occurred.

The EAR Program is considering plans for continuing to engage the community assembled at the workshop. This is anticipated to include some follow-up conversations and outreach in the upcoming year. In addition, the EAR Program is considering the release of a Broad Agency Announcement soliciting proposals for automated video feature extraction sometime in 2013.