In conjunction with the TensorFlow 2.0 alpha release, and our TensorFlow Dev Summit series, we invite you to enter our TensorFlow Edge Kit Giveaway. Winners will receive a gift box from Google that includes some fun toys including the new Coral Edge TPU device and the SparkFun Edge development board powered by TensorFlow.

The giveaway will remain open until April 8th, so get your entries in ASAP!

While entering is simple, we’re being picky, so no two-word responses and be SPECIFIC! Give it some thought, and get that entry in!

5 comments

## Intro
There is a 4-way intersection near my apartment in Midtown Atlanta, GA, which only has stop signs in 2 of the 4 directions. However, the residential surroundings and typical vehicle speed in this intersection makes it extremely ambiguous to drivers whether it is indeed a 4-way stop or not. In addition to numerous close-calls and actual vehicle accidents, it has also been a very dangerous intersection for pedestrians needing to cross any of the 4 ways. There are cross-walks throughout, but no pedestrian traffic signal. Thus, pedestrians usually have to play chicken at this residential intersection! The objective of this effort is thus to request the local government to make appropriate changes to the intersection, in order to reduce the ambiguity and/or danger of it.

This creates a perfect opportunity to combine embedded systems and machine learning at the edge to make a meaningful impact in a local, DIY manner. Instead of reaching out to the local government and describing the issue without proof, it would be much more significant to provide audio and visual evidence of the dangerous situations that this intersection is creating. One such system to create this evidence would involve detecting a dangerous situation via audio sensing and recording evidence of the situation by saving audio and visual (image and/or video) clips of the detected event.

Below details the main components of the system: (1) detect the event via audio sensing, (2) record audio and visual evidence of the event

## (1) Detect the event via audio sensing
– Stay efficient by dynamically deciding when to “wake up” and perform a more fine-grained extraction and prediction from the audio. This logic would classify normal background noise vs an approaching car, which is a straight-forward learning task due to the consistent high-frequency noise the car-on-pavement creates. Since this is not an overly-complex task, it should be realizable by a more classical (and thus more computationally efficient) model such as an energy-based GMM or a kernel SVM.
– When the system is “awake”, a model (likely a CNN variant) would then perform frame-level or chunk-level classification on the audio stream, looking for events like car horn, tire screech, human yelling, vehicle collision, etc. I haven’t verified this, but I recall seeing most (if not all) of these classes in Google’s Audioset data set.
– Depending on the final set of classes of interest and the available set of LEDs, either on-board or peripheral LEDs can be used to indicate the positive prediction of either (1) any of the classes (e.g. binary decision irrespective of _which_ class) or (2) an indication of which class was predicted
– An optional step here would be to send the range of audio that led to a positive prediction back to a more powerful “server”, where a more complex model then makes a further decision on whether the claimed class of interest is indeed present or not. In this scenario, the edge model would favor recall, in order to ensure the more complex server model is able to predict on the true occurrences of the classes of interest. However, the edge model would also act as a filter for the server model, only sending it audio that has some reasonable probability of containing a class of interest. Naturally, the server model would favor precision, in order to avoid a lot of manual effort for a human to cull the positive predictions before creating the final set of verified events.

## (2) Record audio and visual evidence of the event
– With a positive event detected, the edge device would use a peripheral OV7670 camera to take a series of photos or a video for a configured amount of time, in order to further confirm and convey the event. While more expensive, a video stream would enable further downstream processing, if desired.
– With the on-board bluetooth, the edge device can then send audio and photo/video of the claimed event back to a server, for a human (or more complex model) to further verify. Alternatively, the data could be simply written to external memory, if attached.

## Note of General Application
While catered towards a specific use case right outside my apartment window, this proposed setup really could be used for anything which involves audio-based event detection and visual-based data as a supplement.