Cross-modality generation is an emerging topic that aims to synthesize data in one modality based on information in a different modality. In this work we consider a task of such: given an arbitrary audio speech and one face image of arbitrary target identity, generate synthesized facial movements of the target identity saying the speech. To perform well in this task, it inevitably requires a model to not only consider the retention of target identity, photo-realistic of synthesized images, consistency and smoothness of face images in a sequence, but more importantly, learn the correlations between audio speech and lip movements. To solve the collective problems, we explore the best modeling of the audio-visual correlations in building and training a lip-movement generator network. Specifically, we devise a method to fuse audio and image embeddings to generate multiple lip images at once and propose a novel method to synchronize lip changes and speech changes. Our model is trained in an end-to-end fashion and is robust to view angles and different facial characteristics. Thoughtful experiments on different images ranging from male, female to cartoon character or even animal images to show that our model is robust and useful. Our model is trained on English speech and can test on other languages like German, Chinese and so on. For the demo video, please refer to https://youtu.be/mmI31GdGL5g.

CARLA is an open-source simulator for autonomous driving research. It follows a client-server architecture, where the server runs physics and sensor simulations and the clients run the AI drivers. Server and clients exchange sensorial data (images, point clouds), ground truth (depth, semantic segmentation, GPS, 3D bounding boxes), privileged information (e.g. traffic infractions, collisions), and vehicle commands/state for supporting the training and testing of AI drivers. It allows to specify the sensor suite, and environmental conditions such as weather, illumination, number of traffic participants, etc. It includes benchmarks for making possible to compare different AIs under the same conditions. In line with popular real-world datasets such as KITTI and Cityscapes, CARLA also allows to develop vision-based algorithms for 2D/3D object detection, depth estimation, semantic segmentation, Visual SLAM, tracking, etc.

CARLA was born for democratizing research on autonomous driving. Not only the source code is open and free to use/modify/redistribute, but also the 3D assets that are used to build the cities, i.e. buildings, roads, sidewalks, pedestrians, cars, bikes, motorbikes, etc. Since recently CARLA is member of the Open Source Vision Foundation (OSVF), so being the sister of OpenCV and Open3D.

In the seminar paper of CARLA (CoRL’2017), we studied the performance of three vision-based approaches to autonomous driving. Since its public release at during November 2017, many users have joint github CARLA community and have provided additional functionalities that were not originally released such, e.g. a LIDAR sensor and a ROS bridge. Recent interesting works use CARLA for proposing new vision-based approaches to autonomous driving, for instance:

CARLA premier video can be found at and more videos in: https://www.youtube.com/channel/UC1llP9ekCwt8nEJzMJBQekg

We will showcase CARLA in real-time, also showing videos of the most interesting papers up to date using CARLA to perform vision-based autonomous driving. Moreover, we will explain the development road-map of CARLA. Attached to the email there is a snapshot of CARLA environment running in real time. In the background there is the view of CARLA’s server showing a vision-based AI driver controlling a car. In the foreground there is a view of a CARLA’s client showing an on-board image with its depth and semantic segmentation, as well as a map placing the different active cars in the simulation.

A vehicle on a road or a robot in the field does not need a full-blown 3D depth sensor to detect potential collisions or monitor its blind spot. Instead, it needs to only monitor if any object comes within its near proximity, which is an easier task than full depth scanning. We introduce a novel device that monitors the presence of objects on a virtual shell near the device, which we refer to as a light curtain. Light curtains offer a light-weight, resource-efficient and programmable approach for proximity awareness for obstacle avoidance and navigation. They also have additional benefits in terms of improving visibility in fog as well as flexibility in handling light fall-off. Our prototype for generating light curtains works by rapidly rotating a line sensor and a line laser, in synchrony. The device can generate light curtains of various shapes with a range of 20-30m in sunlight (40m under cloudy skies and 50m indoors) and adapts dynamically to the demands of the task. This interactive demo will showcase the potential of light curtains for applications such as safe-zone monitoring, depth imaging, and self-driving cars. This research was accepted for oral presentation at ECCV 2018.

Eyes of Things (EoT) (www.eyesofthings.eu) is an Innovation Project funded by the European Commission within the Horizon 2020 Framework Programme for Research and Innovation. The objective in EoT has been to build an optimized core vision platform that can work independently and also embedded into all types of artefacts. The platform has been optimized for high-performance, low power-consumption, size, cost and programmability. EoT aims at being a flexible platform for OEMs to develop computer vision-based products and services in short time.

The functionality of the EoT device is demonstrated with some cool applications:

• The Next Generation Museum Guide demonstrator. In this demonstrator, the EoT device is inside a headset which automatically recognizes the painting the visitor is looking at and then provides information about the painting via audio. Video: https://www.youtube.com/watch?v=QR5LoKMdQ8c
• The Smart Doll with Emotion Recognition demonstrator embeds an EoT device inside a doll’s head. Facial emotion recognition has been implemented so that the doll can assess the children emotional display reacting accordingly through audio feedback. This demonstrator uses deep learning inference for facial emotion recognition. All processing is done on the EoT board, powered by a LiPo battery. Video: https://www.youtube.com/watch?v=v3YtUWWxiN0
• Flexible Mobile Camera: This demonstrator is actually a set of functionalities useful for surveillance and including additional functionality provided in the cloud using images captured by the EoT device. Video: https://www.youtube.com/watch?v=JXKmmEsww5Q. An incarnation of the previous demonstrator is the ‘Litterbug’ application, which aims to detect illegal littering, Video: https://www.youtube.com/watch?v=dR-v17YuOcg

3D object tracking is an essential part of cinema visual effects pipeline. It’s used for different tasks including color correction, stereo conversion, object replacement, texturing, etc. Typical tracking conditions in this field are characterized by various motion patterns and complicated scenes. We present a demo of 3D object tracking software dedicated for visual effects in the cinema. The software exploits ideas presented in ECCV 2018 paper ‘Combining 3D Model Contour Energy and Keypoints for Object Tracking’. The paper describes an approach for monocular model-based 3D object pose estimation. Preliminary object pose can be found using a keypoint-based technique. The initial pose can be refined via optimization of the contour energy function. The energy determines the degree of correspondence between the contour of the model projection and edges on the image. Contour energy optimization doesn’t require a preliminary training that allows to integrate it in visual effects production pipeline easily. We use this method to improve tracking and simplify user-defined object positioning by automatic pose estimation. The approach was tested on numerous real-world projects and OPT public benchmark dataset.

The ability to infer depth from a single image in an unsupervised manner is highly desirable in several applications such as augmented reality, robotic, autonomous driving and so on. This topic represents a very challenging task, and the advent of deep learning enabled to tackle this problem with excellent results. Recently, we have shown in [1] how, by designing thin architectures, accurate monocular depth estimation can be carried out in real-time on devices with standard CPUs and even on low-powered devices (as reported in this video, https://www.youtube.com/watch?v=Q6ao4Jrulns). For instance, our model infers depth on a Raspberry Pi 3 at about 2 fps [1] with a negligible loss of accuracy compared to the state-of-the-art monocular method represented by Godard et al CVPR 2017.
Eventually, we have proposed in [2] a novel methodology to tackle unsupervised monocular depth estimation enabling to achieve better accuracy than the state-of-the-art. Such network also allows for synthesizing two novel views, never seen at training time, that can be used for interesting additional purposes. To prove the effectiveness of our network for this latter task, given the input image and the two novel synthesized views, we feed to a popular stereo algorithm (i.e., SGM in the attached video) different combinations of stereo pairs (in the video, synthesized left and input image, synthesized left and right) achieving consistent results.
The live demo will show how fast and accurate monocular depth estimation is feasible even on standard CPU-based architectures, including embedded devices. Moreover, our network [2] allows inferring novel synthesized views from the single input video stream. According to the reasons explained so far, we think that several application domains would benefit from these achievements and consequently attract the interest of ECCV 2018 attendants. We plan to organize a live demo using standard computing devices (notebooks, embedded devices, etc).

In our Demo, we show real-time Object Detection and 6D Pose Estimation of multiple objects from single RGB images. The algorithm is built upon the main results of our paper which is presented during the oral session (”Implicit 3D Orientation Learning for 6D Object Detection from RGB Images”, ID: 2662). The considered objects contain few texture and include (view-dependent) symmetries. These circumstances cause problems for many existing approaches. Our so-called ’Augmented Autoencoders’ are solely trained on synthetic RGB data rendered from a 3D model using Domain Randomization. Thus, we do not require real pose-annotated image data and generalize to various test sensors and environments. Furthermore, we also run the demo on an embedded Nvidia Jetson TX2 to demonstrate the efficiency of our approach.

The proposed demo is a prototype of 3D scanner that uses photometric imaging in the near field for highly accurate shape reconstructions. It consists of a set of white-light LEDs synchronised with an RGB camera through a microcontroller. The 3D shape is retrieved by inferring 3D geometry from shading cues of an object lit by calibrated light sources.

The novelty of this prototype with respect to the state-of-the-art is the capability of working in the near-field. The advantage of having the inspected object close to the device is twofold. Firstly, very high spacial frequencies can be retrieved with the limit of precision being around 50 microns
(by using a 8mm lens and a 3.2MP camera at around 4cm away from the object). Secondly, the proximity of the light sources to the object allows a higher signal-to-noise ratio with respect to the ambient lighting. This means that, differently from other photometric imaging based 3D scanners, our prototype can be used in an open environment.

The acquisition process consists of a number of LEDs (typically 8) flashing while the camera captures a RAW image per LED. The acquisition speed is ruled by the camera framerate. For the proposed demo, the acquisition task is achieved at ~40fps (that is ~25ms per LED).

The implementation of the 3D shape reconstruction runs on a laptop using an embedded GPU. The output consists of a .stl mesh having a number of vertices proportional to the number of pixels of the camera.

There is an increasing concern in computer vision devices invading the privacy of their users by recording unwanted videos. On one hand, we want the camera systems/robots to recognize important events and assist human daily life by understanding its videos, but on the other hand we also want to ensure that they do not intrude people's privacy. In this demo, we present a new principled approach for learning a video face anonymizer. We use an adversarial training setting in which two competing systems fight: (1) a video anonymizer that modifies the original video to remove privacy sensitive information (i.e., human face) while still trying to maximize spatial action detection performance, and (2) a discriminator that tries to extract privacy sensitive information from such anonymized videos. The end result is a video anonymizer that performs a pixel level modification to anonymize each person's face, with minimal effect on action detection performance. We experimentally confirm the benefit of our approach compared to conventional handcrafted video face anonymization methods including masking, blurring, and noise adding. See the project page https://jason718.github.io/project/privacy/main.html for a demo video and more results.

3

Inner Space Preserving Generative Pose Machine

Shuangjun Liu, Sarah Ostadabbas (Northeastern University)

Photographs are important because they seem to capture so much: in the right photograph we can almost feel the sunlight, smell the ocean breeze, and see the fluttering of the birds. And yet, none of this information is actually present in a two-dimensional image. Our human knowledge and prior experience allow us to recreate ``much'' of the world state (i.e. its inner space) and even fill in missing portions of occluded objects in an image since the manifold of probable world states has a lower dimension than the world state space. Like humans, deep networks can use context and learned ``knowledge'' to fill in missing elements. But more than that, if trained properly, they can modify (repose) a portion of the inner space while preserving the rest, allowing us to significantly change portions of the image. In this work, we present a novel deep learning based generative model that takes an image and pose specification and creates a similar image in which a target element is reposed.

In reposing a figure there are three goals: (a) the output image should look like a realistic image in the style of the source image, (b) the figure should be in the specified pose, and (c) the rest of the image should be as similar to the original as possible. Generative adversarial networks (GANs) are the ``classic'' approach to solving the first goal by generating novel images that match a certain style. The second goal, putting the figure in the correct pose, requires a more controlled generation approach, such as conditional GANs (cGAN). At a superficial level, this seems to solve the reposing problem. However, these existing approaches generally either focus on preserving the image (goal c) or generating an entirely novel image based on the contextual image (goal b), but not both.

We address the problem of articulated figure reposing while preserving the image's inner space (goal b and c) via the introduction of our inner space preserving generative pose machine (ISP-GPM) that generates realistic reposed images (goal a). In ISP-GPM, an interpretable low-dimensional pose descriptor (LDPD) is assigned to the specified figure in the 2D image domain. Altering LDPD causes figure to be reposed. For image regeneration, we used stack of augmented hourglass networks in the cGAN framework, conditioned on both LDPD and the original image. Furthermore, we extended the ``pose'' concept to a more general format which is no longer a simple rotation of a single rigid body, and instead the relative relationship between all the physical entities captured in an image and also its background. A direct outcome of ISP-GPM is that by altering the pose state in an image, we can achieve unlimited generative reinterpretation of the original world, which ultimately leads to a one-shot data augmentation with the original image inner space preserved.

Dynamic Multimodal Segmentation on the Wild using a Scalable and Distributed Deep Learning Architecture

Edgar Andres Margffoy Tuay (Universidad de los Andes, Colombia)

A scalable, disponible and generic Deep Learning service architecture for mobile/web applications is presented. This architecture integrates with modern cloud services, such as Cloud Storage and Push Notifications via cloud messaging (Firebase), as well with known distributed technologies based on the Erlang programming language.

To demostrate such approach, we present a novel Android application to perform object instance segmentation based on natural language expressions. The application allows users to take a photo using the device camera or to pick a previous image present on the image gallery. Finally, we also present a novel web client that performs similar functions that is supported by the same architecture backbone and also integrates with state-of-the-art technologies and specifications such as ES6 Javascript and WebAssembly.

2

Single Image Water Hazard Detection using FCN for Autonomous Car

Xiaofeng Han, Chuong Nguyen, Shaodi You, Jianfeng Lu (CSIRO data61)

Water bodies, such as puddles and flooded areas, on and off road pose significant risks to autonomous cars. Detecting water from moving camera is a challenging task as water surface is highly reflective, and its appearance varies with viewing angle, surrounding scene, and weather conditions.

We will present a live demo (running on a GPU laptop) of our water puddle detection method based on a Fully Convolutional Network (FCN) with our newly proposed Reflection Attention Units (RAUs). An RAU is a deep network unit designed to embody the physics of reflection on water surface from sky and nearby scene. We show that FCN-8s with RAUs significantly improves precision and recall metrics as compared to FCN-8s, DeepLab V2 and Gaussian Mixture Model (GMM).

The demo will show a system implemented on a mobile phone that is able to automatically localise and display the position of 3D objects using RGB images only. With more details, the algorithm reconstruct the position and occupancy of rigid objects from a set of object detections in a video sequence and the respective camera poses captured by the smartphone camera. In practice, the user scans the environment by moving the device over the space and then he/she receives automatically object proposals from the system together with their localization in 3D. Technically, the algorithm first fits an ellipse onto the image plane at each bounding box as given by the object detector. We then infer the general 3D spatial occupancy and orientation of each object by estimating a quadric (ellipsoid) in 3D given the conics in different views. The use of a closed form solution offers a fast method which can be used in situ to construct a 3D representation of the recorded scene. The system allows also to label the objects with additional information like personalised notes, videos, images and html links. At the moment, the implementation is using a Tango phone but ongoing work is extending the system to ARCore and ARKit enabled smartphones.

The system can be used as a content generation tool for Augmented Reality as we can provide additional information anchored to physical objects. Alternatively, the system is employed as an annotation tool for creating datasets as we record different type of data (images, depth, objects position and camera trajectory in 3D). Ongoing research activities are embedding our system into various robotic platforms at the Italian Institute of Technology (iCub, Centaur, Coman, Walkman) in order to provide robots with object perception and reasoning capabilities.

For smart factory, we need to monitor the status of analog instrumental sensors (idle/normal/danger state) automatically in real-time. To achieve this goal, we perform several tasks: (1) detect instrument, (2) detect numerals, (3) segment and recognize digits, (4) detect needle, and (5) compute needle value and decide the instrument status. Previous approaches used feature-based methods commonly, so they can detect and recognize instruments only for the predefined shapes with a low performance. To overcome these limitations, we propose the deep learning based approaches. First, the instrument detector network is similar with the existing YOLOv2 architecture, but we use only one anchor, not five anchor boxes as in the original YOLOv2, because the instruments commonly have two shapes such as circle and rectangle. To train the detection model, we prepare a plenty of instrument image dataset that includes a total of 18,000 training samples generated by data augmentation and 10,000 artificial samples generated from BEGAN. Second, the numerals detector network takes the Textboxes++ network which is the state-of-the-art text detector, which is based on SSD. So, we can detect numerals accurately with a high speed because Textboxes++ detects numerals precisely even on the poor conditions with quadrilateral bounding boxes. Third, the numeral recognition network consists of two functional blocks: (1) it separates the detected numeral region into several single digits using the maximally stable extremal regions (MSER) algorithm, and (2) it accepts each segmented digit to the CNN with five convolution layers and two fully connected layers. This network classifies each image patch into a 0~9 digit value. Then, we obtain a set of numbers corresponding to each numeral bounding box by regrouping the recognized digits. Fourth, we find the needle using the Edline edge detection algorithm and non-maximum suppression and we may get two major end-points from the needle. Finally, we calculate two angles between the needle line and the directional lines pointing to the smallest and the largest number. From these angles and two numbers, we estimate the needle value pointed by the needle Based on the estimated needle value, we can decide one of the instrument status. The proposed instrument monitoring system can detect instruments from the captured image by a camera and decide the current instrument status automatically in real-time. We get 93.8% accuracy over the test dataset for 15 fps on a Nvidia Titan Xp.

In this demo, we introduce various computational photography software based on deep learning. These software have been developed for pursuing our ultimate goal of building a computational photography software library, which we call "COUPE." Currently COUPE offers several deep learning based image processing utilities, such as image aesthetic assessment, image composition assessment and enhancement, color transfer and enhancement, and non-blind deconvolution.

In the demo session, we will mainly show our recent developments on perceptual image super-resolution and depth image refinement, which will also be presented as posters at ECCV 2018. In addition, we will introduce a demo website where visitors can interactively run several computational photography software to obtain the results for their own images. A demo poster will provide more information on the software as well.

Recent work has shown that optical flow estimation can be formulated as an end-to-end supervised learning problem, which yields estimates with a superior accuracy-runtime trade-off compared to alternative methodology. However, for practical applications (e.g. autonomous driving), a major concern is how much the information can be trusted. In the past very little work has been done in this direction. In this work we show how state-of-the-art uncertainty estimation for optical flow can be obtained with CNNs. Our uncertainties generalize to real-world videos well, including challenges for optical flow such as self-occlusion, homogeneous areas and ambiguities. At the demo will show this live and in real-time.

3

Multi-Frame Quality Enhancement for Compressed Video

Ren Yang, Mai Xu, Zulin Wang, Tianyi Li (Beihang University)

The past few years have witnessed great success in applying deep learning to enhance the quality of compressed image/video. The existing approaches mainly focus on enhancing the quality of a single frame, ignoring the similarity between consecutive frames. This demo illustrates a novel Multi-Frame Quality Enhancement (MFQE) approach for compressed video, which was proposed in our CVPR'18 paper. In this work, we investigate that heavy quality fluctuation exists across compressed video frames, and thus low quality frames can be enhanced using the neighboring high quality frames, seen as Multi-Frame Quality Enhancement (MFQE). Accordingly, this work proposes an MFQE approach for compressed video, as a first attempt in this direction. In our approach, we firstly develop a Support Vector Machine (SVM) based detector to locate Peak Quality Frames (PQFs) in compressed video. Then, a novel Multi-Frame Convolutional Neural Network (MF-CNN) is designed to enhance the quality of compressed video, in which the non-PQF and its nearest two PQFs are as the input. The MF-CNN compensates motion between the non-PQF and PQFs through the Motion Compensation subnet (MC-subnet). Subsequently, the Quality Enhancement subnet (QE-subnet) reduces compression artifacts of the non-PQF with the help of its nearest PQFs. The experimental results validate the effectiveness and generality of our MFQE approach in advancing the state-of-the-art quality enhancement of compressed video.

This demo consists of a novel framework for unsupervised learning of optical flow for event cameras that learns from only the event stream. Event cameras are a novel sensing modality that asynchronously track changes in log light intensity. When a change is detected, the camera immediately sends an event, consisting of the x,y pixel position of the change, timestamp accurate to microseconds, and a polarity, indicating the direction of the change. The cameras provide a number of benefits over traditional cameras, such as low latency, the ability to track very fast motions, very high dynamic range, and low power consumption.

Our work in this demo allows us to show some of the capabilities of algorithms on this camera, by using a trained neural network to predict optical flow in very challenging environments from an event camera. Similar to EV-FlowNet, this work consists of a fully convolutional network that takes in a discretized representation of the event stream, and learns to predict optical flow for each event in a fully unsupervised manner. However, in this work, we propose a novel input representation consisting of a discretized 3D voxel grid, and a loss function that allows the network to learn optical flow from only the event stream by minimizing the motion blur in the scene (no grayscale frames needed).

Our network runs on a laptop grade NVIDIA GTX 960M at 20Hz, and is able to track very fast motions such as objects spinning at 40rad/s, as well as in very challenging and varying lighting conditions, all in realtime. The network, trained only on 10 minutes of driving sequences, generalizes to a variety of different scenes, without any noticeable outliers. We encourage audience participation.

We present a demo of our automatic annotation tool used for the creation of the "JTA" Dataset (Fabbri et al. "Learning to Detect and Track Visible and Occluded Body Joints in a Virtual World". ECCV 2018). This tool allows you to quickly create tracking and pose detection datasets through a convenient graphical interface exploiting the highly photorealistic video game "GTA V" (see video: https://youtu.be/9Q1UYzUysUk).
During the demo we will demonstrate the ease with which our mod allows you to create new scenarios and control the behavior/number/type/ appearance/interactions of people on the screen, showing at the same time the quality of the obtained annotations (both in terms of tracking, 2D and 3D pose detection). Currently, multi-person tracking and pose detection video datasets are small, as the manual annotation for these tasks is extremely complex and time-consuming; moreover the manual approach often does not guarantee optimal results due to unavoidable human errors and difficulties in reconstructing the correct poses of strongly occluded people. The code of our tool will be released soon.

We present a real-time multi-person 3D human body pose estimation system which makes use of a single RGB camera for human motion capture in general scenes. Our learning based approach gives full body 3D articulation estimates even under strong partial occlusion, as well as estimates of camera relative localization in space. Our approach makes use of the detected 2D body joint locations as well as the joint detection confidence values, and is trained using our recently proposed Multi-person Composited 3D Human Pose (MuCo-3DHP) dataset, and also leverages MS-COCO person keypoints dataset for improved performance in general scenes. Our system can handle an arbitrary number of people in the scene, and processes complete frames without requiring prior person detection.

We present a real-time demo of our 3D human pose detector, called LCR-Net++, recently accepted to appear in IEEE TPAMI. This is an extended version of the LCR-Net paper published at CVPR'17. It is the first detector that always estimates the full-body 2D and 3D poses of multiple people from a single RGB image. Our approach significantly outperforms the state of the art in 3D pose estimation on Human3.6M, the standard benchmark dataset captured in a controlled environment, and demonstrates satisfying 3D pose results in real-world images, hallucinating plausible body parts when the persons are partially occluded or truncated by the image boundary.

More details on the project website: https://thoth.inrialpes.fr/src/LCR-Net/

4

HybridFusion: Real-Time Performance Capture Using a Single Depth Sensor and Sparse IMUs

The success of deep learning depends on large-scale annotated vision dataset. However, there has been surprisingly little work on the design of tools for efficient data labeling at scale. To foster the systematic study of large-scale data annotation, we introduce Scalabel, a versatile and scalable annotation system that accommodates a wide range annotation tasks needed for the computer vision community and enables fast labeling with minimal expertise. We support both generic 2D/3D bounding boxes annotation, semantic/instance segmentation annotation, video object tracking as well as domain-specific annotation tasks for autonomous driving, such as lane detection, drivable area segmentation, etc. By providing a common API and visual language for data annotation across a wide range of annotation tasks, Scalabel is both easy to use and to extend. Our labeling system was used to annotate BDD100K dataset and it received positive feedback from the workers during this large-scale production. To our knowledge, existing open-sourced annotation tools are built for specific annotation tasks (e.g., single object detection or vehicle/pedestrian detection) that can not be readily extended to new tasks.

A depth estimation solution based on a single-shot taken with a single phase-coded aperture camera is proposed and presented. One of the most challenging tasks in computer vision is depth estimation from a single image. The main difficulty lies in the fact that depth information is lost in conventional 2D imaging. Various computational imaging approaches have been proposed to address this challenge, such as incorporating an amplitude-mask in the imaging system pupil. Such mask encodes subtle depth dependent cues in the resultant image, which are then used for depth estimation in the post-processing step. Yet, a proper design of such an element is challenging. Moreover, amplitude phase masks reduce the light throughput of the imaging system (in some cases up to 50%). The recent and ongoing Deep Learning (DL) revolution did not pass over this challenge, and many monocular depth estimation Convolutional Neural Networks (CNN) have been proposed. In such solutions, a CNN model is trained to estimate the scene depth map using labeled data. These methods rely on monocular depth cues (perspective, vanishing lines, proportion, etc.), which are not always clear and helpful. In this work, the DL end-to-end design approach is employed to jointly design a phase-mask and a DL model, working in tandem to achieve scene depth estimation from a single image. Utilizing phase aperture-coding has the important advantage of nearly 100% light throughput. The phase coded aperture imaging is modeled as a layer in the DL structure, and its parameters are jointly learned with the CNN to achieve true synergy between the optics and the post processing step. The designed phase mask encodes color and depth dependent cues in the image, which enable depth estimation using a relatively shallow FCN model, trained for depth reconstruction. After the training stage is completed, the phase element is fabricated (according to the trained optical parameters) and incorporated in the aperture stop of a conventional lens, mounted on a conventional camera. The Raw images taken with this phase-coded camera are fed to the 'conventional' FCN model, and depth estimation is achieved. In addition, utilizing the coded PSFs, an all-in focus image can be restored. Combining the all-in-focus image with the acquired depth map, synthetic re-focusing can be created, with the proper Bokeh effect.

We present a novel approach for searching and ranking videos for activities using deep generative model. Ranking is a well-established problem in computer vision. It is usually addressed using discriminative models, however, the decisions made by these models tend to be unexplainable. We believe that generative models are more explainable since they can generate instances of what they learned.

Our model is based on Generative Adversarial Networks (GANs). We formulate a Dense Validation GANs (DV-GANs) that learns human motion, generate realistic visual instances given textual inputs, and then uses the generated instances to search and rank videos in a database under a perceptually sound distance metric in video space. The distance metric can be chosen from a spectrum of handcrafted to learned distance functions controlling trade-offs between explainability and performance. Our model is capable of human motion generation and completion.

We formulate the GAN discriminator using a Convolutional Neural Network (CNN) with dense validation at each time-scale and perturb the discriminator input to make it translation invariant. Our DVGAN generator is capable of motion generation and completion using a Recurrent Neural Network (RNN). For encoding the textual query, a pretrained language models such as skip-thought vectors are used to improve robustness to unseen query words.

We evaluate our approach on Human 3.6M and CMU motion capture datasets using inception scores. Our approach shows through our evaluations the resiliency to noise, generalization over actions, and generation of long diverse sequences.

Our demo is available at: http://visxai.ckprototype.com/demo/ and A simplified blog post of the demo is available at: http://genrank.ckprototype.com/

Large-scale content-based image retrieval (CBIR) systems are widespread [1,2,3], yet most CBIR systems for reverse image search suffer from two drawbacks: inability to incorporate user feedback, and uncertainty why any particular image was retrieved. Without feedback, the user cannot communicate to the system which results are relevant and which are not; without explanations, the system cannot communicate to the user why it believes its answers are correct. Our demonstration proposes solutions to both problems, implemented in an open-source image query and retrieval framework. We demonstrate incorporation of user feedback to increase query precision, combined with a method to generate saliency maps which explain why the system believes that the retrievals match the query. We incorporate feedback via iterative query refinement (IQR), in which the user, via a web-based GUI, provides binary relevance feedback (positive or negative) to refine previously retrieved results until the desired precision is achieved. In each iteration, the feedback is used to train a two-class SVM classifier; this reranks the initial result set attempting to increase the likelihood that higher-ranked results would also garner positive feedback. Secondly, we generate saliency maps for each result, reflecting how regions in a retrieved image match the query image. Building on [4,5,6], we repeatedly obscure regions in the matched image and recompute the similarity metric to identify which regions most affect the score. Unlike previous methods, ours does not require matches to belong to predefined classes. These saliency maps provide the explanation, visualizing the underlying matching criteria to show why a retrieved image was matched. Generating saliency no additional models or modifications to the underlying algorithm. This is combined research of Kitware (IQR and framework) and Boston University (saliency maps), developed as part of the DARPA Explainable AI project.

The current VSLAM algorithms cannot work without assuming rigidity. We propose the first real-time tracking thread for VSLAM systems that manages deformable scenes. It is based on top of the Shape-from-Template (SfT) methods to code the scene deformation model. Our proposal is a sequential two-step method that manages efficiently large templates locating the camera at the same time. We show the system with a demo in which we move the camera just imaging a small part of a fabric while we deform it, recovering both deformation and camera pose in real-time (20Hz).

We would like to present a demo our paper [1], http://rpg.ifi.uzh.ch/ultimateslam.html
Event cameras are bio-inspired vision sensors that output pixel-level brightness changes instead of standard intensity frames. These cameras provide reliable visual information during high-speed motions or in scenes with high dynamic range; however, they output little information during slow motions. Conversely, standard cameras provide rich visual information most of the time (in low-speed and good lighting scenarios). We present the first SLAM pipeline that leverages the complementary advantages of these two sensors by fusing in a tightly-coupled manner events, intensity frames, and inertial measurements. We show that our pipeline leads to an accuracy improvement of 130% over event-only pipelines, and 85% over standard-frames-only visual-inertial systems, while still being computationally tractable.
We believe that event cameras are of great interest to ECCV audience, bringing exciting new ideas about asynchronous and sparse acquisition and processing of visual information. Event cameras are an emerging technology, supported by companies with multi-million investment, such as Samsung and Prohesee [2].
[1] A. Rosinol Vidal et al., Ultimate SLAM?, Combining Events, Images and IMU for Robust Visual SLAM in HDR and High Speed Scenarios, IEEE RA-L, 2018. http://rpg.ifi.uzh.ch/ultimateslam.html
[2] http://www.prophesee.ai/2018/02/21/prophesee-19-million-funding-round/

3

FlashFusion: Real-time Globally Consistent Dense 3D Reconstruction

Lei Han, Lu Fang (Tsinghua University)

Aiming at the practical usage of dense 3D reconstruction on portable devices, we propose FlashFusion, a Fast LArge-Scale High-resolution (sub-centimeter level) 3D reconstruction system without the use of GPU computing. It enables globally-consistent localization through a robust yet fast global bundle adjustment scheme, and realizes spatial hashing based volumetric fusion running at 300Hz and rendering at 25Hz via highly efficient valid chunk selection and mesh extraction schemes. Extensive experiments on both real world and synthetic datasets demonstrate that FlashFusion succeeds to enable real- time, globally consistent, high-resolution (5mm), and large-scale dense 3D reconstruction using highly-constrained computation, i.e., the CPU computing on portable device. The associate paper is previously published on Robotics, Science and Systems 2018 as oral presentation.

4

EVO: A Geometric Approach to Event-based 6-DOF Parallel Tracking and Mapping in Real-time

We wish to present a live demo of our paper EVO [1], https://youtu.be/bYqD2qZJlxE. Event cameras are bio-inspired vision sensors that output pixel-level brightness changes instead of standard intensity frames. They offer significant advantages over standard cameras (a very high dynamic range, no motion blur, and microsecond latency). However, traditional vision algorithms cannot be applied to the output of these sensors, so that a paradigm shift is needed. Our structure from motion algorithm successfully leverages the outstanding properties of event cameras to track fast camera motions while recovering a semi-dense 3D reconstruction of the environment. Our work makes significant progress in SLAM by unlocking the potential of event cameras, allowing us to tackle challenging scenarios (e.g., high-speed) that are currently naccessible to standard cameras. To the best of our knowledge, this is the first work showing real-time structure from motion on a CPU for an event camera moving in six degrees-of-freedom. We believe the paradigm shift posed by event cameras is of great interest to ECCV audience, bringing exciting new ideas about asynchronous and sparse acquisition (and processing) of visual information. Event cameras are an emerging technology that is attracting the attention of investment funding, such as the $40 Million raised by Prophesee [2] or the multi-million dollar investment by Samsung [3].
[1] Rebecq, Horstschaefer, Gallego and Sacaramuzza, A Geometric Approach to Event-based 6-DOF Parallel Tracking and Mapping in Real-time, IEEE RAL, 2017.
[2] http://www.prophesee.ai/2018/02/21/prophesee-19-million-funding-round/
[3] Samsung turns IBM's brain-like chip into a digital eye, https://www.cnet.com/news/samsung-turns-ibms-brain-like-chip-into-a-digital-eye/