The top computer vision conference held in Salt Lake City in the United States in June, CVPR2018, two of Tencent's selected papers attracted the attention of academics and industry due to its high application value.

Tencent excellent figure papers re-enter the top academic conference

CVPR, one of the highest-level conferences in the field of computer vision, usually represents the latest development direction and level of computer vision. This is also Tencent's excellent chart. After winning 12 papers in another ICCV meeting of computer vision in 2017, including 3 oral reports (this type of paper accounts for only 2.1% of the total number of submissions), in 2018, scientific research The results were once again harvested. The paper was included in CVPR2018. The Tencent excellent figure's selected paper put forward many innovations. It is not only the embodiment of scientific research strength, but also more extensible application technology. Visual AI is expected to bring more valuable contributions to the academic community and industry.

("Scale-recurrent Network for Deep Image Deblurring"), introduced the application of AI technology in dealing with the deblurring of non-specific scene images, and the fast portrait processing through Facelet-Bank.

(Facelet-Bank for Fast Portrait Manipulation) introduces the use of AI technology to quickly process portrait applications. These two technologies have solved some of the problems that have long plagued image processing, and have attracted attention from the industry because of their enormous application value. We will focus on introducing the two technologies and application scenarios that are most concerned by foreign media.

Decrypt motion blur: Towards practical non-specific scenes

When taking pictures at slow exposure or fast motion, blurred images often plague photo shooters. Researchers at the Tutu Lab have developed effective new algorithms that can recover blurred images.

Prior to this, image deblurring has been a problem that has plagued the industry in image processing. The causes of image blurring can be very complicated. For example, the camera shakes, loses focus, and the subject moves at high speed. The tools in the existing photo editing software are usually not satisfactory. For example, the "camera shake reduction" tool in Photoshop CC can only handle simple camera shake jitter. This type of blurring is known in the computer vision industry as "uniform blurry." Most of the blurred pictures are not "evenly blurred", so the application of existing picture editing software is very limited.

The new algorithm in Tencent's superior picture laboratory can handle picture blurring in non-specific scenes. The algorithm is based on a fuzzy model hypothesis called "dynamic blur". It models the motion of each pixel individually and can handle almost any type of motion blur. For example, in the above figure, each person's movement trajectory is different due to the translation and rotation caused by camera shake. After processing by the new algorithm of Tencent's excellent diagram laboratory, the picture has been restored to almost complete clarity, and even the words on the books in the background are clearly visible.

According to a researcher at Tencent's Youtu laboratory, the basic technique used by Tencent's superior maps is deep neural networks. After experiencing the training of thousands of pairs of fuzzy/clear image groups, a powerful neural network automatically learned how to sharpen the blurred image structure.

Although the use of neural networks for image deblurring is not a new idea, Tencent's excellent diagram lab incorporates physical intuition to promote model training. In the paper of the new algorithm of Tencent's excellent graphing laboratory, its network imitates a mature image recovery strategy called "coarse to fine". The strategy first reduces the blurred image to a variety of sizes, and then gradually processes a larger-sized image starting from a smaller, sharper image that is easier to recover. The clear images produced in each step can further guide the recovery of larger images, reducing the difficulty of network training.

Modifying face attributes (not just beautification) in portrait photos is very difficult. Artists usually need to do a lot of processing on portraits to make the modified images beautiful and natural. Can AI take over these complex operations?

Prof. Jia Jiaya’s researcher at the Tencent Yutu Lab put forward the latest model of “automatic portrait manipulation”. With this model, the user simply provides a high-level description of the desired effect, and the model automatically renders the photo according to the command, for example, to make him younger/older.

The main challenge in accomplishing this task is the inability to collect “input-output” samples for training. Therefore, the "Generation of confrontation" network that is popular in unsupervised learning is usually used for this task. However, this method proposed by the TUTU team does not depend on generating a confrontation network. It trains neural networks by generating noisy targets. Due to the denoising effect of the deep convolutional network, the output of its network is even better than the learning target.

"Generating confrontation networks is a powerful tool, but it is difficult to optimize. We hope to find a simpler way to solve this problem. We hope that this work will not only reduce the burden on artists, but also reduce the burden on engineers who train models." . "Tencent's researchers said.

According to reports, another attractive feature of the model is that it supports local model updates, that is, when switching different operational tasks, only a small part of the model needs to be replaced. This is very friendly to system developers. Moreover, from the application level, it also allows the application to "incrementally update".

Even if the face in the photo is not cropped and aligned well, the model can implicitly participate in the correct face area. In many cases, the user simply inputs the original photo to the model is sufficient to produce high quality results. Even if the video is input into the model frame by frame, the attributes of the face in the entire video can also be processed.

Attached: In addition to the above two articles, Tencent Tencent Lab's other articles selected for CVPR2018

1. Referring Image Segmentation via Recurrent Refinement Networks

Dividing a specific area of ​​a picture according to the description of natural language is a challenging problem. Previous methods based on neural networks have been segmented by merging the features of the image and language, but ignore the multi-scale information, which results in a poor quality of the segmentation results. In this regard, we propose a model based on a circular convolutional neural network, adding the features of the underlying convolutional neural network during each iteration to enable the network to gradually capture information at different scales of the picture. We visualized the intermediate results of the model and achieved the best level in all relevant open data sets.

Analysis of human body parts, or segmentation of human semantic parts, is the basis of many computer vision tasks. In traditional semantic segmentation methods, we need to provide manually labeled tags for end-to-end training using Full Convolutional Networks (FCN). Although past methods can achieve good results, their performance is highly dependent on the quantity and quality of training data. In this paper, we propose a new method to obtain training data, which can use the data of the human body key points that are easily available to generate the body part analysis data. Our main idea is to use the morphological similarities between humans to pass the result of a person's site analysis to another person with a similar posture. Using the results we generated as additional training data, our semi-supervised model outperforms the strongly supervised method of 6 mIOU on the PASCAL-Person-Part data set and achieves the best human site resolution results. Our method is very versatile. It can easily be extended to other objects or animal parts of the analytical tasks, as long as their morphological similarities can be represented by key points. Our model and source code will be published later.

In this paper, a two-layer convolutional neural network is proposed to deal with some low-level visual problems, such as image super-resolution, edge-preserving image filtering, image de-raining, and image de-fog. These low-level visual problems usually involve the estimation of the structure and details of the target result. Inspired by this, the two-level convolutional neural network proposed in this paper contains two branches, where the two branches can end-to-end estimate the structure and details of the target result. Based on the estimated structure and detail information, the target results can be obtained separately from the imaging model of the specific problem. The two-layer convolutional neural network proposed in this paper is a general framework that can use the existing convolutional neural network to deal with related low-level visual problems. A large number of experimental results show that the proposed double-convolutional neural network can be applied to most of the low-level visual problems, and has achieved good results.

In this paper, we propose a geometric neural network that simultaneously predicts the depth and plane normal vector of a picture scene. Our model is based on two different convolutional neural networks and iteratively updates the depth information and the plane normal vector information by modeling the geometric relations, which makes the final prediction results have extremely high consistency and accuracy. We validate our proposed geometric neural network on the NYU data set. The experimental results show that our model can accurately predict the depth and plane normal vectors with consistent geometric relations.

5. Path Aggregation Network for Instance Segmentation

Instance segmentation through path aggregation network

In a neural network, the quality of information delivery is very important. In this paper, we propose a path aggregation neural network designed to improve the quality of information transfer in an instance-based segmentation framework. Specifically, we built a bottom-up pathway to deliver accurate location information stored in the lower neural network layer, shorten the distance between the underlying network and the higher-level network, and enhance the quality of the entire feature hierarchy. We show the adaptive feature pooling, which connects the area features with all the feature levels, so that all useful information can be passed directly to the subsequent area subnetworks. We added a complementary branch to capture the different characteristics of each region and ultimately improved the mask's prediction quality.

These improvements are very easy to implement and add less additional computational effort. These improvements helped us to win the first place in the 2017 COCO instance segmentation competition and second place in the object detection competition. And our method has achieved the best results in the MVD and Cityscapes datasets.

This article was led by Tencent's Youtu Laboratory and Nanjing University of Science and was selected as the Spotlight article. Face super resolution is a particular area of ​​super resolution, and its unique face prior information can be used to better super-resolution face images. This paper proposes a new end-to-end training face super-resolution network, which can improve very low resolution people without face alignment by better utilizing the geometric information such as facial feature heat map and segmentation map. The quality of the face image. Specifically, this paper first constructs a coarse-grained hyperscale network to recover a coarse-resolution high-resolution image. Next, the image is sent to a fine-grained hyper-encoder and a priori information estimation network. The fine-grained hyper-encoder extracts the image features, and the prior network estimates the feature points and segmentation information of the face. The results of the last two branches are merged into a fine-grained hyperscale decoder to reconstruct the final high-resolution image. In order to further generate a more realistic face, this paper proposes a facial super-resolution generation confrontation network and integrates confrontation ideas into hyper-sub-networks. In addition, we introduce two related tasks, face alignment and face segmentation, as new evaluation criteria for face hyper-fractionation. These two criteria overcome the inconsistency of numerical and visual quality in traditional guidelines (such as PSNR/SSIM). A large number of experiments show that the proposed method is significantly superior to the previous hyperscale method in both numerical and visual quality when dealing with very low resolution face images.

The paper proposes a generational confrontation learning algorithm for fast and weakly supervised target detection. In recent years, there has been a lot of work in the field of weak supervision target detection. Without manual labeling of bounding boxes, most of the existing methods are multi-stage flows, including the candidate zone extraction phase. This makes online testing an order of magnitude slower than fast, supervised target detection (such as SSD, YOLO, etc.). The paper is accelerated by a novel generational learning algorithm. In this process, the generator is a single-phase target detector. An agent is introduced to mine high-quality bounding boxes, and discriminators are used to determine the source of bounding boxes. The final algorithm combines structural similarity loss and confrontation loss to train the model. Experimental results show that the algorithm has achieved a significant performance improvement.

Image-based automatic description based on grouping with structured relevancy and differential constraints

This paper proposes an image auto-description method (GroupCap) based on the analysis of the semantic association of group images, and models the semantic correlation and differences between images. Specifically, the paper first uses the deep convolutional neural network to extract the semantic features of the image and uses the proposed visual analytic model to build the semantic association structure tree. Then it adopts triple loss and classification loss on the basis of the structural tree and semantic relationship between images ( Dependencies and differences are modeled, and relevance is then used as a constraint to guide deep-cycle neural networks to generate text. The method is novel and effective, and it solves the defect that the current automatic image description method is not accurate and has poor discriminability, and achieves higher performance on a plurality of indicators automatically described by the image.