Deep learning added a huge boost to the already rapidly developing field of computer vision. With deep learning, a lot of new applications of computer vision techniques have been introduced and are now becoming parts of our everyday lives. These include face recognition and indexing, photo stylization or machine vision in self-driving cars.
The goal of this course is to introduce students to computer vision, starting from basics and then turning to more modern deep learning models. We will cover both image and video recognition, including image classification and annotation, object recognition and image search, various object detection techniques, motion estimation, object tracking in video, human action recognition, and finally image stylization, editing and new image generation. In course project, students will learn how to build face recognition and manipulation system to understand the internal mechanics of this technology, probably the most renown and often demonstrated in movies and TV-shows example of computer vision and AI.
Do you have technical problems? Write to us: coursera@hse.ru

RR

Don't just read what's written on the projector. Try explaining it. And explain with code.

From the lesson

Convolutional features for visual recognition

Module two revolves around general principles underlying modern computer vision architectures based on deep convolutional neural networks. We’ll build and analyse convolutional architectures tailored for a number of conventional problems in vision: image categorisation, fine-grained recognition, content-based retrieval, and various aspect of face recognition. On the practical side, you’ll learn how to build your own key-points detector using a deep regression CNN.

Taught By

Anton Konushin

Alexey Artemov

Transcript

Hi. Welcome back to week two of deep learning for computer vision. In this week, we'll look at convolutional architectures for image classification. We'll start with a recap on image classification, look into convolutional architectures for image classification, touch upon ResNet, fine grain classification, and the key point regression problem for recognizing face images. Let's start with a recap on image classification. Understanding the contents of an image or an image region is one of the core problems for the computer vision, fundamental to image and scene understanding. Later in our course, we'll show how many other vision problems such as object detection and semantic segmentation can be reduced to image classification. In image classification problems, the goal is to assign the input image one or more labels from some predefined set of categories. For example, the set of all animals. Note that targeting the specific recognition domain such as animals means restricting ourselves from recognizing objects that belong to all other domains such as models of cars, kinds of flowers, or people's emotions. Classification can be thought of as two separate problems: binary classification and multiclass classification. In binary classification, only two classes are involved, whereas multiclass classification involves assigning an object to one of several classes. An example of binary classification problem might be determining whether an image contains a pedestrian suggesting a yes or no answer, while an example of multiclass classification problem might be determining if a specific species of plankton is present in the image, supposedly out of several 100 possible answers. The latter problem is sometimes referred to as the image categorization task. As one might imagine, picking the right set of categories makes a lot of difference for practical purposes. If we're building a self-driving car that must recognize moving objects in the street, should we distinguish between non-objects and objects, non-objects, inanimate things such as other cars, and live objects such as humans and animals or should we recognize the 900,000 animals species known to science? Before we turn to automatic image classification algorithms, let us briefly describe how well humans perform on this task. Research has revealed two fundamental aspects influencing the image recognition process. That is, image resolution and duration of image exposure to the viewer. For color 256 by 256 images, human level performance on scene understanding task corresponds to five percent error rate. For color 32 by 32 images, the performance only drops by seven percent relative to full resolution despite having one over 64 number of pixels. Multiple research shows humans need only as little as 50 milliseconds to recognize most of the scene with average affixation time being on the order of 150 milliseconds. Historically, when shallow methods were employed for classification, CalTech databases featuring 101 or 256 object categories of handpicked images were the standard benchmarks. Later, Pascal visual object classes database of more realistic high resolution images became the defacto standard. The latest revision of this benchmark base back to 2012. Tiny images dataset was among the first attempts to collect a database of images of the size on the order of 100 million. Our storage is expensive. And investigation was made into which image resolution do we need to perform image classification, which turned out to only be 32 by 32 pixels. Later, subsets of tiny images were used to form C410 and C4100 datasets. Nowadays, the state of the art database for classification problems is the ImageNet database. The goal when collecting this database was to create an image collection that would include at least 1,000 images for every known category. Currently, the collection features around 14 million annotated images from around 21,000 categories with one million boning books annotations. Creating databases of size scale is possible, thanks to web-based crowdsourcing platforms such as Amazon Mechanical Turk or Yandex Teleca. Using these platforms, employers are able to post jobs called Human Intelligence Tasks or HITs that only humans are good at. And these may include annotating an image, ranking a search result, or matching photographs while workers can compete and complete them in exchange for money. We think of a machine learning approach to image classification as a two-stage pipeline. The first step of the pipeline correspond to extraction and encoding of meaningful features from the image pixels. The second step performs image classification in the space of features extracted from the image. As always in machine learning, the key question is, are there and what are the effective features that we need in order to extract from the image pixels? Evidence for the existence of such features is found in neurophysiological experiments. For instance, human visual cortex has neurons that fire when observing a bar oriented at a specific direction regardless of the different positions of the bar within the visual field. Human brain has a region called a fusiform face area or FFA that activates when visually observing faces of other animals. All this evidence suggests that certain visual features are really important for recognition of visual objects in the real world. In deep learning, the idea of having features tailored to the specific task has been used extensively. These features have to be discovered or learned during optimization in an end-to-end fashion, rather than fixed or programmed by hand. This allows both to reduce the effort required to build the vision system and improve its quality by utilizing highly specialized image descriptions. Well, performers of conventional hand-grafted features and feature extraction methods, such as SIFT, LBP, ICF, or Hogg, has plateaued in recent years new developments and deep convolutional architectures for vision, have kept performance levels rising. Deep models have outperformed hand-engineered feature representations in many domains and made learning possible in domains where engineered features lack entirely. To summarize the short video, image classification problems are those that aim at telling what category does an image belong to or a scene. Humans are able to recognize objects and scenes even in low-resolution images with great accuracy and really quickly. A number of standard benchmark datasets are available to develop and evaluate the artificial and automatic computer vision methods for image classification. Machine learning methods for image classification build the decision function over features that are extracted from the image, while deep learning methods learn both the features and the decision functions in an end-to-end fashion.

Explore our Catalog

Join for free and get personalized recommendations, updates and offers.