Computer Vision in iOS – Object Recognition

Problem Statement: Given an image, can a machine accurately predict what is there in that image?

Why is this so hard? If I show an image to a human and ask him/her what is there in that image, (s)he can predict exactly what objects are present in the image, where is that picture taken, what is the speciality of the image, (if people are present in the image) what is the action being done by them and what are they going to do etc. For a computer, a picture is nothing but a bunch of numbers. Hence, it can’t easily understand the semantics of it as a human does. Even after telling this if the question – Why is it so hard? – is ringing in your head, then let me ask you to write an algorithm to detect (just) cat!

Having basic assumptions – every cat has two ears, an oval face with whiskers on it, a cylindrical body, four legs and a curvy tail! Perfect 🙂 We have our initial assumptions to start writing code! Assume we have written the code (per say, 50 lines of if-else statements) to find primitives in an image which when combined form a cat that looks nearly as shown in the figure below (PS: Don’t laugh 😛 )

Ok let us test the performance on some real world images. Can our algorithm accurately predict the cat in this picture?

If you think the answer is yes, I would suggest you to think again. If you carefully observe the cat image with primitive shapes, we have actually coded to find the cat that is turning towards its left. Ok! No worries! Write exact same if-else conditions for a cat turning towards its right 😎 . Just an extra 50 lines of conditions. Good! Now we have the cat detector! Can we detect the cat in this image? 😛

Well, the answer is no 😦 . So, for tackling these type of problems we move from basic conditionals to Machine Learning/Deep Learning. Machine Learning is a field where machines learn how to do some specific tasks which only humans are capable of doing it before. Deep Learning is a subset of Machine Learning in which we train very deep neural network architectures. A lot of researchers have already solved this problem and there are some popular neural network architectures which do this specific task.

The real problem lies in importing this network into a mobile architecture and making it run real-time. This is not an easy task. First of all convolutions in a CNN is a costly step and the size of the neural network (forget about it 😛 ). The industries like Google, Apple etc and few research labs have put heavy focus on optimizing the size and performance of neural networks and at last we are having some decent results making neural networks work with decent speed on mobile phones. Still there is a lot of amazing research that needs to be done in this field. After Apple’s WWDC-’17 keynote, the whole app development for solving this particular problem has turned from a 1 year effort to single night effort. Enough of theory and facts, let us dive into the code!

For following this blog from here you need to have the following things ready:

Once you have satisfied all the above requirements, let us move to adding Machine Learning model into our app.

First of all, create a new Xcode ‘Single View App’ Project, select language as ‘Swift’ and set you project name and wait for Xcode to create project. Go to Build Settings of the app and change the Swift Compiler – Language – Swift Language version from Swift 4 to Swift 3.2.

In this particular project, I am moving from my traditional CameraBuffer pipeline to a newer one to make the object detection run constantly at 30 FPS asynchronously. We are using this approach to make sure that the user won’t feel any lag in the system (Hence, better user experience!). First add a new Swift file name ‘PreviewView.swift’ and add the following code to it.

Now let us add camera functionality to our app. If you followed my previous blog under optional pre-requisite. Most of the content here will look pretty obvious and easy. First go to Main.storyboard and add ‘View’ as a child object to existing View.

After dragging and dropping into the existing View, go to ‘Show the Identity Inspector’ in the right side inspector of Xcode and under ‘Custom Class’ change class from UIView to ‘PreviewView’. If you recall, PreviewView is nothing but the new swift file we added in one of the previous steps in which we inherit few properties from UIView.

Make the View full screen with its content mode to ‘Aspect Fill’ and add a Label View under it as a child to see the prediction classes. Add IBOutlets to both View and LabelView in ViewController.swift file.

Le us initialise some parameters for session. The session should use frames from camera, it should start running when the view appears and stop running when the view disappears. Also we need to make sure that we have permissions to use camera and if permissions were not given, we should ask for permission before session starts. Hence, we should make the following changes to our code!

Don’t forget to add ‘Privacy-Camera Usage Description’ in Info.plist and run the app on your device. The app should show camera frames on screen with just 5% CPU usage 😉 Not bad! Now, let us add Inception v3 model to our app.

If you didn’t download the Inception v3 model yet, download it from the link provided above. By this step, you will be having a file named ‘Inceptionv3.mlmodel’.

Drag and drop the ‘Inceptionv3.mlmodel’ file into your Xcode Project. After importing the model into your project, click on the model and this is how your ‘*.mlmodel’ file looks like in Xcode.

What information does ‘*.mlmodel’ file convey? At the starting of the file, you can observe some information about the file such as name of the file, size of it, author and license information, and description about the network. Then comes the ‘Model Evaluation Parameters’ which explains us what should be the input of the model and how our output looks like. Now let us setup our ViewController.swift file to send images into the model for predictions.

Apple has made Machine Learning very easy through its CoreML Framework. All we have to do is ‘import CoreML’ and initialise model variable with ‘*.mlmodel’ file name.

The fun part begins now 🙂 . If we consider every Machine Learning/Deep Learning model as a black box (i.e., we don’t know what is happening inside), then all we should care about is given certain inputs to the black box, are we getting desired outputs? (PC: Wikipedia). But, we can’t any type of input to the model and expect desired output. If the model is trained for a 1D signal, then input should be tweaked to 1D before sending into the model. If the model is trained for 2D (e.g.: CNNs), then input should be a 2D signal. The dimensions and size of the input should match with the model’s input parameters.

The Inception v3 model takes input a 3 channel RGB image of size 299x299x3. So, we should resize our image before passing it into the model. Add the following code at the end of the ViewController.swift file that will resize the image to our desired dimensions 😉 .

In order to pass the image into the CoreML model, we need to convert it from UIImage format to CVPixelBuffer. For doing the same, I am adding some Objective-C code and linking it to swift code using a Bridging Header. If you have no clue about the Bridging header and combing objective-C with Swift code, I would suggest to check out this blog – Computer Vision in iOS – Swift+OpenCV

The results look convincing, but I should not judge the results as the network is not trained by me. What I care for is the performance of the app on the mobile phone! With the current implementation of the pipeline, while profiling the application, the CPU usage of the app is always <30%. Thanks to CoreML as the whole Deep Learning computations have been moved to GPU and the only task of CPU is to do some basic Image Processing and pass the image into the GPU, and fetch predictions from there. There is still a lot of scope to improve the coding style of the app, and I welcome any suggestions/advice from you. 🙂

Source code:

If you like this blog and want to play with the app, the code for this app is available here – iOS-CoreML-Inceptionv3

CoreML didn’t provide enough documentation for writing custom models itself and currently it is in beta stage and the team at Apple is actively releasing new updates. I would like to wait till they release a stable version (Fall-’17). Besides I am playing with integrating python into Unity for training/running custom ML models. I am interested in discussing more about this through mail than on blog.