Be an Optimist Prime in the world of Computer Vision

Why is image processing so hard?

This is another post that has been inspired by a question posed in a forum: “What are the open research areas in image processing?”.

My answer? Everything is still an open research area in image processing/computer vision!

But why is this the case? You’d think that after decades of research we’d feel comfortable in saying “this problem here is solved, let’s focus on something else”. In a way we can say this but only for narrow and simple use cases (e.g. locating a red spoon on an empty white plate) but not for computer vision in general (e.g. locating a red spoon in all possible scenarios, like a big box full of colourful toys). I’m going to spend the rest of this post explaining the main reasons behind this.

So, why is computer vision so hard?

Before we dig into what I consider to be the dominant reasons why computer vision is so damn hard, I first need to explain how machines “see” images. When us humans view an image, we perceive objects, people or a landscape. When machines “view” images, all they see are numbers that represent individual pixels.

An example will explain this best. Let’s say that you have a greyscale image. Each pixel, then, is represented by a number usually between 0 and 255 (I’m abstracting here over things like compression, colour spaces, etc.), where 0 is for black (no colour) and 255 is for white (full intensity). Anything between 0 and 255 is a shade of grey, like in the picture below.

So, for a machine to garner anything about an image, it has to process these numbers in one way or another. This is exactly what image/video processing and computer vision is all about – dealing with numbers!

Now that we have the necessary background information about computer vision we can move on to the meat of the post: the main reasons behind why computer vision is an immensely hard problem to solve. I’m going to list four such reasons:

Swathes of data

Inherent loss of information

Dealing with noise

Requirements for interpretation

We’ll look at these one at a time.

1. We’re dealing with a heck of a lot of data

As I said above, when it comes to images, all computers see are numbers… lots of numbers! And lots of numbers means a lot of data that needs to be processed to be made sense of.

How much data are we talking about here? Let’s take a look at another example (that once again abstracts over many things such as compression and colour spaces). If you have a greyscale (black & white) image with 1920 x 1080 resolution, this means that your image is described by 2 million numbers (1920 * 1080 = 2,073,600 pixels). Now, if you switch to a colour image, you need three times as many numbers because, typically, when you represent a coloured pixel you specify how much read, blue, and green it is composed of. And then further, if you’re trying to analyse images coming in from a video/camera stream with, say, a 30 frames/sec frame rate (which is a standard frame rate nowadays), you’re suddenly dealing with 180 million numbers per second (3 *2,073,600 * 30 ~= 180 million pixels/sec). That is a lot of data that needs processing! Even with today’s powerful processors and relatively large memory sizes, machines struggle to do anything meaningful with 180 million numbers coming in per second.

2. Loss of Information

Loss of information in the digitising process (going from real life to an image on a machine) is another major player contributing to the difficulty involved in computer vision. The nature of image processing is such that you’re taking information from a 3D world (or 4D if we’re dealing with time in a video stream) and projecting it onto a 2D plane (i.e. a flat image). This means that you’re also losing a lot of information in this process – even though we still have a lot of data to deal with as is, as discussed above.

Now, our brains are fantastic at inferring what that lost data is. Machines are not. Take a look at the image below showing a messy room (not mine, promise!)

We can easily tell that the large green gym ball is bigger and further away than the black pan on the table. But how is a machine supposed to infer this if the black pan takes up more pixels than the green ball!? Not an easy task.

Of course, you can attempt to simulate the way we see with two eyes by taking two pictures simultaneously and extracting 3D information from these. This is called stereoscopic vision. However, stitching images together is also not a trivial task and is, hence, likewise an open area of research. Further, it too suffers from the other 3 major reasons I discuss in this post.

3. Noise

The digitising process is frequently accompanied by noise. For example, no camera is going to give you a perfect picture of reality, especially when it comes to the cameras located on our phones (even though phone cameras are getting phenomenally good with each new release). Intensity levels, colour saturation, etc. – these will always be just an attempt at capturing our beautiful world.

Other examples of noise are phenomena known as artefacts. These are distortions of images that can be caused by a number of things. E.g. Lens flare – an example of which is shown in the image below. How is a computer supposed to interpret this and work out what is situated behind it? Algorithms have been developed to attempt to remove lens flare from images but, once again, it’s an open area of research.

The biggest source of artefacts undoubtedly comes from compression. Now, compression is necessary as I discussed in this post. Images would otherwise be too large to store, process, and transfer over networks. But if compression levels are too high, image quality decreases. And then you have compression artefacts appearing, as depicted in the image below.

The right image has clear compression artefacts visible

Humans can deal with artefacts, even if they dominate a scene, as seen above. But this is not the case for computers. Artefacts don’t exist in reality and are frequently arbitrary. They truly add another level of difficulty that machines have to cope with.

4. Interpretation is needed

Lastly and most importantly is interpretation. This is definitely the hardest thing for a machine to deal with in the context of computer vision (and not only!). When we view an image we analyse it with years and years of accumulated learning and memory (called a priori knowledge). We know, for example, that we can sit on gym balls and that pans are generally used in the kitchen – we have learnt about these things in the past. So, if there’s something that looks like a pan in the sky, chances are it isn’t and we can scrutinise further to work out what the object may be (e.g. a frisbee!). Or if there are people kicking around a green ball, chances are it’s not a gym ball but a small children’s ball.

But machines don’t have this kind of knowledge. They don’t understand our world, the intricacies inherent in it, and the numerous tools, commodities, devices, etc. that we have created over the thousands of years of our existence. Maybe one day machines will be able to ingest Wikipedia and extract contextual information about objects from there but at the moment we are very far from such a scenario. And some will argue that we will never reach a phase where machines will be able to completely understand our reality – because consciousness is something that will always be out of reach for them. But more on that in a future post.

Discussion

I hope I have shown you, at least in a nutshell, why computer vision is such a difficult problem. It is an open area of research and will be for a very, very long time. Ever heard of theTuring test? It’s a test for intelligence devised by the famous computer scientist, Alan Turing in the 1950s. He basically said, that if you’re not able to distinguish between a machine and a human within a specified amount of time by having a natural conversation with both parties, then the machine can be dubbed intelligent.

Well, there is an annual competition called the Loebner Prize that gives away prize money to computer programs deemed most intelligent. The format of the competition is exactly the scenario proposed by Alan Turing: in each round, human judges simultaneously hold textual conversations with a computer program and a human being via a computer. Points are then awarded to how much the machine manages to fool the judges. The top prize awarded each year is about US$3,000. If a machine is able to entirely fool a judge, the prize is $25,000. Nobody has won this award, yet.

However, there is a prize worth $100,000 that nobody has picked up either. It will be awarded to the first program that judges cannot distinguish from a real human in a Turing test that includes deciphering and understanding text, visual, and auditory input. Once this is achieved, the organisers say that the annual competition will end. See how far away we are from strong intelligence? Nobody has won the $25,000 prize yet, let alone the big one.

I also mentioned above that some simple use cases can be considered solved. I must also mention here that even when use cases appear to be solved, chances are that the speed of the algorithms leave much to be desired. Neural networks are now supposedly performing better than humans in image classification tasks (I hope to write about this in a future post, also). But the state-of-the-art algorithms are barely able to squeeze out ~1 frame/sec on a standard machine. No chance of getting that to work in real-time (remember how I said above that standard frame rates are now at about 30 frames/sec?). These algorithms need to be optimised. So, although the results obtained are excellent, speed is a major issue.

Summary

In this post I discuss why computer vision is so hard and why it is still very much an open area of research. I discussed four major reasons for this:

Images are represented by a heck of a lot of data that machines need to process before extracting information from them;

When dealing with images we are dealing with a 2D reality that has been shrunk from 3D meaning that A LOT of information has been lost;

Devices that present the world to us frequently also deliver noise such as compression artefacts and lens flare;

And the most important hurdle for machines is interpretation: the inability to fully comprehend the world around us and its intricacies that we learn to deal with from the very beginnings of our lives.

I then mentioned the Loebner Prize, which is an AI competition inspired by the Turing test. Nobody has yet won the $25,000, let alone the big one that involves analysing images. I also discussed the need to optimise the current state-of-the-art algorithms in computer vision. A lot of them do a good job but the amount of processing that takes place behind the scenes makes them unusable in real-time scenarios.

Computer vision is definitely still an open area of research.

To be informed when new content like this is posted, subscribe to the mailing list: