+36 70 203-3120 | agrowebsystem@gmail.com | Blog | Contact

From Supervised To Unsupervised Learning: A Paradigm Shift In Computer Vision

  • 188

Slowly removing the injection of human knowledge from the training process

Source: Leon Sick

Since the inception of modern computer vision methods, success in the application of these techniques could only be seen in the supervised domain. In order to make a model useful for performing tasks such as image recognition, object detection or semantic segmentation, human supervision used to be necessary. In a major shift, the last few years of computer vision research have change the focus of the field: Away from the guaranteed success with human supervision onto new frontiers: Self-supervised and unsupervised learning.

An animation of the clustering of different classes in an unsupervised way. Source: [1]

Let’s go on a journey towards a new era that has already begun.

The success of supervised learning

An illustration of the original AlexNet architecture. Source: [2]

AlexNet marked the first breakthrough in the application of neural networks for image tasks, more specifically the ImageNet challenge. From there, it was game on and the computer vision research community stormed towards perfecting supervised techniques for many kinds of computer vision tasks.

For image classification, many variations of models have emerged since the original AlexNet paper. ResNet has unarguably become the classic among convolutional neural networks. Efficient architectures such as the EfficientNet have emerged. Even networks optimized for mobile devices such as the MobileNet architecture. More recently, Vision Transformers have gained increasing attention (unintended joke) and have shown to outperform convolutional neural networks under the right settings (lots of data and compute). Originally invented for language tasks, their application for computer vision has been a huge success. Another interesting approach has been to design network design spaces where a quantized linear function defines the network architecture called RegNet.

The next tasks that was tackled successfully with supervised learning were object detection and semantic segmentation. R-CNNs have made the first big splash in the first domain, followed by many advances in computational efficiency and accuracy. Notable approaches are the Fast, Faster and Mask R-CNN but also the YOLO algorithms and single-shot detectors such as the SSD MobileNet. A milestone in the domain of semantic segmentation was the U-Net architecture.

Also not to forget are the benchmark datasets that made the supervised techniques more comparable. ImageNet set the standard for image classification and MS COCO is still important for object detection and segmentation tasks.

All of these techniques have one thing in common: They rely on distilled human knowledge and skill in the form of labeled data to perform well. In fact, they are built around this resource and depend on it.

In some way, all these techniques employ artificial neural networks that model the biological neural network in humans. But still, these models learn very different to perceive from the way humans learn this. Why only mimic the human brain in its biological form and not the cognitive process behind learning the recognize and classify?

This is where the next evolution comes in: self-supervised learning.

Introducing self-supervision into the process

Think about how you have learned to see. How you learn to recognize an apple. When you were younger, you have seen many apple, but not all of them had a sign on them that said “This is an apple” and no one told you it was an apple every time you saw one. The way you learned was by similarity: You have seen this object time and time again, multiple times per week, maybe even per day. You recognized: Hey… this is the same thing!

Then, one day, someone taught you this is an apple. All of a sudden, this abstract object, this visual representation, it now became know to you as “apple”. This is a similar process used in self-supervised learning.

An illustration of the SimCLR training process. Source: [3]

The most state-of-the-art techniques such as SimCLR or SwAV copy this process. For pre-training, all labels are discarded, the models train without the use of human knowledge. The models are shown two versions of the same image, may it be cropped or color distorted or rotated, and it starts to learn that despite their differing visual representation, these objects are the same “thing”. In fact, this is visible in their similar latent vector representations (remember this for later). So, the model learn to produce a consistent vector for each class of object.

Next is the “teaching” step: The pre-trained model is shown some images with labels this time. And it learns much more quickly and effectively to classify different kinds of objects.

So much of the human knowledge has been removed from the training process, but not all of it. But the next step is just around the corner.

Towards unsupervised learning

To make a model fully unsupervised, it has to be trained without human supervision (labels) and still be able to achieve the tasks it is expected to do, such as classifying images.

Remember that the self-supervised models already take a step in this direction: Before they are shown any labels, they are already able to compute consistent vector representations for different objects. This is key to removing all human supervision.

An illustration of the clustering of different classes in an unsupervised way. Source: [1]

What this vector generally represents is an images reduced in its dimensionality. In fact, autoencoders can be trained to recreate the image pixels. Because of its reduced dimension, we can use a technique long ignored (for good reasons) in computer vision: A k-nearest-neighbors classifer. If our vector representations are that good so that only the same objects form a cluster and different objects are clustered far away, we can feed the model a new, unknown image and the model will assign it to the cluster of the correct class. The model will not be able to tell you what the class name is, but what group of images it belongs to. If you assign a class name to this group, all objects in the group can be classified. After all, class names are artificial creations by humans (someone defined that this thing is called apple) and are only assigned meaning by humans.

Since all labels are removed from the training process and the results in papers like DINO are quite promising, this is the closest we have come to removing all supervision from the training process of computer vision models.

But there is still more to come, more room for improvement.

Wrapping it up

If you have been reading up to this point, I highly appreciate you taking your time. I have purposely not included any images in this story since they divert your attention away from the meaning of this text. I mean, we all want be a good transformer, right? (This time it was intended)

I sincerely thank you for reading this article. If you are interested in self-supervised learning, have a look at other stories of mine where I try to explain state-of-the-art papers in the space to anyone interestd. And if you would like to dive deeper in the field of advanced computer vision, consider becoming a follower of mine. I try to post a story once a week and and keep you and anyone else interested up-to-date on what’s new in computer vision research!

Source: https://towardsdatascience.com/from-supervised-to-unsupervised-learning-a-paradigm-shift-in-computer-vision-ae19ada1064d