Self-Supervised Learning and Multimodal Learning

Self-Supervised Learning and Multimodal Learning

Created by Yuwei Sun
All posts

Background



Self-Supervised Learning (SSL)

Recent self-supervised methods that use instance discrimination rely on a combination of two elements: (1) a contrastive loss and (2) a set of image transformations.

The contrastive loss explicitly compares pairs of image representations to push away representations from different images while pulling together those from transformations, or views, of the same image.

The goal is to learn models that extract effective representations from the input data, performing transfer learning to tackle different supervised tasks, usually, in both linear evaluation (fixed feature extractor) and fine-tuning settings.

Contrastive Learning

Contrastive Learning [2] is one type of self-supervised learning (SSL) that encourages augmentations (views) of the same input to have more similar representations compared to augmentations of different inputs. The goal of contrastive representation learning is to learn such an embedding space in which similar sample pairs stay close to each other while dissimilar ones are far apart.

$$\mathcal{L}(x_i,x_j,\theta)=\mathbb{1}[y_i=y_j]||f_\theta(x_i)-f_\theta(x_j)||^2_2+\mathbb{1}[y_i\neq y_j]max(0,\epsilon-||f_\theta(x_i)-f_\theta(x_j)||_2^2),$$

where $\epsilon$ is a hyperparameter defining the lower bound distance between samples of different classes.

SSL in an single modality
Fig.1 - SSL in an single modality.


SimCLR

SimCLR [3] employs Contrastive Learning to learn embeddings which can be used for various downstream tasks (supervised-learning tasks using a pre-trained model or component), by the following steps:

1. Randomly sample a minibatch of $N$ samples and each sample $x$ is applied with two different data augmentation operations $t$ and $t'$, resulting in $2N$ augmented samples in total. $\tilde{\mathbf{x}}_i = t(\mathbf{x}),\quad\tilde{\mathbf{x}}_j = t'(\mathbf{x}),\quad t, t' \sim \mathcal{T}$ where $t$ and $t'$ are sampled from the same family of augmentations $\mathcal{T}$, which includes random crop, resize with random flip, color distortions, and Gaussian blur.

2.Given one positive pair$(\tilde{\mathbf{x}}_i, \tilde{\mathbf{x}}_j)$, other $2(N-1)$ data points are treated as negative samples. The representations of these augmented samples are produced by a base encoder $f(.)$, i.e., $\mathbf{h}_i = f(\tilde{\mathbf{x}}_i),\quad \mathbf{h}_j = f(\tilde{\mathbf{x}}_j)$. Moreover, the representation $\mathbf{h}$ is used for downstream tasks.

3. The contrastive learning loss is defined using cosine similarity $\mbox{sim}(.,.)$, which operates on an extra projection layer of the representation $g(.)$ rather than on the representation space $\mathbf{h}$ directly. The importance of using the representation before the nonlinear projection is due to loss of information (which is used for downstream tasks) induced by the contrastive loss. For each pair of $(\mathbf{h}_i, \mathbf{h}_j)$, the loss is defined by the following

$$ \mathbf{z}_i = g(\mathbf{h}_i)$$ $$\mathbf{z}_j = g(\mathbf{h}_j) $$ $$ \mathcal{L}_\text{SimCLR}^{(i,j)} = - \log\exp(\text{sim}(\mathbf{z}_i, \mathbf{z}_j) / \tau) + \log\sum_{k=1}^{2N} \mathbb{1}_{[k \neq i]} \exp(\text{sim}(\mathbf{z}_i, \mathbf{z}_k) / \tau) $$ $$ =- \log\frac{\exp(\text{sim}(\mathbf{z}_i, \mathbf{z}_j) / \tau)}{\sum_{k=1}^{2N} \mathbb{1}_{[k \neq i]} \exp(\text{sim}(\mathbf{z}_i, \mathbf{z}_k) / \tau)} $$

where $\tau$ is the temperature parameter, $\mbox{sim}(.,.)$ is the cosine similarity, and $\mathbb{1}_{[k \neq i]}$ is an indicator function: 1 if $k \neq i$ 0 otherwise.



Barlow Twins

Barlow Twins [4] learns to make the cross-correlation matrix between these two groups of output features close to the identity. The goal is to keep the representation vectors of different distorted versions of one sample similar, while minimizing the redundancy between these vectors.

Let $C$ be a cross-correlation matrix computed between outputs from two identical networks along the batch dimension. Each entry $C_{ij}$ in the matrix is the cosine similarity between $z^A_{b,i}$ and $z^B_{b,j}$.

Then the loss of Barlow Twins is defined by the following

$$\mathcal{L}_{BT}=\sum_i(1-C_{ii})^2+\lambda \sum_i\sum_{j\neq i}C_{ij}^2$$.

MoCo

Momentum Contrast (MoCo) [5] trains a visual representation encoder by matching an encoded query $q$ to a dictionary of encoded keys $\{k_1,k_2,...\}$ using the InfoNCE contrastive loss. The query representation is $q = f_q(x_q)$ where $f_q$ is an encoder network and $x_q$ is a query sample (likewise, $k = f_k(x_k)$).

Dictionary as a queue: The samples in the dictionary are progressively replaced. The current mini-batch is enqueued to the dictionary, and the oldest mini-batch in the queue is removed. The advantage of MoCo compared to SimCLR is that MoCo decouples the batch size from the number of negatives, but SimCLR requires a large batch size in order to have enough negative samples and suffers performance drops when their batch size is reduced.

Momentum update: Let $\theta_k$ be the parameters of $f_k$ and $\theta_q$ be those of $f_q$, then $\theta_k$ is updated by the following

$$\theta_k \leftarrow m\theta_k+(1-m)\theta_q,$$

where a relatively large momentum (e.g., m = 0.999, the default) works much better than a smaller value (e.g., m = 0.9).

Fig.2 - Conceptual comparison of three contrastive loss mechanisms.


SwAV

Similarly to contrastive approaches, SwAV [6] learns representations by comparing transformations of an image, but unlike contrastive methods, it does not require to compute feature pairwise comparisons. SwAV also does not require a large memory bank or a special momentum network.
Fig.3 - SwAV solves a “swapped” prediction problem wherein the codes obtained from one data augmented view are predicted using the other view.


SSL in multimodal learning

CLIP

Given a batch of $N$ (image, text) pairs, CLIP [7] computes the dense cosine similarity matrix between all possible (image, text) candidates within this batch. The text and image encoders are jointly trained to maximize the similarity between $N$ correct pairs of (image, text) associations while minimizing the similarity for $N(N-1)$ incorrect pairs via a symmetric cross entropy loss over the dense matrix. It is trained on 400 million (text, image) pairs, collected from the Internet.
Fig.4 - CLIP contrastive pre-training over text-image pairs.

MMV

Fig.5 - Multimodal learning topology.
Fig.6 - MMV: MultiModal Versatile Networks. V=Vision, A=Audio, T=Text.


AVLnet [10]

Several differences compared with MMV:

Datasets

HowTo100M [11]

HowTo100M is a large-scale dataset of narrated videos with an emphasis on instructional videos where content creators teach complex tasks with an explicit intention of explaining the visual content on screen.

YouCook2 [12]

It contains 2000 long untrimmed videos from 89 cooking recipes; on average, each distinct recipe has 22 videos. The procedure steps for each video are annotated with temporal boundaries and described by imperative English sentences. The videos were downloaded from YouTube and are all in the third-person viewpoint.

Coin [13]

The COIN dataset consists of 11,827 videos related to 180 different tasks, which were all collected from YouTube. The average length of a video is 2.36 minutes. Each video is labelled with 3.91 step segments, where each segment lasts 14.91 seconds on average.

AudioSet [14]

AudioSet consists of an expanding ontology of 632 audio event classes and a collection of 2,084,320 human-labeled 10-second sound clips drawn from YouTube videos. The ontology is specified as a hierarchical graph of event categories, covering a wide range of human and animal sounds, musical instruments and genres, and common everyday environmental sounds.








[1] Linda B. Smith and Michael Gasser. The development of embodied cognition: Six lessons from babies. 2005.

[2] Sumit Chopra, Raia Hadsell, and Yann LeCun. Learning a similarity met- ric discriminatively, with application to face verification. 2005.

[3] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey E. Hinton. A simple framework for contrastive learning of visual representations. 2020.

[4] Jure Zbontar, Li Jing, Ishan Misra, Yann LeCun, and St ́ephane Deny. Barlow twins: Self-supervised learning via redundancy reduction, 2021.

[5] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross B. Girshick. Momentum contrast for unsupervised visual representation learning. 2020.

[6] Aravind Srinivas, Michael Laskin, and Pieter Abbeel. Curl: Contrastive unsupervised representations for reinforcement learning, 2020.

[7] Alec Radford, Jong Wook Kim, Chris Hallacy, and et al.. Learning transferable visual models from natural language supervision. 2021.

[8] Jean-Baptiste Alayrac, Adri`a Recasens, Rosalia Schneider, Relja Arandelovic, Jason Ramapuram, Jeffrey De Fauw, Lucas Smaira, Sander Dieleman, and Andrew Zisserman. Self-supervised multimodal versatile networks. 2020.

[9] Michael Gutmann and Aapo Hyv ̈arinen. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. 2010.

[10] Andrew Rouditchenko, Angie W. Boggust, David Harwath, and et al.. Avlnet: Learning audio-visual language representations from instructional videos. 2021.

[11] Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. 2019.

[12] Luowei Zhou, Nathan Louis, and Jason J. Corso. Weakly-supervised video object grounding from text by loss weighting and object interaction. 2018.

[13] Yansong Tang, Dajun Ding, Yongming Rao, Yu Zheng, Danyang Zhang, Lili Zhao, Jiwen Lu, and Jie Zhou. COIN: A large-scale dataset for com- prehensive instructional video analysis. 2019.

[14] Jort F. Gemmeke, Daniel P. W. Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R. Channing Moore, Manoj Plakal, and Marvin Ritter. Audio set: An ontology and human-labeled dataset for audio events. 2017.