Domain Shift and Transfer Learning

Created by Yuwei Sun
All posts

Transfer Learning

A domain $D$ is defined as a two-element tuple consisting of feature space of the input data $X$ and marginal probability $P(X)$, i.e., $D={X,P(X)}$. Given a source domain $D_S$, a corresponding source task $T_S$, as well as a target domain $D_T$ and a target task $T_T$, the objective of transfer learning is to learn the target conditional probability distribution $P(Y_T|X_T)$ in $D_T$ with the information gained from $D_S$ and $T_S$ where $D_S\neq D_T$. Notably, domain adaptation is one type of transfer learning, which aims to solve the target task $T_T$ when the marginal probability distributions of source and target domain are different. $P(X_S)\neq P(X_T)$. For instance, book reviews labeled as positive or negative would be different from a corpus of product-review sentiments like electronics.

To perform transfer learning, we usually utilize a pre-trained network without its last several layers as a feature extractor for a different task due to deep neural networks are layered architectures that learn different features at different layers. Generally learn highly transferable features in the lower layers while the transferability sharply decreases in higher layers of a deep neural network [1]. It is better to use such a pre-trained model that was trained on a huge dataset as a staring point, then, we can further train the model on a relative small dataset for another task. This is known as model fine-tuning. There are mainly three fine-tuning techniques: 1) train the entire architecture; 2) train some layers while freezing others; 3) freeze the entire architecture. When a layer is frozen, its parameters will become untrainable during model training.

For natural languages processing (NLP), large models includes BERT, GPT-3, and so on. On the other hand, for computer vision, models trained on ImageNet such as VGG-19 can be leveraged.

BERT

BERT stands for Bidirectional Encoder Representations from Transformers. It is a pre-trained deep bidirectional representations for natural languages. Its architecture is based on the transformers and it can be used to solve different languages tasks by leveraging transfer learning. Moreover, it was trained on a large corpus of unlabelled text including the entire Wikipedia and Book Corpus.

Domain Adaptation in Federated Learning

Clients in Federated Learning (FL) have a high possibility of owning data from different domains. The quality of collected data at an edge are constrained to its surrounding environment such as writing style and topics (natural languages); light, angel, and distance (images). It is critical to properly transfer knowledge in FL, alleviating influence of negative transfers (obtained domain knowledge that would degrade the target domain's performance). Such learning algorithms could provide sufficient personalization ability for different usage cases at edges.

Disentanglement

Disentanglement is the ability to understand high-dimensional data, and to distill that knowledge into useful representations in an unsupervised manner, remains a key challenge in deep learning. One approach to solving these challenges is through disentangled representations, models that capture the independent features of a given scene in such a way that if one feature changes, the others remain unaffected.

Dataset

Digit-Five

Digit-Five is a collection of five most popular digit datasets, MNIST (mt) [2], MNIST-M (mm) [3], Synthetic Digits (syn) [4], SVHN (sv), and USPS (up).

DomainNet

DomainNet [5] consists of six domains of object images, clipart (clp), infograph (inf), painting (pnt), quickdraw (qdr), real (rel), and sktech (skt). This dataset includes 345 categories of objects in total.

Amazon Review Dataset

The task is to identify whether the sentiment of a review is positive or negative. Amazon Review [6] dataset includes reviews from four popular merchandise categories: Books (B), DVDs (D), Electronics (E), and Kitchen & housewares (K).

Related Work

Ganin et al. [7] presented the Domain Adversarial Neural Network (DANN) that leverages two losses, the classification loss and the domain confusion loss. By minimizing the classification loss for the source samples and the domain confusion loss for all samples (while maximizing the domain confusion loss for the feature extraction), this makes sure that the samples are mutually indistinguishable for the classifier.
Peng et al. [8] presented Federated Adversarial Domain Adaptation (FADA) which aims to tackle domain shift in a federated learning system through adversarial techniques of disentangling the domain-invariant features from domain-specific features.
Yao et al. discussed an opposite scenario where they aimed to tackle the domain gaps between the unlabeled, distributed client data and a labeled, centralized dataset on the server. The server broadcasts the whole model to the clients, while the clients only need to upload the classifiers to the server.
He et al. [9]demonstrates the idea of group knowledge transfer called FedGKT, by separating the model into a feature extractor and a classifier. They compared this method with federated learning and split learning, FedGKT demanded 9 to 17 times less computational power on edge devices and requires 54 to 105 times fewer parameters in the edge. In addition, they applied the Kullback Leibler (KL) Divergence function to compute the loss of transferring knowledge from a network to another. However, though they considered non-iid variants of training datasets, it was limited to splitting a single training dataset into several unbalanced partitions. Domain adaptation problem is not well explored in the paper.
Gretton et al. [10] proposed the Deep Adaptation Networks (DAN) that applies multi-kernel MMD loss to align the source domain with the target domain in Reproducing Kernel Hilbert Space.

Domain Adversarial Neural Network (DANN) [7]

Target Task

The goal is to adopt domain adaptation in deep architectures that can be trained on large amount of labeled data from the source domain and large amount of unlabeled data from the target domain (no labeled target domain data is necessary).
We assume that the model works with input samples $x\in X$, where $X$ is some input space and certain labels $y$ from the label space $Y$. There exist two distributions $S(x,y)$ and $T(x,y)$ on $X \otimes Y$, which will be referred to as the source distribution and the target distribution. The ultimate goal is to be able to predict labels $y$ given the input $x$ for the target distribution.
We denote with $d_i$ the binary variable (domain label) for the $i$-th example, which indicates whether $x_i$ come from the source distribution ($x_i\in S(x)\mbox{ if } d_i = 0$) or from the target distribution ($x_i\in T(x) \mbox{ if } d_i = 1$).
We define a deep feed-forward architecture that for each input $x$ predicts its label $y \in Y$ and its domain label $d \in \{0,1\}$. We decompose such mapping into three parts. We assume that the input $x$ is first mapped by a mapping $G_f$ (a feature extractor) to a D-dimensional feature vector $f \in \mathbb{R}^D$. We want to make the distributions $S(f) =\{G_f(x;\theta_f|x \in S(x)\}$ and $S(f) =\{G_f(x;\theta_f|x\in T(x)\}$ to be similar.

Approach

Fig.1 - DANN includes a deep feature extractor (green) and a deep label predictor (blue), which together form a standard feed-forward architecture. Unsupervised domain adaptation is achieved by adding a domain classifier (red) connected to the feature extractor.

At training time, in order to obtain domain-invariant features, we seek the parameters $\theta_f$ of the feature mapping that maximize the loss of the domain classifier (making the two feature distributions as similar as possible), while simultaneously seeking the parameters $\theta_d$ of the domain classifier that minimize the loss of the domain classifier.

$$E(\theta_f, \theta_y, \theta_d) = \sum_{i=1,..,N | d_i = 0} L_y^i(\theta_f, \theta_y) - \lambda \sum_{i=1,...,N} L_d^i(\theta_f,\theta_d))$$ $$(\hat{\theta_f},\hat{\theta_y})=\underset{\theta_f,\theta_y}{\mbox{argmin}}E(\theta_f, \theta_y,\hat{\theta_d})$$ $$\hat{\theta_d}=\underset{\theta_d}{\mbox{argmax}}E(\hat{\theta_f},\hat{\theta_y},\theta_d) $$ where $L_y(\cdot,\cdot)$ is the loss for label prediction, $L_d(\cdot,\cdot)$ is the loss for the domain classification.

In order to alleviate noisy signal from the domain classifier at the early stages of the training procedure instead of fixing the adaptation factor $\lambda$, we gradually change it from 0 to 1 using the following schedule: $$\lambda_p = \frac{2}{1+exp(-\gamma \cdot p)}-1,$$ where $\gamma$ was set to 10.

Maximum Mean Discrepancy (MMD)

How to measure the distribution difference or the similarity between domains effectively is an important issue.

The measurement termed Maximum Mean Discrepancy (MMD) [11] is widely used in the field of transfer learning. MMD quantifies the distribution difference by calculating the distance of the mean values of the instances in a Reproducing Kernel Hilbert Space (RKHS). Roughly speaking, if two functions $f$ and $g$ in the RKHS are close in norm, i.e., $\|f-g\|$ is small, then $f$ and $g$ are also pointwise close, i.e., $|f(x)-g(x)|$ is small for all $x$. MMD is formulated as follows:

$$ \mbox{MMD}(X^S,X^T)=\Bigg\| \frac{1}{n^S}\sum_{i=1}^{n^S}\Phi(x_i^S)-\frac{1}{n^T}\sum_{j=1}^{n^T}\Phi(x_j^T)\Bigg\|_\mathcal{H}, $$

where $\Phi$ is a feature map $X\rightarrow \mathcal{H}$ and $\mathcal{H}$ is called a reproducing kernel Hilbert space.

Let $k(x,y)=\langle\Phi(x),\Phi(y)\rangle_\mathcal{H}$ be the Gaussian kernel. MMD using the Gaussian kernel can be formulated as follows:

\begin{equation} \mbox{MMD}^2(X^S,X^T)=\Bigg\| \frac{1}{n^S}\sum_{i=1}^{n^S}\Phi(x_i^S)-\frac{1}{n^T}\sum_{j=1}^{n^T}\Phi(x_j^T)\Bigg\|^2_\mathcal{H} \end{equation} \begin{equation} =\langle\frac{1}{n^S}\sum_{i=1}^{n^S}\Phi(x_i^S), \frac{1}{n^S}\sum_{i=1}^{n^S}\Phi(x_i^{S'})\rangle_\mathcal{H} + \langle\frac{1}{n^T}\sum_{j=1}^{n^T}\Phi(x_j^T), \frac{1}{n^T}\sum_{j=1}^{n^T}\Phi(x_j^{T'})\rangle_\mathcal{H} \end{equation} \begin{equation} =\frac{1}{n^S}\sum_{i=1}^{n^S}k(x_i^S,x_i^{S'})+\frac{1}{n^T}\sum_{j=1}^{n^T}k(x_j^T,x_j^{T'})-2\frac{1}{n^S}\frac{1}{n^T}\sum_{i=1}^{n^S}\sum_{j=1}^{n^T}k(x_i^S,x_j^T) \end{equation}