Domain Shift and Transfer Learning

Domain Shift and Transfer Learning

Created by Yuwei Sun
All posts

Transfer Learning

A domain $D$ is defined as a two-element tuple consisting of feature space of the input data $X$ and marginal probability $P(X)$, i.e., $D={X,P(X)}$. Given a source domain $D_S$, a corresponding source task $T_S$, as well as a target domain $D_T$ and a target task $T_T$, the objective of transfer learning is to learn the target conditional probability distribution $P(Y_T|X_T)$ in $D_T$ with the information gained from $D_S$ and $T_S$ where $D_S\neq D_T$. Notably, domain adaptation is one type of transfer learning, which aims to solve the target task $T_T$ when the marginal probability distributions of source and target domain are different. $P(X_S)\neq P(X_T)$. For instance, book reviews labeled as positive or negative would be different from a corpus of product-review sentiments like electronics.

To perform transfer learning, we usually utilize a pre-trained network without its last several layers as a feature extractor for a different task due to deep neural networks are layered architectures that learn different features at different layers. Generally learn highly transferable features in the lower layers while the transferability sharply decreases in higher layers of a deep neural network [1]. It is better to use such a pre-trained model that was trained on a huge dataset as a staring point, then, we can further train the model on a relative small dataset for another task. This is known as model fine-tuning. There are mainly three fine-tuning techniques: 1) train the entire architecture; 2) train some layers while freezing others; 3) freeze the entire architecture. When a layer is frozen, its parameters will become untrainable during model training.

For natural languages processing (NLP), large models includes BERT, GPT-3, and so on. On the other hand, for computer vision, models trained on ImageNet such as VGG-19 can be leveraged.

BERT

BERT stands for Bidirectional Encoder Representations from Transformers. It is a pre-trained deep bidirectional representations for natural languages. Its architecture is based on the transformers and it can be used to solve different languages tasks by leveraging transfer learning. Moreover, it was trained on a large corpus of unlabelled text including the entire Wikipedia and Book Corpus.

Domain Adaptation in Federated Learning

Clients in Federated Learning (FL) have a high possibility of owning data from different domains. The quality of collected data at an edge are constrained to its surrounding environment such as writing style and topics (natural languages); light, angel, and distance (images). It is critical to properly transfer knowledge in FL, alleviating influence of negative transfers (obtained domain knowledge that would degrade the target domain's performance). Such learning algorithms could provide sufficient personalization ability for different usage cases at edges.

Disentanglement

Disentanglement is the ability to understand high-dimensional data, and to distill that knowledge into useful representations in an unsupervised manner, remains a key challenge in deep learning. One approach to solving these challenges is through disentangled representations, models that capture the independent features of a given scene in such a way that if one feature changes, the others remain unaffected.


Dataset

Digit-Five

Digit-Five is a collection of five most popular digit datasets, MNIST (mt) [2], MNIST-M (mm) [3], Synthetic Digits (syn) [4], SVHN (sv), and USPS (up).

DomainNet

DomainNet [5] consists of six domains of object images, clipart (clp), infograph (inf), painting (pnt), quickdraw (qdr), real (rel), and sktech (skt). This dataset includes 345 categories of objects in total.

Amazon Review Dataset

The task is to identify whether the sentiment of a review is positive or negative. Amazon Review [6] dataset includes reviews from four popular merchandise categories: Books (B), DVDs (D), Electronics (E), and Kitchen & housewares (K).


Related Work


Domain Adversarial Neural Network (DANN) [7]

Target Task

Approach

Fig.1 - DANN includes a deep feature extractor (green) and a deep label predictor (blue), which together form a standard feed-forward architecture. Unsupervised domain adaptation is achieved by adding a domain classifier (red) connected to the feature extractor.

At training time, in order to obtain domain-invariant features, we seek the parameters $\theta_f$ of the feature mapping that maximize the loss of the domain classifier (making the two feature distributions as similar as possible), while simultaneously seeking the parameters $\theta_d$ of the domain classifier that minimize the loss of the domain classifier.

$$E(\theta_f, \theta_y, \theta_d) = \sum_{i=1,..,N | d_i = 0} L_y^i(\theta_f, \theta_y) - \lambda \sum_{i=1,...,N} L_d^i(\theta_f,\theta_d))$$ $$(\hat{\theta_f},\hat{\theta_y})=\underset{\theta_f,\theta_y}{\mbox{argmin}}E(\theta_f, \theta_y,\hat{\theta_d})$$ $$\hat{\theta_d}=\underset{\theta_d}{\mbox{argmax}}E(\hat{\theta_f},\hat{\theta_y},\theta_d) $$ where $L_y(\cdot,\cdot)$ is the loss for label prediction, $L_d(\cdot,\cdot)$ is the loss for the domain classification.

In order to alleviate noisy signal from the domain classifier at the early stages of the training procedure instead of fixing the adaptation factor $\lambda$, we gradually change it from 0 to 1 using the following schedule: $$\lambda_p = \frac{2}{1+exp(-\gamma \cdot p)}-1,$$ where $\gamma$ was set to 10.

Maximum Mean Discrepancy (MMD)

How to measure the distribution difference or the similarity between domains effectively is an important issue.

The measurement termed Maximum Mean Discrepancy (MMD) [11] is widely used in the field of transfer learning. MMD quantifies the distribution difference by calculating the distance of the mean values of the instances in a Reproducing Kernel Hilbert Space (RKHS). Roughly speaking, if two functions $f$ and $g$ in the RKHS are close in norm, i.e., $\|f-g\|$ is small, then $f$ and $g$ are also pointwise close, i.e., $|f(x)-g(x)|$ is small for all $x$. MMD is formulated as follows:

$$ \mbox{MMD}(X^S,X^T)=\Bigg\| \frac{1}{n^S}\sum_{i=1}^{n^S}\Phi(x_i^S)-\frac{1}{n^T}\sum_{j=1}^{n^T}\Phi(x_j^T)\Bigg\|_\mathcal{H}, $$

where $\Phi$ is a feature map $X\rightarrow \mathcal{H}$ and $\mathcal{H}$ is called a reproducing kernel Hilbert space.

Let $k(x,y)=\langle\Phi(x),\Phi(y)\rangle_\mathcal{H}$ be the Gaussian kernel. MMD using the Gaussian kernel can be formulated as follows:

\begin{equation} \mbox{MMD}^2(X^S,X^T)=\Bigg\| \frac{1}{n^S}\sum_{i=1}^{n^S}\Phi(x_i^S)-\frac{1}{n^T}\sum_{j=1}^{n^T}\Phi(x_j^T)\Bigg\|^2_\mathcal{H} \end{equation} \begin{equation} =\langle\frac{1}{n^S}\sum_{i=1}^{n^S}\Phi(x_i^S), \frac{1}{n^S}\sum_{i=1}^{n^S}\Phi(x_i^{S'})\rangle_\mathcal{H} + \langle\frac{1}{n^T}\sum_{j=1}^{n^T}\Phi(x_j^T), \frac{1}{n^T}\sum_{j=1}^{n^T}\Phi(x_j^{T'})\rangle_\mathcal{H} \end{equation} \begin{equation} =\frac{1}{n^S}\sum_{i=1}^{n^S}k(x_i^S,x_i^{S'})+\frac{1}{n^T}\sum_{j=1}^{n^T}k(x_j^T,x_j^{T'})-2\frac{1}{n^S}\frac{1}{n^T}\sum_{i=1}^{n^S}\sum_{j=1}^{n^T}k(x_i^S,x_j^T) \end{equation}






Further reading: Feature Distribution Matching for Federated Domain Generalization [Sun et al. arXiv:2203.11635 2022]



[1] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell, “Decaf: A deep convolutional activation feature for generic visual recognition,” in Proceedings of the 31th International Conference on Machine Learning, ICML 2014.

[2] Y. LeCun, C. Cortes, and C. Burges, “Mnist handwritten digit database,” ATT Labs [Online].2010.

[3] Y. Ganin and V. S. Lempitsky, “Unsupervised domain adaptation by backpropagation,” in ICML 2015.

[4] X. Peng, Q. Bai, X. Xia, Z. Huang, K. Saenko, and B. Wang, “Moment matching for multi-source domain adaptation,” in ICCV 2019.

[5] J. Blitzer, M. Dredze, and F. Pereira, “Biographies, bollywood, boomboxes and blenders: Domain adaptation for sentiment classification,” in ACL 2007.

[6] Y. Ganin and V. S. Lempitsky, “Unsupervised domain adaptation by backpropagation,” in ICML 2015.

[7] X. Peng, Z. Huang, Y. Zhu, and K. Saenko, “Federated adversarial domain adaptation,” in ICLR 2020.

[8] C. Yao, B. Gong, Y. Cui, H. Qi, Y. Zhu, and M. Yang, “Federated multi-target domain adaptation,” CoRR, vol. abs/2108.07792, 2021.

[9] C. He, M. Annavaram, and S. Avestimehr, “Group knowledge transfer: Federated learning of large cnns at the edge,” in NeurIPS 2020.

[10] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Sch ̈olkopf, and A. J. Smola, “A kernel method for the two-sample-problem,” in NeurIPS 2019.

[11] K. M. Borgwardt, A. Gretton, M. J. Rasch, H. Kriegel, B. Sch ̈olkopf, and A. J. Smola, “Integrating structured biological data by kernel maximum mean discrepancy,” in Proceedings 14th International Conference on Intelligent Systems for Molecular Biology 2006.