A domain $D$ is defined as a two-element tuple consisting of feature space of the input data $X$ and marginal probability $P(X)$, i.e., $D={X,P(X)}$. Given a source domain $D_S$, a corresponding source task $T_S$, as well as a target domain $D_T$ and a target task $T_T$, the objective of transfer learning is to learn the target conditional probability distribution $P(Y_T|X_T)$ in $D_T$ with the information gained from $D_S$ and $T_S$ where $D_S\neq D_T$. Notably, domain adaptation is one type of transfer learning, which aims to solve the target task $T_T$ when the marginal probability distributions of source and target domain are different. $P(X_S)\neq P(X_T)$. For instance, book reviews labeled as positive or negative would be different from a corpus of product-review sentiments like electronics.
To perform transfer learning, we usually utilize a pre-trained network without its last several layers as a feature extractor for a different task due to deep neural networks are layered architectures that learn different features at different layers. Generally learn highly transferable features in the lower layers while the transferability sharply decreases in higher layers of a deep neural network [1]. It is better to use such a pre-trained model that was trained on a huge dataset as a staring point, then, we can further train the model on a relative small dataset for another task. This is known as model fine-tuning. There are mainly three fine-tuning techniques: 1) train the entire architecture; 2) train some layers while freezing others; 3) freeze the entire architecture. When a layer is frozen, its parameters will become untrainable during model training.
For natural languages processing (NLP), large models includes BERT, GPT-3, and so on. On the other hand, for computer vision, models trained on ImageNet such as VGG-19 can be leveraged.
BERT stands for Bidirectional Encoder Representations from Transformers. It is a pre-trained deep bidirectional representations for natural languages. Its architecture is based on the transformers and it can be used to solve different languages tasks by leveraging transfer learning. Moreover, it was trained on a large corpus of unlabelled text including the entire Wikipedia and Book Corpus.
Clients in Federated Learning (FL) have a high possibility of owning data from different domains. The quality of collected data at an edge are constrained to its surrounding environment such as writing style and topics (natural languages); light, angel, and distance (images). It is critical to properly transfer knowledge in FL, alleviating influence of negative transfers (obtained domain knowledge that would degrade the target domain's performance). Such learning algorithms could provide sufficient personalization ability for different usage cases at edges.
Disentanglement is the ability to understand high-dimensional data, and to distill that knowledge into useful representations in an unsupervised manner, remains a key challenge in deep learning. One approach to solving these challenges is through disentangled representations, models that capture the independent features of a given scene in such a way that if one feature changes, the others remain unaffected.
Digit-Five is a collection of five most popular digit datasets, MNIST (mt) [2], MNIST-M (mm) [3], Synthetic Digits (syn) [4], SVHN (sv), and USPS (up).
DomainNet [5] consists of six domains of object images, clipart (clp), infograph (inf), painting (pnt), quickdraw (qdr), real (rel), and sktech (skt). This dataset includes 345 categories of objects in total.
The task is to identify whether the sentiment of a review is positive or negative. Amazon Review [6] dataset includes reviews from four popular merchandise categories: Books (B), DVDs (D), Electronics (E), and Kitchen & housewares (K).
At training time, in order to obtain domain-invariant features, we seek the parameters $\theta_f$ of the feature mapping that maximize the loss of the domain classifier (making the two feature distributions as similar as possible), while simultaneously seeking the parameters $\theta_d$ of the domain classifier that minimize the loss of the domain classifier.
$$E(\theta_f, \theta_y, \theta_d) = \sum_{i=1,..,N | d_i = 0} L_y^i(\theta_f, \theta_y) - \lambda \sum_{i=1,...,N} L_d^i(\theta_f,\theta_d))$$ $$(\hat{\theta_f},\hat{\theta_y})=\underset{\theta_f,\theta_y}{\mbox{argmin}}E(\theta_f, \theta_y,\hat{\theta_d})$$ $$\hat{\theta_d}=\underset{\theta_d}{\mbox{argmax}}E(\hat{\theta_f},\hat{\theta_y},\theta_d) $$ where $L_y(\cdot,\cdot)$ is the loss for label prediction, $L_d(\cdot,\cdot)$ is the loss for the domain classification.In order to alleviate noisy signal from the domain classifier at the early stages of the training procedure instead of fixing the adaptation factor $\lambda$, we gradually change it from 0 to 1 using the following schedule: $$\lambda_p = \frac{2}{1+exp(-\gamma \cdot p)}-1,$$ where $\gamma$ was set to 10.
How to measure the distribution difference or the similarity between domains effectively is an important issue.
The measurement termed Maximum Mean Discrepancy (MMD) [11] is widely used in the field of transfer learning. MMD quantifies the distribution difference by calculating the distance of the mean values of the instances in a Reproducing Kernel Hilbert Space (RKHS). Roughly speaking, if two functions $f$ and $g$ in the RKHS are close in norm, i.e., $\|f-g\|$ is small, then $f$ and $g$ are also pointwise close, i.e., $|f(x)-g(x)|$ is small for all $x$. MMD is formulated as follows:
$$ \mbox{MMD}(X^S,X^T)=\Bigg\| \frac{1}{n^S}\sum_{i=1}^{n^S}\Phi(x_i^S)-\frac{1}{n^T}\sum_{j=1}^{n^T}\Phi(x_j^T)\Bigg\|_\mathcal{H}, $$where $\Phi$ is a feature map $X\rightarrow \mathcal{H}$ and $\mathcal{H}$ is called a reproducing kernel Hilbert space.
Let $k(x,y)=\langle\Phi(x),\Phi(y)\rangle_\mathcal{H}$ be the Gaussian kernel. MMD using the Gaussian kernel can be formulated as follows:
\begin{equation} \mbox{MMD}^2(X^S,X^T)=\Bigg\| \frac{1}{n^S}\sum_{i=1}^{n^S}\Phi(x_i^S)-\frac{1}{n^T}\sum_{j=1}^{n^T}\Phi(x_j^T)\Bigg\|^2_\mathcal{H} \end{equation} \begin{equation} =\langle\frac{1}{n^S}\sum_{i=1}^{n^S}\Phi(x_i^S), \frac{1}{n^S}\sum_{i=1}^{n^S}\Phi(x_i^{S'})\rangle_\mathcal{H} + \langle\frac{1}{n^T}\sum_{j=1}^{n^T}\Phi(x_j^T), \frac{1}{n^T}\sum_{j=1}^{n^T}\Phi(x_j^{T'})\rangle_\mathcal{H} \end{equation} \begin{equation} =\frac{1}{n^S}\sum_{i=1}^{n^S}k(x_i^S,x_i^{S'})+\frac{1}{n^T}\sum_{j=1}^{n^T}k(x_j^T,x_j^{T'})-2\frac{1}{n^S}\frac{1}{n^T}\sum_{i=1}^{n^S}\sum_{j=1}^{n^T}k(x_i^S,x_j^T) \end{equation}