Parametric Methods for MetaLearning
BackBox Adaptation
These are a set of approaches that treat step 1 as an inference problem and thus, training a Neural Network to represent \(p(\phi_i \mid \mathcal{D}^{tr}, \theta)\) i.e a way to estimate \(\phi_i\) and then use that as a parameter to optimize for a new task. The deterministic way to go about it would be to take point estimates
\[\phi_i = f_\theta (\mathcal{D}^{tr}_i)\]Thus, we can treat \(f_\theta(.)\) as a neural network parameterized by \(\theta\) which takes the training data as an input, sequential or batched, and outputs the taskspecific parameters \(\phi_i\) which are then used by another neural network \(g_{\phi_i} (.)\) to predict the outputs on a new dataset. Thus, we can essentially treat this as a supervised learning problem with our optimization being
\[\begin{aligned} & \max_\theta \sum_{\mathcal{T_i}} \sum_{(x,y) \sim \mathcal{D_i}^{test}} \log g_{\phi_i} (y\mid x) \\ = & \max_\theta \sum_{\mathcal{T_i}} \mathcal{L}(f_\theta(\mathcal{D^{tr}_i}), \mathcal{D_i^{test}}) \end{aligned}\]To make this more tractable, \(\phi\) can be replaced by a sufficient statistic \(h_i\) instead of all the parameters. Some ANN architectures that work well with this approach are LSTMs, as shown in the work of Santoro et. al, feedforward networks with averaging as shown by Ramalho et. al, Having inner task learners and outer metalearners i.e MetaNetworks byMukhdalai e.t.c. I am personally fascinated by the use of transformer architectures in this domain. The advantage of this approach is that it is expressive and easy to combine with other techniques like supervised learning, reinforcement learning e.t.c. However, the optimization bit is challenging and not the best solution from the onset for every kind of problem. Thus, our stepbystep approach would be:
 Sample Task \(\mathcal{T}_i\) ( a sequential stream or minibatches )
 Sample Disjoint Datasets \(\mathcal{D^{tr}_i}\),\(\mathcal{D^{test}_i}\) from \(\mathcal{D}_i\)
 Compute \(\phi_i \leftarrow f_\theta(\mathcal{D^{tr}_i})\)
 Update \(\theta\) using \(\nabla_\theta\mathcal{L}(\phi_i, \mathcal{D^{test}_i})\)
OptimizationBased Approaches
This set treats the prediction of \(\phi_i\) as an optimization procedure and then differentiates through that optimization process to get a \(\phi_i\) that leads to good performance. The method can be summarized into the surrogates sums of maximization of observing the training data given \(\phi_i\) and the maximization of getting \(\phi_i\) given our model parameters \(\theta\).
\[\max_{\phi_i} \log p(\mathcal{D^{tr}_i} \mid \phi_i ) + \log p(\phi_i \mid \theta)\]The second part of the above summation is our prior and the first part is a likelihood. Thus, our next question is the form of this prior that might be useful. In deep learning, one good way to incorporate priors is through the initialization of hyperparameters, or finetuning. Thus, we can take \(\theta\) as a pretrained parameter and run gradient descent on it
\[\phi \leftarrow \theta  \alpha \nabla_\theta \mathcal{L} (\theta, \mathcal{D^{tr}})\]One popular way to do this for image classification is to have a feature extractor pretrained on some datasets like ImageNet and then finetune its output to our problem. The aim in optimizationbased approaches is to get to a sweetspot in the multidimensional parameter space \(\mathbf{\Phi} = {\phi_1, \phi_2, .., \phi_n}\) such that our model becomes independent of the loss function and the training data, and this is called ModelAgnostic MetaLearning. Thus, now our procedure becomes
 Sample Task \(\mathcal{T}_i\) ( a sequential stream or minibatches )
 Sample Disjoint Datasets \(\mathcal{D^{tr}_i}\),\(\mathcal{D^{test}_i}\) from \(\mathcal{D}_i\)
 Optimize \(\phi_i \leftarrow f_\theta(\mathcal{D^{tr}_i})\)
 Update \(\theta\) using \(\nabla_\theta\mathcal{L}(\phi_i, \mathcal{D^{test}_i})\)
For our optimization process, let’s define our final task specific parameter as
\[\phi = u(\theta, \mathcal{D^{tr}})\]And now, our optimization target becomes
\[\begin{aligned} & \min_\theta \mathcal{L}(\phi, \mathcal{D^{test}}) \\ = & \min_\theta \mathcal{L} \big (u(\theta, \mathcal{D^{tr}}), \mathcal{D^{test}} \big) \end{aligned}\]This optimization can be achieved by differentiating our loss w.r.t our metaparameters \(\theta\) and then performing an inner differentiation w.r.t \(\phi\):
\[\frac{d\mathcal{L} (\phi, \mathcal{D^{test}} ) }{d \theta} = \nabla _{\bar{\phi}} \mathcal{L} (\bar{\phi}, \mathcal{D^{test}} ) \mid_{\bar{\phi} = u(\theta, \mathcal{D^{tr}}) } d_\theta \big ( u(\theta, \mathcal{D^{tr}} ) \big )\]Now, if we use our optimization update for \(u (.)\) then we get:
\[\begin{aligned} & u(\theta, \mathcal{D^{tr}} ) = \theta  \alpha \,\, d_\theta \big( L(\theta, \mathcal{D^{tr}}) \big ) \\ \implies & d_\theta \big ( u(\theta, \mathcal{D^{tr}} ) \big ) = \mathbf{1}  \alpha \, d^2_\theta \big (L(\theta, \mathcal{D^{tr}}) \big ) \end{aligned}\]Thus, when we substitute the hessian in the derivative equation we get:
\[\begin{aligned} \frac{d\mathcal{L} (\phi, \mathcal{D^{test}} ) }{d \theta} & = \bigg (\nabla _{\bar{\phi}} \mathcal{L} (\bar{\phi}, \mathcal{D^{test}} ) \mid_{\bar{\phi} = u(\theta, \mathcal{D^{tr}}) } \bigg ). \bigg ( \mathbf{1}  \alpha \, d^2_\theta \big (L(\theta, \mathcal{D^{tr}}) \big ) \bigg ) \\ & = \nabla _{\bar{\phi}} \mathcal{L} (\bar{\phi}, \mathcal{D^{test}} ) \mid_{\bar{\phi} = u(\theta, \mathcal{D^{tr}}) }  \alpha\,\, \bigg( \nabla _{\bar{\phi}} \mathcal{L} (\bar{\phi}, \mathcal{D^{test}} ) . d^2_\theta \big (L(\theta, \mathcal{D^{tr}}) \big ) \bigg ) \mid_{\bar{\phi} = u(\theta, \mathcal{D^{tr}}) } \end{aligned}\]We now have a matrix product on the right which can be made more efficient and turn out ot be easier to compute than the full hessian of the network. Thus, this process is tractable. one really interesting thing that comes out of this is that we can also view this modelagnostic approach and the optimization update as a computation graph! Thus, we can say
\[\phi_i = \theta  f(\theta, \mathcal{D_i^{tr}}, \nabla_\theta \mathcal{L} )\]Now, we can train an ANN to output the gradient \(f(.)\) , and thus, this allows us to mix the optimization procedure with the blackbox adaptation process. Moreover, MAML approaches show a better performance on the omniglot dataset since they are optimizing for the modelagnostic points. It has been shown by Finn and Levine that MAML can approximate any function of \(\mathcal{D_i^{tr}}\) and \(x^{ts}\) give:
 Nonzero \(\alpha\)
 Loss function gradient does not lose information about the label
 Datapoints in \(\mathcal{D_i^{tr}}\) are unique
Thus, MAML is able to inject inductive bias without losing expressivity.
Inferece
To better understand why MAML works well, we need to look through probabilistic lenses again to say that the metaparameters \(\theta\) are inducing some kinds of prior knowledge into our system and so our learning objective would be to maximize the probability of observing the data \(\mathcal{D}_i\), given our metaparameters \(\theta\)
\[\max_\theta \log \prod_i p(\mathcal{D}_i \theta )\]This can be further written as the sum of the probabilities of \(\mathcal{D_i}\) given our modelspecific parameters \(\phi_i\), and the probability of seeing each \(\phi_i\) given our prior knowledge \(\theta\) :
\[\max _\theta \prod_i \int p(\mathcal{D_i} \phi_i) p(\phi_i\theta) d\phi_i\]And now, we can estimate the probability of seeing each \(\phi_i\) given our prior knowledge \(\theta\) using a Maximum APosteriori (MAP) estimate \(\hat{\phi}\), so that
\[\max_\theta \log \prod_i p(\mathcal{D}_i \theta ) \approx \max_\theta \log \prod_i p(\mathcal{D}_i\hat{\phi}_i) p(\hat{\phi}  \theta)\]It has been shown that, for likelihoods that are Gaussian in \(\phi_i\), gradient descent with early stopping corresponds exactly to maximum aposteriori inference under a Gaussian prior with mean initial samples. This estimation is exact in the linear case, and the variance in nonlinear cases is determined by the order of the derivative. Thus, by limiting the computation to second derivatives, MAML is able to maintain a fairly good MAP inference estimate and so, MAML approximates hierarchical Bayesian Inference. We can also use other kinds of priors like:

Explicit Gaussian Prior: $$\phi \leftarrow \min_{\phi’} \mathcal{L} (\phi’, \mathcal{D^{tr}}) + \frac{\lambda}{2} \theta  \phi’ ^2$$  Bayesian Linear Regression on learned features
 Convex optimization on learned features
 Ridge or logistic regression
 Support Vector Machines
Challenge 1: Choosing Architecture
The major bottleneck in this process is the inner gradient step and so, we want to chosse an architecture that is effective for this inner step. One idea, called AutoMeta is to adopt the progressive neural architecture search to find optimal architectures for metalearners i.e combine AutoML with GradientBased MetaLearning. The interesting results of this were:
 They found highlynonstandard architectures, both deep and narrow
 They found architectures very different from the ones used for supervised learning
Challenge 2: Handling Instabilities
Another challenge comes from the instability that can come from the complicated BiLevel optimization procedure. One way of mitigating this is to learn the inner vector learning rate and then tune the outer learning rate :
 MetaStochastic Gradient Descent is a metalearner that can learn initialization, learner update direction, and learning rate, all in a single closedloop process
 AlphaMAML incorporates an online hyperparameter adaptation scheme that eliminates the need to tune metalearning and learning rates
Another idea idea is to optimize only a subset of parameters in the innter loop:
 DEML jointly learns a concept generator, a metalearner, and a concept discriminator. The concept generator abstracts representation of each instance of training data, the metalearner performs fewshot learning on this instance and the concept discriminator recognizes the concept pertaining to each instance
 CAVIA partitions the model parameters into context parameters that serve as additional input to the model and are adapted on individual tasks and shared parameters that are metatrained and shared across tasks. Thus, during test time only the context parameters need to be updated, which is a lowerdimensional search problem compared to all the model parameters
In MAML++ the authors ablate the various ideas and issues of MAML and then propose a new framework that addresses these issues. Some significant points were the decoupling of the inner loop learning rate and the outer updates, the addition of batch normalization to each and every step and greedy updates.
Challenge 3: Managing Compute and Memory
The backpropagation through many innergradient steps adds computational and memory overhead that is hard to deal with. One idea to mitigate this is to approximate the derivative of \(\phi_i\) w.r.t \(\theta\). This is a crude approximation and works well for fewshot learning problem, but fails in more complex problems like imitation learning. Another direction is to try to not compute the gradient at all and use the implicit function theorem→ Let’s take our function \(\phi\) as the explicit gaussian representation :
\[\phi = u(\theta, \mathcal{D^{tr}}) = \underset{\phi'}{\text{argmin}} \mathcal{L}(\phi', \mathcal{D^{tr}}) + \frac{\lambda}{2} \phi'  \theta ^2\]Let our optimization function be
\[G(\phi', \theta ) = \mathcal{L}(\phi', \mathcal{D^{tr}}) + \frac{\lambda}{2} \phi'  \theta ^2\]Finding the \(\text{argmin}\) of the this function implies that the gradient w.r.t \(\phi\) is \(0\) i.e
\[\begin{aligned} & \nabla_{\phi'} G(\phi', \theta) \big_{\phi' = \phi} = 0 \\ \implies & \nabla_\phi L(\phi) + \lambda(\phi  \theta ) = 0 \\ \implies & \phi = \theta  \frac{1}{\lambda} \nabla_\phi L(\phi) \end{aligned}\]Thus, our derivative now becomes
\[\begin{aligned} & \frac{d \phi}{d \theta } = \mathbf{1}  \frac{1}{\lambda} \nabla_\phi^2 L(\phi) \frac{d \phi}{d \theta } \\ \therefore\,\,\,& \frac{d \phi}{d \theta } = \bigg [\mathbf{1} + \frac{1}{\lambda} \nabla_\phi^2 L(\phi) \bigg ] ^{1} \end{aligned}\]Thus, we can compute this without going through the inner optimization process and it works only on the assumption that the out function \(G(\phi', \theta)\) has an \(\text{argmin}\) , to begin with.