Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics

How uncertainty is used to weigh different losses in task where multiple losses are used.

Summary

They learn the task uncertainty (homoscedastic uncertainty), they use this uncertainty to wight the losses of different tasks. So when we have a training which have mutiple losses such as L=iLiwi L = \sum_i L_iw_i , then wiw_iis is not a hyperparameter here like traditional methods but is learnable uncertainty. They show this in the network where they jointy learn semantic segmentation and depth regression, hence classification and regression task.

Methodolgy

When we learn multiple output of network such as in this case. The outputs have different units and dimensions, this raises a problem to asign different weights to different losses of outputs.

Homoscedastic Uncertainty: It is an aleatoric uncertainty which is not dependent on the input data. It is not a model output, rather a quantity which stays constant for all input data and varies between different tasks.

Deriving losses using liklihood

Let fW(x)f_W (x) be the output of a neural network with weights WW on input xx. For regrassion tasks, we define our liklihood as a Gaussian with mean as model output:

p(yfW(x))=N(fW(x),σ2)p(y|f _W (x)) = \mathcal{N} (f_W (x), σ^2 )

where this σ2\sigma^2 is the observation noise (also task uncertainty in this case). When we use maximize the liklihood for this, we get:

logp(yfW(x))12σ2yfW(x)2logσlog p(y|f_W (x)) ∝ \frac{− 1}{2\sigma^2} ||y − f ^W (x)||^ 2 − \log σ

For classification we often squash the model output through a softmax function, and sample from the resulting probability vector:

p(yfW(x))=Softmax(fW(x))p(y|f_W (x)) = Softmax(f_W (x))

Here the classification liklihood they squash the scaled version of the model output through a softmax function:

p(yfW(x),σ)=Softmax(1σ2fW(x))p(y|f^W (x), σ) = Softmax( \frac{1}{\sigma^2} f^W (x))

This can be interpreted as a Boltz- mann distribution (also called Gibbs distribution) where the input is scaled by σ 2 (often referred to as temperature). The log liklihood of this is given as

logp(y=cfW(x),σ)=1σ2fcW(x)logcexp(1σ2fcW(x))\log p(y = c|f^ W (x), σ) = \frac{1}{\sigma^2} f_c ^W(x) - \log \sum_{c^`}\exp(\frac{1}{\sigma^2} f_{c`} ^W(x))

with fcW(x)f_c ^W (x) the cc’th element of the vector fW(x)f^W(x)

Now let y1,y2y_1, y_2be the model outputs corresponding to regression and classification, then:

p(y1,y2fW(x))=p(y1fW(x))p(y2fW(x))p(y _1 , y _2 |f_ W (x)) = p(y _1 |f_ W (x)) · p(y_2 |f_ W (x))

The loss for this will the negative log liklihood of above which is given by:

L(W,σ1,σ2)=logp(y1,y2=cfW(x))=logN(y1;fW(x),σ12)Softmax(y2=c;fW(x),σ2)12σ12L1(w)+12σ22L2(w)+logσ1+logσ2L(W, σ_1 , σ_2 ) = -\log p(y _1 , y _2=c |f_ W (x)) \\ = − \log \mathcal{N} (y_1 ; f^ W (x), σ_1 ^2 ) · Softmax(y_ 2 = c; f ^W (x), σ _2 ) \\ \approx \frac{1}{2\sigma_1^2}L_1(w) + \frac{1}{2\sigma_2^2}L_2(w) + \log \sigma_1 + \log \sigma_2

where L1(W)=y1fW(x)2L_1 (W) = ||y_ 1 − f^ W (x)||^ 2 for the euclidean loss of y1y_1 and L2(W)=logSoftmax(y2,fW(x)) L_2 (W) = − \log Softmax(y_2 , f^ W (x)) .

Hence, here we mathematically derived the multi task loss using task uncertainty for each task. Using this loss formulation of classification and regression, you extend it to any of the multitask loss.

Insights/Discussions

  • Using uncertainty for weighing the losses allows us to incorporate the problem for different dimension and units for different tasks in a multi task learning problem.

  • The unceratiainty are learnable parameters and does not depend on input but the task itself.

  • Looking logically, if the output of task is more spreaded i.e has larger units, then σ\sigmawill be high. Hence normalizing the loss. Using σ\sigmabasically normalize the loss and helps getting rid of units in the equation.

Last updated