Skip to main content
Engineering, Uber AI

Introducing LCA: Loss Change Allocation for Neural Network Training

10 September 2019 / Global
Featured image for Introducing LCA: Loss Change Allocation for Neural Network Training
Figure 1. A toy example of a loss surface (a) depicts the LCA of two parameters. One parameter (b) moves but does not affect the loss, and another (c) has negative LCA since its motion caused the loss to decrease.
Figure 2. A histogram showing the distribution of all LCA elements (all parameters at all iterations) reveals that barely more than half are negative (helping). The same histogram is shown in log scale (left) to see the tails of the distribution and regular scale (right) for a clearer visualization of the negative/positive ratio.
Figure 3. In our research, we found that (a) the percent of parameters helping is near 50 percent at all iterations and (b) all parameters help around half the time.
Figure 4. Two parameters from the last layer of ResNet are shown: one that hurt the most (top) and one that helped the most (bottom) in these given iterations (net LCA of +3.41e-3 and -3.03e-3, respectively). The weight (orange) and gradient (blue) trajectories both oscillate, causing LCA (green and red bars) to alternate between helping and hurting.
Figure 5. We sum LCA over all parameters within each layer for (left) FC and (right) LeNet. Different layers learn different amounts, and the differences in LCA per layer can mostly be explained by the number of parameters in the layer.
Figure 6. ResNet reveals a different pattern: the first and last layers have positive LCA, meaning that their movements actually increased the loss over the training process. This is surprising because the network learns on the whole, and it does not make sense for the LCA summed over a large group of parameters to be consistently positive.
Figure 7: We show LCA per layer for a ResNet for a regular training scenario (solid bars) and a scenario in which we freeze the last layer at initialization (hatched bars), averaged over 10 runs each. By freezing the last layer, we prevent it from hurting. Though the other layers do not help as much (LCA is less negative), the change in the last layer’s LCA more than compensates, resulting in a lower overall loss (right).
Figure 8: We train a ResNet with varied momentum for the last layer and plot total LCA per layer (first ten layers omitted for better visibility). As we decrease the last layer’s momentum, the gradient information driving learning becomes less delayed relative to that of other layers, and the last layer’s LCA pulls ahead at the expense of other layers.
Figure 9. We visualize “peak moments of learning” by layer and class for MNIST-FC, with each dot representing a peak in LCA for a particular class and layer, where the three layers in this network are shown as three stacked dots. When the learning of all three layers line up on the same iteration, we highlight the dots in red.
Janice Lan

Janice Lan

Janice Lan is a research scientist with Uber AI.

Rosanne Liu

Rosanne Liu

Rosanne is a senior research scientist and a founding member of Uber AI. She obtained her PhD in Computer Science at Northwestern University, where she used neural networks to help discover novel materials. She is currently working on the multiple fronts where machine learning and neural networks are mysterious. She attempts to write in her spare time.

Hattie Zhou

Hattie Zhou

Hattie Zhou is a data scientist with Uber's Marketing Analytics team.

Jason Yosinski

Jason Yosinski

Jason Yosinski is a former founding member of Uber AI Labs and formerly lead the Deep Collective research group.

Posted by Janice Lan, Rosanne Liu, Hattie Zhou, Jason Yosinski