Quantized Training with Deep Networks
How to cut neural net training time in half with minimal effort…
Quantization at a Glance
Many approaches exist for reducing the overhead of neural network training, but one of the most promising methods is low-precision/quantized training. The approach is simple—just reduce the number of bits used to represent activations and gradients within the network throughout training. Then, these low-precision representations, as depicted in the figure above, make common operations within the network’s forward and backward pass faster, yielding improvements in training time and energy efficiency.
Although the idea behind low-precision training is simple, implementing this idea effectively is more difficult due to several issues one may encounter:
Sensitivity of gradient representations to lower precision
Numerical instability in batch normalization modules
Setting the level of precision too low to achieve good performance
Despite these issues, recent research has found that neural network training is surprisingly robust to the use of low precision activation and gradient representations, enabling significant cost savings to be reaped at the most basic bit level of neural network training. Beyond simply leveraging low precision representations, however, existing approaches also explore dynamic adaptions to network precision during training (e.g., cyclic or input-dependent precision), yielding even greater improvements in training efficiency.
Throughout this overview, I will address existing approaches to quantized training, the context around them, and how to correctly address problems—like those mentioned above—that one may experience. As I go through these topics, I will attempt to emphasize aspects of quantized training that are most useful to practitioners, revealing how one may leverage such techniques to significantly reduce neural network training time with minimal effort.
Background Information
Here, I’ll provide all of the background information needed to fully understand quantized training for neural networks. First, I’ll briefly outline different floating point representations that may be used within quantized training. Then, I will explain the generic approach to quantization within neural networks, which can be modified to form different quantized training variants with various properties.
Floating Point Representations
Although a complete overview of floating point representations is beyond the scope of this overview, having a basic grasp of how numbers are represented within a computer is important for understanding low-precision training techniques. Numerous, in-depth discussions of this topic can be found online.
To understand quantized training, we must first understand how floating point numbers are represented in deep learning packages like PyTorch, as this representation will be used for neural network training. Such packages use 32-bit floating point representations, as depicted within the figure below.
Because computer representations are binary, they are an approximation of the actual, full-precision number with some (hopefully small) error. Such error is controlled by representation’s precision, where increasing (decreasing) the number of bits within the representation will provide a more (less) exact approximation. For example, floating point numbers can also be represented with 64 bits or 16 bits (i.e., “Double” or “Half” representations) to provide different levels of precision.
Put simply, modifying the precision of a floating point representation is similar to allowing more/fewer decimal places to be used in representing a number. For example, in trying to represent the number pi, one can claim that pi is equal to 3.14, though 3.14159265 is a more accurate/precise representation.
The Benefit of Low Precision
Given that floating point numbers can be represented with varying levels of precision, one may begin to wonder why lower-precision representations are useful. Wouldn’t we want to always use the most accurate representation? Well, it depends on what our goal is…
Consider, for example, a scientific computing application. Oftentimes, precision is pivotal in these scenarios—small errors in numerical precision can drastically impact results. As such, numbers should be represented with the highest level of precision that is possible in these cases (e.g., 64-bit double representations).
In neural network applications, lower-precision representations (e.g., 16-bit [9]) can be used without noticeable performance deterioration. Additionally, common operations within neural network training (e.g., vector/matrix multiplication and addition) are much faster with lower-precision input, thus providing significant efficiency benefits. This basic idea is the crux of quantized training—we ideally want to find ways of reducing precision such that:
No deterioration in network performance is observed
Training overhead (e.g., time, energy, computation, latency) is reduced
Techniques for Improved Neural Network Efficiency
Although low-precision training is a popular method for improving neural network training and energy efficiency, numerous alternative methodologies exist. For example, many approaches have been proposed for reducing the size of a neural network vai pruning (interested readers can see my previous overview this topic), weight sharing, low rank approximation, or even weight quantization [10, 11, 12].
Though such approaches are capable of greatly reducing the number of parameters within a neural network, many early works in improving neural network efficiency focused upon network evaluation/inference, which has limited impact given that training iterations require approximately three times more computation compared to network evaluation/inference (i.e., due to the computation of the gradient in addition to the forward pass). However, later work began to focus upon quantizing network parameters and intermediate activations during the training phase, resulting in more significant cost and energy savings [9, 13].
A Global View of Quantization
At this point, we should understand that quantized training leverages low-precision floating point representations to reduce training overhead. However, it is not immediately clear where/how such quantization is applied within the neural network. To make this more clear, I have schematically illustrated the components of a backward and forward pass for a single neural network layer within the figure above. For those who are unfamiliar with back propagation, I encourage you to supplement this illustration with one of the many high-quality descriptions online.
As can be seen, there are two main components of the forward/backward pass to which quantization can be applied—the activation and the gradient. Namely, the activation can be stored using a low precision representation to make the forward pass less expensive, while the layer gradient can be quantized to make the weight gradient computation and propagation of the gradient to the layer before less expensive. As can be seen, the backward pass requires two separate operations at each layer—twice the amount of computation compared to the forward pass—revealing why performing quantization during training (as opposed to inference, which only performs forward propagation) is so beneficial.
It should be noted that different levels of quantization may be adopted in the forward and backward pass (e.g., 4-bit forward pass and 6-bit backward pass). The level of quantization in the backward pass is usually less aggressive relative to the forward pass, as gradients are typically more sensitive to quantization.
Static vs. Dynamic Quantization Schemes
Approaches for quantized training in neural networks can be roughly divided into two categories—static and dynamic schemes. Early work in quantization adopted a static approach, where the same amount of quantization was performed throughout the entire training process—a fixed number of bits was adopted for network gradients and activations that did not change during training. Such approaches are quite popular due to being supported in modern, mixed-precision deep learning tools like Apex that can easily speed up network training with minimal code changes.
Despite the popularity of static approaches, more recent work has explored dynamic low-precision training, which varies network precision along the training trajectory (i.e., the precision used for activations/gradients changes as training progresses). These methods are typically applied on top of a static quantization scheme, such that the level of precision is dynamically varied between a lower and upper bound on precision. This upper bound matches the precision used by the static method [2, 3]. Thus, cost savings are achieved by lowering precision beyond static levels during certain phases of network training.
Publications
Now that we have a basic understanding of quantized training and relevant background, I will overview some useful papers on the topic. I will begin with a high-performing static quantization technique that significantly reduced the precision with which neural networks could be effectively trained. Then, I will overview two recently-proposed dynamic quantization methods. The first proposes a unique approach that both (i) adapts quantization dynamically along the training trajectory and (ii) trains a supplemental neural network to adapt precision in an input-dependent manner. The other empirically analyzes different cyclical, dynamic schemes for adapting neural network precision during training and provides numerous practical insights into quantized training.
Scalable Methods for 8-bit Training of Neural Networks [1]
Main Idea. Previous work that attempted quantization during training beyond 16-bit precision encountered significant degradations in network performance. Within this work, authors explore modifications to the quantization procedure and general network architecture that enable 8-bit quantization of network gradients, activations, and weights. Surprisingly, they arrive at a static, low-precision training procedure, called SBM, that achieves this goal, proving that neural network training is more robust to quantization than previously thought.
The derived training procedure performs all operations, excluding parameter updates and computation of weight gradients, using 8-bit precision. To enable training at such low precisions, the most sensitive components of neural network training—batch normalization and back propagation—must be addressed. In particular, authors reduce the sensitivity of such components to quantization by:
Developing a quantization-friendly variant of batch normalization—called Range Batch Normalization—that is less prone to numerical instability.
Using Gradient Bifurcation to maintain two copies of network gradients at each layer so that weight gradients, which are sensitive to the use of lower precision, can be computed with higher, 16-bit precision.
Together, such modifications are shown to enable significantly lower precisions to be explored within quantized training, providing massive savings in neural network training cost, latency, and energy usage.
Methodology. As mentioned above, the authors identify that, in performing low-precision training, batch normalization and back propagation operations are the most problematic components of the neural network training process and architecture. To understand the reasoning behind this, we must examine each of these components separately.
Batch normalization—formulated above, see here for a more comprehensive description—transforms each components within some input by:
Subtracting the mean of the components
Dividing by the standard deviation of the components
Computing this standard deviation requires a sum of squares, which can lead to numerical instability and arithmetic overflows when dealing with large values at lower precision. To avoid such instability, the authors propose range batch normalization, outlined in the figure above, which instead normalizes by the range of the input distribution (scaled by a dimension-dependent constant). This quantity is more tolerant to quantization—due to not involving sums of large numbers that elicit numerical instability—and provably approximates the standard deviation under a Gaussian data assumption (see Section 3 of [1] for more details).
The process of back propagation performed on a single network layer is depicted within the background information section. As can be seen, two operations are performed at each layer:
Computing the previous layer gradient
Computing the weight gradient
The 1st step of this procedure is repeated sequentially for every layer within the network, making it a bottleneck to the completion of back propagation. Thus, this procedure must be sped up (i.e., using lower precision operations) to achieve improved efficiency. In contrast, weight gradients are computed separately for each layer and have no sequential dependence, meaning that this step can be performed at higher precision (and in parallel) without deteriorating network performance.
With this in mind, the authors introduce a gradient bifurcation procedure that maintains separate copies of the layer gradient at both 8 and 16-bits. Then, layer gradients can be computed using the lower-precision representation to speed up sequential computation, while weight gradients can be computed separately with higher precision. This better-informed approach for quantized back propagation, depicted in the figure below, is pivotal to enabling training at lower precisions.
Findings.
Vanilla batch normalization can be replaced with range batch normalization with no noticeable performance differences, even in large-scale experiments on ImageNet.
Range batch normalization is shown to maintain higher performance levels relative to vanilla batch normalization when lower precision is adopted.
Combining range batch normalization with the proposed approach for quantized back propagation enables training at surprisingly low precision without deterioration in network performance.
FracTrain: Fractionally Squeezing Bit Savings Both Temporally and Spatially for Efficient DNN Training [2]
Main Idea. Previous work in quantized training typically adopts a static approach that utilizes fixed precision throughout the entire training process [1, 4, 5]. Moving beyond static methods, FracTrain explores dynamic adaptation of precision during training. In particular, the proposed approach adapts precision on two fronts:
Temporally by adopting different levels of precision during different stages of the training process
Spatially by learning to adapt the level of precision used in each layer/block of the network based on properties of the input
Such temporal and spatial strategies for dynamically adapting precision are referred to as Progressive Fractional Quantization (PFQ) and Dynamic Fractional Quantization (DFQ), respectively. To minimize training costs, these strategies are applied in tandem, forming the FracTrain policy.
In practice, FracTrain is applied on top of static quantization schemes, meaning that the highest precision level used during training matches that of the static precision baseline. Using FracTrain in this manner, one can drastically reduce training, energy, and latency costs accrued during training. Additionally, resulting models perform comparably to those obtained via static-precision baselines—no noticeable impact to network performance is observed.
Methodology. As outlined above, the FracTrain methodology is comprised of two components—PFQ and DFQ—that are applied in tandem throughout the training process. These methods are applied on top of a static, low-precision training methodology. Thus, the highest precision used during training matches that of this static baseline, and FracTrain further reduces precision beyond this point.
The idea behind PFQ is simple—use lower precision during the early stages of training and slowly increase precision as training progresses. Such an approach is inspired by the fact that features learned during the early phase of network training are robust to noise (e.g., from quantization) and higher learning rates [6, 7, 8]. To apply PFQ in practice, one must pick a low, initial precision (e.g., 4-bit forward and 6-bit backward precision). Then, this precision is increased when the network’s loss does not improve for several epochs, eventually reaching the precision of the static baseline during the later part of training.
Unfortunately, DFQ is not quite as straightforward as PFQ. To modify precision in an input-dependent manner throughout the network, authors leverage a separate, learned recurrent neural network per layer/block that:
Takes the same input as the network itself
Predicts the proper precision level to be used
This supplemental network, depicted within the figure above, can be made light-weight and is trained by incorporating a regularization term within the network’s objective function that captures training cost, allowing a tradeoff between network performance and training efficiency to be achieved.
Findings.
When applied together, PFQ and DFQ (or FracTrain) yield significant improvements in network training cost, energy usage, and latency relative to static, low-precision training methodologies, while maintaining similar levels of performance. Such results are obtained in large-scale experiments with modern convolutional neural network (e.g., various ResNets and MobileNetV2) and transformer architectures.
Applied in isolation, both PFQ and DFQ are found to improve training efficiency relative to static, low-precision training. Interestingly, DFQ is found to be particularly effective at very low levels of precision, where it maintains impressive accuracy levels despite significant degradations in the performance of static baseline methodologies.
CPT: Efficient Deep Neural Network Training via Cyclic Precision [3]
Main Idea. Somewhat similarly to FracTrain, the cyclic precision training (CPT) strategy proposed in [3] further explores possibilities for temporal quantization in the training process. In particular, a cyclical schedule is adopted to vary network precision between a minimum and maximum value throughout the training process. Interestingly, authors find that cyclically varying precision throughout training can yield significant computational savings relative to static quantization procedures and even improve network generalization in many cases.
To motivate the use of cyclic precision schedules, the authors draw a comparison between network precision and the learning rate. Namely, utilizing lower precision during training provides noise that aids the network in “exploring” the loss landscape, similarly to using a large value for the learning rate. In contrast, higher precision levels allow the model to converge to a final solution, similarly to using a low learning rate.
Authors find that this connection between the learning rate and precision is empirically valid by demonstrating that varying these two hyperparameters has a similar impact on neural network training. The CPT methodology is inspired by this analogy, as it cyclically varies the precision according to a cosine schedule—the same hyperparameter schedule that is often used for setting the learning rate in practice. In fact, utilizing such cyclical schedules for the learning rate is so common that it is specifically implemented within PyTorch.
Methodology. The methodology behind CPT is quite simple. First, one must choose a minimum and maximum precision to use during training. Then, the precision level is varied between this minimum and maximum level following a cyclical, cosine schedule throughout the training process, as depicted within the figure above. Similarly to FracTrain, such a procedure is applied on top of a static, low-precision training methodology, meaning that the highest precision level used within the cyclical schedule matches that of the static precision baseline.
Although the maximum precision used during training can be adopted from any static, low-precision baseline methodology, one must also determine the lower bound of precision to be used by CPT. To do this empirically, authors propose a simple Precision Range Test, which operates by:
Starting from the lowest possible precision (e.g., 2 bits)
Gradually increasing the precision and monitoring the training accuracy
Setting the precision lower bound as the first precision that enables an increase in training accuracy beyond a pre-set threshold
This precision range test can be performed in the first cycle of CPT—it does not need to be executed separately from the network training process itself. Then, future cycles simply adopt the lower bound in precision determined from the range test performed in the first cycle. Only a few iterations of training are lost in performing the precision range test, making it a low cost and effective methodology for determining the ideal range of precisions to be used in CPT.
Findings.
Precision levels are found to impact the neural network training process similarly to the learning rate. Using low precision is a pseudo-substitute for a high learning rate and vice versa, thus motivating the use of cyclical schedules for precision that mimic those used for the learning rate.
Utilizing dynamic schedules for precision during training is found to aid in network generalization. Networks trained using CPT tend to generalize better than those trained with static, low-precision baselines.
CPT significantly reduces computational cost, energy usage, and latency of neural network training relative to static precision baselines, while matching or improving upon their performance, even in large-scale experiments.
More compact networks (e.g., MobileNetV2 vs. ResNet) are found to be less robust to lower quantization levels, meaning that the lower bound of precision used in CPT must be raised slightly.
What can we use in practice?
Despite the impressive results achieved with dynamic quantization schemes, such approaches cannot be used in practice on current hardware. In particular, current NVIDIA GPUs only support mixed precision training with half-precision (e.g., 16-bit format). However, this reduction in precision can still produce up to a 3X speedup in terms of model training time for many common deep learning workflows. Plus, current languages/software packages (e.g., PyTorch, TensorFlow, Julia, etc.) are beginning to build support for arbitrary, low-precision arithmetic due to the benefits demonstrated by research in quantized training.
Although more sophisticated quantization techniques are not supported by current hardware, using half precision is highly useful for deep learning practitioners, as it can significantly reduce model training time without noticeable drops in performance. Additionally, training in half precision is abstracted by tools like Apex, allowing practitioners to achieve significant reductions in training time without extensive implementation changes. See here for an example.
Given that many of the quantization approaches described within this overview are not yet supported by modern hardware, one may wonder how we actually know that such methods actually benefit training efficiency. Many metrics and approaches exist for quantifying this benefit, such as:
Computing the effective number of FLOPS used during training, as described in [4] (also referred to as MACs [2] or GBitOPs [3])
Implementing low-precision training on an FPGA and measuring actual performance metrics (e.g., energy usage, latency, etc.)
Using a simulator to measure performance metrics (e.g., BitFusion [14])
By leveraging these techniques, existing work in quantized training can prove that the proposed approaches provide a significant benefit to training efficiency despite limited support in current hardware.
Takeaways
The main takeaways from this overview can be simply stated as follows:
Neural network training is robust to the use of low precision representations for network activations and gradients
Leveraging low precision during training can yield significant improvements in training time, energy efficiency and computational cost
Such findings are quite fundamental in nature. Furthermore, even the more complex approaches for manipulating neural network precision just take these fundamental findings and extend them to produce smarter/more intricate methodologies of further lowering precision. Thus, quantized training is a simple-to-understand approach with incredible potential and broad practical impact.
In learning about low-precision training, several fundamental deep learning concepts have arisen:
Forward/Back Propagation (and where quantization can occur)
Batch Normalization (and its sensitivity to numerical instability)
Hyperparameter Schedules (e.g., cyclical learning rates/precision)
Sensitivity of Gradients to Noise (e.g., from lower precision)
Each of these concepts arise frequently in deep learning literature and are important to understand for any deep learning practitioner that wants a more developed perspective. Most of these ideas are easy to understand. However, for those interested in learning more about the details behind the difficulty of quantizing gradients, I recommend reading Section 4 through 6.1 in [1].
Conclusion
Thanks so much for reading this article—I really appreciate your support and interest in my content. If you liked it, please subscribe to my Deep (Learning) Focus newsletter, where I contextualize, explain, and examine a single, relevant topic in deep learning research every two weeks. Feel free to follow me on medium or explore the rest of my website, which has links to my social media and other content. If you have any recommendations/feedback, contact me directly or leave a comment on this post!
Bibliography
[1] Banner, Ron, et al. "Scalable methods for 8-bit training of neural networks." Advances in neural information processing systems 31 (2018).
[2] Fu, Yonggan, et al. "Fractrain: Fractionally squeezing bit savings both temporally and spatially for efficient dnn training." Advances in Neural Information Processing Systems 33 (2020): 12127-12139.
[3] Fu, Yonggan, et al. "CPT: Efficient deep neural network training via cyclic precision." arXiv preprint arXiv:2101.09868 (2021).
[4] Zhou, Shuchang, et al. "Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients." arXiv preprint arXiv:1606.06160 (2016).
[5] Yang, Yukuan, et al. "Training high-performance and large-scale deep neural networks with full 8-bit integers." Neural Networks 125 (2020): 70-82.
[6] Rahaman, Nasim, et al. "On the spectral bias of neural networks." International Conference on Machine Learning. PMLR, 2019.
[7] Achille, Alessandro, Matteo Rovere, and Stefano Soatto. "Critical learning periods in deep networks." International Conference on Learning Representations. 2018.
[8] Li, Yuanzhi, Colin Wei, and Tengyu Ma. "Towards explaining the regularization effect of initial large learning rate in training neural networks." Advances in Neural Information Processing Systems 32 (2019).
[9] Das, Dipankar, et al. "Mixed precision training of convolutional neural networks using integer operations." arXiv preprint arXiv:1802.00930 (2018).
[10] Chen, Wenlin, et al. "Compressing neural networks with the hashing trick." International conference on machine learning. PMLR, 2015.
[11] Jaderberg, Max, Andrea Vedaldi, and Andrew Zisserman. "Speeding up convolutional neural networks with low rank expansions." arXiv preprint arXiv:1405.3866 (2014).
[12] Ullrich, Karen, Edward Meeds, and Max Welling. "Soft weight-sharing for neural network compression." arXiv preprint arXiv:1702.04008 (2017).
[13] Gupta, Suyog, et al. "Deep learning with limited numerical precision." International conference on machine learning. PMLR, 2015.
[14] Sharma, Hardik, et al. "Bit fusion: Bit-level dynamically composable architecture for accelerating deep neural network." 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2018.