How good is it to train a large model with FP8? Microsoft: 64% faster and 42% less memory than BF16

Question

Original source: Heart of the Machine![](https://img-cdn.gateio.im/resized-social/moments-bab2147faf-b21c4b66cc-dd1a6f-cd5cc0) Image source: Generated by Unbounded AI> Low-precision training is one of the key technologies for scaling the model size and saving training costs in large model training. Compared to the current 16-bit and 32-bit floating-point mixed-precision training, using FP8 8-bit floating-point mixed-precision training can bring a 2x faster speed, save 50% - 75% of video memory and 50% - 75% of communication costs, and NVIDIA's latest generation Kawang H100 comes with good FP8 hardware support. However, the support for FP8 training is still very limited in the industry's large model training frameworks. Recently, Microsoft proposed FP8-LM, an FP8 mixed-precision framework for training LLMs, to apply FP8 to the computing, storage, and communication of large model training as much as possible, using H100 to train GPT-175B 64% faster than BF16 and save 42% of memory usage. What's more: it's open-sourced.    Large language models (LLMs) have unprecedented language understanding and generation capabilities, but unlocking these advanced capabilities requires huge model sizes and computationally intensive training. In this context, and especially when we look at scaling to the scale of OpenAI's proposed Super Intelligence model, low-precision training is one of the most effective and critical techniques, with advantages such as a small memory footprint, fast training speed, and low communication overhead. Most current training frameworks, such as Megatron-LM, MetaSeq, and Colossal-AI, use FP32 full precision or FP16/BF16 mixed precision by default for training LLMs. But that's still not pushing the envelope: with the release of the NVIDIA H100 GPU, FP8 is becoming the next generation of data type for low-precision characterization. Theoretically, FP8 can deliver a 2x speed increase compared to the current FP16/BF16 floating-point mixed-precision training, saving 50% to 75% of memory costs and 50% to 75% of communication costs. Despite this, support for FP8 training is currently limited. NVIDIA's Transformer Engine (TE), which uses FP8 only for GEMM calculations, provides limited end-to-end acceleration, memory, and communication cost savings. But now Microsoft's open-source FP8-LM FP8 mixed-precision framework solves this problem dramatically: the FP8-LM framework is highly optimized to use the FP8 format throughout training forward and backward passing, greatly reducing the computational, memory, and communication overhead of the system. ![](https://img-cdn.gateio.im/resized-social/moments-bab2147faf-f66152dd91-dd1a6f-cd5cc0) * Address:* Open Source Framework:Experimental results show that when training the GPT-175B model on the H100 GPU platform, the FP8-LM mixed-precision training framework not only reduces the actual memory footprint by 42%, but also runs 64% faster than the widely adopted BF16 framework (i.e., Megatron-LM) and 17% faster than the Nvidia Transformer Engine. Moreover, on pre-training and multiple downstream tasks, the FP8-LM training framework can be used to obtain models with similar results to the current standard BF16 mixed-precision framework. Given the compute resources, using the FP8-LM framework can painlessly increase the size of a trainable model by up to 2.5x. Some developers have been discussing on Twitter that if GPT-5 is trained with FP8, even if only the same amount of H100 is used, the model size will be 2.5 times that of GPT-4! ![](https://img-cdn.gateio.im/resized-social/moments-bab2147faf-59bbc69755-dd1a6f-cd5cc0) Huggingface R&D engineers quipped, "That's cool, with FP8 mass training technology, you can achieve computational cheating!"![](https://img-cdn.gateio.im/resized-social/moments-bab2147faf-b65a006fda-dd1a6f-cd5cc0) FP8-LM Key Contributions: * A new FP8 mixed-precision training framework. It's easy to use with an additive way that gradually unlocks 8-bit weights, gradients, optimizers, and distributed training. This 8-bit framework is a simple and straightforward replacement for the corresponding parts of existing 16/32-bit mixed-precision methods without any modifications to the hyperparameters and training methods. In addition, the Microsoft team has released a PyTorch implementation that allows users to implement 8-bit low-precision training with a small amount of code.* A family of GPT-style models trained with FP8. They used the newly proposed FP8 scheme to perform GPT pre-training and fine-tuning, including SFT and RLHF, and the results showed that the new method has potential for models of all sizes ranging from 7 billion to 175 billion parameters. They have FP8 support for common parallel computing paradigms, including tensors, pipelines, and sequence parallelization, allowing users to use FP8 to train large base models. They also released the first FP8 GPT training codebase based on Megatron-LM in an open-source manner.FP8-LM Implementation Specifically, they designed three levels of optimization for the goal of using FP8 to simplify mixed-precision and distributed training. These three levels can gradually integrate an 8-bit collective communication optimizer and distributed parallel training in a progressive manner. The higher the optimization level, the more FP8 is used in LLM training. In addition, for large-scale training (e.g., GPT-175B on thousands of GPUs), the framework provides low-digit parallelization with FP8 precision, including tensors, training pipelines, and training, paving the way to the next generation of low-precision parallel training. Tensor parallelization is the dispersion of layers of a model across multiple devices, placing shards of weights, gradients, and activation tensors on different GPUs. In order to support FP8 for tensor parallelization, the Microsoft team converted the weights and activation tensors of the shards into FP8 format for linear layer computation, so that FP8 was used for forward computation and backward gradient collective communication. Sequence parallelization, on the other hand, is to slice the input sequence into multiple chunks, and then feed the subsequences to different devices to save activation memory. As shown in Figure 2, sequence parallelization and tensor parallelization are being performed in different parts of a Transformer model to make full use of the available memory and improve training efficiency. ![](https://img-cdn.gateio.im/resized-social/moments-bab2147faf-14749199b6-dd1a6f-cd5cc0) In the case of ZeRO (Zero Redundancy Optimizer), FP8 cannot be applied directly because it is difficult to handle the scaling factor associated with FP8 partitioning. Therefore, the scaling factor for each tensor should be distributed along the way FP8 is divided. To solve this problem, the researchers implemented a new FP8 allocation scheme that spreads each tensor as a whole across multiple devices, rather than splitting it into multiple subtensors as in the ZeRO approach. The method handles the allocation of FP8 tensors in a greedy way, as shown in Algorithm 1. ![](https://img-cdn.gateio.im/resized-social/moments-bab2147faf-a8616fdf20-dd1a6f-cd5cc0) Specifically, the method first sorts the tensors of the model state based on size, and then allocates the tensors to different GPUs based on the amount of memory remaining for each GPU. This allocation follows the principle that GPUs with more memory remaining are more likely to receive newly allocated tensors. In this way, tensor scaling factors can be smoothly assigned along tensors while reducing communication and computational complexity. Figure 3 illustrates the difference between how ZeRO tensors are divided with and without scaling factors. ![](https://img-cdn.gateio.im/resized-social/moments-bab2147faf-0b9395a80c-dd1a6f-cd5cc0) Training LLMs with FP8 is not easy. There are many challenging issues involved, such as data underflow or overflow; There are also quantization errors due to narrow dynamic range and the degradation of accuracy inherent in the FP8 data format. These challenges can lead to numerical instability and irreversible divergence in the training process. To address these issues, Microsoft has proposed two technologies: precision decoupling and automatic scaling, to prevent the loss of critical information. **Precision Decoupling** Precision decoupling involves decoupling the influence of data accuracy on parameters such as weights, gradients, and optimizer states, and assigning the reduced precision to components that are not sensitive to accuracy. For precision decoupling, the team said they found a guiding principle: gradient statistics can use lower precision, while sovereign weight requires high accuracy. More specifically, the first-order gradient moment can tolerate higher quantization errors and can be equipped with a low-precision FP8, while the second-order moment requires higher accuracy. This is because when using Adam, the direction of the gradient is more important than its amplitude during model updates. FP8 with tensor scaling capability can effectively preserve the distribution of first-order moments as a high-precision tensor, although it also results in some degradation in accuracy. Since gradient values are usually small, calculating the square of the gradient for the second-order gradient moment can lead to data overflow problems. Therefore, in order to preserve numerical accuracy, it is necessary to assign a higher 16-bit precision. On the other hand, they also found that the use of high precision to preserve the weight of sovereignty was also crucial. The fundamental reason is that during training, weight updates can sometimes become very large or very small, and for sovereign weights, higher accuracy helps prevent information from being lost when weights are updated, allowing for more stable and accurate training. In this implementation, there are two viable options for sovereign heavyweights: either use FP32 full precision or use FP16 with tensor scaling. The advantage of FP16 with tensor scaling is that it saves memory without sacrificing accuracy. Therefore, the default choice for the new framework is to use FP16 with tensor scaling to store sovereign weights in the optimizer. In training, for the FP8 mixed-precision optimizer, 6 bytes of memory are required for each parameter: ![](https://img-cdn.gateio.im/resized-social/moments-bab2147faf-2b4e9b4336-dd1a6f-cd5cc0) This new low-digit optimizer reduces memory footprint by a factor of 2.6 compared to the previous solution. It's worth explaining: this is the first FP8 optimizer for LLM training. Experiments have shown that the FP8 optimizer maintains model accuracy at a wide range of model sizes ranging from 125 million to 175 billion parameters. **Autoscaling** Autoscaling is to save gradient values to the representation range of the FP8 data format, which requires dynamic adjustment of the tensor scaling factor, which can reduce data underflow and overflow during all-reduce communication. Specifically, the researchers introduced an autoscaling factor μ that can change depending on the situation during training. **Experimental Results**To validate the newly proposed FP8 low-precision framework, the researchers experimented with using it to train GPT-style models, including pre-training and supervised fine-tuning (SFT). The experiment was conducted on the latest NDv5 H100 supercomputing platform for Azure cloud computing. The experimental results show that the new FP8 method is effective: compared with the previous widely used BF16 mixed-precision training method, the new method has obvious advantages, including a 27%-42% reduction in real memory usage (for example, a 27% decrease for the GPT-7B model and a 42% decrease for the GPT-175B model); The weighted gradient communication overhead has dropped by 63%-65%. ![](https://img-cdn.gateio.im/resized-social/moments-bab2147faf-c49206f0f8-dd1a6f-cd5cc0) Without modifying any hyperparameters such as learning rate and weight decay, the performance of a model trained with FP8 is comparable to that of a model trained with BF16 precision, regardless of whether it is a pre-trained task or a downstream task. It is worth noting that during the training of the GPT-175B model, the newly proposed FP8 mixed-precision framework can reduce the training time by 17% while reducing the memory footprint by 21% on the H100 GPU platform compared to the TE method. What's more, as the model size continues to scale, further costs can be reduced by using the low-precision FP8, as shown in Figure 1. ![](https://img-cdn.gateio.im/resized-social/moments-bab2147faf-a1697e948e-dd1a6f-cd5cc0) For fine-tuning, they used FP8 mixed precision for instruction fine-tuning and reinforcement learning using human feedback (RLHF) to better align the pre-trained LLM with terminal tasks and user preferences. ![](https://img-cdn.gateio.im/resized-social/moments-bab2147faf-0b7e3cc91f-dd1a6f-cd5cc0) It was found that the performance of the model using FP8 mixed-precision fine-tuning was comparable to that of the model using half-precision BF16 fine-tuning on the Alpaca and MT-Bench benchmarks, while training with FP8 was also 27% faster. In addition, FP8 mixed precision has shown great potential for RLHF, a process that requires multiple models to be loaded during training. By using FP8 in training, AlpacaFarm, a popular RLHF framework, was able to reduce model weights by 46% and memory consumption of optimizer state by 62%. This further demonstrates the versatility and adaptability of the newly proposed FP8 low-precision training framework. They also performed ablation experiments to verify the effectiveness of each component. It is foreseeable that FP8 low-precision training will become a new infrastructure for the development of large models in the future.