📢 Gate Square #Creator Campaign Phase 1# is now live – support the launch of the PUMP token sale!
The viral Solana-based project Pump.Fun ($PUMP) is now live on Gate for public sale!
Join the Gate Square Creator Campaign, unleash your content power, and earn rewards!
📅 Campaign Period: July 11, 18:00 – July 15, 22:00 (UTC+8)
🎁 Total Prize Pool: $500 token rewards
✅ Event 1: Create & Post – Win Content Rewards
📅 Timeframe: July 12, 22:00 – July 15, 22:00 (UTC+8)
📌 How to Join:
Post original content about the PUMP project on Gate Square:
Minimum 100 words
Include hashtags: #Creator Campaign
How good is it to train a large model with FP8? Microsoft: 64% faster and 42% less memory than BF16
Original source: Heart of the Machine
Large language models (LLMs) have unprecedented language understanding and generation capabilities, but unlocking these advanced capabilities requires huge model sizes and computationally intensive training. In this context, and especially when we look at scaling to the scale of OpenAI's proposed Super Intelligence model, low-precision training is one of the most effective and critical techniques, with advantages such as a small memory footprint, fast training speed, and low communication overhead. Most current training frameworks, such as Megatron-LM, MetaSeq, and Colossal-AI, use FP32 full precision or FP16/BF16 mixed precision by default for training LLMs.
But that's still not pushing the envelope: with the release of the NVIDIA H100 GPU, FP8 is becoming the next generation of data type for low-precision characterization. Theoretically, FP8 can deliver a 2x speed increase compared to the current FP16/BF16 floating-point mixed-precision training, saving 50% to 75% of memory costs and 50% to 75% of communication costs.
Despite this, support for FP8 training is currently limited. NVIDIA's Transformer Engine (TE), which uses FP8 only for GEMM calculations, provides limited end-to-end acceleration, memory, and communication cost savings.
But now Microsoft's open-source FP8-LM FP8 mixed-precision framework solves this problem dramatically: the FP8-LM framework is highly optimized to use the FP8 format throughout training forward and backward passing, greatly reducing the computational, memory, and communication overhead of the system.
Experimental results show that when training the GPT-175B model on the H100 GPU platform, the FP8-LM mixed-precision training framework not only reduces the actual memory footprint by 42%, but also runs 64% faster than the widely adopted BF16 framework (i.e., Megatron-LM) and 17% faster than the Nvidia Transformer Engine. Moreover, on pre-training and multiple downstream tasks, the FP8-LM training framework can be used to obtain models with similar results to the current standard BF16 mixed-precision framework.
Given the compute resources, using the FP8-LM framework can painlessly increase the size of a trainable model by up to 2.5x. Some developers have been discussing on Twitter that if GPT-5 is trained with FP8, even if only the same amount of H100 is used, the model size will be 2.5 times that of GPT-4!
FP8-LM Implementation
Specifically, they designed three levels of optimization for the goal of using FP8 to simplify mixed-precision and distributed training. These three levels can gradually integrate an 8-bit collective communication optimizer and distributed parallel training in a progressive manner. The higher the optimization level, the more FP8 is used in LLM training.
In addition, for large-scale training (e.g., GPT-175B on thousands of GPUs), the framework provides low-digit parallelization with FP8 precision, including tensors, training pipelines, and training, paving the way to the next generation of low-precision parallel training.
Tensor parallelization is the dispersion of layers of a model across multiple devices, placing shards of weights, gradients, and activation tensors on different GPUs.
In order to support FP8 for tensor parallelization, the Microsoft team converted the weights and activation tensors of the shards into FP8 format for linear layer computation, so that FP8 was used for forward computation and backward gradient collective communication.
Sequence parallelization, on the other hand, is to slice the input sequence into multiple chunks, and then feed the subsequences to different devices to save activation memory.
As shown in Figure 2, sequence parallelization and tensor parallelization are being performed in different parts of a Transformer model to make full use of the available memory and improve training efficiency.
To solve this problem, the researchers implemented a new FP8 allocation scheme that spreads each tensor as a whole across multiple devices, rather than splitting it into multiple subtensors as in the ZeRO approach. The method handles the allocation of FP8 tensors in a greedy way, as shown in Algorithm 1.
Precision Decoupling
Precision decoupling involves decoupling the influence of data accuracy on parameters such as weights, gradients, and optimizer states, and assigning the reduced precision to components that are not sensitive to accuracy.
For precision decoupling, the team said they found a guiding principle: gradient statistics can use lower precision, while sovereign weight requires high accuracy.
More specifically, the first-order gradient moment can tolerate higher quantization errors and can be equipped with a low-precision FP8, while the second-order moment requires higher accuracy. This is because when using Adam, the direction of the gradient is more important than its amplitude during model updates. FP8 with tensor scaling capability can effectively preserve the distribution of first-order moments as a high-precision tensor, although it also results in some degradation in accuracy. Since gradient values are usually small, calculating the square of the gradient for the second-order gradient moment can lead to data overflow problems. Therefore, in order to preserve numerical accuracy, it is necessary to assign a higher 16-bit precision.
On the other hand, they also found that the use of high precision to preserve the weight of sovereignty was also crucial. The fundamental reason is that during training, weight updates can sometimes become very large or very small, and for sovereign weights, higher accuracy helps prevent information from being lost when weights are updated, allowing for more stable and accurate training.
In this implementation, there are two viable options for sovereign heavyweights: either use FP32 full precision or use FP16 with tensor scaling. The advantage of FP16 with tensor scaling is that it saves memory without sacrificing accuracy. Therefore, the default choice for the new framework is to use FP16 with tensor scaling to store sovereign weights in the optimizer. In training, for the FP8 mixed-precision optimizer, 6 bytes of memory are required for each parameter:
Autoscaling
Autoscaling is to save gradient values to the representation range of the FP8 data format, which requires dynamic adjustment of the tensor scaling factor, which can reduce data underflow and overflow during all-reduce communication.
Specifically, the researchers introduced an autoscaling factor μ that can change depending on the situation during training.
Experimental Results
To validate the newly proposed FP8 low-precision framework, the researchers experimented with using it to train GPT-style models, including pre-training and supervised fine-tuning (SFT). The experiment was conducted on the latest NDv5 H100 supercomputing platform for Azure cloud computing.
The experimental results show that the new FP8 method is effective: compared with the previous widely used BF16 mixed-precision training method, the new method has obvious advantages, including a 27%-42% reduction in real memory usage (for example, a 27% decrease for the GPT-7B model and a 42% decrease for the GPT-175B model); The weighted gradient communication overhead has dropped by 63%-65%.
They also performed ablation experiments to verify the effectiveness of each component.
It is foreseeable that FP8 low-precision training will become a new infrastructure for the development of large models in the future.