SGD、RMSProp 和 Adam 优化器的区别及训练显存消耗分析：以LLaMA-2 7B为例（中英双语）-CFANZ编程社区

中文版

深入探讨：SGD、RMSProp 和 Adam 优化器的区别及显存消耗分析

在深度学习中，优化器的选择对训练效率和最终模型的表现有着重要影响。SGD（随机梯度下降）、RMSProp 和 Adam 是三种常见的优化算法，它们在如何更新模型参数和如何调整学习率方面有着显著的不同。本文将深入探讨这三种优化器的主要区别，并分析它们的显存消耗，帮助大家更好地理解它们的优缺点及适用场景。

1. SGD（随机梯度下降）优化器

1.1 基本原理

SGD 是最基础的优化器，它的核心思想是在每一步更新中，根据当前的梯度来调整模型参数。在每次迭代中，SGD 使用整个数据集的梯度来更新参数，因此在每个训练步骤中，它计算的是全局梯度。

数学公式：
$\theta_{t+1} = \theta_t - \alpha \nabla_\theta L(\theta_t)$
其中：

( $\theta_t$ ) 是第 ( $t$ ) 次迭代时的模型参数。
( $\alpha$ ) 是学习率。
( $\nabla_\theta L(\theta_t)$ ) 是在当前参数下计算的梯度。

SGD 的优点是实现简单，计算开销低，并且对于大规模数据集特别有效。然而，SGD 需要手动调整学习率，且在面对复杂和多样的训练任务时，常常会遇到收敛速度慢和不稳定的问题。

1.2 显存消耗

SGD 相对于 RMSProp 和 Adam 优化器，显存消耗较低，因为它只需要存储模型参数和梯度，计算时不涉及额外的状态信息。

模型参数：假设使用 float32 类型，LLaMA-2 7B 模型有 70 亿个参数，每个参数占用 4 字节。显存需求为：
$\times 10^9 \times 4 \, \text{bytes} = 28 \, \text{GB}$
梯度：每个参数都有一个梯度，因此梯度占用的显存与模型参数相同，即 28 GB。

因此，使用 SGD 优化器时，显存消耗大约为 56 GB。

下面的python代码测试7B大模型本身的参数量：以float32计算。进位采用1024，计算得出：7B大模型的参数量为26.08 GB；当进位采用1000时，计算得出28.00 GB。为什么尝试1000，是因为在其他博文中看到28GB这个数字，自己测试一下，发现他们是在以1000为进位的时候测试得出的。参考文章：https://cuiyuhao.com/posts/c87c0f5d/

# 定义参数
num_parameters = 7 * 10**9  # 70 亿个参数
bytes_per_param = 4  # 每个参数占用 4 字节（32 位浮动数）

# 计算显存需求（单位：字节）
memory_in_bytes = num_parameters * bytes_per_param

# 将字节转换为 GB
memory_in_GB = memory_in_bytes / (1024 ** 3)  # 转换为 GB， 可调为1000

print(f"模型需要的显存为: {memory_in_GB:.2f} GB")

以bf16为例，由于它是float32的一半，所以它的参数量为 26.08GB / 2 = 13.04GB (以1024为进位)，当以1000进位的时候，28GB / 2 = 14GB

2. RMSProp 优化器

2.1 基本原理

RMSProp 是一种自适应学习率的优化器，它根据每个参数的梯度平方的平均值调整每个参数的学习率。与 SGD 不同，RMSProp 会对每个参数分配一个不同的学习率，使得在不同维度上学习更加平衡，从而加快收敛速度。

RMSProp 通过维护一个指数衰减的平方梯度的平均值来更新参数，这样做的目的是避免梯度爆炸或梯度消失的问题。

数学公式：

梯度计算：
$g_t = \nabla_\theta L(\theta_t)$
指数加权平均的平方梯度：
$v_t = \beta v_{t-1} + (1 - \beta) g_t^2$
其中，( $\beta$ ) 是衰减率，通常取 0.9。
参数更新：
$\theta_{t+1} = \theta_t - \frac{\alpha}{\sqrt{v_t + \epsilon}} g_t$
其中，( $\alpha$ ) 是学习率，( $\epsilon$ ) 是防止除零错误的小常数。

2.2 显存消耗

与 SGD 相比，RMSProp 需要存储额外的平方梯度的移动平均值，增加了显存的消耗。

模型参数：28 GB。
梯度：28 GB。
RMSProp 状态：每个参数需要存储一个平方梯度的平均值，因此 RMSProp 的额外内存开销为 28 GB。

因此，使用 RMSProp 优化器时，显存消耗大约为 84 GB。

3. Adam 优化器

3.1 基本原理

Adam（Adaptive Moment Estimation）优化器结合了 动量（Momentum） 和 RMSProp 的思想。Adam 不仅使用梯度的平方来调整学习率，还引入了梯度的一阶矩（即动量）来平滑更新过程。Adam 对每个参数都使用不同的学习率，并且能够自适应地调整学习率。

数学公式：

梯度计算：
$g_t = \nabla_\theta L(\theta_t)$
一阶矩（动量）更新：
$m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t$
二阶矩（平方梯度）更新：
$v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2$
偏差修正：
$\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}$
参数更新：
$\theta_{t+1} = \theta_t - \frac{\alpha}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t$

Adam 通过计算动量和梯度平方的指数加权平均来动态调整每个参数的学习率，从而在大部分情况下表现出更好的收敛速度。

3.2 显存消耗

Adam 不仅需要存储模型参数和梯度，还需要存储一阶矩和二阶矩的估计值。相比 RMSProp，Adam 需要存储两个额外的变量（动量和平方梯度），因此显存消耗更大。

模型参数：28 GB。
梯度：28 GB。
Adam 状态：每个参数需要存储动量和平方梯度的平均值，因此 Adam 需要额外的 56 GB。

因此，使用 Adam 优化器时，显存消耗大约为 112 GB。

4. 总结与对比

优化器	模型参数	梯度	优化器状态	总显存消耗
SGD	28 GB	28 GB	无	56 GB
RMSProp	28 GB	28 GB	28 GB	84 GB
Adam	28 GB	28 GB	56 GB	112 GB

SGD：显存消耗最少，仅需要存储模型参数和梯度。
RMSProp：需要额外存储平方梯度的移动平均，显存消耗增加。
Adam：最复杂的优化器，需要存储动量和平方梯度的移动平均，因此显存消耗最大。

5. 优化器的选择

SGD 适用于数据量极大的问题，但收敛速度较慢，并且对于某些复杂问题可能需要手动调整学习率。
RMSProp 在处理稀疏数据和有噪声的梯度时表现良好，适用于RNN和类似任务。
Adam 是目前最常用的优化器之一，特别适合处理稀疏梯度和复杂任务。它自动调整每个参数的学习率，并且通常能提供最好的收敛性。

结论

根据你的硬件资源和具体任务选择合适的优化器。若显存有限且任务较简单，可以选择 SGD。若任务较为复杂，且你希望自适应调整学习率，RMSProp 或 Adam 是更好的选择，但会带来较高的显存消耗。

英文版

A Detailed Analysis of SGD, RMSProp, and Adam Optimizers, and Their Memory Consumption

In deep learning, the choice of optimizer significantly impacts both the training efficiency and the performance of the model. SGD (Stochastic Gradient Descent), RMSProp, and Adam are three commonly used optimization algorithms, each with its own way of updating model parameters and adjusting the learning rates. In this blog, we will dive deep into the key differences between these three optimizers, discuss their memory consumption, and analyze how they differ in terms of computational resources. This will help you better understand the strengths, weaknesses, and use cases of each optimizer.

1. SGD (Stochastic Gradient Descent) Optimizer

1.1 Basic Principle

SGD is one of the simplest and most fundamental optimizers. The core idea behind SGD is to adjust the model parameters based on the gradient calculated from a single data sample or a small batch of data. During each iteration, the optimizer updates the model parameters in the direction that reduces the loss, which is computed using the gradient of the loss function.

Mathematical Formula:
$\theta_{t+1} = \theta_t - \alpha \nabla_\theta L(\theta_t)$
Where:

( $\theta_t$ ) is the model parameters at time step ( $t$ ),
( $\alpha$ ) is the learning rate,
( $\nabla_\theta L(\theta_t)$ ) is the gradient of the loss function with respect to the parameters.

SGD has the advantage of being simple and computationally efficient, especially for large datasets. However, it requires careful tuning of the learning rate and often struggles with slow convergence and unstable behavior in complex tasks.

1.2 Memory Consumption

Compared to RMSProp and Adam, SGD consumes relatively less memory because it only needs to store the model parameters and the gradients.

Model Parameters: Assuming the use of float32 type, LLaMA-2 7B model has 7 billion parameters. Each parameter takes up 4 bytes. The memory required is:
$\times 10^9 \times 4 \, \text{bytes} = 28 \, \text{GB}$
Gradients: Since each parameter has a corresponding gradient, the memory required for gradients is also 28 GB.

Thus, the total memory consumption using SGD is approximately 56 GB.

2. RMSProp Optimizer

2.1 Basic Principle

RMSProp (Root Mean Square Propagation) is an adaptive learning rate optimizer that adjusts the learning rate for each parameter based on the average of the squared gradients. Unlike SGD, which uses a single learning rate for all parameters, RMSProp assigns different learning rates to each parameter, making it particularly effective for training models with sparse gradients and noisy data.

RMSProp computes an exponentially decaying average of the squared gradients and uses this to adjust the step size during each update. This helps prevent issues like exploding or vanishing gradients.

Mathematical Formula:

Gradient Calculation:
$g_t = \nabla_\theta L(\theta_t)$
Exponential Moving Average of Squared Gradients:
$v_t = \beta v_{t-1} + (1 - \beta) g_t^2$
where ( $\beta$ ) is the decay factor, typically set to 0.9.
Parameter Update:
$\theta_{t+1} = \theta_t - \frac{\alpha}{\sqrt{v_t + \epsilon}} g_t$
where ( $\alpha$ ) is the learning rate, and ( $\epsilon$ ) is a small constant added to prevent division by zero.

2.2 Memory Consumption

Compared to SGD, RMSProp requires additional memory to store the exponentially decaying average of the squared gradients. This extra memory overhead increases the total memory consumption.

Model Parameters: 28 GB.
Gradients: 28 GB.
RMSProp State: Each parameter needs to store an average of its squared gradients, which requires another 28 GB of memory.

Thus, the total memory consumption when using RMSProp is approximately 84 GB.

3. Adam Optimizer

3.1 Basic Principle

Adam (Adaptive Moment Estimation) is an optimizer that combines the advantages of both Momentum and RMSProp. Adam uses both the first moment (mean of gradients) and the second moment (mean of squared gradients) to adaptively adjust the learning rate for each parameter. It has become one of the most widely used optimizers due to its fast convergence and ability to adapt to different datasets.

Mathematical Formula:

Gradient Calculation:
$g_t = \nabla_\theta L(\theta_t)$
First Moment (Momentum) Update:
$m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t$
Second Moment (Squared Gradient) Update:
$v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2$
Bias Correction:
$\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}$
Parameter Update:
$\theta_{t+1} = \theta_t - \frac{\alpha}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t$

Adam adapts the learning rates for each parameter using both the first and second moments, which helps it achieve faster convergence than SGD and RMSProp.

3.2 Memory Consumption

Adam requires storing the model parameters, gradients, first moments (momentum), and second moments (squared gradients). This significantly increases its memory consumption compared to RMSProp.

Model Parameters: 28 GB.
Gradients: 28 GB.
First Moment (Momentum): 28 GB.
Second Moment (Squared Gradients): 28 GB.

Thus, the total memory consumption when using Adam is approximately 112 GB.

4. Summary and Comparison

Optimizer	Model Parameters	Gradients	Optimizer State	Total Memory Consumption
SGD	28 GB	28 GB	None	56 GB
RMSProp	28 GB	28 GB	28 GB	84 GB
Adam	28 GB	28 GB	56 GB	112 GB

SGD: Consumes the least memory as it only requires storing the model parameters and gradients.
RMSProp: Requires extra memory to store the exponentially moving average of squared gradients, increasing memory consumption.
Adam: Requires the most memory, as it stores both the momentum and squared gradient averages, which significantly increase memory consumption.

5. Optimizer Choice

SGD: Suitable for large datasets, but requires careful tuning of the learning rate and may converge slowly on complex problems.
RMSProp: Works well with noisy gradients or sparse data, and is often used in recurrent neural networks (RNNs) and similar tasks.
Adam: A robust optimizer that adapts learning rates for each parameter and is typically the best choice for most tasks due to its fast convergence.

Conclusion

Choosing the right optimizer depends on your hardware resources and the specific task at hand. If memory is a concern and the task is relatively simple, SGD is a good choice. However, for more complex tasks with noisy or sparse gradients, RMSProp or Adam may offer better performance, albeit at the cost of higher memory consumption.

后记

2024年11月29日19点19分于上海，在GPT4o辅助下完成。

SGD、RMSProp 和 Adam 优化器的区别及训练显存消耗分析：以LLaMA-2 7B为例（中英双语）

中文版