Computational Bottlenecks of Training Small-scale Large Language Models 翻译

阅读量：

选择 Doc2X，让 PDF 转换更轻松
支持 PDF 转 Word、Latex、Markdown，多栏与公式精准解析，还提供深度翻译功能，适合科研及日常办公！
Choose Doc2X, Simplify PDF Conversion
Supports PDF to Word, LaTeX, and Markdown, with precise multi-column and formula parsing, plus advanced translation for research and daily work!
👉 立即试用 Doc2X | Try Doc2X Now

原文地址：https://arxiv.org/pdf/2410.19456

Computational Bottlenecks of Training Small-scale Large Language Models

小规模大型语言模型训练的计算瓶颈

Saleh Ashkboos* Iman Mirzadeh Keivan Alizadeh

Moin Nabi

Apple

saleh.ashkboos@inf.eth.ch,fartash@apple.com

Abstract

摘要

While large language models (LLMs) dominate the AI landscape, Small-scale large Language Models (SLMs) are gaining attention due to cost and efficiency demands from consumers. However, there is limited research on the training behavior and computational requirements of SLMs. In this study, we explore the computational bottlenecks of training SLMs (up to $2\mathrm{\;B}$ parameters) by examining the effects of various hyperparameters and configurations, including GPU type, batch size, model size, communication protocol, attention type, and the number of GPUs. We assess these factors on popular cloud services using metrics such as loss per dollar and tokens per second ${}^{2}$ . Our findings aim to support the broader adoption and optimization of language model training for low-resource AI research institutes.

尽管大型语言模型（LLMs）主导了人工智能领域，但由于消费者对成本和效率的需求，小规模大型语言模型（SLMs）正逐渐受到关注。然而，关于SLMs的训练行为和计算需求的现有研究有限。在本研究中，我们通过考察各种超参数和配置（包括GPU类型、批量大小、模型大小、通信协议、注意力类型和GPU数量）的影响，探讨了训练SLMs（最多 $2\mathrm{\;B}$ 参数）的计算瓶颈。我们使用每美元损失和每秒令牌数 ${}^{2}$ 等指标，在流行的云服务上评估这些因素。我们的研究旨在支持低资源人工智能研究机构更广泛地采用和优化语言模型训练。

1 Introduction

1 引言

Large Language Models (LLMs) are becoming increasingly popular in various fields due to their performance on a variety of tasks $\left\lbrack {6,{18},8,{20},5}\right\rbrack$ . However,deploying large models widely such as on mobile hardware and edge devices is challenging due to the large memory and compute requirements $\left\lbrack {{11},{12},{10}}\right\rbrack$ . These constraints have driven a growing interest in smaller language models (such as $\leq {2B}$ parameters) as a viable alternative [24,16,23]. Recent work refer to these models as Small-scale large Language Models (SLMs) which can work well in environments where cost-efficiency and resource limitations are of significant concern, as well as on servers where the reduced cost of inference will be a dominant factor to attract and retain customers.

大型语言模型（LLMs）由于其在各种任务上的表现 $\left\lbrack {6,{18},8,{20},5}\right\rbrack$ ，在各个领域中变得越来越受欢迎。然而，由于内存和计算需求巨大，在移动硬件和边缘设备等广泛部署大型模型具有挑战性 $\left\lbrack {{11},{12},{10}}\right\rbrack$ 。这些限制促使人们对更小的语言模型（如 $\leq {2B}$ 参数）产生了越来越大的兴趣，这些模型作为一种可行的替代方案[24,16,23]。最近的研究将这些模型称为小规模大型语言模型（SLMs），这些模型在成本效率和资源限制显著的环境中表现良好，并且在推理成本降低将成为吸引和留住客户的主要因素的服务器上也能很好地工作。

SLMs have demonstrated substantial potential in achieving competitive results despite their smaller size. Techniques such as pruning, distillation, and quantization have been employed to enhance their performance $\left\lbrack {2,3,{17}}\right\rbrack$ ,allowing SLMs to perform on par with,and in some cases surpass,much larger models [4]. For example, Gemma-2B outperforms the largest OPT-175B [25], challenging the notion that sheer model size is the primary determinant of effectiveness. In addition to on par accuracy, SLMs can meet consumer demands for fast, efficient, and cost-effective AI without sacrificing task performance, making them increasingly attractive for organizations with limited computational budgets, such as small businesses and academic institutions.

尽管规模较小，但规模较小的语言模型（SLMs）已经展示了实现竞争性结果的巨大潜力。诸如剪枝、蒸馏和量化等技术已被用于提升其性能 $\left\lbrack {2,3,{17}}\right\rbrack$ ，使得 SLMs 能够在某些情况下与甚至超越更大规模的模型 [4]。例如，Gemma-2B 的表现优于最大的 OPT-175B [25]，挑战了模型规模是效果主要决定因素的观念。除了相当的准确性外，SLMs 还能满足消费者对快速、高效且经济实惠的 AI 的需求，而不会牺牲任务性能，这使得它们对于计算预算有限的组织，如小型企业和学术机构，越来越具有吸引力。

While prior work mostly focused on optimizing SLMs for inference [15], relatively little attention has been paid to their training dynamics. This gap is significant, as the computational and infrastructure demands of training LLMs may not translate to SLMs. Given the diverse range of hardware configurations available on cloud platforms-such as GPU type, batch size, and communication protocols-there is a need for a systematic analysis of how these factors impact the training efficiency of SLMs, particularly when measured in terms of practical metrics such as loss per dollar and tokens per second. Our findings indicate that for smaller models, more affordable options like A100-40GB GPUs and Distributed Data Parallel (DDP) can be utilized without sacrificing performance. For larger models, advanced configurations, such as A100-80GB and H100-80GB GPUs paired with Flash Attention (FA) and Fully Sharded Data Parallel (FSDP), are necessary to handle larger batch sizes and prevent memory-related issues.

尽管先前的工作主要集中在优化 SLMs 的推理 [15]，但对其训练动态的关注相对较少。这一差距是显著的，因为训练大型语言模型（LLMs）的计算和基础设施需求可能不适用于 SLMs。鉴于云平台上可用的硬件配置种类繁多——如 GPU 类型、批量大小和通信协议——需要对这些因素如何影响 SLMs 的训练效率进行系统分析，特别是在以每美元损失和每秒令牌数等实际指标衡量时。我们的研究结果表明，对于较小的模型，可以使用更经济实惠的选项，如 A100-40GB GPU 和分布式数据并行（DDP），而不会牺牲性能。对于较大的模型，需要更高级的配置，如 A100-80GB 和 H100-80GB GPU 与闪存注意力（FA）和全分片数据并行（FSDP）相结合，以处理更大的批量大小并防止与内存相关的问题。

*Work done during an internship at Apple.

*在苹果实习期间完成的工作。

${}^{2}$ We use average dollar cost ratios of cloud instance types based on publicly available pricing (Appx. A).

${}^{2}$ 我们使用基于公开可用定价的云实例类型的平均美元成本比率（附录 A）。

Recent advancements in the field underscore the importance of scaling AI systems not only for state-of-the-art performance but also for practical applications in real-world environments. The emerging trend toward SLMs suggests that a re-evaluation of hardware and computation strategies is essential. The contribution of this paper is to address the need for such evaluation, providing a systematic study on the computational bottlenecks and cost-efficiency of training SLMs up to 2B parameters on various cloud infrastructure and setups. We find that 1) FlashAttention is significantly more important for SLMs than LLMs, 2) Expensive hardware, e.g., H100-80GB and A100-80GB, is not necessarily cost effective for SLM training, 3) DDP is the best distributed training scheme for SLMs, and 4) Maximizing GPU memory utilization is not cost-optimal for SLM training.

该领域的最新进展强调了扩展人工智能系统的重要性，这不仅是为了实现最先进的性能，也是为了在现实世界环境中实现实际应用。SLM（规模化语言模型）的兴起趋势表明，重新评估硬件和计算策略是必要的。本文的贡献在于解决这种评估的需求，提供了一个关于在各种云基础设施和设置上训练高达20亿参数的SLM的计算瓶颈和成本效益的系统研究。我们发现：1) FlashAttention对于SLM比LLM（大型语言模型）更为重要，2) 昂贵的硬件，例如H100-80GB和A100-80GB，对于SLM训练并不一定具有成本效益，3) DDP（分布式数据并行）是SLM的最佳分布式训练方案，4) 最大化GPU内存利用率对于SLM训练并不是成本最优的。

2 Metrics

2 指标

Our goal is to find architectures with maximal performance and minimum cost of training. It is common to measure the cost of training in terms of wall-clock time, iterations, or tokens. However, these metrics are incomplete for choosing a sufficient infrastructure within a budget. We recommend metrics that directly incorporate the dollar cost of hardware. Specifically, we aim to maximize the accuracy of the model while minimizing the cost or in other words optimize for accuracy/dollar. Prior works have discovered neural scaling laws controlling the relation between accuracy, loss, and number of seen samples or tokens during training [14]. In this paper, we focus on the number of tokens processed during the training (or $\frac{\text{ Token }}{\text{ Second }}$ ) for various architectures and measure the cost of processing tokens (or $\frac{\text{ Token }}{\text{ Dollar }}$ ). Given $\frac{\text{ Token }}{\text{ Sec }}$ measurements and $\frac{\text{ Loss }}{\text{ Token }}$ derived from scaling laws,we get

我们的目标是找到性能最高且训练成本最低的架构。通常，训练成本可以通过挂钟时间、迭代次数或令牌数量来衡量。然而，这些指标不足以在预算内选择足够的基础设施。我们推荐直接包含硬件美元成本的指标。具体来说，我们的目标是在最小化成本的同时最大化模型的准确性，或者换句话说，优化准确性/美元。先前的工作已经发现了神经网络缩放定律，这些定律控制了训练过程中准确性、损失与看到的样本或令牌数量之间的关系[14]。在本文中，我们关注训练过程中处理的令牌数量（或 $\frac{\text{ Token }}{\text{ Second }}$ ），并测量处理令牌的成本（或 $\frac{\text{ Token }}{\text{ Dollar }}$ ）。给定 $\frac{\text{ Token }}{\text{ Sec }}$ 测量值和从缩放定律中得出的 $\frac{\text{ Loss }}{\text{ Token }}$ ，我们得到

where, Second is the cost of hardware and infrastructure which is extracted by averaging publicly available prices from various cloud providers for each hardware configuration (See Appx. A). The result of our analysis provides $\frac{\text{ Token }}{\text{ Dollar }}$ laws that combined with loss scaling laws can be used to find the minimal cost to a target loss. Other metrics of interest can be CPU/GPU utilization and Memory bandwidth usage that we leave for future work. We will report $\frac{\text{ Token }}{\text{ Dollar }}$ in various setups below where a value of $1\mathrm{k}$ means training for $1\mathrm{k}$ tokens costs

1$ . 其中，Second 是硬件和基础设施的成本，通过平均来自不同云提供商的公开价格来提取每种硬件配置的成本（见附录 A）。我们的分析结果提供了 $\frac{\text{ Token }}{\text{ Dollar }}$ 定律，结合损失缩放定律可以用来找到达到目标损失的最小成本。其他感兴趣的指标包括 CPU/GPU 利用率和内存带宽使用情况，我们将其留待未来的工作。我们将在下面报告 $\frac{\text{ Token }}{\text{ Dollar }}$ 在各种设置中的情况，其中 $1\mathrm{k}$ 表示训练 $1\mathrm{k}$ 个 token 的成本为

1$。

3 Model and Parameters

3 模型和参数

We focus on LLaMa architectures $\left\lbrack {{21},9}\right\rbrack$ as they are one of the most popular architectures in recent public LLMs and SLMs [1, 13]. The smallest LLaMa-2/3 have 7/8B parameters which is still too large for most mobile hardware. We extract the number of decoder blocks and parameters of our models by fitting a curve over the LLaMa models and use it for defining our models (see Fig. 5 in Appx. B). We evaluate four different model sizes with ${100}\mathrm{M},{500}\mathrm{M},1\mathrm{\;B}$ ,and $2\mathrm{\;B}$ parameters. Notably, we maximize over all configuration parameters not shown in the $x$ -axis or legend of a figure. That is, we run a large grid search over all combinations of configuration parameters listed below and each point in each plot is the best configuration given all parameters specified in the plot. This way, we find the the optimal $\frac{\text{ Token }}{\text{ Dollar }}$ and assume one can tune optimization hyperparameters such as learning rate to achieve the optimal convergence with the hardware-optimal configurations. We present details of these derivative models in Appx. B. Next, we define the configuration parameters:

我们专注于 LLaMa 架构 $\left\lbrack {{21},9}\right\rbrack$ ，因为它们是近期公共大型语言模型（LLMs）和中小型模型（SLMs）中最流行的架构之一 [1, 13]。最小的 LLaMa-2/3 拥有 7/8B 参数，这对于大多数移动硬件来说仍然过大。我们通过拟合 LLaMa 模型的曲线来提取我们模型的解码器块数量和参数，并使用它来定义我们的模型（见附录 B 中的图 5）。我们评估了四种不同大小的模型，参数分别为 ${100}\mathrm{M},{500}\mathrm{M},1\mathrm{\;B}$ 和 $2\mathrm{\;B}$ 。值得注意的是，我们在图中未显示在 $x$ 轴或图例中的所有配置参数上进行最大化。也就是说，我们在下面列出的所有配置参数的组合上运行大规模网格搜索，并且每个图中的每个点都是在图中指定的所有参数下找到的最佳配置。通过这种方式，我们找到了最优的 $\frac{\text{ Token }}{\text{ Dollar }}$ ，并假设可以调整优化超参数（如学习率）以在硬件最优配置下实现最优收敛。我们在附录 B 中详细介绍了这些衍生模型。接下来，我们定义配置参数：

GPU Types: We evaluate the usage of three NVIDIA GPU types: A100-40GB, A100-80GB, and H100-80GB. We use BFloat16 data types in all GPUs.

GPU 类型：我们评估了三种 NVIDIA GPU 类型的使用情况：A100-40GB、A100-80GB 和 H100-80GB。我们在所有 GPU 中使用 BFloat16 数据类型。

GPU Numbers and Communication: We study three main training configurations for each GPU type including single-node-single-GPU (1 GPU), single-node-multi-GPU (2, 4, and 8 GPUs), and multi-node-multi-GPU (16, 32, and 64 GPUs) settings. When we use more than a single GPU, we evaluate Distributed Data-Parallel (DDP) and Fully Sharded Data Parallel [26] (FSDP) for communication. For FSDP, we study two sharding policies: 1) full sharding where we shard all gradients, optimizer states, and weights, and 2) grad_op sharding where we shard only gradients and optimizer states (but keep the weights unsharded). We use RDMA/EFA.

GPU 数量和通信：我们研究了每种 GPU 类型的三种主要训练配置，包括单节点单 GPU（1 个 GPU）、单节点多 GPU（2、4 和 8 个 GPU）以及多节点多 GPU（16、32 和 64 个 GPU）设置。当我们使用多个 GPU 时，我们评估了分布式数据并行（DDP）和完全分片数据并行 [26]（FSDP）的通信方式。对于 FSDP，我们研究了两种分片策略：1）完全分片，其中我们对所有梯度、优化器状态和权重进行分片，以及 2）grad_op 分片，其中我们仅对梯度和优化器状态进行分片（但保持权重不分片）。我们使用 RDMA/EFA。

Figure 1: FlashAttention is more cost-efficient for smaller models. Maximum Token/Dollar across GPU-types, GPU-number, and communication type when we use FlashAttention. FlashAttention shows a significant Token/Dollar improvement over vanilla attention in smaller models and batch sizes. OOM runs are shown with 0 . Token/Dollar $= 1\mathrm{k}$ means training for $1\mathrm{k}$ tokens costs

1$ . 图 1：FlashAttention 对于较小的模型更具成本效益。在使用 FlashAttention 时，不同 GPU 类型、GPU 数量和通信类型的最大 Token/Dollar。FlashAttention 在较小的模型和批量大小下显示出显著的 Token/Dollar 改进。OOM 运行显示为 0。Token/Dollar $= 1\mathrm{k}$ 表示训练 $1\mathrm{k}$ 个 token 的成本为

1$。

Number of Samples: We evaluate various number of samples fit into a single GPU during the training. We fix the sequence length to 1028 and iterate over the batch-size we fit into a single device. As we cannot fit 128 samples into a single GPU memory even in our smallest (100M) model, we study the per-device batch-size of 4,8,16,32,and 64 . We do not use gradient accumulation.

样本数量：我们评估了在训练期间单个 GPU 中适合的各种样本数量。我们将序列长度固定为 1028，并迭代适合单个设备的批量大小。由于我们无法将 128 个样本放入单个 GPU 内存中，即使在我们最小的（100M）模型中，我们也研究了每个设备的批量大小为 4、8、16、32 和 64。我们不使用梯度累积。

Flash Attention: We study the affect of using FlashAttention [7] for the attention block.

Flash Attention：我们研究了使用 FlashAttention [7] 对注意力块的影响。

4 Experimental Results

4 实验结果

In this section, we present results on A100-40GB, A100-80GB, and H100-80GB. We implement our models in HuggingFace [22] and run our experiments using PyTorch [19] without any additional frameworks and use 1024 sequence length. We provide details of the runtime configuration in Appx. C. For each setup, we run our experiments at least 3 times, each with 10 training steps, and then report the average and error bars. We form our findings into research questions and answers.

在本节中，我们展示了在 A100-40GB、A100-80GB 和 H100-80GB 上的结果。我们在 HuggingFace [22] 中实现了我们的模型，并使用 PyTorch [19] 运行我们的实验，没有任何额外的框架，并使用 1024 的序列长度。我们在附录 C 中提供了运行时配置的详细信息。对于每个设置，我们至少运行 3 次实验，每次 10 个训练步骤，然后报告平均值和误差条。我们将我们的发现形成研究问题和答案。

Q1: How important is to use FlashAttention during the SLM training?

Q1：在 SLM 训练中使用 FlashAttention 有多重要？

Figure 1 compares the use of FlashAttention2 [7] against vanilla attention for different global batch-sizes. First, we can see that FlashAttention (FA) significantly increases our Token/Dollar in SLMs. Notably, FA improves Token/Dollar more significantly for smaller models as the cost of attention is quadratic in context length and dominates when the hidden dimension shrinks. Such SLMs enter a data-bound regime where the data movement (CPU/GPU as well as GPU/GPU) becomes the main bottleneck. Finally, we can see that for larger models (1B and 2B), FA enables training of larger batch-sizes (1024) while vanilla attention results in out of memory error (OOM).

图 1 比较了在不同全局批量大小下使用 FlashAttention2 [7] 与普通注意力的情况。首先，我们可以看到 FlashAttention (FA) 显著提高了我们在 SLM 中的 Token/Dollar。值得注意的是，FA 对较小模型的 Token/Dollar 改进更为显著，因为注意力的成本与上下文长度呈二次关系，并且在隐藏维度缩小时占主导地位。这种 SLM 进入了一个数据受限的阶段，其中数据移动（CPU/GPU 以及 GPU/GPU）成为主要的瓶颈。最后，我们可以看到，对于较大的模型（1B 和 2B），FA 能够训练更大的批量大小（1024），而普通注意力则会导致内存不足错误（OOM）。

Q2: Given a fixed number of GPUs, what is the best GPU type for training an SLM?

Q2：在固定数量的 GPU 下，训练 SLM 的最佳 GPU 类型是什么？

Figure 2 shows the result of training our models using different GPU types: A100-40GB, and A100- 80GB. Although we cannot see a consistent pattern for all models, we can see that A100-80GB GPU is a better choice when we use a large number of GPUs (32) to train larger models (1B and 2B). In such cases, A100-80GB can be used for training larger batch-sizes, while in smaller models, we can use 40GB GPU with cheaper price (see Appx. A for Hadrware prices).

图2展示了使用不同GPU类型（A100-40GB和A100-80GB）训练我们模型的结果。尽管我们无法看到所有模型的统一模式，但我们发现当使用大量GPU（32个）训练更大模型（1B和2B）时，A100-80GB GPU是一个更好的选择。在这种情况下，A100-80GB可以用于训练更大的批量大小，而在较小的模型中，我们可以使用价格更便宜的40GB GPU（硬件价格见附录A）。

Q3: What is the best communication scheme for training SLMs for different number of nodes?

Q3：对于不同数量的节点，训练大型语言模型（SLMs）的最佳通信方案是什么？

Next, we study the role of using different parallelization schemes for SLM training. To this end, we study the use of Distributed Data Parallel (DDP), Fully Sharded Data Parallel with full sharding policy (FSDP-Full), and Fully Sharded Data Parallel with sharding gradients and optimizer states (FSDP-Grad+Optimizer). Figure 3 shows the result of training our models with different parallelization schemes on A100-80GB GPU. Our results show that for smaller models, DDP is a better choice due to the less communication volume. However, for largest model (2B), FSDP outperforms DDP as we can train larger batch-sizes (see Q4). Finally, we observe that FSDP-Grad+Optimizer outperforms FSDP-Full due to the lower communication overhead.

接下来，我们研究了使用不同并行化方案对SLM训练的影响。为此，我们研究了使用分布式数据并行（DDP）、全分片数据并行（FSDP-Full）以及分片梯度和优化器状态的全分片数据并行（FSDP-Grad+Optimizer）。图3展示了在A100-80GB GPU上使用不同并行化方案训练我们模型的结果。我们的结果表明，对于较小的模型，DDP是一个更好的选择，因为其通信量较少。然而，对于最大的模型（2B），FSDP优于DDP，因为我们可以训练更大的批量大小（见Q4）。最后，我们观察到FSDP-Grad+Optimizer由于较低的通信开销而优于FSDP-Full。

Figure 3: DDP is the best scheme for training SLMs. Maximum Token/Dollar for different GPU-nodes on A100-80GB across different batch-sizes. We use FlashAttention in our models. for a single node, we use 2, 4, and 8 GPUs while for 2 and 4 nodes we use 16 and 32 GPUs respectively.

图3：DDP是训练大型语言模型（SLMs）的最佳方案。不同GPU节点在A100-80GB上跨不同批量大小的最大Token/美元。我们在模型中使用了FlashAttention。对于单个节点，我们使用了2、4和8个GPU，而对于2和4个节点，我们分别使用了16和32个GPU。

Q4: What is the best communication scheme for training SLMs for different global batch-sizes? Fig. 4 shows the result of training SLM with various global batch-sizes using DDP, FSDP-Full, and FSDP-Grad+Optimizer for various per-device batch sizes. We cannot see a substantial difference in small batch sizes in our experiments. However, similar to Q3, FSDP always outperforms DDP for largest model (2B) and batch-sizes. In addition, FSDP enable training larger global batch-sizes (512) for larger models (2B) compared to DDP (which results in OOM).

Q4: 对于不同的全局批量大小，训练自回归语言模型的最佳通信方案是什么？图4展示了使用DDP、FSDP-Full和FSDP-Grad+Optimizer对不同设备批量大小进行训练的结果。在我们的实验中，小批量大小之间没有显著差异。然而，与Q3类似，FSDP始终在最大模型（2B）和批量大小上优于DDP。此外，与DDP相比，FSDP能够为更大的模型（2B）启用更大的全局批量大小（512）训练（DDP会导致OOM）。

Figure 4: For SLMs increasing global batch size saturates cost-efficiency before GPU memory is fully utilized. Maximum Token/Dollar for different global batch-sizes across across different GPU-types, GPU numbers, and various per-device batch-sizes. We use FlashAttention in our models.

图4：对于自回归语言模型，增加全局批量大小在GPU内存完全利用之前就达到了成本效率的饱和点。不同全局批量大小在不同GPU类型、GPU数量和各种设备批量大小下的最大Token/美元。我们在模型中使用了FlashAttention。

5 Summary and Conclusion

5 总结与结论

In this study, we examined the computational bottlenecks of training Small-scale Language Models (SLMs) up to 2B parameters, focusing on the impact of hyperparameters and hardware configurations. Our findings highlight the importance of Flash Attention (FA) for smaller models and batch sizes, where data movement is the primary bottleneck. FA also enables training larger models (1B and 2B) with batch sizes up to 512, avoiding out-of-memory (OOM) errors common with vanilla attention. Additionally, we found that A100-80GB GPUs are optimal for training larger models with many GPUs, while the more cost-effective A100-40GB works well for smaller models. In terms of distributed training, DDP is more efficient for smaller models, but FSDP outperforms it for larger models, particularly when training large batch sizes. These insights provide practical guidance for optimizing SLM training by offering clear strategies for selecting the most efficient hardware and parallelization methods based on model size.

在本研究中，我们考察了训练参数规模达到20亿的小规模语言模型（SLM）的计算瓶颈，重点关注超参数和硬件配置的影响。我们的研究结果强调了Flash Attention（FA）对于较小模型和批量大小的关键作用，其中数据移动是主要的瓶颈。FA还使得能够训练更大的模型（10亿和20亿参数），批量大小可达512，避免了使用传统注意力机制时常见的内存不足（OOM）错误。此外，我们发现A100-80GB GPU对于使用多GPU训练较大模型是最佳选择，而更具成本效益的A100-40GB则适用于较小模型。在分布式训练方面，DDP对于较小模型更为高效，但FSDP在训练较大模型时表现更优，尤其是在处理大批量时。这些见解为优化SLM训练提供了实际指导，通过提供明确的策略，根据模型大小选择最有效的硬件和并行化方法。

—— 更多内容请到Doc2X翻译查看——
—— For more content, please visit Doc2X for translations ——

全部评论 (0)

还没有任何评论哟~

Computational Bottlenecks of Training Small-scale Large Language Models 翻译

选择Doc2X，让PDF转换更轻松支持PDF转Word、Latex、Markdown，多栏与公式精准解析，还提供深度翻译功能，适合科研及日常办公！ ChooseDoc2X,SimplifyPDFCo...

Can Small Language Models Help Large Language Models Reason Better?: LM-Guided Chain-of-Thought

本文是LLM系列文章，针对《CanSmallLanguageModelsHelpLargeLanguageModelsReason Better? 小型语言模型能帮助大型语言模型更好地推理吗？LM引导...

论文翻译：USENIX-2021 Extracting Training Data from Large Language Models

ExtractingTrainingDatafromLargeLanguageModels 从大型语言模型中提取训练数据 <https://www.usenix.org/system/files/se...

StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of Large Language Models

本文是LLM系列文章，针对《StableToolBench:TowardsStableLargeScaleBenchmarkingon ToolLearningofLargeLanguageModel...

论文阅读：TinyLLaVA: A Framework of Small-scale Large Multimodal Models

论文：<https://arxiv.org/abs/2402.14289 代码：<https://github.com/TinyLLaVA/TinyLLaVAFactory 轻量级多模态模型的设计思路...

Emergent Abilities of Large Language Models 机翻mark

摘要证明通过扩大语言模型可以可靠地提高性能和样本效率在广泛的下游任务。相反，本文讨论了我们称之为大型语言模型的新兴能力的一种不可预测的现象。我们认为如果一个能力不存在于较小的模型中，但在较大的模型中...

Spike No More: Stabilizing the Pre-training of Large Language Models

Q:这篇论文试图解决什么问题？ A:这篇论文试图解决大型语言模型（LLMs）预训练过程中损失值激增（lossspike）的问题。损失值激增会降低LLMs的性能，有时甚至会破坏预训练过程。由于预训练需要...

论文翻译：Rethinking Interpretability in the Era of Large Language Models

<https://arxiv.org/abs/2402.01761 在大型语言模型时代的可解释性再思考摘要在过去十年中，随着越来越大的数据集和深度神经网络的兴起，可解释机器学习领域的兴趣迅速增长。...

Baichuan2：Open large-scale language models

introduction baichuan2基于2.6万亿个token进行训练。 2.pretraining 2.1pretrainingdata 数据处理：关注数据频率和质量。

翻译：arXiv-2023 PromptRobust: Towards Evaluating the Robustness of Large Language Models on

PromptRobust:TowardsEvaluatingtheRobustnessofLargeLanguageModelsonAdversarialPrompts <https://arxiv....

是否确定退出登录?

Computational Bottlenecks of Training Small-scale Large Language Models 翻译

Computational Bottlenecks of Training Small-scale Large Language Models

小规模大型语言模型训练的计算瓶颈

Abstract

摘要

1 Introduction

1 引言

2 Metrics

2 指标

3 Model and Parameters

3 模型和参数

4 Experimental Results

4 实验结果

Q1: How important is to use FlashAttention during the SLM training?

Q1：在 SLM 训练中使用 FlashAttention 有多重要？

Q2: Given a fixed number of GPUs, what is the best GPU type for training an SLM?

Q2：在固定数量的 GPU 下，训练 SLM 的最佳 GPU 类型是什么？

Q3: What is the best communication scheme for training SLMs for different number of nodes?

Q3：对于不同数量的节点，训练大型语言模型（SLMs）的最佳通信方案是什么？

5 Summary and Conclusion

5 总结与结论

全部评论 (0)

相关文章推荐

Computational Bottlenecks of Training Small-scale Large Language Models 翻译

Can Small Language Models Help Large Language Models Reason Better?: LM-Guided Chain-of-Thought

论文翻译：USENIX-2021 Extracting Training Data from Large Language Models

StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of Large Language Models

论文阅读：TinyLLaVA: A Framework of Small-scale Large Multimodal Models

Emergent Abilities of Large Language Models 机翻mark

Spike No More: Stabilizing the Pre-training of Large Language Models

论文翻译：Rethinking Interpretability in the Era of Large Language Models

Baichuan2：Open large-scale language models

翻译：arXiv-2023 PromptRobust: Towards Evaluating the Robustness of Large Language Models on