天秤女和什么星座最配| 克罗心是什么档次| 璀璨人生是什么意思| 什么中生什么| 异化是什么意思| 绌是什么意思| 什么是痔疮早期图片| 水指什么生肖| 口干是什么病的前兆| 正常人为什么会得梅毒| 龟头上抹什么药能延时| 女人吃什么| 邪淫是什么意思| 喝什么解暑| 睡觉身上痒是什么原因| 同房后出血什么原因| 王八是什么字| 香蕉和什么一起吃能减肥| 担当是什么| 做梦梦到牙齿掉了是什么意思| 腰椎间盘突出吃什么好| 89年什么命| 吃什么补蛋白质| 北京的市花是什么| 葬爱家族是什么意思| 嗓子疼吃什么药好得快| 白洞是什么东西| 思密达是什么药| 生生不息是什么意思| lf是什么牌子| 长期喝什么水可以美白| 一什么池塘| 胃气不足吃什么中成药| 查肾挂什么科| 1964年出生属什么| 夜猫子是什么意思| 西游记什么时候写的| 为什么脸上老长痘痘| 腰疼吃什么药| 大明湖畔的夏雨荷是什么意思| 胃疼吃什么止痛药| 腿老是抽筋是什么原因| 抑郁到什么程度要吃氟西汀| 一串什么| 利尿什么意思| 疰夏是什么意思| 什么是粘胶纤维| 后脑勺发胀是什么原因| 缺二氧化碳是什么症状| 经期适合吃什么食物| 万兽之王是什么动物| 扁平苔藓是什么病| 新生儿什么时候可以喝水| 感冒发烧挂什么科| 口腔溃疡为什么那么痛| 月亮什么颜色| 血压的低压高是什么原因| 什么是brt| 菽是什么意思| 走肾不走心什么意思| 百鸟归巢什么意思| 心跳过慢吃什么药| 亟须什么意思| 白带多是什么情况| 为什么最迷人的最危险是什么歌| 高尿酸血症是什么病| 人参什么时候吃最好| 什么是sp| 挖坑是什么意思| 姑息治疗什么意思| 聚首一堂是指什么生肖| 吃什么不会胖又减肥| 石榴叶子泡水喝有什么功效| 走水是什么意思| 火气重喝什么茶| 2004年属猴的是什么命| 眼睛总跳是什么原因| 毒龙钻是什么| 肛门痒是什么原因男性| 欧皇什么意思| 术后吃什么补元气| 七夕节什么时候| 晕血是什么原因| 可乐是什么做的| 闹代表什么生肖| 夏天是什么样的| gmp认证是什么意思| crp偏高说明什么| cot什么意思| 白英别名叫什么| 什么什么一什么| 边缘性脐带入口是什么意思| 后背痒是什么原因| 松花粉有什么功效| 牙龈出血吃什么药| 千与千寻是什么意思| 为什么射出来的精子是黄色的| 什么叫生理需求| 手术后能吃什么| 女人吃什么排湿气最快| 斜视手术有什么后遗症和风险| 汗手是什么原因| 腰肌劳损用什么药最好| wwe是什么意思| nz是什么意思| 毓婷和金毓婷有什么区别| 现是什么生肖| 什么红酒好喝| 肾盂肾炎吃什么药好| 30岁属什么的生肖| 天然气什么味道| 梦到镯子碎了什么预兆| 灵魂伴侣是指什么意思| 血崩是什么症状| 猫咪有泪痕是什么原因| 樱桃泡酒有什么功效| 血糖高饮食需要注意什么| 补办护照需要什么材料| 鹿吃什么食物| 难过美人关是什么生肖| 头好出汗是什么原因| 痛风是什么原因引起的| 摩羯座后面是什么星座| 经心的近义词是什么| 偶发室性期前收缩是什么意思| 梦见打麻将是什么意思| 排卵期后面是什么期| 加拿大货币叫什么| 什么是肩袖损伤| 青光眼是什么| 马来西亚有什么特产| 珅是什么意思| 饣与什么有关| 黯然泪下是什么意思| 血糖高吃什么水果最好| 1983年是什么年| 5.22是什么星座| 干扰素是治什么病的| 月牙是什么| 字母哥什么位置| 胆碱酯酶高是什么原因| 溥仪姓什么| 流产后吃什么食物| 左氧氟沙星氯化钠注射作用是什么| 乳房疼挂什么科| 什么水果含糖量高| 小腹凸起是什么原因| 尿血是什么原因引起的| 头发掉要用什么洗发水| 大便有粘液是什么原因| 补办手机卡需要什么| vcr是什么意思| 涤纶是什么材料| 林黛玉属什么生肖| 为什么总放屁| 眼睛一直跳是什么原因| 凌波仙子是什么花| 嚼舌根是什么意思| 七八年属什么生肖| 今天股市为什么大跌| 早上起来心慌是什么原因| 什么是低烧| ivy什么意思| 六月五行属什么| 肛门里面痒是什么原因| 左手经常发麻是什么原因引起的| bf是什么牌子| rst是什么意思| 畏寒肢冷是什么意思| 姨妈不正常是什么原因| #NAME?| 308是什么意思| gap是什么品牌| 榻榻米床垫什么材质的好| 什么姿势舒服| 玫瑰花可以和什么一起泡水喝| cmv病毒是什么病毒| 生孩子需要准备什么东西| 什么的青草| 炉甘石是什么| 普工是什么| lh是什么意思啊| 攀缘是什么意思| 八月一日是什么日子| 糖稀是什么| 一个月来两次大姨妈是什么原因| 三重一大是什么内容| 洄游是什么意思| 3月3日是什么节| 口干是什么病| 眼睛长麦粒肿用什么药| 7月5号是什么星座| 流虚汗是什么原因| 小孩长白头发是什么原因| 鸡肾炒什么配菜好吃| 猫爪草有什么功效| 旋转跳跃我闭着眼是什么歌| 想一出是一出什么意思| 晴雨伞是什么意思| 从小一起长大的姐妹叫什么| 腺肌症不治疗会导致什么结果| 什么是门户网站| 黑今读什么| 嗓子干疼是什么原因| 宋朝之前是什么朝代| 梦到生男孩有什么预兆| 心什么什么什么| 腊猪蹄炖什么好吃| 马拉车是什么牌子的包| 小狗感冒了吃什么药| 左手麻木是什么原因引起的| 腰痛去医院挂什么科| 痤疮是由什么引起的| 尿多是什么原因男性| 为什么体检前不能喝水| 尿路感染去医院挂什么科| TOYOTA是什么车| 皮肤软组织感染用什么消炎药| 后脖子出汗多是什么原因| 阴虚火旺是什么症状| 西瓜适合什么土壤种植| 7月23日是什么日子| 天井是什么意思| 广东省省长是什么级别| 什么是妈妈臀| 风属于五行属什么| 百白破是预防什么的| 手腕关节疼痛什么原因引起的| 泛滥成灾是什么意思| 突如其来什么意思| 桂枝茯苓丸治什么病| 胸口疼是什么原因| 四个鱼念什么| as是什么材质| 椰浆是什么| 猫能吃什么人吃的东西| 枸杞喝多了有什么坏处| 什么是巧克力囊肿| 克服是什么意思| 大学休学1年有什么影响| 乾隆的名字叫什么| 怀孕吃鹅蛋有什么好处| 胡子白了是什么原因| 一只脚心疼是什么原因| 211是什么| 食古不化是什么意思| 人最怕什么| emg是什么意思| 什么酷暑| 蓝莓树长什么样| 蝙蝠是什么类动物| 转氨酶是什么| 小便尿道刺痛吃什么药| 舅舅的舅舅叫什么| 什么操场| 1989年出生的是什么命| 心脏不好喝什么茶比较好| 叶酸什么牌子好| 卡西欧手表属于什么档次| 什么时候立夏| 两只小船儿孤孤零零是什么歌| 月亮五行属什么| 早搏是什么意思| 前列腺增生伴钙化是什么意思| 百度
Skip to main content

乘风破浪的意思是什么

百度 如行政机关的支部应保证行政工作的完成,学校支部应保障教育计划的完成,工厂支部应保证提高技术与生产计划的完成等。

by Megha Agarwal, Asfandyar Qureshi, Nikhil Sardana, Linden Li, Julian Quevedo and Daya Khudia

in


Share this post
LLM Inference Performance Engineering: Best Practices

In this blog post, the MosaicML engineering team shares best practices for how to capitalize on popular open source large language models (LLMs) for production usage. We also provide guidelines for deploying inference services built around these models to help users in their selection of models and deployment hardware. We have worked with multiple PyTorch-based backends in production; these guidelines are drawn from our experience with FasterTransformers, vLLM, NVIDIA's soon-to-be-released TensorRT-LLM, and others.

Understanding LLM Text Generation

Large Language Models (LLMs) generate text in a two-step process: "prefill", where the tokens in the input prompt are processed in parallel, and "decoding", where text is generated one 'token' at a time in an autoregressive manner. Each generated token is appended to the input and fed back into the model to generate the next token. Generation stops when the LLM outputs a special stop token or when a user-defined condition is met (e.g., some maximum number of tokens has been generated). If you'd like more background on how LLMs use decoder blocks, check out this blog post.

Tokens can be words or sub-words; the exact rules for splitting text into tokens vary from model to model. For instance, you can compare how Llama models tokenize text to how OpenAI models tokenize text. Although LLM inference providers often talk about performance in token-based metrics (e.g., tokens/second), these numbers are not always comparable across model types given these variations. For a concrete example, the team at Anyscale found that Llama 2 tokenization is 19% longer than ChatGPT tokenization (but still has a much lower overall cost). And researchers at HuggingFace also found that Llama 2 required ~20% more tokens to train over the same amount of text as GPT-4.

Important Metrics for LLM Serving

So, how exactly should we think about inference speed?

Our team uses four key metrics for LLM serving:

  1. Time To First Token (TTFT): How quickly users start seeing the model's output after entering their query. Low waiting times for a response are essential in real-time interactions, but less important in offline workloads. This metric is driven by the time required to process the prompt and then generate the first output token.
  2. Time Per Output Token (TPOT): Time to generate an output token for each user that is querying our system. This metric corresponds with how each user will perceive the "speed" of the model. For example, a TPOT of 100 milliseconds/tok would be 10 tokens per second per user, or ~450 words per minute, which is faster than a typical person can read.
  3. Latency: The overall time it takes for the model to generate the full response for a user. Overall response latency can be calculated using the previous two metrics: latency = (TTFT) + (TPOT) * (the number of tokens to be generated).
  4. Throughput: The number of output tokens per second an inference server can generate across all users and requests.

Our goal? The fastest time to first token, the highest throughput, and the quickest time per output token. In other words, we want our models to generate text as fast as possible for as many users as we can support.

Notably, there is a tradeoff between throughput and time per output token: if we process 16 user queries concurrently, we'll have higher throughput compared to running the queries sequentially, but we'll take longer to generate output tokens for each user.

If you have overall inference latency targets, here are some useful heuristics for evaluating models:

  • Output length dominates overall response latency: For average latency, you can usually just take your expected/max output token length and multiply it by an overall average time per output token for the model.
  • Input length is not significant for performance but important for hardware requirements: The addition of 512 input tokens increases latency less than the production of 8 additional output tokens in the MPT models. However, the need to support long inputs can make models harder to serve. For example, we recommend using the A100-80GB (or newer) to serve MPT-7B with its maximum context length of 2048 tokens.
  • Overall latency scales sub-linearly with model size: On the same hardware, larger models are slower, but the speed ratio won't necessarily match the parameter count ratio. MPT-30B latency is ~2.5x that of MPT-7B latency. Llama2-70B latency is ~2x that of Llama2-13B latency.

We are often asked by prospective customers to provide an average inference latency. We recommend that before you anchor yourself to specific latency targets ("we need less than 20 ms per token"), you should spend some time characterizing your expected input and desired output lengths.

Challenges in LLM Inference

Optimizing LLM inference benefits from general techniques such as:

  • Operator Fusion: Combining different adjacent operators together often results in better latency.
  • Quantization: Activations and weights are compressed to use a smaller number of bits.
  • Compression: Sparsity or Distillation.
  • Parallelization: Tensor parallelism across multiple devices or pipeline parallelism for larger models.

Beyond these methods, there are many important Transformer-specific optimizations. A prime example of this is KV (key-value) caching. The Attention mechanism in decoder-only Transformer-based models is computationally inefficient. Each token attends to all previously seen tokens, and thus recomputes many of the same values as each new token is generated. For example, while generating the Nth token, the (N-1)th token attends to (N-2)th, (N-3)th … 1st tokens. Similarly, while generating (N+1)th token, attention for the Nth token again needs to look at the (N-1)th, (N-2)th, (N-3)th, … 1st tokens. KV caching, i.e., saving of intermediate keys/values for the attention layers, is used to preserve those results for later reuse, avoiding repeated computation.

Memory Bandwidth is Key

Computations in LLMs are mainly dominated by matrix-matrix multiplication operations; these operations with small dimensions are typically memory-bandwidth-bound on most hardware. When generating tokens in an autoregressive manner, one of the activation matrix dimensions (defined by batch size and number of tokens in the sequence) is small at small batch sizes. Therefore, the speed is dependent on how quickly we can load model parameters from GPU memory to local caches/registers, rather than how quickly we can compute on loaded data. Available and achieved memory bandwidth in inference hardware is a better predictor of speed of token generation than their peak compute performance.

Inference hardware utilization is very important in terms of serving costs. GPUs are expensive and we need them to do as much work as possible. Shared inference services promise to keep costs low by combining workloads from many users, filling in individual gaps and batching together overlapping requests. For large models like Llama2-70B, we only achieve good cost/performance at large batch sizes. Having an inference serving system that can operate at large batch sizes is critical for cost efficiency. However, a large batch means larger KV cache size, and that in turn increases the number of GPUs required to serve the model. There's a tug-of-war here and shared service operators need to make some cost trade-offs and implement systems optimizations.

Model Bandwidth Utilization (MBU)

How optimized is an LLM inference server?

As briefly explained earlier, inference for LLMs at smaller batch sizes—especially at decode time—is bottlenecked on how quickly we can load model parameters from the device memory to the compute units. Memory bandwidth dictates how quickly the data movement happens. To measure the underlying hardware's utilization, we introduce a new metric called Model Bandwidth Utilization (MBU). MBU is defined as (achieved memory bandwidth) / (peak memory bandwidth) where achieved memory bandwidth is ((total model parameter size + KV cache size) / TPOT).

For example, if a 7B parameter running with 16-bit precision has TPOT equal to 14ms, then it's moving 14GB of parameters in 14ms translating to 1TB/sec bandwidth usage. If the peak bandwidth of the machine is 2TB/sec, we are running at an MBU of 50%. For simplicity, this example ignores KV cache size, which is small for smaller batch sizes and shorter sequence lengths. MBU values close to 100% imply that the inference system is effectively utilizing the available memory bandwidth. MBU is also useful to compare different inference systems (hardware + software) in a normalized manner. MBU is complementary to the Model Flops Utilization (MFU; introduced in the PaLM paper) metric which is important in compute-bound settings.

Figure 1 shows a pictorial representation of MBU in a plot similar to a roofline plot. The solid sloped line of the orange-shaded region shows the maximum possible throughput if memory bandwidth is fully saturated at 100%. However, in reality for low batch sizes (white dot), the observed performance is lower than maximum – how much lower is a measure of the MBU. For large batch sizes (yellow region), the system is compute bound, and the achieved throughput as a fraction of the peak possible throughput is measured as the Model Flops Utilization (MFU).

Model Bandwidth Utilization
Figure 1: illustrates MBU (Model Bandwidth Utilization) and MFU (Model Flops Utilization). MBU and MFU are fractions of the peaks achieved in memory bound and compute bound regions respectively.

MBU and MFU determine how much more room is available to push the inference speed further on a given hardware setup. Figure 2 shows measured MBU for different degrees of tensor parallelism with our TensorRT-LLM-based inference server. Peak memory bandwidth utilization is attained when transferring large contiguous memory chunks. When smaller models like MPT-7B are distributed across multiple GPUs, we observe lower MBU as we are moving smaller memory chunks in each GPU.

Empirically observed MBU
Figure 2: Empirically observed MBU for different degrees of tensor parallelism with TensorRT-LLM on A100-40G GPUs. Requests: sequences of 512 input tokens with a batch size of 1.

Figure 3 shows empirically observed MBU for different degrees of tensor parallelism and batch sizes on the NVIDIA H100 GPUs. MBU decreases as batch size increases. However, as we scale GPUs, the relative decrease in MBU is less significant. It is also worthy to note that picking hardware with greater memory bandwidth can boost performance with fewer GPUs. At batch size 1, we can achieve a higher MBU of 60% on 2xH100-80GBs as compared to 55% on 4xA100-40GB GPUs (Figure 2).

Tensor Parallelism Modes
Figure 3: Empirically observed MBU for different batch sizes and tensor parallelism modes on H100-80G GPUs. Requests: sequences of 512 input tokens

Benchmarking Results

Latency

We have measured time to first token (TTFT) and time per output token (TPOT) across different degrees of tensor parallelism for MPT-7B and Llama2-70B models. As input prompts lengthen, time to generate the first token starts to consume a substantial portion of total latency. Tensor parallelizing across multiple GPUs helps reduce this latency.

Unlike model training, scaling to more GPUs offers significant diminishing returns for inference latency. Eg. for Llama2-70B going from 4x to 8x GPUs only decreases latency by 0.7x at small batch sizes. One reason for this is that higher parallelism has lower MBU (as discussed earlier). Another reason is that tensor parallelism introduces communication overhead across a GPU node.

? Time to first token (ms)
Model 1xA100-40GB 2xA100-40GB 4xA100-40GB 8xA100-40GB
MPT-7B 46 (1x) 34 (0.73x) 26 (0.56x) -
Llama2-70B Doesn't fit 154 (1x) 114 (0.74x)

Table 1: Time to first token given input requests are 512 tokens length with batch size of 1. Larger models like Llama2 70B needs at least 4xA100-40B GPUs to fit in memory

At larger batch sizes, higher tensor parallelism leads to a more significant relative decrease in token latency. Figure 4 shows how time per output token varies for MPT-7B. At batch size 1, going from 2x to 4x only reduces token latency by ~12%. At batch size 16, latency with 4x is 33% lower than with 2x. This goes in line with our earlier observation that the relative decrease in MBU is smaller at higher degrees of tensor parallelism for batch size 16 as compared to batch size 1.

Increasing Number of GPUs
Figure 4: Time per output token per user as we scale MPT-7B across A100-40GB GPUs. The latency doesn't scale linearly with the increasing number of GPUs. Requests: sequences of 128 input and 64 output tokens

Figure 5 shows similar results for Llama2-70B, except the relative improvement between 4x and 8x is less pronounced. We also compare GPU scaling across two different hardware. Because H100-80GB has 2.15x GPU memory bandwidth as compared to A100-40GB, we can see that latency is 36% lower at batch size 1 and 52% lower at batch size 16 for 4x systems.

Multiple GPUs
Figure 5: Time per output token per user as we scale Llama-v2-70B across multiple GPUs (input requests: 512 token length). Please note that 1x40GB GPU, 2x40GB and 1x80GB GPU numbers are missing here because Llama-v2-70B (in float16) doesn't fit on those systems.

Throughput

We can trade off throughput and time per token by batching requests together. Grouping queries during GPU evaluation increases throughput compared to processing queries sequentially, but each query will take longer to complete (ignoring queueing effects).

There are a few common techniques for batching inference requests:

  • Static batching: Client packs multiple prompts into requests and a response is returned after all sequences in the batch have been completed. Our inference servers support this but do not require it.
  • Dynamic batching: Prompts are batched together on the fly inside the server. Typically, this method performs worse than static batching but can get close to optimal if responses are short or of uniform length. Does not work well when requests have different parameters.
  • Continuous batching: The idea of batching requests together as they arrive was introduced in this excellent paper and is currently the SOTA method. Instead of waiting for all sequences in a batch to finish, it groups sequences together at the iteration level. It can achieve 10x-20x better throughput than dynamic batching.

LLM Serving
Figure 6: Different types of batching with LLM serving. Batching is an effective way of improving the efficiency of inference.

Continuous batching is usually the best approach for shared services, but there are situations where the other two might be better. In low-QPS environments, dynamic batching can outperform continuous batching. It is sometimes easier to implement low-level GPU optimizations in a simpler batching framework. For offline batch inference workloads, static batching can avoid significant overhead and achieve better throughput.

Batch Size

How well batching works is highly dependent on the request stream. But we can get an upper bound on its performance by benchmarking static batching with uniform requests.

? Batch size
Hardware 1 4 8 16 32 64 128
1 x A10 0.4 (1x) 1.4 (3.5x) 2.3 (6x) 3.5 (9x) OOM (Out of Memory) error
2 x A10 0.8 2.5 4.0 7.0 8.0 ?
1 x A100 0.9 (1x) 3.2 (3.5x) 5.3 (6x) 8.0 (9x) 10.5 (12x) 12.5 (14x) ?
2 x A100 1.3 3.0 5.5 9.5 14.5 17.0 22.0
4 x A100 1.7 6.2 11.5 18.0 25.0 33.0 36.5

Table 2: Peak MPT-7B throughput (req/sec) with static batching and a FasterTransformers-based backend. Requests: 512 input and 64 output tokens. For larger inputs, the OOM boundary will be at smaller batch sizes.

Latency Trade-Off

Request latency increases with batch size. With one NVIDIA A100 GPU, for example, if we maximize throughput with a batch size of 64, latency increases by 4x while throughput increases by 14x. Shared inference services typically pick a balanced batch size. Users hosting their own models should decide the appropriate latency/throughput trade-off for their applications. In some applications, like chatbots, low latency for fast responses is the top priority. In other applications, like batched processing of unstructured PDFs, we might want to sacrifice the latency to process an individual document to process all of them fast in parallel.

Figure 7 shows the throughput vs latency curve for the 7B model. Each line on this curve is obtained by increasing the batch size from 1 to 256. This is useful in determining how large we can make the batch size, subject to different latency constraints. Recalling our roofline plot above, we find that these measurements are consistent with what we would expect. After a certain batch size, i.e., when we cross to the compute bound regime, every doubling of batch size just increases the latency without increasing throughput.

Latency Curve
Figure 7: Throughput latency curve for the MPT-7B model. This allows users to pick a hardware configuration that meets their throughput requirements under a latency constraint.

When using parallelism, it's important to understand low-level hardware details. For instance, not all 8xA100 instances are the same across different clouds. Some servers have high bandwidth connections between all GPUs, others pair GPUs and have lower bandwidth connections between pairs. This could introduce bottlenecks, causing real-world performance to deviate significantly from the curves above.

Optimization Case Study: Quantization

Quantization is a common technique used to reduce the hardware requirements for LLM inference. Reducing the precision of model weights and activations during inference can dramatically reduce hardware requirements. For instance, switching from 16-bit weights to 8-bit weights can halve the number of required GPUs in memory constrained environments (eg. Llama2-70B on A100s). Dropping down to 4-bit weights makes it possible to run inference on consumer hardware (eg. Llama2-70B on Macbooks).

In our experience, quantization should be implemented with caution. Naive quantization techniques can lead to a substantial degradation in model quality. The impact of quantization also varies across model architectures (eg. MPT vs Llama) and sizes. We will explore this in more detail in a future blog post.

When experimenting with techniques like quantization, we recommend using an LLM quality benchmark like the Mosaic Eval Gauntlet to evaluate the quality of the inference system, not just the quality of the model in isolation. Additionally, it's important to explore deeper systems optimizations. In particular, quantization can make KV caches much more efficient.

As mentioned previously, in autoregressive token generation, past Key/Values (KV) from the attention layers are cached instead of recomputing them at every step. The size of the KV cache varies based on the number of sequences processed at a time and the length of these sequences. Moreover, during each iteration of the next token generation, new KV items are added to the existing cache making it bigger as new tokens are generated. Therefore, effective KV cache memory management when adding these new values is critical for good inference performance. Llama2 models use a variant of attention called Grouped Query Attention (GQA). Please note that when the number of KV heads is 1, GQA is the same as Multi-Query-Attention (MQA). GQA helps with keeping the KV cache size down by sharing Keys/Values. The formula to calculate KV cache size is
batch_size * seqlen * (d_model/n_heads) * n_layers * 2 (K and V) * 2 (bytes per Float16) * n_kv_heads
?

Table 3 shows GQA KV cache size calculated at different batch sizes at a sequence length of 1024 tokens. The parameter size for Llama2 models, in comparison, is 140 GB (Float16) for the 70B model. Quantization of KV cache is another technique (in addition to GQA/MQA) to reduce the size of KV cache, and we are actively evaluating its impact on the generation quality.

Batch Size GQA KV cache memory (FP16) GQA KV cache memory (Int8)
1 .312 GiB .156 GiB
16 5 GiB 2.5 GiB
32 10 GiB 5 GiB
64 20 GiB 10 GiB

Table 3: KV cache size for Llama-2-70B at a sequence length of 1024

As mentioned previously, token generation with LLMs at low batch sizes is a GPU memory bandwidth-bound problem, i.e. the speed of generation depends on how quickly model parameters can be moved from the GPU memory to on-chip caches. Converting model weights from FP16 (2 bytes) to INT8 (1 byte) or INT4 (0.5 byte) requires moving less data and thus speeds up token generation. However, quantization may negatively impact the model generation quality. We are currently evaluating the impact on model quality using Model Gauntlet and plan to publish a followup blog post on it soon.

Conclusions and Key Results

Each of the factors we've outlined above influences the way we build and deploy models. We use these results to make data-driven decisions that take into consideration the hardware type, the software stack, the model architecture, and typical usage patterns. Here are some recommendations drawn from our experience.

Identify your optimization target: Do you care about interactive performance? Maximizing throughput? Minimizing cost? There are predictable trade-offs here.

Pay attention to the components of latency: For interactive applications time-to-first-token drives how responsive your service will feel and time-per-output-token determines how fast it will feel.

Memory bandwidth is key: Generating the first token is typically compute-bound, while subsequent decoding is memory-bound operation. Because LLM inference often operates in memory-bound settings, MBU is a useful metric to optimize for and can be used to compare the efficiency of inference systems.

Batching is critical: Processing multiple requests concurrently is critical for achieving high throughput and for effectively utilizing expensive GPUs. For shared online services continuous batching is indispensable, whereas offline batch inference workloads can achieve high throughput with simpler batching techniques.

In depth optimizations: Standard inference optimization techniques are important (eg. operator fusion, weight quantization) for LLMs but it's important to explore deeper systems optimizations, especially those which improve memory utilization. One example is KV cache quantization.

Hardware configurations: The model type and expected workload should be used to decide deployment hardware. For instance, when scaling to multiple GPUs MBU falls much more rapidly for smaller models, such as MPT-7B, than it does for larger models, such as Llama2-70B. Performance also tends to scale sub-linearly with higher degrees of tensor parallelism. That said, a high degree of tensor parallelism might still make sense for smaller models if traffic is high or if users are willing to pay a premium for extra low latency.

Data Driven Decisions: Understanding the theory is important, but we recommend always measuring end-to-end server performance. There are many reasons an inference deployment can perform worse than expected. MBU could be unexpectedly low because of software inefficiencies. Or differences in hardware between cloud providers could lead to surprises (we have observed a 2x latency difference between 8xA100 servers from two cloud providers).

To get started with LLM inference, try out Databricks Model Serving. Check out the documentation to learn more.

See all previous MosaicML blogs

骨量偏高是什么原因 知否知否应是绿肥红瘦什么意思 吃什么养肝护肝最好 胃烂了是什么病严重吗 梦见着火了是什么征兆
血小板计数高是什么原因 王火火念什么 高压偏低有什么危害 脚没有力气是什么原因 什么是类风湿性关节炎
胃潴留是什么病 什么原因导致心肌缺血 甲字五行属什么 2月27是什么星座 rainbow什么意思
双子座后面是什么星座 软件开发属于什么行业 肝肾阴虚吃什么中成药 静待花开什么意思 随波逐流什么意思
三氧化硫常温下是什么状态sanhestory.com 瘘管是什么意思hcv8jop7ns2r.cn 按摩椅什么品牌最好hcv9jop1ns7r.cn dr拍片是检查什么的hcv8jop2ns1r.cn 十二生肖各代表什么花hcv8jop8ns8r.cn
每天吃松子有什么好处hcv9jop4ns2r.cn 暗物质是什么tiangongnft.com raf是什么意思hcv7jop5ns4r.cn 尿路感染吃什么hcv8jop1ns1r.cn 什么是我的wuhaiwuya.com
慢性阑尾炎吃什么药hcv8jop0ns4r.cn dha宝宝什么时候吃最好hcv7jop9ns6r.cn 寒潮是什么hcv9jop1ns0r.cn 鼻息肉长什么样子图片hcv8jop0ns7r.cn 对头是什么意思cj623037.com
奔走相告是什么意思hcv8jop9ns8r.cn 16是什么生肖hcv8jop2ns0r.cn 十二生肖各代表什么花hcv9jop1ns1r.cn 未羊是什么意思xscnpatent.com 水瓶座前面是什么星座hcv8jop7ns2r.cn
百度