C语言中的进制转换-CFANZ编程社区

Deploy an AI Coding Assistant with NVIDIA TensorRT-LLM and NVIDIA Triton | NVIDIA Technical Blog
Quick Start Guide — tensorrt_llm documentation (nvidia.github.io)

使用TensorRT-LLM的源码，来下载docker并在docker里编译TensorRT-LLM；

模型格式先Huggingface转为FasterTransformer；再用TensorRT-LLM将其compile为TensorRT engine；然后可用TensorRT-LLM的C++ runtime来跑推理（或者模型放到Triton Repo上，并指定TensorRT-LLM为backend）

Input的Tokenizing和Output的De-Tokenizing，视作前处理、后处理，创建"Python Model"；整个流程用一个"Ensemble Model"来表示，包含以上两个"Model"以及真正的GPT-Model;

LLama:

https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/llama/README.md

TensorRT-LLM支持很多常用模型；例如：baichuan、internlm、chatglm、qwen、bloom、gpt、gptneox、llama;

convert_checkpoint.py，是每种模型用自己的；run.py，是所有模型共享；

每种模型，支持的技术完善程度不同。

支持LLama的以下功能：

FP16
FP8
INT8 & INT4 Weight-Only
SmoothQuant
Groupwise quantization (AWQ/GPTQ)
FP8 KV CACHE
INT8 KV CACHE (+ AWQ/per-channel weight-only)
Tensor Parallel
STRONGLY TYPED

python convert_checkpoint.py

--tp_size 4 // Tensor-parallel

--pp_size 4 // Pipeline-parallel

Pipeline并行，在某一个GPU忙碌时，其他GPU是否在忙着处理别的batch?

量化相关：

Numerical Precision — tensorrt_llm documentation (nvidia.github.io)

9种量化，对每种模型只支持一部分：

Model	FP32	FP16	BF16	FP8	W8A8 SQ	W8A16	W4A16	W4A16 AWQ	W4A16 GPTQ
Baichuan	Y	Y	Y	Y	Y	Y	Y	Y	Y
BERT	Y	Y	Y	.	.	.	.	.	.
ChatGLM	Y	Y	Y	.	.	.	.	.	.
ChatGLM-v2	Y	Y	Y	.	.	.	.	.	.
ChatGLM-v3	Y	Y	Y	.	.	.	.	.	.
GPT	Y	Y	Y	Y	Y	Y	Y	.	.
GPT-NeMo	Y	Y	Y	.	.	.	.	.	.
GPT-NeoX	Y	Y	Y	.	.	.	.	.	Y
InternLM	Y	Y	Y	.	Y	Y	Y	.	.
LLaMA	Y	Y	Y	Y	Y	Y	Y	Y	Y
LLaMA-v2	Y	Y	Y	Y	Y	Y	Y	Y	Y
LLaMA-v3	Y	Y	Y	Y	Y	Y	Y	Y	Y
Qwen	Y	Y	Y	.	Y	Y	Y	Y	Y

--use_fused_mlp

适用于Gated MLP层(将mlp-mixer跟门控机制结合起来。即将输入沿着特征维度分为两半，然后将其中一半传入mlp-mixer，作为另一半的gate)；原本是计算gate是一个矩阵乘法，MLP是一个矩阵乘法；这个优化把2个矩阵乘法融合为1个矩阵乘法；

--multi_block_mode

batch_size * heads_count，小于GPU Stream Multiprocessor数目的一半时，且context input tokens较长(>1000)，则使用这个，可以增加GPU SM的利用率。（似乎是每个SM负责解决1个sample和1个head的乘法，同时无法利用所有SM时，就把其他token的计算也并行？）

类似资料：Flash-Decoding for Long-Context Inference | Princeton NLP Group (princeton-nlp.github.io)

--use_paged_context_fmha

在"--context_fmha"的基础上，允许context kv-cache在GPU和CPU memory之间offload；适合长input context的推理；

LLama2-70B使用了Grouped Query Attention:

减少了显存占用量；从activation乘以变换矩阵，计算得到Key和Value，只计算N组，减少了计算量；

--int8_kv_cache

KV cache使用INT8量化，来存放；节约显存；

会使用一部分输入数据，来试跑(calibrate the model)；从而得到Key、Value的取值范围，拿到Scaling factor；

跑多个LoRA ckpt: (有编号，-1表示原始model，0表示luotuo那个，1表示Japanese那个）

git-lfs clone https://huggingface.co/qychen/luotuo-lora-7b-0.1
git-lfs clone https://huggingface.co/kunishou/Japanese-Alpaca-LoRA-7b-v0
BASE_LLAMA_MODEL=llama-7b-hf/

python convert_checkpoint.py --model_dir ${BASE_LLAMA_MODEL} \
                            --output_dir ./tllm_checkpoint_1gpu \
                            --dtype float16
trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu \
            --output_dir /tmp/llama_7b_with_lora_qkv/trt_engines/fp16/1-gpu/ \
            --gemm_plugin auto \
            --lora_plugin auto \
            --max_batch_size 8 \
            --max_input_len 512 \
            --max_output_len 50 \
            --lora_dir  "luotuo-lora-7b-0.1/" "Japanese-Alpaca-LoRA-7b-v0/" \
            --max_lora_rank 8 \
            --lora_target_modules attn_q attn_k attn_v

python ../run.py --engine_dir "/tmp/llama_7b_with_lora_qkv/trt_engines/fp16/1-gpu/" \
              --max_output_len 10 \
              --tokenizer_dir ${BASE_LLAMA_MODEL} \
              --input_text "美国的首都在哪里? \n答案:" "美国的首都在哪里? \n答案:" "美国的首都在哪里? \n答案:" "アメリカ合衆国の首都はどこですか? \n答え:" "アメリカ合衆国の首都はどこですか? \n答え:" "アメリカ合衆国の首都はどこですか? \n答え:" \
              --lora_task_uids -1 0 1 -1 0 1 \
              --use_py_session --top_p 0.5 --top_k 0

Streaming LLM: (可以允许无限长度）

--streamingllm enable

--max_attention_window_size=2048

做多向前看多少个token的attention