0
点赞
收藏
分享

微信扫一扫

dm8用户配置免密登录

小_北_爸 04-05 09:00 阅读 2

数据更新:
prompt length 9 on bucket size 8
python脚本(注意分支):
在这里插入图片描述

HLO图分析KV-Cache更新:
KV-Cache作为HLO图的输入输出:bf16[1,2048,32,128]{3,2,1,0} 128x, 2x32x2
在这里插入图片描述

参考链接

notes for transformer introduction by an Italian teaching in China: Attention is all you need (Transformer) - Model explanation (including math), Inference and Training
notes for and LLaMa: LLaMA explained: KV-Cache, Rotary Positional Embedding, RMS Norm, Grouped Query Attention, SwiGLU
github 使用XLA_GPU,选择分支llama2-google-next-inference
pytorch.org: path-achieve-low-inference-latency

举报

相关推荐

0 条评论