判别式
极小极大
计算机科学
感知器
推论
人工智能
生成语法
机器学习
样品(材料)
生成模型
错误
人工神经网络
数学优化
数学
化学
政治学
法学
色谱法
作者
Ian Goodfellow,Jean Pouget-Abadie,Mehdi Mirza,Bing Xu,David Warde-Farley,Sherjil Ozair,Aaron Courville,Yoshua Bengio
出处
期刊:Cornell University - arXiv
日期:2014-06-10
被引量:4557
标识
DOI:10.48550/arxiv.1406.2661
摘要
Large Language Models (LLMS) rely on Key-Value (KV) caches to store attention context during autoregressive decoding. In long-sequence settings, the KV cache can consume large amounts of VRAM and become a practical bottleneck for throughput . We introduce KVHALO, an auxiliary reconstruction model that restores higher-fidelity KV tensors from a compressed cache state when required, reducing persistent memory footprint during inference. In our evaluation, KVHALO achieves up to 91.85% directional cosine alignment at convergence and reduces long-context degradation relative to a low-bit baseline under our stress-test workloads. We used HRM instead of other architectures, which allowed for higher-quality results in only 18,600 steps.
科研通智能强力驱动
Strongly Powered by AbleSci AI