KI News: Kurz und klar.

Anmelden

LAVa: Layer-wise KV Cache Eviction with Dynamic Budget Allocation

arXiv – cs.LG • 15.09.2025 05:00 • Original

#KV Cache #Cache Compression #Dynamic Budget Allocation #Transformer Residual Streams #Layer Attention #Head Budgets #LAVa

Anzeige

Ähnliche Artikel

arXiv – cs.LG • 22.12.2025 05:00

Effiziente Langkontext-Inferenz: Write-Gated KV reduziert Speicherbedarf um bis zu 57 %

PyTorch – Blog • 05.11.2025 22:00

Hybrid Models as First-Class Citizens in vLLM

arXiv – cs.LG • 06.10.2025 05:00

TokenFlow: Responsive LLM Text Streaming Serving under Request Burst via Preemptive Scheduling

arXiv – cs.AI • 08.09.2025 05:00

Enhancing LLM Efficiency: Targeted Pruning for Prefill-Decode Disaggregation in Inference

arXiv – cs.LG • 14.08.2025 05:00

Diffusion LLMs Can Do Faster-Than-AR Inference via Discrete Diffusion Forcing

Hugging Face – Blog • 04.06.2025 01:00

KV Cache from scratch in nanoVLM