Diffusion LLMs Can Do Faster-Than-AR Inference via Discrete Diffusion Forcing
Anzeige
Ähnliche Artikel
arXiv – cs.LG
•
Effiziente Langkontext-Inferenz: Write-Gated KV reduziert Speicherbedarf um bis zu 57 %
PyTorch – Blog
•
Hybrid Models as First-Class Citizens in vLLM
arXiv – cs.AI
•
Saber: An Efficient Sampling with Adaptive Acceleration and Backtracking Enhanced Remasking for Diffusion Language Model
arXiv – cs.LG
•
TokenFlow: Responsive LLM Text Streaming Serving under Request Burst via Preemptive Scheduling
arXiv – cs.LG
•
Inpainting-Guided Policy Optimization for Diffusion Large Language Models
arXiv – cs.LG
•
LAVa: Layer-wise KV Cache Eviction with Dynamic Budget Allocation