You Only Cache Once (25 minute read)
The YOCO architecture is a decoder-decoder model that reduces GPU memory demands while retaining global attention capabilities. It consists of a self-decoder and cross-decoder, allowing for efficient caching and reuse of key-value pairs. YOCO achieves favorable performance compared to traditional Transformers, with significant improvements in inference memory, latency, and throughput, making it suitable for large language models and long context lengths.
Consistency Language Models (21 minute read)
Predicting more than one token at a time is an interesting paradigm of active research. If successful, it would dramatically improve generation time for many large language models. The approach in this post, which mirrors consistency models from image synthetics, attempts to use a parallel decoding strategy on fine-tuned LLMs to speed up generation. Early results match speculative decoding performance of 3x.