The research team on machine learning Of Apple recently shared on their blog that, in addition to ongoing efforts to accelerate inference on Apple silicon, they have recently made significant progress inspeeding up LLM inference for GPU NVIDIAwidely used for production applications across the AI industry.
Accelerating LLM inference – Apple researchers point out – is an important problem of ML research, since auto-regressive token generation is computationally expensive and relatively slow and improving inference efficiency can reduce latency for users.
Apple researchers have published and made open source Recurrent Drafter (ReDrafter), a new approach to speculative decoding that achieves state-of-the-art performance. ReDrafter uses an RNN-based draft model and combines beam search with dynamic tree attention to accelerate LLM token generation by up to 3.5 tokens per generation step in the models open sourceoutperforming previous speculative decoding techniques.
This research work according to the Apple team has demonstrated solid results, but its greatest impact comes from the production application to accelerate the inference of LLMs. To make this advancement production-ready on NVIDIA GPUs, Apple partnered with NVIDIA to integrate ReDrafter into the NVIDIA TensorRT-LLM inference acceleration framework.
While TensorRT-LLM supports several open source LLMs and the Medusa speculative decoding method, ReDrafter's beam search and tree attention algorithms rely on operators never used in previous applications. To enable ReDrafter integration, NVIDIA has added new operators or made existing ones available, significantly improving TensorRT-LLM's ability to support advanced models and decoding methods. ML developers using NVIDIA GPUs can now easily benefit from ReDrafter's accelerated token generation for their LLM applications in production with TensorRT-LLM.
When benchmarking a production model with tens of billions of parameters on an NVIDIA GPU, using the NVIDIA TensorRT-LLM inference acceleration framework with ReDrafter, the team observed a 2.7x increase in token generation rate per second for greedy decoding. According to the researchers, these results indicate that this technology could significantly reduce the latency perceived by users, while using fewer GPUs and consuming less power.
LLMs are increasingly used to power manufacturing applications, and improving inference efficiency can both impact computing costs and reduce latency for users. Thanks to ReDrafter's new approach to speculative decoding integrated into the NVIDIA TensorRT-LLM framework, developers can now benefit from Faster token generation on NVIDIA GPUs for their production LLM applications.
More information is available on Apple's Machine Learning Research blog and NVIDIA's blog.
Related News :