NVIDIA is announcing a brand new AI software stack today known as TensorRT LLM which boosts Large Language Models performance across its GPUs.
NVIDIA's TensorRT-LLM is announced as a highly optimized, open-source library that enables the fastest inferencing performance across all Large Language Models with NVIDIA's AI GPUs such as Hopper. NVIDIA has worked with all LLMs within the open-source community to optimize its GPUs by utilizing the latest AI kernels with cutting-edge techniques such as SmoothQuant, FlashAttention & fMHA. The open-source foundation includes ready-to-run SOTA inference-optimized versions of LLMs such as GPT-3 (175B), Llama Falcom (180B), & Bloom, just to name a few.
TensorRT-LLM is also optimized to do automatic parallelization across multiple NVLINK servers with Infiniband interconnect. Previously, servers had to be manually assigned a large language model across multiple servers/GPUs which shouldn't be the case anymore with Tensor-RT LLM.
One of the biggest updates that TensorRT-LLM brings is in the form of a new scheduler known as In-Flight batching which allows work to enter and exit the GPU independent of other tasks. It allows dynamic processing of several smaller queries while processing large compute-intensive requests in the same GPU. This whole process makes the GPU more efficient and leads to some huge gains in throughput on GPUs such as the H100, up to 2x to be exact.
The TensorRT-LLM stack is also optimized around Hopper's Transformer engine and its compute FP8 capabilities. The library offers automatic FP8 conversion, a DL compiler for kernel fusion, & a mixed precision optimizer along with support for NVIDIA's own Smoothquaint algorithm enabling 8-bit quantization performance without
Read more on wccftech.com