Inside Nvidia’s Tools for Faster GPU AI Personalisation

Fine-tuning large language models (LLMs) has become essential for enterprises seeking to customise AI systems for specific workflows – from customer support chatbots to coding assistants.
The challenge lies in getting smaller language models to respond consistently with high accuracy for specialised tasks, particularly when those tasks require domain expertise or adherence to particular formats.
Innovating for this challenge, Nvidia has released guidance on fine-tuning these models using Unsloth, an open-source framework optimised for the company’s graphics processing units (GPUs).
The announcement comes alongside the launch of Nvidia’s Nemotron 3 family of open models, designed for agentic AI applications that can orchestrate actions on behalf of users.
Teaching AI new tricks with Unsloth
“Fine-tuning is like giving an AI model a focused training session,” says Annamalai Chockalingam, a product professional focused on LLM products from Nvidia.
“With examples tied to a specific topic or workflow, the model improves its accuracy by learning new patterns and adapting to the task at hand.”
Developers can choose between three main approaches.
Parameter-efficient fine-tuning, using techniques such as LoRA or QLoRA, updates only a small portion of the model for faster training.
This method requires between 100 and 1,000 prompt-sample pairs and suits scenarios including adding domain knowledge, improving coding accuracy or adapting models for legal tasks.
Full fine-tuning updates all of the model’s parameters and proves useful for teaching models to follow specific formats or styles.
Nvidia recommends this approach for building AI agents and chatbots that must provide assistance about particular topics, stay within certain guardrails and respond in a particular manner.
This method requires more than 1,000 prompt-sample pairs.
Reinforcement learning however, is a more advanced technique that adjusts model behaviour using feedback or preference signals.
The model learns by interacting with its environment and uses the feedback to improve itself over time, making it suitable for building autonomous agents or improving accuracy in specialised domains such as law or medicine.
Unsloth translates complex mathematical operations into custom GPU kernels to accelerate training.
The framework boosts the performance of the Hugging Face transformers library by 2.5 times on Nvidia GPUs.
Meanwhile, Hugging Face operates a platform for sharing AI models and datasets.
These optimisations work across Nvidia’s hardware range, from GeForce RTX laptops to RTX PRO workstations and DGX Spark.
Nemotron 3: Bringing hybrid architecture to market
The Nemotron 3 family uses a hybrid latent Mixture-of-Experts architecture.
Nemotron 3 Nano 30B-A3B, available now, represents what Nvidia describes as the most compute-efficient model in the lineup.
It contains 30 billion parameters but activates only three billion during inference, producing up to 60% fewer reasoning tokens and significantly reducing inference costs.
The model includes a one million-token context window, allowing it to retain far more information for long, multistep tasks.
Nvidia has optimised Nemotron 3 Nano for software debugging, content summarisation, AI assistant workflows and information retrieval at low inference costs.
Nemotron 3 Super targets high-accuracy reasoning for multi-agent applications, while Nemotron 3 Ultra handles complex AI applications.
Both models arrive in the first half of 2026.
Nvidia has also released an open collection of training datasets and reinforcement learning libraries.
Developers can access Nemotron 3 Nano through Hugging Face, Llama.cpp and LM Studio.
How DGX Spark enables local fine-tuning workflows
DGX Spark brings what Nvidia positions as desktop supercomputing to developers.
Built on the company’s Grace Blackwell architecture, the system delivers up to one petaflop of FP4 AI performance and includes 128GB of unified CPU-GPU memory.
This unified memory architecture enables larger model sizes, with models containing more than 30 billion parameters fitting comfortably within DGX Spark’s capacity despite exceeding the VRAM limits of consumer GPUs.
The system supports full fine-tuning and reinforcement learning workflows, which demand more memory and higher throughput than parameter-efficient methods.
Developers can run compute-intensive tasks locally instead of waiting for cloud instances or managing multiple environments.
The system fine-tunes a Llama model with eight billion parameters at 6,832 tokens per second using parameter-efficient methods, or 4,294 tokens per second for full fine-tuning.
For models with 70 billion parameters, DGX Spark achieves 1,461 tokens per second with parameter-efficient fine-tuning.
DGX Spark’s capabilities extend beyond language models – as high-resolution diffusion models, which often require more memory than typical desktops provide, can generate 1,000 images in seconds on the system.


