Inside Nvidia’s Granary: One Million Hours of Speech AI Data

Share this article
Share this article
Prioritise Us on Google
Nvidia creates Granary to overcome speech data scarcity, improving recognition and translation for underrepresented languages | Credit: Nvidia
Nvidia launches Granary, an open-source speech dataset with one million hours of multilingual audio across 25 languages alongside two AI models

Nvidia releases an open-source speech dataset containing approximately one million hours of audio across 25 European languages. 

The dataset, called Granary, addresses data scarcity issues that have limited speech recognition and translation capabilities for languages including Croatian, Estonian and Maltese.

The dataset accompanies two neural network models designed for speech transcription and translation tasks. 

Nvidia Canary-1b-v2, containing one billion parameters, targets accuracy-focused applications, while Nvidia Parakeet-tdt-0.6b-v3, with 600 million parameters, prioritises throughput for real-time processing requirements.

The speech corpus includes nearly 650,000 hours designated for automatic speech recognition tasks and over 350,000 hours for speech translation applications. 

Automatic speech recognition converts spoken language into text, whilst speech translation directly converts audio from one language to another without intermediate transcription steps.

How does Granary work?

To create Granary, Nvidia’s speech AI team collaborated with researchers from Carnegie Mellon University, a Pennsylvania-based research institution and Fondazione Bruno Kessler, an Italian research centre. 

The partnership developed a processing pipeline using Nvidia’s NeMo Speech Data Processor toolkit to convert unlabelled audio into structured training data.

Key facts:
  • Granary: An open-source speech dataset with 1M hours of multilingual audio, including 650k for recognition and 350k for translation
  • Nvidia Canary-1b-v2: A 1B-parameter model for accurate transcription and translation across 25 European languages, ranked top for multilingual ASR on Hugging Face
  • Nvidia Parakeet-tdt-0.6b-v3: A 600M-parameter model optimised for real-time, large-scale transcription with the highest throughput among multilingual ASR models on Hugging Face

The methodology eliminates requirements for human annotation, a labour-intensive process where people manually transcribe and label audio content. 

Meanwhile, the processing pipeline transforms public speech data into formats suitable for machine learning model training without manual intervention.

Now, Granary covers 24 official European Union languages plus Russian and Ukrainian. 

The dataset provides training resources for languages with limited existing datasets, addressing linguistic representation gaps in current AI systems.

How Nvidia’s NeMo accelerates Granary’s model development workflow

Research presented at Interspeech, a language processing conference in the Netherlands, demonstrates that Granary requires approximately half the training data volume compared to existing datasets to achieve equivalent accuracy levels for automatic speech recognition and automatic speech translation tasks.

Canary-1b-v2
The Canary-1b-v2 model expands language support from four to 25 languages compared to previous versions – operating under a permissive software licence, allowing commercial usage and modification. 

Nvidia NeMo, a software suite for managing AI agent development lifecycles, facilitated Granary’s model creation.

Simultaneously, NeMo Curator, a component within the suite, filtered synthetic examples from source data to ensure training used only authentic audio samples. 

Parakeet-tdt-0.6b-v3
Furthermore, the Parakeet-tdt-0.6b-v3 model can transcribe 24-minute audio segments in single inference passes. 

Youtube Placeholder

The model automatically detects input audio language without requiring manual language specification.

Hugging Face leaderboards validating Granary’s performance claims

Hugging Face operates as a repository and collaboration platform for machine learning (ML) models and datasets – and Canary-1b-v2 ranks highest among open-source models for multilingual speech recognition accuracy on Hugging Face’s evaluation platform. 

Meanwhile, Parakeet-tdt-0.6b-v3 achieves the highest throughput among multilingual models on the same platform. 

Both models generate outputs with punctuation, capitalisation and word-level timestamps. 

Timestamps indicate precise timing when specific words occur within audio recordings, enabling applications requiring synchronisation between text and audio.

The dataset and models target production-scale applications, including multilingual chatbots, customer service voice agents and near-real-time translation services. 

Nvidia has made the processing methodology available as open-source software on GitHub, enabling developers to adapt the workflow for additional languages or different model architectures. 

Jensen Huang, CEO of Nvidia

Jensen Huang, CEO of Nvidia, says: “General-purpose, open-source research and foundation models are the backbone of AI innovation.”

The approach allows the global speech AI developer community to extend beyond the initial 25 supported languages.

Granary’s release addresses the challenge that fewer than 100 of approximately 7,000 global languages receive support from existing AI language models.

Company portals