January 14, 2025 – NVIDIA has announced the launch of a massive English AI training database called Nemotron-CC, containing a total of 6.3 trillion tokens, with 1.9 trillion being synthetic data. According to NVIDIA, this training database is expected to significantly advance the training process of large language models for both academia and the corporate world.
Currently, the performance of various AI models in the industry primarily depends on their training data. However, existing public databases often have limitations in terms of scale and quality. NVIDIA claims that Nemotron-CC addresses this bottleneck by providing a vast amount of verified high-quality data, deeming it an “ideal material for training large language models.”
The Nemotron-CC database is built upon data from the Common Crawl website and undergoes a rigorous data processing workflow to extract a high-quality subset called Nemotron-CC-HQ.

In terms of performance, NVIDIA boasts that models trained using Nemotron-CC-HQ achieved a 5.6-point improvement on the MMLU (Massive Multitask Language Understanding) benchmark test compared to the current industry-leading public English training database, DCLM (Deep Common Crawl Language Model).
Further testing revealed that an 8-billion parameter model trained on Nemotron-CC showed a 5-point increase in the MMLU benchmark, a 3.1-point boost in the ARC-Challenge benchmark, and a 0.5-point enhancement in the average performance across 10 different tasks. This surpassed the performance of the Llama 3.1 8B model trained on the Llama 3 dataset.
NVIDIA employed various techniques during the development of Nemotron-CC, such as model classifiers, synthetic data rephrasing, and more, to ensure the highest quality and diversity of data. Additionally, they reduced the weight of traditional heuristic filters for specific high-quality data, thereby increasing the number of high-quality tokens in the database without compromising model accuracy.
It’s worth noting that NVIDIA has made the Nemotron-CC training database publicly available on the Common Crawl website, with relevant documentation set to be released on the company’s GitHub page in the near future. This initiative is expected to revolutionize AI training and enhance the capabilities of large language models across various domains.