The largest synthetic pre-training datasets for Large Language Models (LLMs) to date.

QVAC Genesis is a family of open datasets focused on real understanding - not imitation. Our aim is to provide the global AI community with the high-quality data needed to accelerate the development of open-source models.

Discover QVAC Genesis

Genesis II

The second release of QVAC Genesis expands coverage to 10 new domains, for example chemistry, computer science, statistics, machine learning, astronomy and econometrics, while also introducing an improved methodology that produces higher-quality synthetic datasets.

More than a scale increase, our research aims to empower the community to develop models that reason, and explain, grounding intelligence in understanding not imitation - a deliberate shift in how educational AI data should be built.

Find it on Hugging Face

Commited to the Community

The QVAC Genesis family is made available under a Creative Commons license reinforcing our commitment to open, community-driven AI research.

Join the conversation

Genesis I

We start with QVAC Genesis I, a synthetic dataset purpose-built for education-specific content, offering deep and comprehensive coverage across key STEM domains.

The high-quality dataset has been rigorously validated across multiple educational benchmarks, demonstrating superior performance across school and college-level subjects like Logical deduction, Mathematics, Biology, and Medicine.

Find it on Hugging Face

Test using our evaluation model

Test Genesis I yourself using our open-source pre-trained base model. Perform continual pre-training, test, and compare on a proven baseline instantly and discover how Genesis I provides a practical foundation for developing next-generation STEM learning assistants that genuinely understand complex STEM concepts.

See the evaluation model

QVAC SDK

QVAC Fabric

Workbench

Health

QVAC Genesis

The largest synthetic pre-training datasets for Large Language Models (LLMs) to date.

Genesis II

Commited to the Community

Genesis I

Test using our evaluation model

FAQ

QVAC Genesis

The largest synthetic pre-training datasets for Large Language Models (LLMs) to date.

Genesis II

Commited to the Community

Genesis I

Test using our evaluation model

FAQ

1. What is QVAC Genesis and what is the difference to other pre-training datasets?

2. Why focus on synthetic data rather than naturally occurring data?

3. How large are the QVAC Genesis datasets?

4. What are the system requirements for downloading and using Genesis?

5. Can I test the dataset without having to train a model myself?

6. Is QVAC Genesis compatible with popular LLM training frameworks like PyTorch?

7. What specific subjects and topics are covered by the QVAC Genesis datasets?

8. What educational levels does the dataset cover (undergraduate, graduate)?

9. Does QVAC Genesis include multilingual content or is it English-only?

10. What benchmarks were used to evaluate QVAC Genesis, and what were the results?

11. How do you ensure the educational content is factually accurate?

12. What quality control processes were implemented during dataset creation?

13. How was the synthetic data in Genesis generated?

14. What safeguards are in place to prevent biases or inaccuracies in synthetic content?

15. Can you explain the methodology behind creating education-specific synthetic data?

16. How can I access and download the QVAC Genesis dataset?

17. What is the licensing model for QVAC Genesis and are there usage restrictions?

18. How does QVAC Genesis address concerns about data privacy and copyright?

19. What types of models is QVAC Genesis best suited for training?

20. Can QVAC Genesis be used for fine-tuning existing models?

21. How does QVAC Genesis complement existing open-source datasets?

22. How can I contribute to or provide feedback on QVAC Genesis?

23. Where can I find documentation and tutorials for using QVAC Genesis?

24. What steps were taken to ensure QVAC Genesis promotes responsible AI development?

25. Does QVAC Genesis contain any filtering for harmful or inappropriate content?