qvac

The largest synthetic LLM educational pre-training datasets.

QVAC Genesis is a family of open datasets focused on real understanding - not imitation. Our aim is to provide the global AI community with the high-quality data needed to accelerate the development of open-source models.

Discover QVAC Genesis

Genesis II

The second release of QVAC Genesis expands coverage to 10 new domains, for example chemistry, computer science, statistics, machine learning, astronomy and econometrics, while also introducing an improved methodology that produces higher-quality synthetic datasets.

More than a scale increase, our research aims to empower the community to develop models that reason, and explain, grounding intelligence in understanding not imitation - a deliberate shift in how educational AI data should be built.

qvac

Commited to the Community

The QVAC Genesis family is made available under a Creative Commons license reinforcing our commitment to open, community-driven AI research.

Genesis I

We start with QVAC Genesis I, a synthetic dataset purpose-built for education-specific content, offering deep and comprehensive coverage across key STEM domains.

The high-quality dataset has been rigorously validated across multiple educational benchmarks, demonstrating superior performance across school and college-level subjects like Logical deduction, Mathematics, Biology, and Medicine.

qvac

Test using our evaluation model

Test Genesis I yourself using our open-source pre-trained base model.

Perform continual pre-training, test, and compare on a proven baseline instantly and discover how Genesis I provides a practical foundation for developing next-generation STEM learning assistants that genuinely understand complex STEM concepts.

FAQ