yifanzhang-pro/AutoMathText

[ACL 2025 Findings] Autonomous Data Selection with Zero-shot Generative Classifiers for Mathematical Texts (As Huggingface Daily Papers: https://huggingface.co/papers/2402.07625)

/ 100

Emerging

AutoMathText is a large collection of around 200 GB of mathematical texts and code excerpts gathered from various online sources. It provides this content with an associated 'LM score' (between 0 and 1) that indicates its relevance, quality, and educational value for mathematical intelligence. This dataset is valuable for AI researchers, educators, and mathematics enthusiasts who need high-quality, pre-assessed mathematical content for learning, teaching, or training AI models.

Use this if you need a pre-scored, extensive dataset of mathematical texts and code to develop AI models for math, create educational materials, or conduct research at the intersection of mathematics and AI.

Not ideal if you require text outside of mathematics or if you prefer to manually curate and score your data.

mathematical-research AI-model-training educational-content data-curation mathematical-intelligence

No Package No Dependents

Maintenance 6 / 25

Adoption 9 / 25

Maturity 16 / 25

Community 8 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

CC-BY-4.0

Higher-rated alternatives

ExtensityAI/symbolicai

A neurosymbolic perspective on LLMs

TIGER-AI-Lab/MMLU-Pro

The code and data for "MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding...

deep-symbolic-mathematics/LLM-SR

[ICLR 2025 Oral] This is the official repo for the paper "LLM-SR" on Scientific Equation...

microsoft/interwhen

A framework for verifiable reasoning with language models.

zhudotexe/fanoutqa

Companion code for FanOutQA: Multi-Hop, Multi-Document Question Answering for Large Language...

Explore Transformer Models

All categories Trending Transformer directory Insights