zjysteven/mink-plus-plus
[ICLR'25 Spotlight] Min-K%++: Improved baseline for detecting pre-training data of LLMs
This tool helps evaluate whether specific text examples were part of a large language model's (LLM) training data. It takes an LLM and a set of text examples as input, then outputs a score indicating the likelihood that each example was used for training. Data scientists, machine learning researchers, and AI model auditors who work with LLMs would use this for privacy and data governance assessments.
No commits in the last 6 months.
Use this if you need to determine if a particular piece of text was included in the pre-training data of a large language model.
Not ideal if you are looking for a general-purpose tool to filter or clean text data, rather than specifically detecting membership in LLM training sets.
Stars
54
Forks
9
Language
Python
License
MIT
Category
Last pushed
May 26, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/zjysteven/mink-plus-plus"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
ModelCloud/GPTQModel
LLM model quantization (compression) toolkit with hw acceleration support for Nvidia CUDA, AMD...
intel/auto-round
🎯An accuracy-first, highly efficient quantization toolkit for LLMs, designed to minimize quality...
pytorch/ao
PyTorch native quantization and sparsity for training and inference
bodaay/HuggingFaceModelDownloader
Simple go utility to download HuggingFace Models and Datasets
NVIDIA/kvpress
LLM KV cache compression made easy