ashvardanian/SpaceV
Billion-scale Semantic Search dataset derived from Microsoft SpaceV for Vector Search benchmarks with smaller subsets
This dataset helps evaluate the performance of very large-scale information retrieval and recommendation systems. It provides a billion semantic vectors, representing textual or visual information, along with benchmark queries and their expected results. Data scientists and machine learning engineers can use this to stress-test and compare different vector search engines.
No commits in the last 6 months.
Use this if you are developing or evaluating a vector search engine and need a massive, realistic dataset to benchmark its performance, especially concerning speed and accuracy.
Not ideal if you are looking for a dataset to train a new machine learning model or if you only need a small dataset for quick prototyping.
Stars
9
Forks
—
Language
Python
License
—
Category
Last pushed
Aug 27, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/embeddings/ashvardanian/SpaceV"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
deepset-ai/haystack-tutorials
Here you can find all the Tutorials for Haystack 📓
aryn-ai/sycamore
🍁 Sycamore is an LLM-powered search and analytics platform for unstructured data.
MaartenGr/PolyFuzz
Fuzzy string matching, grouping, and evaluation.
unum-cloud/USearch
Fast Open-Source Search & Clustering engine × for Vectors & Arbitrary Objects × in C++, C,...
towhee-io/towhee
Towhee is a framework that is dedicated to making neural data processing pipelines simple and fast.