ashvardanian/SpaceV

Billion-scale Semantic Search dataset derived from Microsoft SpaceV for Vector Search benchmarks with smaller subsets

/ 100

Experimental

This dataset helps evaluate the performance of very large-scale information retrieval and recommendation systems. It provides a billion semantic vectors, representing textual or visual information, along with benchmark queries and their expected results. Data scientists and machine learning engineers can use this to stress-test and compare different vector search engines.

No commits in the last 6 months.

Use this if you are developing or evaluating a vector search engine and need a massive, realistic dataset to benchmark its performance, especially concerning speed and accuracy.

Not ideal if you are looking for a dataset to train a new machine learning model or if you only need a small dataset for quick prototyping.

information-retrieval recommender-systems machine-learning-engineering vector-search performance-benchmarking

Stale 6m No Package No Dependents

Maintenance 2 / 25

Adoption 5 / 25

Maturity 16 / 25

Community 0 / 25

How are scores calculated?

Stars

Forks

—

Language

Python

License

—

Higher-rated alternatives

deepset-ai/haystack-tutorials

Here you can find all the Tutorials for Haystack 📓

aryn-ai/sycamore

🍁 Sycamore is an LLM-powered search and analytics platform for unstructured data.

MaartenGr/PolyFuzz

Fuzzy string matching, grouping, and evaluation.

unum-cloud/USearch

Fast Open-Source Search & Clustering engine × for Vectors & Arbitrary Objects × in C++, C,...

towhee-io/towhee

Towhee is a framework that is dedicated to making neural data processing pipelines simple and fast.

Explore Embedding Tools

All categories Trending Embeddings directory Insights