skywalker023/sodaverse
🥤🧑🏻🚀Code and dataset for our EMNLP 2023 paper - "SODA: Million-scale Dialogue Distillation with Social Commonsense Contextualization"
This project helps AI researchers create large datasets of natural-sounding conversations infused with social commonsense. It takes existing social commonsense knowledge bases and distills them into dialogue, producing new, realistic conversational datasets. AI researchers and dialogue system developers who need to train or evaluate conversational AI models would use this.
239 stars.
Use this if you are an AI researcher or developer looking to generate large-scale, high-quality dialogue datasets that reflect real-world social interactions and commonsense understanding.
Not ideal if you need a conversational AI model for knowledge-intensive domains like science, medical advice, or legal consultation, as this model is primarily for social chitchat.
Stars
239
Forks
14
Language
Python
License
MIT
Category
Last pushed
Jan 23, 2026
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/skywalker023/sodaverse"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
monarch-initiative/ontogpt
LLM-based ontological extraction tools, including SPIRES
weAIDB/awesome-data-llm
Official Repository of "LLM × DATA" Survey Paper
AXYZdong/AMchat
AM (Advanced Mathematics) Chat is a large language model that integrates advanced mathematical...
Y-Research-SBU/TimeSeriesScientist
Official Repository for TimeSeriesScientist
open-chinese/poetry-collection
中文《诗歌总集》,距今为止最全面,最系统的中文诗词数据集,统一数据建模.