JonathanReeve/corpus-db
A textual corpus database for the digital humanities.
This project helps digital humanities researchers and literary scholars easily find and download specific collections of public domain texts. You input criteria like literary genre, author, publication decade, or setting, and it provides a curated subcorpus of books for your analysis. This is ideal for academics, students, and anyone doing literary research.
No commits in the last 6 months.
Use this if you need to quickly assemble a dataset of texts with particular characteristics for literary analysis or computational humanities projects.
Not ideal if you need to analyze a random sample of texts without specific metadata filters, or if you're looking for copyrighted materials.
Stars
63
Forks
9
Language
Jupyter Notebook
License
GPL-3.0
Category
Last pushed
Jul 26, 2020
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/JonathanReeve/corpus-db"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
Helsinki-NLP/OpusFilter
OpusFilter - Parallel corpus processing toolkit
natasha/corus
Links to Russian corpora + Python functions for loading and parsing
darija-open-dataset/dataset
darija <-> english dataset
omicsNLP/Auto-CORPus
Auto-CORPus pipeline developed by a University of Nottingham and Imperial College London...
SergeyShk/ruTS
Библиотека для извлечения статистик из текстов на русском языке.