jacklanda/CCAE

[NLPCC 2023] CCAE: A Corpus of Chinese-based Asian Englishes

33
/ 100
Emerging

This project provides the CCAE corpus, a collection of web documents totaling 340 million tokens from six Chinese-based Asian English varieties. It offers a clean, deduplicated dataset with traceable origins, suitable for researchers studying the nuances of English usage across different Asian regions. Linguists, sociolinguists, and computational linguists focused on World Englishes will find this corpus valuable.

No commits in the last 6 months.

Use this if you need a large, open-access, and meticulously cleaned dataset of Chinese-based Asian Englishes for linguistic research, language modeling, or tasks like identifying language varieties.

Not ideal if you are looking for a corpus primarily focused on Inner-circle English or require extensive linguistic annotations like Part-of-Speech tagging, which this corpus does not provide.

World Englishes sociolinguistics corpus linguistics language variation Asian Englishes
Stale 6m No Package No Dependents
Maintenance 0 / 25
Adoption 8 / 25
Maturity 16 / 25
Community 9 / 25

How are scores calculated?

Stars

57

Forks

4

Language

Python

License

MIT

Last pushed

Dec 06, 2023

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/nlp/jacklanda/CCAE"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.