bahaeddinmselmi/derja-smart-scraper

A lightweight CLI tool for collecting Tunisian Derja text snippets from the open web. It queries Google via [SerpAPI](https://serpapi.com), downloads each result, extracts readable text, and keeps only the sentences that look like Tunisian Derja using a heuristic detector.

28
/ 100
Experimental

This tool helps researchers and linguists gather Tunisian Arabic (Derja) text from the internet. You provide search queries, and it finds relevant web pages, extracts sentences, and filters for content that specifically sounds like Tunisian Derja. The output is a clean JSONL file ready for training AI language models. It's ideal for anyone building AI tools for Tunisian speakers.

Use this if you need to build a specialized dataset of authentic Tunisian Derja text for natural language processing or large language model training.

Not ideal if you need to scrape data in languages other than Tunisian Arabic or require a general-purpose web scraper for various content types.

Tunisian-Arabic NLP-dataset-creation LLM-training-data dialectal-language-research web-content-mining
No Package No Dependents
Maintenance 10 / 25
Adoption 5 / 25
Maturity 13 / 25
Community 0 / 25

How are scores calculated?

Stars

11

Forks

Language

Python

License

MIT

Category

arabic-nlp-tools

Last pushed

Jan 28, 2026

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/nlp/bahaeddinmselmi/derja-smart-scraper"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.