bahaeddinmselmi/derja-smart-scraper
A lightweight CLI tool for collecting Tunisian Derja text snippets from the open web. It queries Google via [SerpAPI](https://serpapi.com), downloads each result, extracts readable text, and keeps only the sentences that look like Tunisian Derja using a heuristic detector.
This tool helps researchers and linguists gather Tunisian Arabic (Derja) text from the internet. You provide search queries, and it finds relevant web pages, extracts sentences, and filters for content that specifically sounds like Tunisian Derja. The output is a clean JSONL file ready for training AI language models. It's ideal for anyone building AI tools for Tunisian speakers.
Use this if you need to build a specialized dataset of authentic Tunisian Derja text for natural language processing or large language model training.
Not ideal if you need to scrape data in languages other than Tunisian Arabic or require a general-purpose web scraper for various content types.
Stars
11
Forks
—
Language
Python
License
MIT
Category
Last pushed
Jan 28, 2026
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/bahaeddinmselmi/derja-smart-scraper"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
CAMeL-Lab/camel_tools
A suite of Arabic natural language processing tools developed by the CAMeL Lab at New York...
PetrKorab/Arabica
Python package for text mining of time-series data
markuskiller/textblob-de
German language support for TextBlob.
MagedSaeed/farasapy
A Python implementation of Farasa toolkit
adhaamehab/textblob-ar
Arabic support for textblob