Text Preprocessing Pipelines NLP Tools

End-to-end tools and libraries for cleaning, normalizing, and preparing raw text data for NLP tasks. Includes tokenization, stemming, stopword removal, and data cleaning utilities. Does NOT include downstream NLP applications (sentiment analysis, classification, etc.), feature extraction, or domain-specific cleaning (tweets, names, etc.).

There are 45 text preprocessing pipelines tools tracked. 4 score above 50 (established tier). The highest-rated is chartbeat-labs/textacy at 60/100 with 2,236 stars.

Get all 45 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=nlp&subcategory=text-preprocessing-pipelines&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Tool Score Tier
1 chartbeat-labs/textacy

NLP, before and after spaCy

60
Established
2 nltk/nltk_data

NLTK Data

57
Established
3 brightertiger/pygarble

Python Package to detect garbled, gibberish text for EN

54
Established
4 jfilter/clean-text

🧹 Python package for text cleaning

53
Established
5 prasanthg3/cleantext

An open-source package for python to clean raw text data

49
Emerging
6 alinapetukhova/textcl

Text preprocessing package for use in NLP tasks https://pypi.org/project/textcl/

45
Emerging
7 takuti/prelims

Front matter post-processor for static site generators

41
Emerging
8 ksnugroho/basic-text-preprocessing

Basic text preprocessing for Bahasa with Python.

40
Emerging
9 textpipe/textpipe

Textpipe: clean and extract metadata from text

40
Emerging
10 citiususc/pyplexity

Cleaning tool for web scraped text

40
Emerging
11 MusfiqDehan/data-preprocessors

🛠️An easy to use tool for Data Preprocessing specially for Text Preprocessing

38
Emerging
12 LoLei/redditcleaner

Cleans Reddit Text Data :scroll: :broom:

38
Emerging
13 huu4ontocord/rio

Text pre-processing for NLP datasets

37
Emerging
14 Shubha23/Text-processing-NLP

This notebook contains entire text preprocessing pipeline for NLP problems....

37
Emerging
15 YugantM/textcleaner

text-data pre-processing utility

35
Emerging
16 Abhayparashar31/crazytext

A Simple Easy To Use Text Cleaning Package For NLP Built In Python. It Can...

31
Emerging
17 Arfius/light-text-prepro

Python module that collects regex rules

31
Emerging
18 mantzaris/KeemenaPreprocessing.jl

Preprocessing for text data: cleaning, normalization, vectorization,...

31
Emerging
19 iaramer/dobbi

An open-source NLP library: fast text cleaning and preprocessing

31
Emerging
20 Ankur3107/nlp_preprocessing

Text Preprocessing Package includes cleaning, tokenization, dataset...

30
Emerging
21 ninadpatil09/NLP-Notebooks

Explore NLP tasks with Python using NLTK, SpaCy & scikit-learn:...

30
Emerging
22 aflah02/cleansetext

This is a simple library to help you clean your textual data

29
Experimental
23 lgomezt/tidyX

Python package to clean raw tweets for ML applications.

27
Experimental
24 umapornp/textprepro

👀 Everything Everyway All At Once Text Preprocessing for Natural Language Processing.

27
Experimental
25 Al-Hasib/eng_text_cleaner

A python package for cleaning text

26
Experimental
26 krisograbek/text-preprocessing

Text preprocessing in Python. Libs include string, re, nltk, spacy, gensim,...

24
Experimental
27 abeaderstadt/nlp-02-text-preprocessing

Text Preprocessing NLP Project

23
Experimental
28 NITHISHM2410/text-preprocessing-techniques

This Repo includes modules that helps NLP related tasks.

19
Experimental
29 basit-afridi62/nlp-nltk-python

This repository is a hands-on guide to Natural Language Processing (NLP)...

19
Experimental
30 angelsomo/nlp-text-cleaning

Lightweight Python CLI tool for robust text cleaning, Unicode normalization,...

19
Experimental
31 iam-salma/NLP-Bootcamp-with-python

A hands-on NLP Bootcamp using Python covering text preprocessing,...

18
Experimental
32 Abdelrahman-Atef-Elsayed/NLP_Preprocessing_pipeline

This repo includes a generalized preprocessing pipeline for text data in NLP tasks.

18
Experimental
33 MariyamSiddiqui/Text-Preprocessing-NLP-pipeline

End-to-end NLP text preprocessing pipeline using Python — includes...

18
Experimental
34 shrutimary15/Text-data-preparation

The repository consists of a python code that inputs a text file consisting...

17
Experimental
35 tripathiadityap/cleantxty

Python package to clean strings and making them reasonable for NLP.

17
Experimental
36 nadinejackson1/text-preprocessing-pipeline

Basic text preprocessing pipeline, which includes tokenization, stemming,...

17
Experimental
37 udityamerit/Text-Processing-Package-For-Natural-Language-Processing

This project is a comprehensive collection of NLP techniques, practical...

16
Experimental
38 mahirmsb25/Text-Preprocessing-Pipeline

A Python-based NLP preprocessing pipeline using NLTK and Pandas to clean and...

15
Experimental
39 nluninja/nlp_crash_course_with_spacy

A Natural Language Processing crash course with SpaCy 2.6 and NLTK 3.6.2,...

14
Experimental
40 Varsh008/text_preprocessor_toolkit

Configurable Text Preprocessing Toolkit in Python using spaCy

11
Experimental
41 alanindra/baca-juga-cleaner

Program to clean news text by filtering out irrelevant syntactic...

11
Experimental
42 dodevca/tweet-preprocessor

Lightweight, modular, and extensible Python library for preprocessing...

11
Experimental
43 tnathu-ai/NLP-Job-Ad

Pre-process natural language text data to generate effective feature...

11
Experimental
44 michellepellon/tidyname

Intelligent company name cleaning and normalization for Python. Entity...

11
Experimental
45 PawarMukesh/NLP-Text-PreProcessing

This file is contain techniques used in pre-process the text data

11
Experimental

Comparisons in this category