TakeLab/podium

Podium: a framework agnostic Python NLP library for data loading and preprocessing

/ 100

Experimental

This tool helps machine learning engineers and data scientists efficiently prepare text data for training natural language processing (NLP) models. It takes raw text from various sources like CSV files or popular NLP datasets, processes it according to custom rules, and outputs structured, cleaned, and tokenized data ready for model ingestion. It's designed for anyone building NLP applications who needs robust, flexible control over their text data pipeline.

No commits in the last 6 months.

Use this if you need a lightweight and flexible way to load and preprocess diverse text datasets for training custom NLP models, especially if you want to integrate with existing Hugging Face models or define specific text cleaning steps.

Not ideal if you primarily work with pre-built, end-to-end NLP solutions and don't require fine-grained control over data preparation or custom model development.

natural-language-processing text-analysis machine-learning-engineering data-preparation model-training

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 8 / 25

Maturity 16 / 25

Community 5 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

BSD-3-Clause

Higher-rated alternatives

chrismattmann/tika-python

Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called...

sloria/TextBlob

Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase...

cltk/cltk

The Classical Language Toolkit

allenai/scispacy

A full spaCy pipeline and models for scientific/biomedical documents.

wi2trier/cbrkit

Customizable Case-Based Reasoning (CBR) toolkit for Python with a built-in API and CLI.

Explore NLP Tools

All categories Trending NLP directory Insights