TakeLab/podium
Podium: a framework agnostic Python NLP library for data loading and preprocessing
This tool helps machine learning engineers and data scientists efficiently prepare text data for training natural language processing (NLP) models. It takes raw text from various sources like CSV files or popular NLP datasets, processes it according to custom rules, and outputs structured, cleaned, and tokenized data ready for model ingestion. It's designed for anyone building NLP applications who needs robust, flexible control over their text data pipeline.
No commits in the last 6 months.
Use this if you need a lightweight and flexible way to load and preprocess diverse text datasets for training custom NLP models, especially if you want to integrate with existing Hugging Face models or define specific text cleaning steps.
Not ideal if you primarily work with pre-built, end-to-end NLP solutions and don't require fine-grained control over data preparation or custom model development.
Stars
60
Forks
2
Language
Python
License
BSD-3-Clause
Category
Last pushed
Dec 12, 2022
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/TakeLab/podium"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
chrismattmann/tika-python
Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called...
sloria/TextBlob
Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase...
cltk/cltk
The Classical Language Toolkit
allenai/scispacy
A full spaCy pipeline and models for scientific/biomedical documents.
wi2trier/cbrkit
Customizable Case-Based Reasoning (CBR) toolkit for Python with a built-in API and CLI.