proycon/colibri-core
Colibri core is an NLP tool as well as a C++ and Python library for working with basic linguistic constructions such as n-grams and skipgrams (i.e patterns with one or more gaps, either of fixed or dynamic size) in a quick and memory-efficient way. At the core is the tool ``colibri-patternmodeller`` whi ch allows you to build, view, manipulate and query pattern models.
This tool helps linguists and language researchers efficiently analyze large collections of text (corpora) to find common word patterns. You provide a text corpus, and it generates models of n-grams, skipgrams, and flexgrams, along with their frequencies and relationships. This is ideal for computational linguists, sociolinguists, or anyone performing detailed corpus analysis.
129 stars.
Use this if you need to quickly identify and count recurring word sequences or patterns with gaps in very large text datasets without running out of memory.
Not ideal if you're only working with small text files or need advanced semantic understanding beyond pattern extraction and frequency counting.
Stars
129
Forks
20
Language
C++
License
GPL-3.0
Category
Last pushed
Feb 05, 2026
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/proycon/colibri-core"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Related tools
apache/opennlp
Apache OpenNLP
stanfordnlp/CoreNLP
CoreNLP: A Java suite of core NLP tools for tokenization, sentence segmentation, NER, parsing,...
dkpro/dkpro-core
Collection of software components for natural language processing (NLP) based on the Apache UIMA...
stanfordnlp/python-stanford-corenlp
Python interface to CoreNLP using a bidirectional server-client interface.
apache/opennlp-sandbox
Apache OpenNLP Sandbox