rahulpunia29/extractous-go
Fast, multi-format document extraction library for Go. Includes streaming API for large files and OCR for scanned documents via Tesseract.
This is a Go library for developers that helps applications quickly extract text and metadata from a wide range of document types, including PDFs, Word, Excel, and scanned images. It takes various document files as input and outputs their textual content and associated metadata, even from very large or scanned documents. This is used by software engineers building applications that need to process and understand document content.
Use this if you are a Go developer building an application that needs fast, reliable, and memory-efficient extraction of text and metadata from a diverse set of document formats, including those requiring OCR.
Not ideal if you need a standalone application for document extraction rather than a library to integrate into your Go codebase, or if you are not a Go developer.
Stars
55
Forks
2
Language
Go
License
Apache-2.0
Category
Last pushed
Oct 25, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/rahulpunia29/extractous-go"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
ikawaha/kagome-dict
Dictionary Library for Kagome v2
aaaton/golem
A lemmatizer implemented in Go
habeanf/yap
Yet Another (natural language) Parser
clipperhouse/uax29
A tokenizer based on Unicode text segmentation (UAX #29), for Go. Split graphemes, words, sentences.
abadojack/whatlanggo
Natural language detection library for Go