eellak/glossAPI

Greek Dataset Production from PDF+

/ 100

Established

This tool helps researchers and institutions convert academic PDFs, especially those in Greek, into clean, structured Markdown and JSON. It takes a collection of PDF documents and outputs well-organized text, making it easier to analyze, index, or use for further research. The primary users are researchers, librarians, and data scientists working with academic literature and requiring high-quality text extraction.

128 stars. Available on PyPI.

Use this if you need to reliably extract content from academic PDFs, including those with complex layouts or in Greek, and transform it into a clean, machine-readable format.

Not ideal if you only need basic text extraction from simple documents or are not working with a large corpus where automated cleaning and structuring are crucial.

academic-research document-processing digital-humanities scientific-publishing corpus-linguistics

Maintenance 10 / 25

Adoption 10 / 25

Maturity 25 / 25

Community 20 / 25

How are scores calculated?

Stars

128

Forks

Language

Python

License

—

Related tools

pymupdf/langchain-pymupdf4llm

An integration package connecting PyMuPDF4LLM to LangChain

KalyanM45/DocGenius-Revolutionizing-PDFs-with-AI

This is a Python application that allows you to load a PDF and ask questions about it using...

mozilla-ai/structured-qa

Blueprint by Mozilla.ai for answering questions about structured documents

alejandro-ao/langchain-ask-pdf

An AI-app that allows you to upload a PDF and ask questions about it. It uses OpenAI's LLMs to...

leehanchung/llm-pdf-qa-workshop

Introduction to LLM App Development Workshop: PDF Q&A App using OpenAI, Langchain, and Chainlit

Explore LLM Tools

All categories Trending LLM Tool directory Insights