lperezmo/embeddings-extraction
Scripts for reading, extracting, and organizing data from either HTML or PDF documents and prepare them to be converted into embeddings for use in context-augmented LLM queries.
This tool helps you quickly find answers within a large collection of HTML or PDF documents. You provide a folder of these documents, and it processes them to create a searchable database. You can then ask questions, and it will retrieve the most relevant sections from your documents, helping knowledge workers or researchers efficiently tap into their document archives.
No commits in the last 6 months.
Use this if you need to extract specific information or answer questions by searching through a large set of unstructured HTML or PDF documents.
Not ideal if your primary goal is to extract structured data into tables, or if you only have a few documents to review manually.
Stars
13
Forks
4
Language
Python
License
MIT
Category
Last pushed
Aug 26, 2024
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/embeddings/lperezmo/embeddings-extraction"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
ContextualAI/gritlm
Generative Representational Instruction Tuning
xlang-ai/instructor-embedding
[ACL 2023] One Embedder, Any Task: Instruction-Finetuned Text Embeddings
liuqidong07/LLMEmb
[AAAI'25 Oral] The official implementation code of LLMEmb
hpcaitech/CachedEmbedding
A memory efficient DLRM training solution using ColossalAI
ritesh-modi/embedding-hallucinations
This repo shows how foundational model hallucinates and how we can fix such hallucinations using...