xn0tsa/Web2LLM
An advanced Python tool for extracting data from websites, cleaning the content, and converting it to high-quality Markdown for optimal use by LLM systems.
When you need to feed up-to-date information from websites to your AI tools or large language models, this project helps by intelligently extracting the core content, removing irrelevant clutter like navigation and ads. It takes any web page URL and outputs a clean, structured Markdown file. This is ideal for AI trainers, data scientists, or content managers who build and maintain custom knowledge bases for AI.
No commits in the last 6 months.
Use this if you need to reliably convert web pages into a clean, token-efficient Markdown format specifically optimized for AI comprehension, avoiding irrelevant website elements.
Not ideal if you need to preserve the exact visual layout or every single element of a webpage, as it's designed to strip away non-essential components.
Stars
20
Forks
4
Language
Python
License
—
Category
Last pushed
Mar 04, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/perception/xn0tsa/Web2LLM"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
scrapy/scrapy
Scrapy, a fast high-level web crawling & scraping framework for Python.
Altimis/Scweet
A simple and unlimited twitter scraper : scrape tweets, likes, retweets, following, followers,...
lexiforest/curl_cffi
Python binding for curl-impersonate fork via cffi. A http client that can impersonate browser...
plabayo/rama
modular service framework to move and transform network packets
scrapinghub/spidermon
Scrapy Extension for monitoring spiders execution.