srinidhinandakumar/big-data-ocr-ner
Applying Optical Character Recogntion, Named Entity Detection, Object Detection and Caption Generation on Big datasets
This project helps researchers and analysts automatically extract specific information from vast collections of scanned PDF documents and online image data. It takes in large volumes of image-based documents and web images, processing them to output structured data that highlights recognized text, identified objects, and categorized entities like names, locations, and organizations. The ideal user is a data analyst or researcher dealing with large, unstructured image-heavy datasets who needs to make sense of them for further study.
No commits in the last 6 months.
Use this if you need to automate the extraction of text, identify objects, and pull out named entities from extensive sets of scanned documents and web-scraped images.
Not ideal if your data is already structured text or if you only have a few documents to process manually.
Stars
10
Forks
4
Language
Python
License
—
Category
Last pushed
Jul 01, 2018
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/perception/srinidhinandakumar/big-data-ocr-ner"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
scrapy/scrapy
Scrapy, a fast high-level web crawling & scraping framework for Python.
Altimis/Scweet
A simple and unlimited twitter scraper : scrape tweets, likes, retweets, following, followers,...
lexiforest/curl_cffi
Python binding for curl-impersonate fork via cffi. A http client that can impersonate browser...
plabayo/rama
modular service framework to move and transform network packets
scrapinghub/spidermon
Scrapy Extension for monitoring spiders execution.