zjunlp/IEPile

[ACL 2024] IEPile: A Large-Scale Information Extraction Corpus

/ 100

Emerging

This project offers a massive, high-quality collection of Chinese and English text data specifically designed for training AI models to extract information. It takes raw text in various domains, like medical or financial, and helps create models that can automatically identify and pull out specific facts or entities based on predefined categories. This is primarily useful for AI researchers and developers who are building or improving large language models for information extraction tasks.

212 stars. No commits in the last 6 months.

Use this if you are developing AI models for automatically extracting specific data points from text and need a comprehensive, high-quality, and schema-based dataset for training and fine-tuning, especially for bilingual applications.

Not ideal if you are an end-user simply looking to apply an existing information extraction tool or if your task doesn't involve training new large language models.

natural-language-processing data-extraction machine-learning-engineering text-analytics AI-model-training

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 13 / 25

How are scores calculated?

Stars

212

Forks

Language

Python

License

—

Higher-rated alternatives

williamliujl/CMExam

A Chinese National Medical Licensing Examination dataset and large languge model benchmarks

StefanHeng/ProgGen

Code for paper "ProgGen: Generating Named Entity Recognition Datasets Step-by-step with...

Yinghao-Li/GnO-IE

Code for "A Simple but Effective Approach to Improve Structured Language Model Output for...

MaheshJakkala/naamapadam-multilingual-ner

Benchmarking NER on Naamapadam across 7 Indic languages. EDA + model training for...

yaoyiran/BLI-Reading-List

A 2024 Reading List for Bilingual Lexicon Induction (BLI) / Word Translation. Frequently Updated.

Explore NLP Tools

All categories Trending NLP directory Insights