microsoft/SafeNLP

Safety Score for Pre-Trained Language Models

/ 100

Emerging

This project helps evaluate how safe your large language models are. It takes a pre-trained language model and a dataset of sentences that might contain harmful language or implicit hate. The output provides a 'safety score' for your model, broken down by various demographic groups, showing how likely the model is to generate toxic content towards those groups. This is for AI ethics researchers or product managers who want to ensure their models are fair and unbiased.

No commits in the last 6 months.

Use this if you are developing or deploying a language model and need to quantify its potential for generating harmful or biased text towards different demographic groups.

Not ideal if you are looking to fine-tune a model for safety directly or need to detect harmful content in real-time user-generated text.

AI ethics responsible AI language model evaluation bias detection content moderation

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 9 / 25

Maturity 16 / 25

Community 8 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

—

Higher-rated alternatives

dccuchile/wefe

WEFE: The Word Embeddings Fairness Evaluation Framework. WEFE is a framework that standardizes...

dreji18/Fairness-in-AI

Detecting Bias and ensuring Fairness in AI solutions

amazon-science/bold

Dataset associated with "BOLD: Dataset and Metrics for Measuring Biases in Open-Ended Language...

dhfbk/variationist

Variationist: Exploring Multifaceted Variation and Bias in Written Language Data (ACL 2024 demo track)

soarsmu/BiasFinder

BiasFinder | IEEE TSE | Metamorphic Test Generation to Uncover Bias for Sentiment Analysis Systems

Explore NLP Tools

All categories Trending NLP directory Insights