
Our classifier is based on Detoxify's unbiased-toxic-roberta model, which was trained on the Jigsaw Unintended Bias dataset.
This model was selected for its high accuracy and reduced false positives in content from marginalized groups. We calibrate our thresholds using guidance from Google Publisher Policies.
Each page receives a composite safety grade based on the most concerning content on the page.
Note: Pages without completed safety review default to Grade D to protect buyers.
Urban Dictionary applies two distinct moderation layers to every piece of content — one for publication, and another for advertiser safety. This ensures ads never appear next to content that violates your risk tolerance.
- ✓ Only Grade A pages
- ✓ Ideal for family-safe or regulated brands
- ✓ Grades A & B
- ✓ Suitable for most mainstream advertisers
- ✓ Grades A–C
- ✓ Broadest reach, suitable for mature brands
Custom controls
Set category-level thresholds (e.g. exclude only identity-based attacks or sexual content) to tailor safety enforcement.
✓ Open methodology
- Model details, thresholds, and scoring code are available for audit.
✓ Bias mitigation
- We use fairness-optimized models and conduct regular audits to reduce disproportionate impact on minority content.
✓ Full control
- CSV exports, dashboard tools, and override options let you tune your brand safety settings.
Have a question about our brand safety policies?
Get detailed information about our moderation system, thresholds, and implementation.