CoPE-A: The COntent Policy Evaluator Model

Model Overview

CoPE-A-9B is a special purpose content policy evaluation model designed to perform accurate content labeling based on developer-customizable criteria. Built by trust & safety veterans, CoPE-A represents a significant advancement in content classification technology, offering steerable and precise policy interpretation capabilities while maintaining a small and efficient model size.

Key Features

Policy-adaptive content evaluation
High-accuracy binary classification
Steerable (no fixed taxonomy/definitions)
Validated across multiple harm categories

Getting Started

For Developers

Download google/gemma2-9b base model
Merge with the cope adapter (i.e., this model)
Perform inference using the prompt template below

Try the Demo

Test out a live version of the model in this HF Space

Technical Specifications

Model Architecture

CoPE-A-9B is constructed as a ~9B parameter model, built upon Google's Gemma-2-9b architecture using LoRA fine-tuning techniques. This architecture enables efficient deployment while maintaining high accuracy in content policy evaluation tasks.

Input Format

The model accepts input in a structured format:

INSTRUCTIONS
============

Examine the given POLICY and determine if the given CONTENT meets the criteria for ANY of the LABELS. Answer "1" if yes, and "0" if no.


POLICY
======

# Policy

## Overview

[Short statement of what subject matter the policy is meant to classify]

## Definition of Terms

- [Definition Term 1]: Precise definition of a specific word or phrase that will be used repeatedly in the policy.
- Etc.

## Interpretation of Language

- [Instructions on how to handle ambiguous language for the purposes of this policy]
- [Other relevant guidance for how to go about applying this specific policy]
- Etc.

## Definition of Labels

### (HS): Hate Speech

#### Includes

- [Criteria Name 1]: Precise definition of a specific characteristic of content that causes it to qualify for this label. These can, but do not need to, include caveats specific to this criteria
- Etc.

#### Excludes

- [Criteria Name 1]: Precise definition of a specific characteristic of content that causes it to NOT qualify for this label. These can, but do not need to, include caveats specific to this criteria
- Etc.


CONTENT
=======

{content_text}


ANSWER
======

See our demo for an example of a full policy.

Output Format

CoPE-A provides binary classification outputs:

0: None of the policy labels apply
1: One or more policy labels apply

System Requirements

Compatible with commodity GPU hardware
Sub-second inference time without any optimization
Deployment requirements match Gemma-2-9b base model specifications

Training Details

Training Methodology

The model draws upon research into designing more pluralistic policy interpreters. It employs a novel training methodology (see research talk) that moves beyond simple policy memorization and achieves true policy interpretation capabilities by:

Training across conflicting policy formulations
Focusing on generalizable policy understanding
Emphasizing interpretation consistency

Training Data

CoPE-A's training dataset was carefully curated to ensure robust policy interpretation:

Approximately 60,000 labels across unique policy / content pairs
Policy texts created by the CoPE team
Content data sourced from publicly-accessible internet forums
Combined automated and manual processes to produce golden labels
Coverage across multiple harm areas, including but not limited to:
- Hate speech
- Sexual content
- Self-harm
- Harassment
- Toxicity

Performance Evaluation

Methodology

Evaluation in the policy interpreter space is extremely difficult since most benchmark datasets do not publish the exact policy criteria given to labelers. As such, with the exception of the Ethos hate speech benchmark, we had to largely validate our model based on internal datasets. However, to make our analyses more rigorous, we only evaluated our model based on unique policies and unique content samples that were held out and not present in our training corpus.

It was also difficult to evaluate certain peer models when the harm areas in question were outside those models’ taxonomies. We have accordingly dropped them from the relevant data tables as they performed very poorly (as expected). CoPE, by contrast, has no fixed taxonomy and can accept any arbitrary content policy, similar to foundation models like GPT-4o, which we were able to evaluate in all areas.

Benchmark Results

Hate Speech Classification

Model	Precision	Recall	F1 Score
CoPE-A-9B	89%	93%	91%
GPT-4o	97%	78%	87%
Llama-3.1-8B	59%	96%	73%
LlamaGuard3-8B	88%	64%	74%
ShieldGemma-9B	68%	98%	80%

Hate Speech - Ethos Benchmark (External Set)

Model	Precision	Recall	F1 Score
CoPE-A-9B	80%	88%	84%
GPT-4o	91%	78%	84%
Llama-3.1-8B	62%	92%	74%
LlamaGuard3-8B	87%	79%	83%
ShieldGemma-9B	82%	82%	82%

Toxic Speech Classification

Model	Precision	Recall	F1 Score
CoPE-A-9B	93%	87%	90%
GPT-4o	64%	89%	75%
Llama-3.1-8B	33%	96%	49%

Note: Toxicity is outside the ShieldGemma & LlamaGuard taxonomy

Sexual Content Classification

Model	Precision	Recall	F1 Score
CoPE-A-9B	96%	83%	89%
GPT-4o	95%	72%	82%
Llama-3.1-8B	48%	95%	64%
LlamaGuard3-8B	100%	43%	60%
ShieldGemma-9B	96%	76%	85%

Self-Harm Content Classification

Model	Precision	Recall	F1 Score
CoPE-A-9B	83%	93%	88%
GPT-4o	84%	93%	88%
Llama-3.1-8B	56%	96%	70%
LlamaGuard3-8B	65%	84%	73%
ShieldGemma-9B	69%	89%	78%

Harassment Classification

Model	Precision	Recall	F1 Score
CoPE-A-9B	69%	78%	73%
GPT-4o	100%	17%	30%
Llama-3.1-8B	35%	87%	50%
ShieldGemma-9B	49%	55%	52%

Note: Harassment is outside the LlamaGuard taxonomy

Performance Analysis

CoPE-A demonstrates state of the art performance across multiple policy interpretation evaluation metrics, with significant improvements over comparable models. It even excels in hate speech detection, which is typically one of the most subjective and difficult harm areas. Overall, the model shows balanced precision and recall, making it suitable for production deployment in at-scale content moderation and content classification systems.

Intended Applications

Primary Use Cases

Content Labeling
- Real-time content moderation
- Batch processing of content
LLM Guardrails
- Input prompt risk assessment
- Output answer risk assessment
- NB: Not yet optimized for chat format
Content Scoring
- Feature generation for social feed ranking
- Content quality assessment & measurement

Prohibited Uses

Surveillance applications
Applications requiring external fact verification
Use cases beyond stated technical limitations

Limitations and Constraints

Current Limitations

Text Processing: Limited to 8K tokens (combined policy and content)
Language Support: Currently optimized for US English only. Performance will degrade for other languages and locales.
Knowledge Constraints: Cannot make classifications requiring external verification (e.g., misinformation) unless explicitly defined in the provided policy
Scope: Binary classification only (i.e., presence/absence of matching labels)

Ethical Considerations

Bias and Fairness

While comprehensive bias evaluation is still ongoing, users should:

Implement careful policy design to mitigate potential biases
Monitor classification patterns across different demographic groups
Contribute problematic examples to our bias assessment efforts

Safety Measures

The model's binary classification nature inherently limits certain risks, but users should:

Maintain appropriate human oversight
Regularly audit classification decisions
Implement robust observability systems

Maintenance and Updates

Update Schedule

Quarterly releases planned
Regular performance improvements
Community-driven feature enhancements

Future Roadmap Focus

Expansion to new harm areas
Language and locale support
Multi-modality (e.g., images)
Performance optimizations
Novel evaluation benchmarks

Community and Support

General Resources

For any technical questions or comments, please join our HuggingFace community forum. You can share your feedback, suggest new areas, or pick our brains about anything. If you’d prefer a more private discussion, you can also submit your feedback via this form.

Pilot Partner Program

We are currently accepting a limited number of pilot partners to test real world deployments of CoPE. Partners receive early access to the model weights as well as technical training and support with custom policy development and fine-tuning. If you are interested in joining our trusted partner program, contact us at ([email protected]).

About the Developer

CoPE-A is developed and maintained by Zentropi, an AI Trust & Safety company focused on making content classification simple. The project represents a collaborative effort between industry experts and academic researchers to advance the state of the art in content labeling technology.

Last Updated: December 19, 2024

You need to agree to share your contact information to access this model