CoPE-A: The COntent Policy Evaluator Model
Model Overview
CoPE-A-9B is a special purpose content policy evaluation model designed to perform accurate content labeling based on developer-customizable criteria. Built by trust & safety veterans, CoPE-A represents a significant advancement in content classification technology, offering steerable and precise policy interpretation capabilities while maintaining a small and efficient model size.
Key Features
- Policy-adaptive content evaluation
- High-accuracy binary classification
- Steerable (no fixed taxonomy/definitions)
- Validated across multiple harm categories
Getting Started
For Developers
- Download google/gemma2-9b base model
- Merge with the cope adapter (i.e., this model)
- Perform inference using the prompt template below
Try the Demo
- Test out a live version of the model in this HF Space
Technical Specifications
Model Architecture
CoPE-A-9B is constructed as a ~9B parameter model, built upon Google's Gemma-2-9b architecture using LoRA fine-tuning techniques. This architecture enables efficient deployment while maintaining high accuracy in content policy evaluation tasks.
Input Format
The model accepts input in a structured format:
INSTRUCTIONS
============
Examine the given POLICY and determine if the given CONTENT meets the criteria for ANY of the LABELS. Answer "1" if yes, and "0" if no.
POLICY
======
# Policy
## Overview
[Short statement of what subject matter the policy is meant to classify]
## Definition of Terms
- [Definition Term 1]: Precise definition of a specific word or phrase that will be used repeatedly in the policy.
- Etc.
## Interpretation of Language
- [Instructions on how to handle ambiguous language for the purposes of this policy]
- [Other relevant guidance for how to go about applying this specific policy]
- Etc.
## Definition of Labels
### (HS): Hate Speech
#### Includes
- [Criteria Name 1]: Precise definition of a specific characteristic of content that causes it to qualify for this label. These can, but do not need to, include caveats specific to this criteria
- Etc.
#### Excludes
- [Criteria Name 1]: Precise definition of a specific characteristic of content that causes it to NOT qualify for this label. These can, but do not need to, include caveats specific to this criteria
- Etc.
CONTENT
=======
{content_text}
ANSWER
======
See our demo for an example of a full policy.
Output Format
CoPE-A provides binary classification outputs:
0
: None of the policy labels apply1
: One or more policy labels apply
System Requirements
- Compatible with commodity GPU hardware
- Sub-second inference time without any optimization
- Deployment requirements match Gemma-2-9b base model specifications
Training Details
Training Methodology
The model draws upon research into designing more pluralistic policy interpreters. It employs a novel training methodology (see research talk) that moves beyond simple policy memorization and achieves true policy interpretation capabilities by:
- Training across conflicting policy formulations
- Focusing on generalizable policy understanding
- Emphasizing interpretation consistency
Training Data
CoPE-A's training dataset was carefully curated to ensure robust policy interpretation:
- Approximately 60,000 labels across unique policy / content pairs
- Policy texts created by the CoPE team
- Content data sourced from publicly-accessible internet forums
- Combined automated and manual processes to produce golden labels
- Coverage across multiple harm areas, including but not limited to:
- Hate speech
- Sexual content
- Self-harm
- Harassment
- Toxicity
Performance Evaluation
Methodology
Evaluation in the policy interpreter space is extremely difficult since most benchmark datasets do not publish the exact policy criteria given to labelers. As such, with the exception of the Ethos hate speech benchmark, we had to largely validate our model based on internal datasets. However, to make our analyses more rigorous, we only evaluated our model based on unique policies and unique content samples that were held out and not present in our training corpus.
It was also difficult to evaluate certain peer models when the harm areas in question were outside those models’ taxonomies. We have accordingly dropped them from the relevant data tables as they performed very poorly (as expected). CoPE, by contrast, has no fixed taxonomy and can accept any arbitrary content policy, similar to foundation models like GPT-4o, which we were able to evaluate in all areas.
Benchmark Results
Hate Speech Classification
Model | Precision | Recall | F1 Score |
---|---|---|---|
CoPE-A-9B | 89% | 93% | 91% |
GPT-4o | 97% | 78% | 87% |
Llama-3.1-8B | 59% | 96% | 73% |
LlamaGuard3-8B | 88% | 64% | 74% |
ShieldGemma-9B | 68% | 98% | 80% |
Hate Speech - Ethos Benchmark (External Set)
Model | Precision | Recall | F1 Score |
---|---|---|---|
CoPE-A-9B | 80% | 88% | 84% |
GPT-4o | 91% | 78% | 84% |
Llama-3.1-8B | 62% | 92% | 74% |
LlamaGuard3-8B | 87% | 79% | 83% |
ShieldGemma-9B | 82% | 82% | 82% |
Toxic Speech Classification
Model | Precision | Recall | F1 Score |
---|---|---|---|
CoPE-A-9B | 93% | 87% | 90% |
GPT-4o | 64% | 89% | 75% |
Llama-3.1-8B | 33% | 96% | 49% |
Note: Toxicity is outside the ShieldGemma & LlamaGuard taxonomy
Sexual Content Classification
Model | Precision | Recall | F1 Score |
---|---|---|---|
CoPE-A-9B | 96% | 83% | 89% |
GPT-4o | 95% | 72% | 82% |
Llama-3.1-8B | 48% | 95% | 64% |
LlamaGuard3-8B | 100% | 43% | 60% |
ShieldGemma-9B | 96% | 76% | 85% |
Self-Harm Content Classification
Model | Precision | Recall | F1 Score |
---|---|---|---|
CoPE-A-9B | 83% | 93% | 88% |
GPT-4o | 84% | 93% | 88% |
Llama-3.1-8B | 56% | 96% | 70% |
LlamaGuard3-8B | 65% | 84% | 73% |
ShieldGemma-9B | 69% | 89% | 78% |
Harassment Classification
Model | Precision | Recall | F1 Score |
---|---|---|---|
CoPE-A-9B | 69% | 78% | 73% |
GPT-4o | 100% | 17% | 30% |
Llama-3.1-8B | 35% | 87% | 50% |
ShieldGemma-9B | 49% | 55% | 52% |
Note: Harassment is outside the LlamaGuard taxonomy
Performance Analysis
CoPE-A demonstrates state of the art performance across multiple policy interpretation evaluation metrics, with significant improvements over comparable models. It even excels in hate speech detection, which is typically one of the most subjective and difficult harm areas. Overall, the model shows balanced precision and recall, making it suitable for production deployment in at-scale content moderation and content classification systems.
Intended Applications
Primary Use Cases
Content Labeling
- Real-time content moderation
- Batch processing of content
LLM Guardrails
- Input prompt risk assessment
- Output answer risk assessment
- NB: Not yet optimized for chat format
Content Scoring
- Feature generation for social feed ranking
- Content quality assessment & measurement
Prohibited Uses
- Surveillance applications
- Applications requiring external fact verification
- Use cases beyond stated technical limitations
Limitations and Constraints
Current Limitations
- Text Processing: Limited to 8K tokens (combined policy and content)
- Language Support: Currently optimized for US English only. Performance will degrade for other languages and locales.
- Knowledge Constraints: Cannot make classifications requiring external verification (e.g., misinformation) unless explicitly defined in the provided policy
- Scope: Binary classification only (i.e., presence/absence of matching labels)
Ethical Considerations
Bias and Fairness
While comprehensive bias evaluation is still ongoing, users should:
- Implement careful policy design to mitigate potential biases
- Monitor classification patterns across different demographic groups
- Contribute problematic examples to our bias assessment efforts
Safety Measures
The model's binary classification nature inherently limits certain risks, but users should:
- Maintain appropriate human oversight
- Regularly audit classification decisions
- Implement robust observability systems
Maintenance and Updates
Update Schedule
- Quarterly releases planned
- Regular performance improvements
- Community-driven feature enhancements
Future Roadmap Focus
- Expansion to new harm areas
- Language and locale support
- Multi-modality (e.g., images)
- Performance optimizations
- Novel evaluation benchmarks
Community and Support
General Resources
For any technical questions or comments, please join our HuggingFace community forum. You can share your feedback, suggest new areas, or pick our brains about anything. If you’d prefer a more private discussion, you can also submit your feedback via this form.
Pilot Partner Program
We are currently accepting a limited number of pilot partners to test real world deployments of CoPE. Partners receive early access to the model weights as well as technical training and support with custom policy development and fine-tuning. If you are interested in joining our trusted partner program, contact us at ([email protected]).
About the Developer
CoPE-A is developed and maintained by Zentropi, an AI Trust & Safety company focused on making content classification simple. The project represents a collaborative effort between industry experts and academic researchers to advance the state of the art in content labeling technology.
Last Updated: December 19, 2024
Model tree for zentropi-ai/cope-a-9b
Base model
google/gemma-2-9b