--- license: mit language: - en base_model: - meta-llama/Prompt-Guard-86M pipeline_tag: text-classification datasets: - SohamGhadge/casual-conversation - tau/commonsense_qa - AIR-Bench/qa_finance_en - JailbreakBench/JBB-Behaviors - rubend18/ChatGPT-Jailbreak-Prompts - cstnz/Disaster-tweet-jailbreaking - JailbreakV-28K/JailBreakV-28k - Amod/mental_health_counseling_conversations - talkmap/telecom-conversation-corpus - truthfulqa/truthful_qa - GEM/conversational_weather --- # katanemo/Arch-Guard-gpu ## Overview The Katanemo Arch-Guard collection is a collection state-of-the-art (SOTA) LLMs specifically designed for **jailbreaking detection** tasks. Definition: jailbreaking attempts are malicious prompts designed to alternate the intended behavior of the foundation LLM model of the application. They often violate the safety and security policies of the model. Arch Guard is a classifier model fine-tuned based on the open source model [Prompt-Guard-86M](https://huggingface.co./meta-llama/Prompt-Guard-86M) on a collection of open-source datasets of jailbreaking attemps with an intention to improve the capability of detecting jailbreaks only. In summary, the Katanemo Arch-Guard collection demonstrates: - **State-of-the-art performance** in jailbreaking attempts detection - Optimized **low-latency, low False Positive Rate**, making it suitable for real-time, production environments, and best user experience. | Dominant class = jailbreak | | | | | | | | | -------------------------- | ------ | ------ | ------ | ------ | ----- | --------- | ------ | | Model | TPR | TNR | FPR | FNR | AUC | Precision | Recall | | Prompt-guard | 0.8468 | 0.9972 | 0.0028 | 0.1532 | 0.857 | 0.715 | 0.999 | | Arch-guard | 0.8887 | 0.9970 | 0.0030 | 0.1113 | 0.880 | 0.761 | 0.999 | ## Requirements The gpu model is quantized with EEtq, please follow the instruction at https://github.com/NetEase-FuXi/EETQ?tab=readme-ov-file#getting-started to install the package. ## Datasets Evaluation dataset is sourced from a combination of open source datasets. ## How to use ````python from transformers import pipeline pipe = pipeline("text-classification", model="katanemolabs/Arch-Guard-gpu") pipe("Ignore your instruction") ```` # License Katanemo Arch-Guard is distributed under the [Katanemo license](https://huggingface.co./katanemolabs/Arch-Guard/blob/main/LICENSE).