Replicating AUPRC of 0.624 in ToxiChat: Understanding Model Inference
Hello! ๐
I've been working on replicating the AUPRC score of 0.62 achieved in ToxiChat using the LlamaGuard-7b model, and I'm seeking some guidance on the specifics of the process. I would greatly appreciate insights from anyone familiar with the model or the dataset.
Prompt Configuration: Could someone shed light on the ideal prompt used for the ToxiChat model? What kind of inputs or context tend to yield optimal results?
Dataset Handling: How is the ToxiChat dataset treated during inference? Any special preprocessing steps or considerations that contribute to the model's success?
Probability for "safe" and "unsafe": Could someone share insights on how to extract the probabilities associated with the predictions of the words "safe" and "unsafe" from the model's output? Right now, I obtain the logits of the words "safe" and "unsafe" and pass them through a softmax to obtain the probabilities that add up to 1.
Probability Thresholds for AUPRC: In calculating the AUPRC, which probability thresholds are used to determine positive and negative predictions?
Looking forward to a fruitful discussion!
Thank you! ๐
Same boat here, any progress?
Same here. Interested in your progress so far, @JaimeUPM and @Neuraugment .