Capybara and SystemChat-1.1 Preferences with SOTA LLMs
This collection contains the preference versions of both `LDJnr/Capybara` and `abacusai/SystemChat-1.1`, in collaboration with Hugging Face and LDJnr
Viewer • Updated • 16k • 415 • 225Note Starting `LDJnr/Capybara` dataset
abacusai/SystemChat-1.1
Viewer • Updated • 20.2k • 103 • 30Note Starting `abacusai/SystemChat-1.1` dataset
distilabel-internal-testing/Capybara-and-SystemChat-1.1
Viewer • Updated • 36.2k • 34Note Dataset that combines both `LDJnr/Capybara` and `abacusai/SystemChat-1.1` but sharing the same format for the conversations (OpenAI-style), and defining the same columns while keeping source for Capybara, and adding `dataset` as the identifier of the origin dataset
distilabel-internal-testing/Capybara-and-SystemChat-1.1-Text
Viewer • Updated • 36.2k • 39Note Adds a new column on top of `distilabel-internal-testing/Capybara-and-SystemChat-1.1` which is `text` and contains the values for the column `messages` with the chat template applied using the ChatML format
distilabel-internal-testing/Capybara-and-SystemChat-1.1-MinHash
Viewer • Updated • 35.6k • 35Note Runs MinHash deduplication (threshold=0.95) on top of `distilabel-internal-testing/Capybara-and-SystemChat-1.1-Text` to remove 588 near duplicates from the dataset, before starting off with the generation
distilabel-internal-testing/Capybara-and-SystemChat-1.1-Filtered
Viewer • Updated • 35.2k • 37Note Runs URL filtering on the assistant responses, and also filters out the instances with ChatGPT-ish terms, as @LDJnr kindly provided a list of common ChatGPT-like terms that tend to appear within the generated responses that we want to avoid; on top of `distilabel-internal-testing/Capybara-and-SystemChat-1.1-MinHash`