metadata

license: other
tags:
  - merge
  - not-for-all-audiences
license_name: microsoft-research-license
model-index:
  - name: DarkForest-20B-v2.0
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: AI2 Reasoning Challenge (25-Shot)
          type: ai2_arc
          config: ARC-Challenge
          split: test
          args:
            num_few_shot: 25
        metrics:
          - type: acc_norm
            value: 63.74
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co./spaces/HuggingFaceH4/open_llm_leaderboard?query=TeeZee/DarkForest-20B-v2.0
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: HellaSwag (10-Shot)
          type: hellaswag
          split: validation
          args:
            num_few_shot: 10
        metrics:
          - type: acc_norm
            value: 86.32
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co./spaces/HuggingFaceH4/open_llm_leaderboard?query=TeeZee/DarkForest-20B-v2.0
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MMLU (5-Shot)
          type: cais/mmlu
          config: all
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 59.79
            name: accuracy
        source:
          url: >-
            https://huggingface.co./spaces/HuggingFaceH4/open_llm_leaderboard?query=TeeZee/DarkForest-20B-v2.0
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: TruthfulQA (0-shot)
          type: truthful_qa
          config: multiple_choice
          split: validation
          args:
            num_few_shot: 0
        metrics:
          - type: mc2
            value: 56.14
        source:
          url: >-
            https://huggingface.co./spaces/HuggingFaceH4/open_llm_leaderboard?query=TeeZee/DarkForest-20B-v2.0
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: Winogrande (5-shot)
          type: winogrande
          config: winogrande_xl
          split: validation
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 77.9
            name: accuracy
        source:
          url: >-
            https://huggingface.co./spaces/HuggingFaceH4/open_llm_leaderboard?query=TeeZee/DarkForest-20B-v2.0
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: GSM8k (5-shot)
          type: gsm8k
          config: main
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 23.28
            name: accuracy
        source:
          url: >-
            https://huggingface.co./spaces/HuggingFaceH4/open_llm_leaderboard?query=TeeZee/DarkForest-20B-v2.0
          name: Open LLM Leaderboard

DarkForest 20B v2.0

Model Details

To create this model two step procedure was used. First a new 20B model was created using microsoft/Orca-2-13b and KoboldAI/LLaMA2-13B-Erebus-v3 , deatils of the merge in darkforest_v2_step1.yml
then jebcarter/psyonic-cetacean-20B
and TeeZee/BigMaid-20B-v1.0 was used to produce the final model, merge config in darkforest_v2_step2.yml
The resulting model has approximately 20 billion parameters.

Warning: This model can produce NSFW content!

Results

main difference to v1.0 - model has much better sense of humor.
produces SFW nad NSFW content without issues, switches context seamlessly.
good at following instructions.
good at tracking multiple characters in one scene.
very creative, scenarios produced are mature and complicated, model doesn't shy from writing about PTSD, mental issues or complicated relationships.
NSFW output is more creative and suprising than typical limaRP output.
definitely for mature audiences, not only because of vivid NSFW content but also because of overall maturity of stories it produces.
This is NOT Harry Potter level storytelling.

All comments are greatly appreciated, download, test and if you appreciate my work, consider buying me my fuel:

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric	Value
Avg.	61.19
AI2 Reasoning Challenge (25-Shot)	63.74
HellaSwag (10-Shot)	86.32
MMLU (5-Shot)	59.79
TruthfulQA (0-shot)	56.14
Winogrande (5-shot)	77.90
GSM8k (5-shot)	23.28