openGPT-X
/

Teuken-7B-instruct-research-v0.4

+---
+language:
+- de
+- bg
+- cs
+- da
+- el
+- en
+- es
+- et
+- fi
+- fr
+- ga
+- hr
+- hu
+- it
+- lt
+- lv
+- mt
+- nl
+- pl
+- pt
+- ro
+- sl
+- sv
+- sk
+metrics:
+- accuracy
+- bleu
+pipeline_tag: text-generation
+library_name: transformers
+base_model:
+- openGPT-X/Teuken-7B-base-v0.4
+license: apache-2.0
+---
+# Model Card for HalloEurope-7B-Instruct
+Teuken-7B-chat-v0.4 is an instruction-tuned version of Teuken-7B-base-v0.4.
+### Model Description
+<!-- Provide a longer summary of what this model is. -->
+- **Developed by:** Fraunhofer IAIS
+- **Funded by:** German Federal Ministry of Economics and Climate Protection (BMWK) in the context of the OpenGPT-X project
+- **Model type:** Transformer based decoder-only model
+- **Language(s) (NLP):** bg, cs, da, de, el, en, es, et, fi, fr, ga, hr, hu, it, lt, lv, mt, nl, pl, pt, ro, sk, sl, sv
+- **Shared by:** Fraunhofer IAIS
+## Uses
+<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
+Teuken-7B-chat-v0.4 is intended for commercial and research use in all official 24 European languages. Since Teuken-7B-chat-v0.4 focuses on covering all 24 EU languages, it renders more stable results across these languages and better reflects European values in its answers than English-centric models. It is therefore specialized for use in multilingual tasks.
+### Out-of-Scope Use
+<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
+The model is not intended for use in math and coding tasks.
+## Bias, Risks, and Limitations
+<!-- This section is meant to convey both technical and sociotechnical limitations. -->
+Teuken-7B-chat-v0.4 is an instruction-tuned version of Teuken-7B-base-v0.4 that is not completely free from biases and hallucinations.
+## How to Get Started with the Model
+## Usage
+The model requires transformers, sentencepiece, and the torch library.
+After installation, here's an example of how to use the model:
+The prompt template for the fine-tuned model is defined as follows:
+```python
+user="Hi!"
+lang_code = "DE"
+system_messages={
+            "EN": "A chat between a human and an artificial intelligence assistant."
+            " The assistant gives helpful and polite answers to the human's questions.",
+            "DE": "Ein Gespräch zwischen einem Menschen und einem Assistenten mit künstlicher Intelligenz."
+            " Der Assistent gibt hilfreiche und höfliche Antworten auf die Fragen des Menschen.",
+        }
+prompt = f"System: {system_messages[lang_code]}\nUser: {user}\nAssistant:<s>"
+```
+```python
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+model_name = "openGPT-X/Teuken-7B-chat-v0.4"
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, torch_dtype=torch.bfloat16).to(device)
+inputs = tokenizer(prompt, return_tensors="pt")
+inputs = {k: v.to(device) for k, v in inputs.items()}  # Move inputs to the same device as the model
+output = model.generate(input_ids=inputs['input_ids'], max_new_tokens=1000, do_sample=True)
+result = tokenizer.decode(output.tolist())
+```
+This example demonstrates how to load the model and tokenizer, prepare input, generate text, and print the result.
+## Training Details
+### Training Data
+<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
+For composing the final instruction-tuning dataset termed "Honey", we first include all German examples. We aim to include roughly the same amount of English examples, as we have German examples:
+  1. Add all multi-turn examples
+  2. Add the entire code_alpaca dataset subset
+  3. Add entire lmsys_chat_1m_high_quality_train_en dataset subset
+  4. For the remaining dataset subsets ("open_orca", "evol_instruct_143k", "evol_instruct_70k", "bactrianx_EN") add the examples with the highest reward scores ("quality score") so that each dataset subset contributes an equal amount of high-quality examples
+## Dataset Sizes Before Composition
+### English
+### German
+### Training Procedure
+<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
+Instruction fined tuned version of Teuken-7B-base-v0.4.
+#### Training Hyperparameters
+- **Training regime:** bf16 mixed precision <!--fp32, fp16 mixed precision, , bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
+## Evaluation
+<!-- This section describes the evaluation protocols and provides the results. -->
+### Testing Data, Factors & Metrics
+#### Testing Data
+<!-- This should link to a Dataset Card if possible. -->
+The model was evaluated in 21 languages on ARC, GSM8K, HellaSwag, TruthfulQA, Translation and MMLU. Results can be seen in the European LLM Leaderboard (https://huggingface.co/spaces/openGPT-X/european-llm-leaderboard).
+## Technical Specifications
+### Model Architecture and Objective
+| Hyper-Parameter            | Value    |
+|----------------------------|----------|
+| Training Objective         | CLM      |
+| Activation Function        | SwiGLU   |
+| Seq Length                 | 4096     |
+| Position Embeddings        | Rotary   |
+| Num Layers                 | 32       |
+| Hidden Size                | 4096     |
+| FFN Hidden Size            | 13440    |
+| Num Attention Heads        | 32       |
+| Head Dim                   | 128      |
+| Group Query Attention      | yes      |
+| Num Query Groups           | 2        |
+| Normalization              | RMSNorm  |
+| Learning rate              | 3e-4     |
+| Min learning rate          | 3e-5     |
+| Disable bias in linear     | yes      |
+| Hidden dropout             | 0.0      |
+| Attention dropout          | 0.0      |
+| Optimizer                  | AdamW    |
+| Beta1                      | 0.9      |
+| Beta2                      | 0.95     |
+| Sequence-parallelism
+| Data-type                  | bf16     |
+| Recompute-activations      | yes      |
+| Distributed-optimizers     | yes      |
+| Model Initialization       |          |
+**BibTeX:**
+TODO
+**APA:**
+TODO
+## Model Card Contact
+<div class="hf-card">
+    <h2>Contact Information</h2>
+    <p>You can reach out to the following model card contact:</p>
+    <ul>
+        <li>
+            <a href="https://huggingface.co/iwendler" target="_blank">OpenGPT-X</a>
+            - <a href="[email protected]">[email protected]</a>
+        </li>
+    </ul>
+</div>