acul3 commited on
Commit
3ed04bb
1 Parent(s): 3c6a454

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +56 -38
README.md CHANGED
@@ -1,58 +1,76 @@
1
  ---
 
 
2
  license: other
3
- base_model: "bahasa-4b"
4
- tags:
5
- - llama-factory
6
- - full
7
- - generated_from_trainer
8
- model-index:
9
- - name: v3
10
- results: []
11
  ---
12
 
13
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
14
- should probably proofread and complete it, then remove this comment. -->
15
 
16
- # v3
 
17
 
18
- This model is a fine-tuned version of [/home/acul/data_nvme5/bahasa-4b/saves/Qwen1.5-4B/full/train_2024-03-27-23-212-110/checkpoint-210000](https://huggingface.co//home/acul/data_nvme5/bahasa-4b/saves/Qwen1.5-4B/full/train_2024-03-27-23-212-110/checkpoint-210000) on the indo_instruct_3 dataset.
 
19
 
20
- ## Model description
 
21
 
22
- More information needed
 
23
 
24
- ## Intended uses & limitations
 
25
 
26
- More information needed
 
27
 
28
- ## Training and evaluation data
 
 
 
 
 
 
 
29
 
30
- More information needed
31
 
32
- ## Training procedure
 
 
33
 
34
- ### Training hyperparameters
 
 
 
 
 
35
 
36
- The following hyperparameters were used during training:
37
- - learning_rate: 5e-05
38
- - train_batch_size: 2
39
- - eval_batch_size: 8
40
- - seed: 42
41
- - distributed_type: multi-GPU
42
- - num_devices: 2
43
- - total_train_batch_size: 4
44
- - total_eval_batch_size: 16
45
- - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
46
- - lr_scheduler_type: cosine
47
- - num_epochs: 4.0
48
 
49
- ### Training results
50
 
 
 
 
 
 
51
 
 
 
 
 
52
 
53
- ### Framework versions
 
 
54
 
55
- - Transformers 4.39.1
56
- - Pytorch 2.1.2+cu121
57
- - Datasets 2.18.0
58
- - Tokenizers 0.15.2
 
1
  ---
2
+ language:
3
+ - id
4
  license: other
5
+ license_name: tongyi-qianwen
 
 
 
 
 
 
 
6
  ---
7
 
8
+ # Bahasa-4b Model Report
 
9
 
10
+ ## Model Name
11
+ **Bahasa-4b**
12
 
13
+ ## Model Detail
14
+ Bahasa-4b is continued training from qwen-4b using 10 billion high quality text of Indonesian. The model outperforms some 4b, and even 7b models for Indonesian tasks.
15
 
16
+ ## Model Developers
17
+ Bahasa AI
18
 
19
+ ## Intended Use
20
+ This model is intended for various NLP tasks that require understanding and generating Indonesian language. It is suitable for applications such as question answering, sentiment analysis, document summarization, and more.
21
 
22
+ ## Training Data
23
+ Bahasa-4b was trained on a 10 billion subset data of Indonesian dataset from a collected pool of 100 billion.
24
 
25
+ ## Benchmarks
26
+ The following table shows the performance of Bahasa-4b compared to the models Sailor_4b and Mistral-7B-v0.1 across several benchmarks:
27
 
28
+ | Dataset | Version | Metric | Mode | Sailor_4b | Bahasa-4b-hf | Mistral-7B-v0.1 |
29
+ |----------------|---------|--------|------|-----------|--------------|-----------------|
30
+ | tydiqa-id | 0e9309 | EM | gen | 53.98 | 55.04 | 63.54 |
31
+ | tydiqa-id | 0e9309 | F1 | gen | 73.48 | 75.39 | 78.73 |
32
+ | xcopa-id | 36c11c | EM | ppl | 69.2 | 73.2 | 62.40 |
33
+ | xcopa-id | 36c11c | F1 | ppl | 69.2 | 73.2 | - |
34
+ | m3exam-id-ppl | ede415 | EM | ppl | 31.27 | 44.47 | 26.68 |
35
+ | belebele-id-ppl| 7fe030 | EM | ppl | 41.33 | 42.33 | 41.33 |
36
 
 
37
 
38
+ ```python
39
+ from transformers import AutoModelForCausalLM, AutoTokenizer
40
+ device = "cuda" # the device to load the model onto
41
 
42
+ model = AutoModelForCausalLM.from_pretrained(
43
+ "Bahasalab/Bahasa-4b-chat-v2",
44
+ torch_dtype="auto",
45
+ device_map="auto"
46
+ )
47
+ tokenizer = AutoTokenizer.from_pretrained("Bahasalab/Bahasa-4b-chat")
48
 
49
+ messages = [
50
+ {"role": "system", "content": "Kamu adalah asisten yang membantu"},
51
+ {"role": "user", "content": "kamu siapa"}
52
+ ]
53
+ text = tokenizer.apply_chat_template(
54
+ messages,
55
+ tokenize=False,
56
+ add_generation_prompt=True
57
+ )
 
 
 
58
 
59
+ model_inputs = tokenizer([text], return_tensors="pt").to(device)
60
 
61
+ generated_ids = model.generate(
62
+ input_ids=model_inputs.input_ids,
63
+ attention_mask=model_inputs.attention_mask,
64
+ max_new_tokens=512,
65
+ eos_token_id=tokenizer.eos_token_id
66
 
67
+ )
68
+ generated_ids = [
69
+ output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
70
+ ]
71
 
72
+ response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
73
+ print(response)
74
+ ```
75
 
76
+ This data demonstrates that Bahasa-4b consistently outperforms the Sailor_4b model in various Indonesian language tasks, showing improvements in both EM (Exact Match) and F1 scores across different datasets, and is competitive with the Mistral-7B-v0.1 model.