BAAI
/

Safetensors

You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

CCI3-HQ-Intermediate-Checkpoints

In the CCI3.0-HQ tech report, we conducted a direct comparison of different datasets through end-to-end pre-training experiments. The performance evaluation was based on both the final checkpoints and intermediate checkpoints of model training in two experiments: the Mixed Dataset Experiment and the Chinese Dataset Experiment.

To closely monitor and compare the performance of various datasets throughout the training process, we saved intermediate checkpoints at approximately every 20 billion tokens of training. This allowed us to track progress and changes in performance over time.

Below, we list all checkpoints from the models trained in all comparison experiments. The suffix "-zh" represents checkpoints from the Chinese Dataset Experiment, while the suffix "-mix" indicates checkpoints from the Mixed Dataset Experiment.

Citation Information

You can cite our paper:

@misc{wang2024cci30hqlargescalechinesedataset,
      title={CCI3.0-HQ: a large-scale Chinese dataset of high quality designed for pre-training large language models}, 
      author={Liangdong Wang and Bo-Wen Zhang and Chengwei Wu and Hanyu Zhao and Xiaofeng Shi and Shuhao Gu and Jijie Li and Quanyue Ma and TengFei Pan and Guang Liu},
      year={2024},
      eprint={2410.18505},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2410.18505}, 
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference API
Unable to determine this model's library. Check the docs .

Datasets used to train BAAI/CCI3-HQ-Intermediate-Checkpoints

Collection including BAAI/CCI3-HQ-Intermediate-Checkpoints