Consider Reporting Performance of Sailor-Chat
Thank you for your hard work and congrats on the new release of sea-lion-7b-instruct! I appreciate you including Sailor as one of the baselines. However, I noticed that among all the compared models, only Sailor-7B is a base model that has not been fine-tuned with instruction-following dataset. As well known, it would be more appropriate to compare chat models (i.e., instruction-tuned models) with each other rather than comparing a chat model to a base model that lacks instruction tuning.
If possible, could you please add the results for Sailor-7B-Chat, which is the instruction-tuned version of Sailor? That would allow for a more fair comparison across models. Thank you for considering this request.
Dear @SivilTaram ,
Thank you very much for writing in and I would like to extend our congratulations on the release of Sailor, it is quite impressive.
We have all intention to include benchmarks for Sailor-7B-Chat, it was rather unfortunate that by the end of our evaluation phase, only the Sailor base model was available.
We have queued this task in our backlog and we will have the benchmarks for the Sailor-7B-Chat updated on the model card within a week or two. In the meantime, we will update the model card to clarify that the current benchmarks for Sailor are for the base model only.
Thank you for your patience and we will update this thread again once the benchmarks are updated.
@RaymondAISG Thanks for the quick response! Look forward to the evaluation results :-)
Dear
@SivilTaram
,
Thank you for your patience! We have added the evaluation results for Sailor-7B-Chat to the table and there are significant improvements across the board (compared to the base), with the exception of QA, but that is really more of an issue of the answers being full sentences rather than spans.
Congratulations on the model as well as the publication!
@weiqipedia Thanks for your hard working! The result is very encouraging!