mbart25-multilingual-summarization-multilarge-cs

This model is a fine-tuned checkpoint of facebook/mbart-large-cc25 on the Multilingual large summarization dataset focused on Czech texts to produce multilingual summaries.

Task

The model deals with a multi-sentence summary in eight different languages. With the idea of adding other foreign language documents, and by having a considerable amount of Czech documents, we aimed to improve model summarization in the Czech language. Supported languages: 'en_XX' : 'en', 'de_DE': 'de', 'es_XX': 'es', 'fr_XX':'fr', 'ru_RU':'ru', 'tr_TR':'tr'.

USAGE

Assume that you are using the provided MultilingualSummarizer.ipynb file and included files from git repository.

## Configuration of summarization pipeline
#
def summ_config():
    cfg = OrderedDict([
        
        ## summarization model - checkpoint
        #   ctu-aic/m2m100-418M-multilingual-summarization-multilarge-cs
        #   ctu-aic/mt5-base-multilingual-summarization-multilarge-cs
        #   ctu-aic/mbart25-multilingual-summarization-multilarge-cs
        ("model_name", "ctu-aic/mbart25-multilingual-summarization-multilarge-cs"),
        
        ## language of summarization task
        #   language : string : cs, en, de, fr, es, tr, ru, zh
        ("language", "en"), 
        
        ## generation method parameters in dictionary
        #
        ("inference_cfg", OrderedDict([
            ("num_beams", 4),
            ("top_k", 40),
            ("top_p", 0.92),
            ("do_sample", True),
            ("temperature", 0.95),
            ("repetition_penalty", 1.23),
            ("no_repeat_ngram_size", None),
            ("early_stopping", True),
            ("max_length", 128),
            ("min_length", 10),
        ])),
        #texts to summarize values = (list of strings, string, dataset)
        ("texts",
            [
               "english text1 to summarize",
               "english text2 to summarize",
            ]
        ),
        #OPTIONAL: Target summaries values = (list of strings, string, None)
        ('golds',
         [
               "target english text1",
               "target english text2",
         ]),
        #('golds', None),
    ])
    return cfg

cfg = summ_config()
msummarizer = MultiSummarizer(**cfg)
ret = msummarizer(**cfg)

Dataset

Multilingual large summarization dataset consists of 10 sub-datasets mainly based on news and daily mails. For the training, it was used the entire training set and 72% of the validation set.

Train set:        3 464 563 docs
Validation set:     121 260 docs

Stats	fragment			avg document length		avg summary length		Documents
dataset	compression	density	coverage	nsent	nwords	nsent	nwords	count
cnc	7.388	0.303	0.088	16.121	316.912	3.272	46.805	750K
sumeczech	11.769	0.471	0.115	27.857	415.711	2.765	38.644	1M
cnndm	13.688	2.983	0.538	32.783	676.026	4.134	54.036	300K
xsum	18.378	0.479	0.194	18.607	369.134	1.000	21.127	225K
mlsum/tu	8.666	5.418	0.461	14.271	214.496	1.793	25.675	274K
mlsum/de	24.741	8.235	0.469	32.544	539.653	1.951	23.077	243K
mlsum/fr	24.388	2.688	0.424	24.533	612.080	1.320	26.93	425K
mlsum/es	36.185	3.705	0.510	31.914	746.927	1.142	21.671	291K
mlsum/ru	78.909	1.194	0.246	62.141	948.079	1.012	11.976	27K
cnewsum	20.183	0.000	0.000	16.834	438.271	1.109	21.926	304K

Tokenization

Truncation and padding were set to 512 tokens for the encoder (input text) and 128 for the decoder (summary).

Training

Trained based on cross-entropy loss.

Time: 3 days 8 hours
Epochs: 860K steps cca 8 (from 10)
GPUs: 4x NVIDIA A100-SXM4-40GB
eloss: 2.214 - 1.762
tloss: 3.365 - 1.445

ROUGE results per individual dataset test set:

ROUGE	ROUGE-1			ROUGE-2			ROUGE-L
dataset	Precision	Recall	Fscore	Precision	Recall	Fscore	Precision	Recall	Fscore
cnc	27.45	24.8	25.24	9.35	8.54	8.67	20.14	18.19	18.54
sumeczech	25.38	21.61	22.66	7.71	6.67	6.96	18.76	16.02	16.78
cnndm	41.97	42.61	41.05	19.64	19.88	19.16	29.38	29.85	28.73
xsum	39.18	39.8	38.83	16.59	16.98	16.5	31.25	31.74	30.96
mlsum-tu	51.02	47.95	47.72	36.15	34.07	33.9	44.59	41.9	41.74
mlsum-de	46.96	46.16	46.02	35.95	35.87	35.66	43.26	42.7	42.53
mlsum-fr	34.51	31.4	32.03	16.56	15.07	15.37	26.73	24.41	24.86
mlsum-es	32.62	29.66	30.21	13.3	12.2	12.39	26.24	24.02	24.4
mlsum-ru	1.25	1.54	1.31	0.46	0.46	0.44	1.25	1.54	1.31
cnewsum	26.43	29.44	26.38	7.38	8.52	7.46	25.99	28.94	25.92