WARNING: APEX not installed - defaulting to deepspeed's fused adam Time to load fused_adam op: 0.05144023895263672 seconds Time to load utils op: 0.10138964653015137 seconds Rank: 1 partition count [2, 2] and sizes[(253755392, False), (160768, False)] Time to load utils op: 0.00027060508728027344 seconds WARNING: shuffle index length (191133) is not equal to sample index length (191134) WARNING: shuffle index length (38225) is not equal to sample index length (38226) WARNING: shuffle index length (38225) is not equal to sample index length (38226) WARNING: shuffle index length (58533) is not equal to sample index length (58534) WARNING: shuffle index length (642) is not equal to sample index length (643) WARNING: shuffle index length (642) is not equal to sample index length (643) WARNING: shuffle index length (330778) is not equal to sample index length (330779) WARNING: shuffle index length (165388) is not equal to sample index length (165389) WARNING: shuffle index length (165388) is not equal to sample index length (165389) WARNING: shuffle index length (224993) is not equal to sample index length (224994) WARNING: shuffle index length (37498) is not equal to sample index length (37499) WARNING: shuffle index length (37498) is not equal to sample index length (37499) WARNING: shuffle index length (225273) is not equal to sample index length (225274) WARNING: shuffle index length (3099) is not equal to sample index length (3100) WARNING: shuffle index length (1032) is not equal to sample index length (1033) WARNING: shuffle index length (59398) is not equal to sample index length (59399) WARNING: shuffle index length (680) is not equal to sample index length (681) WARNING: shuffle index length (112) is not equal to sample index length (113) WARNING: shuffle index length (155804) is not equal to sample index length (155805) WARNING: shuffle index length (19474) is not equal to sample index length (19475) WARNING: shuffle index length (19474) is not equal to sample index length (19475) WARNING: shuffle index length (233692) is not equal to sample index length (233693) WARNING: shuffle index length (58422) is not equal to sample index length (58423) WARNING: shuffle index length (58422) is not equal to sample index length (58423) WARNING: shuffle index length (475425) is not equal to sample index length (475426) WARNING: shuffle index length (237712) is not equal to sample index length (237713) WARNING: shuffle index length (237712) is not equal to sample index length (237713) WARNING: shuffle index length (287120) is not equal to sample index length (287121) WARNING: shuffle index length (71779) is not equal to sample index length (71780) WARNING: shuffle index length (71779) is not equal to sample index length (71780) WARNING: shuffle index length (99876) is not equal to sample index length (99877) WARNING: shuffle index length (4341) is not equal to sample index length (4342) WARNING: shuffle index length (4341) is not equal to sample index length (4342) WARNING: shuffle index length (536307) is not equal to sample index length (536308) WARNING: shuffle index length (268153) is not equal to sample index length (268154) WARNING: shuffle index length (268153) is not equal to sample index length (268154) WARNING: shuffle index length (84118) is not equal to sample index length (84119) WARNING: shuffle index length (2335) is not equal to sample index length (2336) WARNING: shuffle index length (2335) is not equal to sample index length (2336) WARNING: shuffle index length (243248) is not equal to sample index length (243249) WARNING: shuffle index length (60811) is not equal to sample index length (60812) WARNING: shuffle index length (60811) is not equal to sample index length (60812) WARNING: shuffle index length (183462) is not equal to sample index length (183463) WARNING: shuffle index length (26208) is not equal to sample index length (26209) WARNING: shuffle index length (26208) is not equal to sample index length (26209) WARNING: shuffle index length (45097) is not equal to sample index length (45098) WARNING: shuffle index length (546) is not equal to sample index length (547) WARNING: shuffle index length (181) is not equal to sample index length (182) WARNING: shuffle index length (425408) is not equal to sample index length (425409) WARNING: shuffle index length (212703) is not equal to sample index length (212704) WARNING: shuffle index length (212703) is not equal to sample index length (212704) WARNING: shuffle index length (62295) is not equal to sample index length (62296) WARNING: shuffle index length (852) is not equal to sample index length (853) WARNING: shuffle index length (852) is not equal to sample index length (853) WARNING: shuffle index length (585511) is not equal to sample index length (585512) WARNING: shuffle index length (292755) is not equal to sample index length (292756) WARNING: shuffle index length (292755) is not equal to sample index length (292756) WARNING: shuffle index length (201923) is not equal to sample index length (201924) WARNING: shuffle index length (33653) is not equal to sample index length (33654) WARNING: shuffle index length (33653) is not equal to sample index length (33654) WARNING: shuffle index length (276245) is not equal to sample index length (276246) WARNING: shuffle index length (69060) is not equal to sample index length (69061) WARNING: shuffle index length (69060) is not equal to sample index length (69061) WARNING: shuffle index length (65434) is not equal to sample index length (65435) WARNING: shuffle index length (1294) is not equal to sample index length (1295) WARNING: shuffle index length (646) is not equal to sample index length (647) WARNING: shuffle index length (42670) is not equal to sample index length (42671) WARNING: shuffle index length (446) is not equal to sample index length (447) WARNING: shuffle index length (36) is not equal to sample index length (37) WARNING: shuffle index length (353503) is not equal to sample index length (353504) WARNING: shuffle index length (353503) is not equal to sample index length (353504) WARNING: shuffle index length (353503) is not equal to sample index length (353504) WARNING: shuffle index length (231277) is not equal to sample index length (231278) WARNING: shuffle index length (13603) is not equal to sample index length (13604) WARNING: shuffle index length (13603) is not equal to sample index length (13604) WARNING: shuffle index length (391816) is not equal to sample index length (391817) WARNING: shuffle index length (97953) is not equal to sample index length (97954) WARNING: shuffle index length (97953) is not equal to sample index length (97954) WARNING: shuffle index length (110499) is not equal to sample index length (110500) WARNING: shuffle index length (4249) is not equal to sample index length (4250) WARNING: shuffle index length (4249) is not equal to sample index length (4250) WARNING: shuffle index length (101747) is not equal to sample index length (101748) WARNING: shuffle index length (3390) is not equal to sample index length (3391) WARNING: shuffle index length (3390) is not equal to sample index length (3391) > RANK 1 elapsed time for building blendable dataset indices: 0.27 (sec) > RANK 1 elapsed time for building blendable dataset indices: 0.09 (sec) > RANK 1 elapsed time for building blendable dataset indices: 0.09 (sec) ... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... valid_8: no. of documents:479525 reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... test_8: no. of documents:479525 reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... train_9: no. of documents:146329 reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... valid_9: no. of documents:146329 reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... test_9: no. of documents:146329 reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... train_10: no. of documents:8822 reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... valid_10: no. of documents:8822 reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... test_10: no. of documents:8822 reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... train_11: no. of documents:566715 reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... valid_11: no. of documents:566715 reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... test_11: no. of documents:566715 reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... train_12: no. of documents:4766 reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... valid_12: no. of documents:4766 reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... test_12: no. of documents:4766 reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... train_13: no. of documents:182247 reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... valid_13: no. of documents:182247 reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... test_13: no. of documents:182247 reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... train_14: no. of documents:52472 reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... valid_14: no. of documents:52472 reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... test_14: no. of documents:52472 reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... train_15: no. of documents:637 reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... valid_15: no. of documents:637 reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... test_15: no. of documents:637 reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... train_16: no. of documents:490404 reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... valid_16: no. of documents:490404 reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... test_16: no. of documents:490404 reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... train_17: no. of documents:1846 reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... valid_17: no. of documents:1846 reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... test_17: no. of documents:1846 reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... train_18: no. of documents:718649 reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... valid_18: no. of documents:718649 reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... test_18: no. of documents:718649 reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... train_19: no. of documents:76431 reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... valid_19: no. of documents:76431 reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... test_19: no. of documents:76431 reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... train_20: no. of documents:168453 reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... valid_20: no. of documents:168453 reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... test_20: no. of documents:168453 reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... train_21: no. of documents:2174 reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... valid_21: no. of documents:2174 reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... test_21: no. of documents:2174 reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... train_22: no. of documents:534 reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... valid_22: no. of documents:534 reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... test_22: no. of documents:534 reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... train_23: no. of documents:772965 reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... valid_23: no. of documents:772965 reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... test_23: no. of documents:772965 reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... train_24: no. of documents:147871 reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... valid_24: no. of documents:147871 reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... test_24: no. of documents:147871 reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... train_25: no. of documents:1028438 reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... valid_25: no. of documents:1028438 reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... test_25: no. of documents:1028438 reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... train_26: no. of documents:11763 reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... valid_26: no. of documents:11763 reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... test_26: no. of documents:11763 reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... train_27: no. of documents:8702 reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... valid_27: no. of documents:8702 reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... test_27: no. of documents:8702 reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... train_0: no. of documents:77653 > WARNING: could not find index map files, building the indices on rank 0 ... > elasped time to build and save doc-idx mapping (seconds): 0.012457 > elapsed time to build and save sample-idx mapping (seconds): 0.001877 > elapsed time to build and save shuffle-idx mapping (seconds): 0.004111 > loading doc-idx mapping from data-hfcm/adl/adl_text_document_train_0_indexmap_187446ns_2048sl_1234s_doc_idx.npy > loading sample-idx mapping from data-hfcm/adl/adl_text_document_train_0_indexmap_187446ns_2048sl_1234s_sample_idx.npy > loading shuffle-idx mapping from data-hfcm/adl/adl_text_document_train_0_indexmap_187446ns_2048sl_1234s_shuffle_idx.npy loaded indexed file in 0.001 seconds total number of samples: 191135 total number of epochs: 5 WARNING: shuffle index length (191133) is not equal to sample index length (191134) reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... valid_0: no. of documents:77653 > WARNING: could not find index map files, building the indices on rank 0 ... > elasped time to build and save doc-idx mapping (seconds): 0.002071 > elapsed time to build and save sample-idx mapping (seconds): 0.000397 > elapsed time to build and save shuffle-idx mapping (seconds): 0.000847 > loading doc-idx mapping from data-hfcm/adl/adl_text_document_valid_0_indexmap_1881ns_2048sl_1234s_doc_idx.npy > loading sample-idx mapping from data-hfcm/adl/adl_text_document_valid_0_indexmap_1881ns_2048sl_1234s_sample_idx.npy > loading shuffle-idx mapping from data-hfcm/adl/adl_text_document_valid_0_indexmap_1881ns_2048sl_1234s_shuffle_idx.npy loaded indexed file in 0.001 seconds total number of samples: 38227 total number of epochs: 1 WARNING: shuffle index length (38225) is not equal to sample index length (38226) reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... test_0: no. of documents:77653 > WARNING: could not find index map files, building the indices on rank 0 ... > elasped time to build and save doc-idx mapping (seconds): 0.002163 > elapsed time to build and save sample-idx mapping (seconds): 0.000442 > elapsed time to build and save shuffle-idx mapping (seconds): 0.000862 > loading doc-idx mapping from data-hfcm/adl/adl_text_document_test_0_indexmap_6ns_2048sl_1234s_doc_idx.npy > loading sample-idx mapping from data-hfcm/adl/adl_text_document_test_0_indexmap_6ns_2048sl_1234s_sample_idx.npy > loading shuffle-idx mapping from data-hfcm/adl/adl_text_document_test_0_indexmap_6ns_2048sl_1234s_shuffle_idx.npy loaded indexed file in 0.120 seconds total number of samples: 38227 total number of epochs: 1 WARNING: shuffle index length (38225) is not equal to sample index length (38226) reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... train_1: no. of documents:1508 > WARNING: could not find index map files, building the indices on rank 0 ... > elasped time to build and save doc-idx mapping (seconds): 0.003722 > elapsed time to build and save sample-idx mapping (seconds): 0.000717 > elapsed time to build and save shuffle-idx mapping (seconds): 0.001240 > loading doc-idx mapping from data-hfcm/botxt/botxt_text_document_train_1_indexmap_58238ns_2048sl_1234s_doc_idx.npy > loading sample-idx mapping from data-hfcm/botxt/botxt_text_document_train_1_indexmap_58238ns_2048sl_1234s_sample_idx.npy > loading shuffle-idx mapping from data-hfcm/botxt/botxt_text_document_train_1_indexmap_58238ns_2048sl_1234s_shuffle_idx.npy loaded indexed file in 0.001 seconds total number of samples: 58535 total number of epochs: 91 WARNING: shuffle index length (58533) is not equal to sample index length (58534) reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... valid_1: no. of documents:1508 > WARNING: could not find index map files, building the indices on rank 0 ... > elasped time to build and save doc-idx mapping (seconds): 0.000162 > elapsed time to build and save sample-idx mapping (seconds): 0.000089 > elapsed time to build and save shuffle-idx mapping (seconds): 0.000073 > loading doc-idx mapping from data-hfcm/botxt/botxt_text_document_valid_1_indexmap_585ns_2048sl_1234s_doc_idx.npy > loading sample-idx mapping from data-hfcm/botxt/botxt_text_document_valid_1_indexmap_585ns_2048sl_1234s_sample_idx.npy > loading shuffle-idx mapping from data-hfcm/botxt/botxt_text_document_valid_1_indexmap_585ns_2048sl_1234s_shuffle_idx.npy loaded indexed file in 0.001 seconds total number of samples: 644 total number of epochs: 1 WARNING: shuffle index length (642) is not equal to sample index length (643) reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... test_1: no. of documents:1508 > WARNING: could not find index map files, building the indices on rank 0 ... > elasped time to build and save doc-idx mapping (seconds): 0.000186 > elapsed time to build and save sample-idx mapping (seconds): 0.000104 > elapsed time to build and save shuffle-idx mapping (seconds): 0.000090 > loading doc-idx mapping from data-hfcm/botxt/botxt_text_document_test_1_indexmap_2ns_2048sl_1234s_doc_idx.npy > loading sample-idx mapping from data-hfcm/botxt/botxt_text_document_test_1_indexmap_2ns_2048sl_1234s_sample_idx.npy > loading shuffle-idx mapping from data-hfcm/botxt/botxt_text_document_test_1_indexmap_2ns_2048sl_1234s_shuffle_idx.npy loaded indexed file in 0.001 seconds total number of samples: 644 total number of epochs: 1 WARNING: shuffle index length (642) is not equal to sample index length (643) reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... train_2: no. of documents:335038 > WARNING: could not find index map files, building the indices on rank 0 ... > elasped time to build and save doc-idx mapping (seconds): 0.021067 > elapsed time to build and save sample-idx mapping (seconds): 0.002939 > elapsed time to build and save shuffle-idx mapping (seconds): 0.007183 > loading doc-idx mapping from data-hfcm/cc/cc_text_document_train_2_indexmap_277303ns_2048sl_1234s_doc_idx.npy > loading sample-idx mapping from data-hfcm/cc/cc_text_document_train_2_indexmap_277303ns_2048sl_1234s_sample_idx.npy > loading shuffle-idx mapping from data-hfcm/cc/cc_text_document_train_2_indexmap_277303ns_2048sl_1234s_shuffle_idx.npy loaded indexed file in 0.001 seconds total number of samples: 330780 total number of epochs: 2 WARNING: shuffle index length (330778) is not equal to sample index length (330779) reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... valid_2: no. of documents:335038 > WARNING: could not find index map files, building the indices on rank 0 ... > elasped time to build and save doc-idx mapping (seconds): 0.009154 > elapsed time to build and save sample-idx mapping (seconds): 0.001631 > elapsed time to build and save shuffle-idx mapping (seconds): 0.003539 > loading doc-idx mapping from data-hfcm/cc/cc_text_document_valid_2_indexmap_2782ns_2048sl_1234s_doc_idx.npy > loading sample-idx mapping from data-hfcm/cc/cc_text_document_valid_2_indexmap_2782ns_2048sl_1234s_sample_idx.npy > loading shuffle-idx mapping from data-hfcm/cc/cc_text_document_valid_2_indexmap_2782ns_2048sl_1234s_shuffle_idx.npy loaded indexed file in 0.001 seconds total number of samples: 165390 total number of epochs: 1 WARNING: shuffle index length (165388) is not equal to sample index length (165389) reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... test_2: no. of documents:335038 > WARNING: could not find index map files, building the indices on rank 0 ... > elasped time to build and save doc-idx mapping (seconds): 0.009207 > elapsed time to build and save sample-idx mapping (seconds): 0.001887 > elapsed time to build and save shuffle-idx mapping (seconds): 0.003764 > loading doc-idx mapping from data-hfcm/cc/cc_text_document_test_2_indexmap_9ns_2048sl_1234s_doc_idx.npy > loading sample-idx mapping from data-hfcm/cc/cc_text_document_test_2_indexmap_9ns_2048sl_1234s_sample_idx.npy > loading shuffle-idx mapping from data-hfcm/cc/cc_text_document_test_2_indexmap_9ns_2048sl_1234s_shuffle_idx.npy loaded indexed file in 0.001 seconds total number of samples: 165390 total number of epochs: 1 WARNING: shuffle index length (165388) is not equal to sample index length (165389) reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... train_3: no. of documents:88414 > WARNING: could not find index map files, building the indices on rank 0 ... > elasped time to build and save doc-idx mapping (seconds): 0.014543 > elapsed time to build and save sample-idx mapping (seconds): 0.002882 > elapsed time to build and save shuffle-idx mapping (seconds): 0.005104 > loading doc-idx mapping from data-hfcm/danavis/danavis_text_document_train_3_indexmap_194514ns_2048sl_1234s_doc_idx.npy > loading sample-idx mapping from data-hfcm/danavis/danavis_text_document_train_3_indexmap_194514ns_2048sl_1234s_sample_idx.npy > loading shuffle-idx mapping from data-hfcm/danavis/danavis_text_document_train_3_indexmap_194514ns_2048sl_1234s_shuffle_idx.npy loaded indexed file in 0.001 seconds total number of samples: 224995 total number of epochs: 6 WARNING: shuffle index length (224993) is not equal to sample index length (224994) reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... valid_3: no. of documents:88414 > WARNING: could not find index map files, building the indices on rank 0 ... > elasped time to build and save doc-idx mapping (seconds): 0.002417 > elapsed time to build and save sample-idx mapping (seconds): 0.000560 > elapsed time to build and save shuffle-idx mapping (seconds): 0.000833 > loading doc-idx mapping from data-hfcm/danavis/danavis_text_document_valid_3_indexmap_1952ns_2048sl_1234s_doc_idx.npy > loading sample-idx mapping from data-hfcm/danavis/danavis_text_document_valid_3_indexmap_1952ns_2048sl_1234s_sample_idx.npy > loading shuffle-idx mapping from data-hfcm/danavis/danavis_text_document_valid_3_indexmap_1952ns_2048sl_1234s_shuffle_idx.npy loaded indexed file in 0.001 seconds total number of samples: 37500 total number of epochs: 1 WARNING: shuffle index length (37498) is not equal to sample index length (37499) reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... test_3: no. of documents:88414 > WARNING: could not find index map files, building the indices on rank 0 ... > elasped time to build and save doc-idx mapping (seconds): 0.003371 > elapsed time to build and save sample-idx mapping (seconds): 0.000754 > elapsed time to build and save shuffle-idx mapping (seconds): 0.001172 > loading doc-idx mapping from data-hfcm/danavis/danavis_text_document_test_3_indexmap_7ns_2048sl_1234s_doc_idx.npy > loading sample-idx mapping from data-hfcm/danavis/danavis_text_document_test_3_indexmap_7ns_2048sl_1234s_sample_idx.npy > loading shuffle-idx mapping from data-hfcm/danavis/danavis_text_document_test_3_indexmap_7ns_2048sl_1234s_shuffle_idx.npy loaded indexed file in 0.001 seconds total number of samples: 37500 total number of epochs: 1 WARNING: shuffle index length (37498) is not equal to sample index length (37499) reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... train_4: no. of documents:147120 > WARNING: could not find index map files, building the indices on rank 0 ... > elasped time to build and save doc-idx mapping (seconds): 1.580486 > elapsed time to build and save sample-idx mapping (seconds): 0.031241 > elapsed time to build and save shuffle-idx mapping (seconds): 0.004666 > loading doc-idx mapping from data-hfcm/dannet/dannet_text_document_train_4_indexmap_224242ns_2048sl_1234s_doc_idx.npy > loading sample-idx mapping from data-hfcm/dannet/dannet_text_document_train_4_indexmap_224242ns_2048sl_1234s_sample_idx.npy > loading shuffle-idx mapping from data-hfcm/dannet/dannet_text_document_train_4_indexmap_224242ns_2048sl_1234s_shuffle_idx.npy loaded indexed file in 0.001 seconds total number of samples: 225275 total number of epochs: 218 WARNING: shuffle index length (225273) is not equal to sample index length (225274) reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... valid_4: no. of documents:147120 > WARNING: could not find index map files, building the indices on rank 0 ... > elasped time to build and save doc-idx mapping (seconds): 0.011649 > elapsed time to build and save sample-idx mapping (seconds): 0.000628 > elapsed time to build and save shuffle-idx mapping (seconds): 0.000150 > loading doc-idx mapping from data-hfcm/dannet/dannet_text_document_valid_4_indexmap_2250ns_2048sl_1234s_doc_idx.npy > loading sample-idx mapping from data-hfcm/dannet/dannet_text_document_valid_4_indexmap_2250ns_2048sl_1234s_sample_idx.npy > loading shuffle-idx mapping from data-hfcm/dannet/dannet_text_document_valid_4_indexmap_2250ns_2048sl_1234s_shuffle_idx.npy loaded indexed file in 0.001 seconds total number of samples: 3101 total number of epochs: 3 WARNING: shuffle index length (3099) is not equal to sample index length (3100) reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... test_4: no. of documents:147120 > WARNING: could not find index map files, building the indices on rank 0 ... > elasped time to build and save doc-idx mapping (seconds): 0.003963 > elapsed time to build and save sample-idx mapping (seconds): 0.000298 > elapsed time to build and save shuffle-idx mapping (seconds): 0.000097 > loading doc-idx mapping from data-hfcm/dannet/dannet_text_document_test_4_indexmap_8ns_2048sl_1234s_doc_idx.npy > loading sample-idx mapping from data-hfcm/dannet/dannet_text_document_test_4_indexmap_8ns_2048sl_1234s_sample_idx.npy > loading shuffle-idx mapping from data-hfcm/dannet/dannet_text_document_test_4_indexmap_8ns_2048sl_1234s_shuffle_idx.npy loaded indexed file in 0.001 seconds total number of samples: 1034 total number of epochs: 1 WARNING: shuffle index length (1032) is not equal to sample index length (1033) reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... train_5: no. of documents:1608 > WARNING: could not find index map files, building the indices on rank 0 ... > elasped time to build and save doc-idx mapping (seconds): 0.023242 > elapsed time to build and save sample-idx mapping (seconds): 0.001569 > elapsed time to build and save shuffle-idx mapping (seconds): 0.001243 > loading doc-idx mapping from data-hfcm/depbank/depbank_text_document_train_5_indexmap_59370ns_2048sl_1234s_doc_idx.npy > loading sample-idx mapping from data-hfcm/depbank/depbank_text_document_train_5_indexmap_59370ns_2048sl_1234s_sample_idx.npy > loading shuffle-idx mapping from data-hfcm/depbank/depbank_text_document_train_5_indexmap_59370ns_2048sl_1234s_shuffle_idx.npy loaded indexed file in 0.001 seconds total number of samples: 59400 total number of epochs: 523 WARNING: shuffle index length (59398) is not equal to sample index length (59399) reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... valid_5: no. of documents:1608 > WARNING: could not find index map files, building the indices on rank 0 ... > elasped time to build and save doc-idx mapping (seconds): 0.000360 > elapsed time to build and save sample-idx mapping (seconds): 0.000116 > elapsed time to build and save shuffle-idx mapping (seconds): 0.000080 > loading doc-idx mapping from data-hfcm/depbank/depbank_text_document_valid_5_indexmap_596ns_2048sl_1234s_doc_idx.npy > loading sample-idx mapping from data-hfcm/depbank/depbank_text_document_valid_5_indexmap_596ns_2048sl_1234s_sample_idx.npy > loading shuffle-idx mapping from data-hfcm/depbank/depbank_text_document_valid_5_indexmap_596ns_2048sl_1234s_shuffle_idx.npy loaded indexed file in 0.001 seconds total number of samples: 682 total number of epochs: 6 WARNING: shuffle index length (680) is not equal to sample index length (681) reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... test_5: no. of documents:1608 > WARNING: could not find index map files, building the indices on rank 0 ... > elasped time to build and save doc-idx mapping (seconds): 0.000162 > elapsed time to build and save sample-idx mapping (seconds): 0.000094 > elapsed time to build and save shuffle-idx mapping (seconds): 0.000070 > loading doc-idx mapping from data-hfcm/depbank/depbank_text_document_test_5_indexmap_2ns_2048sl_1234s_doc_idx.npy > loading sample-idx mapping from data-hfcm/depbank/depbank_text_document_test_5_indexmap_2ns_2048sl_1234s_sample_idx.npy > loading shuffle-idx mapping from data-hfcm/depbank/depbank_text_document_test_5_indexmap_2ns_2048sl_1234s_shuffle_idx.npy loaded indexed file in 0.001 seconds total number of samples: 114 total number of epochs: 1 WARNING: shuffle index length (112) is not equal to sample index length (113) reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... train_6: no. of documents:38991 > WARNING: could not find index map files, building the indices on rank 0 ... > elasped time to build and save doc-idx mapping (seconds): 0.008179 > elapsed time to build and save sample-idx mapping (seconds): 0.001085 > elapsed time to build and save shuffle-idx mapping (seconds): 0.003320 > loading doc-idx mapping from data-hfcm/elrc-emea/elrc-emea_text_document_train_6_indexmap_153497ns_2048sl_1234s_doc_idx.npy > loading sample-idx mapping from data-hfcm/elrc-emea/elrc-emea_text_document_train_6_indexmap_153497ns_2048sl_1234s_sample_idx.npy > loading shuffle-idx mapping from data-hfcm/elrc-emea/elrc-emea_text_document_train_6_indexmap_153497ns_2048sl_1234s_shuffle_idx.npy loaded indexed file in 0.001 seconds total number of samples: 155806 total number of epochs: 8 WARNING: shuffle index length (155804) is not equal to sample index length (155805) reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... valid_6: no. of documents:38991 > WARNING: could not find index map files, building the indices on rank 0 ... > elasped time to build and save doc-idx mapping (seconds): 0.001067 > elapsed time to build and save sample-idx mapping (seconds): 0.000222 > elapsed time to build and save shuffle-idx mapping (seconds): 0.000462 > loading doc-idx mapping from data-hfcm/elrc-emea/elrc-emea_text_document_valid_6_indexmap_1540ns_2048sl_1234s_doc_idx.npy > loading sample-idx mapping from data-hfcm/elrc-emea/elrc-emea_text_document_valid_6_indexmap_1540ns_2048sl_1234s_sample_idx.npy > loading shuffle-idx mapping from data-hfcm/elrc-emea/elrc-emea_text_document_valid_6_indexmap_1540ns_2048sl_1234s_shuffle_idx.npy loaded indexed file in 0.001 seconds total number of samples: 19476 total number of epochs: 1 WARNING: shuffle index length (19474) is not equal to sample index length (19475) reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... test_6: no. of documents:38991 > WARNING: could not find index map files, building the indices on rank 0 ... > elasped time to build and save doc-idx mapping (seconds): 0.001048 > elapsed time to build and save sample-idx mapping (seconds): 0.000224 > elapsed time to build and save shuffle-idx mapping (seconds): 0.000468 > loading doc-idx mapping from data-hfcm/elrc-emea/elrc-emea_text_document_test_6_indexmap_5ns_2048sl_1234s_doc_idx.npy > loading sample-idx mapping from data-hfcm/elrc-emea/elrc-emea_text_document_test_6_indexmap_5ns_2048sl_1234s_sample_idx.npy > loading shuffle-idx mapping from data-hfcm/elrc-emea/elrc-emea_text_document_test_6_indexmap_5ns_2048sl_1234s_shuffle_idx.npy loaded indexed file in 0.001 seconds total number of samples: 19476 total number of epochs: 1 WARNING: shuffle index length (19474) is not equal to sample index length (19475) reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... train_7: no. of documents:126377 > WARNING: could not find index map files, building the indices on rank 0 ... > elasped time to build and save doc-idx mapping (seconds): 0.013155 > elapsed time to build and save sample-idx mapping (seconds): 0.002235 > elapsed time to build and save shuffle-idx mapping (seconds): 0.004901 > loading doc-idx mapping from data-hfcm/ep/ep_text_document_train_7_indexmap_215050ns_2048sl_1234s_doc_idx.npy > loading sample-idx mapping from data-hfcm/ep/ep_text_document_train_7_indexmap_215050ns_2048sl_1234s_sample_idx.npy > loading shuffle-idx mapping from data-hfcm/ep/ep_text_document_train_7_indexmap_215050ns_2048sl_1234s_shuffle_idx.npy loaded indexed file in 0.001 seconds total number of samples: 233694 total number of epochs: 4 WARNING: shuffle index length (233692) is not equal to sample index length (233693) reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... valid_7: no. of documents:126377 > WARNING: could not find index map files, building the indices on rank 0 ... > elasped time to build and save doc-idx mapping (seconds): 0.003597 > elapsed time to build and save sample-idx mapping (seconds): 0.000663 > elapsed time to build and save shuffle-idx mapping (seconds): 0.001223 > loading doc-idx mapping from data-hfcm/ep/ep_text_document_valid_7_indexmap_2158ns_2048sl_1234s_doc_idx.npy > loading sample-idx mapping from data-hfcm/ep/ep_text_document_valid_7_indexmap_2158ns_2048sl_1234s_sample_idx.npy > loading shuffle-idx mapping from data-hfcm/ep/ep_text_document_valid_7_indexmap_2158ns_2048sl_1234s_shuffle_idx.npy loaded indexed file in 0.001 seconds total number of samples: 58424 total number of epochs: 1 WARNING: shuffle index length (58422) is not equal to sample index length (58423) reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... test_7: no. of documents:126377 > WARNING: could not find index map files, building the indices on rank 0 ... > elasped time to build and save doc-idx mapping (seconds): 0.003337 > elapsed time to build and save sample-idx mapping (seconds): 0.000653 > elapsed time to build and save shuffle-idx mapping (seconds): 0.001232 > loading doc-idx mapping from data-hfcm/ep/ep_text_document_test_7_indexmap_7ns_2048sl_1234s_doc_idx.npy > loading sample-idx mapping from data-hfcm/ep/ep_text_document_test_7_indexmap_7ns_2048sl_1234s_sample_idx.npy > loading shuffle-idx mapping from data-hfcm/ep/ep_text_document_test_7_indexmap_7ns_2048sl_1234s_shuffle_idx.npy loaded indexed file in 0.001 seconds total number of samples: 58424 total number of epochs: 1 WARNING: shuffle index length (58422) is not equal to sample index length (58423) reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... train_8: no. of documents:479525 > WARNING: could not find index map files, building the indices on rank 0 ... > elasped time to build and save doc-idx mapping (seconds): 0.026269 > elapsed time to build and save sample-idx mapping (seconds): 0.004217 > elapsed time to build and save shuffle-idx mapping (seconds): 0.009961 > loading doc-idx mapping from data-hfcm/eubookshop/eubookshop_text_document_train_8_indexmap_300457ns_2048sl_1234s_doc_idx.npy > loading sample-idx mapping from data-hfcm/eubookshop/eubookshop_text_document_train_8_indexmap_300457ns_2048sl_1234s_sample_idx.npy > loading shuffle-idx mapping from data-hfcm/eubookshop/eubookshop_text_document_train_8_indexmap_300457ns_2048sl_1234s_shuffle_idx.npy loaded indexed file in 0.001 seconds total number of samples: 475427 total number of epochs: 2 WARNING: shuffle index length (475425) is not equal to sample index length (475426) reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... valid_8: no. of documents:479525 > WARNING: could not find index map files, building the indices on rank 0 ... > elasped time to build and save doc-idx mapping (seconds): 0.012218 > elapsed time to build and save sample-idx mapping (seconds): 0.002321 > elapsed time to build and save shuffle-idx mapping (seconds): 0.004894 > loading doc-idx mapping from data-hfcm/eubookshop/eubookshop_text_document_valid_8_indexmap_3014ns_2048sl_1234s_doc_idx.npy > loading sample-idx mapping from data-hfcm/eubookshop/eubookshop_text_document_valid_8_indexmap_3014ns_2048sl_1234s_sample_idx.npy > loading shuffle-idx mapping from data-hfcm/eubookshop/eubookshop_text_document_valid_8_indexmap_3014ns_2048sl_1234s_shuffle_idx.npy loaded indexed file in 0.001 seconds total number of samples: 237714 total number of epochs: 1 WARNING: shuffle index length (237712) is not equal to sample index length (237713) reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... test_8: no. of documents:479525 > WARNING: could not find index map files, building the indices on rank 0 ... > elasped time to build and save doc-idx mapping (seconds): 0.012302 > elapsed time to build and save sample-idx mapping (seconds): 0.002277 > elapsed time to build and save shuffle-idx mapping (seconds): 0.004886 > loading doc-idx mapping from data-hfcm/eubookshop/eubookshop_text_document_test_8_indexmap_10ns_2048sl_1234s_doc_idx.npy > loading sample-idx mapping from data-hfcm/eubookshop/eubookshop_text_document_test_8_indexmap_10ns_2048sl_1234s_sample_idx.npy > loading shuffle-idx mapping from data-hfcm/eubookshop/eubookshop_text_document_test_8_indexmap_10ns_2048sl_1234s_shuffle_idx.npy loaded indexed file in 0.001 seconds total number of samples: 237714 total number of epochs: 1 WARNING: shuffle index length (237712) is not equal to sample index length (237713) reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... train_9: no. of documents:146329 > WARNING: could not find index map files, building the indices on rank 0 ... > elasped time to build and save doc-idx mapping (seconds): 0.014955 > elapsed time to build and save sample-idx mapping (seconds): 0.002172 > elapsed time to build and save shuffle-idx mapping (seconds): 0.005747 > loading doc-idx mapping from data-hfcm/ft/ft_text_document_train_9_indexmap_223911ns_2048sl_1234s_doc_idx.npy > loading sample-idx mapping from data-hfcm/ft/ft_text_document_train_9_indexmap_223911ns_2048sl_1234s_sample_idx.npy > loading shuffle-idx mapping from data-hfcm/ft/ft_text_document_train_9_indexmap_223911ns_2048sl_1234s_shuffle_idx.npy loaded indexed file in 0.001 seconds total number of samples: 287122 total number of epochs: 4 WARNING: shuffle index length (287120) is not equal to sample index length (287121) reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... valid_9: no. of documents:146329 > WARNING: could not find index map files, building the indices on rank 0 ... > elasped time to build and save doc-idx mapping (seconds): 0.003812 > elapsed time to build and save sample-idx mapping (seconds): 0.000650 > elapsed time to build and save shuffle-idx mapping (seconds): 0.001498 > loading doc-idx mapping from data-hfcm/ft/ft_text_document_valid_9_indexmap_2247ns_2048sl_1234s_doc_idx.npy > loading sample-idx mapping from data-hfcm/ft/ft_text_document_valid_9_indexmap_2247ns_2048sl_1234s_sample_idx.npy > loading shuffle-idx mapping from data-hfcm/ft/ft_text_document_valid_9_indexmap_2247ns_2048sl_1234s_shuffle_idx.npy loaded indexed file in 0.001 seconds total number of samples: 71781 total number of epochs: 1 WARNING: shuffle index length (71779) is not equal to sample index length (71780) reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... test_9: no. of documents:146329 > WARNING: could not find index map files, building the indices on rank 0 ... > elasped time to build and save doc-idx mapping (seconds): 0.003851 > elapsed time to build and save sample-idx mapping (seconds): 0.000741 > elapsed time to build and save shuffle-idx mapping (seconds): 0.001501 > loading doc-idx mapping from data-hfcm/ft/ft_text_document_test_9_indexmap_7ns_2048sl_1234s_doc_idx.npy > loading sample-idx mapping from data-hfcm/ft/ft_text_document_test_9_indexmap_7ns_2048sl_1234s_sample_idx.npy > loading shuffle-idx mapping from data-hfcm/ft/ft_text_document_test_9_indexmap_7ns_2048sl_1234s_shuffle_idx.npy loaded indexed file in 0.001 seconds total number of samples: 71781 total number of epochs: 1 WARNING: shuffle index length (71779) is not equal to sample index length (71780) reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... train_10: no. of documents:8822 > WARNING: could not find index map files, building the indices on rank 0 ... > elasped time to build and save doc-idx mapping (seconds): 0.005279 > elapsed time to build and save sample-idx mapping (seconds): 0.000727 > elapsed time to build and save shuffle-idx mapping (seconds): 0.002084 > loading doc-idx mapping from data-hfcm/gutenberg/gutenberg_text_document_train_10_indexmap_98809ns_2048sl_1234s_doc_idx.npy > loading sample-idx mapping from data-hfcm/gutenberg/gutenberg_text_document_train_10_indexmap_98809ns_2048sl_1234s_sample_idx.npy > loading shuffle-idx mapping from data-hfcm/gutenberg/gutenberg_text_document_train_10_indexmap_98809ns_2048sl_1234s_shuffle_idx.npy loaded indexed file in 0.001 seconds total number of samples: 99878 total number of epochs: 23 WARNING: shuffle index length (99876) is not equal to sample index length (99877) reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... valid_10: no. of documents:8822 > WARNING: could not find index map files, building the indices on rank 0 ... > elasped time to build and save doc-idx mapping (seconds): 0.000322 > elapsed time to build and save sample-idx mapping (seconds): 0.000128 > elapsed time to build and save shuffle-idx mapping (seconds): 0.000153 > loading doc-idx mapping from data-hfcm/gutenberg/gutenberg_text_document_valid_10_indexmap_992ns_2048sl_1234s_doc_idx.npy > loading sample-idx mapping from data-hfcm/gutenberg/gutenberg_text_document_valid_10_indexmap_992ns_2048sl_1234s_sample_idx.npy > loading shuffle-idx mapping from data-hfcm/gutenberg/gutenberg_text_document_valid_10_indexmap_992ns_2048sl_1234s_shuffle_idx.npy loaded indexed file in 0.001 seconds total number of samples: 4343 total number of epochs: 1 WARNING: shuffle index length (4341) is not equal to sample index length (4342) reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... test_10: no. of documents:8822 > WARNING: could not find index map files, building the indices on rank 0 ... > elasped time to build and save doc-idx mapping (seconds): 0.000310 > elapsed time to build and save sample-idx mapping (seconds): 0.000121 > elapsed time to build and save shuffle-idx mapping (seconds): 0.000152 > loading doc-idx mapping from data-hfcm/gutenberg/gutenberg_text_document_test_10_indexmap_4ns_2048sl_1234s_doc_idx.npy > loading sample-idx mapping from data-hfcm/gutenberg/gutenberg_text_document_test_10_indexmap_4ns_2048sl_1234s_sample_idx.npy > loading shuffle-idx mapping from data-hfcm/gutenberg/gutenberg_text_document_test_10_indexmap_4ns_2048sl_1234s_shuffle_idx.npy loaded indexed file in 0.001 seconds total number of samples: 4343 total number of epochs: 1 WARNING: shuffle index length (4341) is not equal to sample index length (4342) reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... train_11: no. of documents:566715 > WARNING: could not find index map files, building the indices on rank 0 ... > elasped time to build and save doc-idx mapping (seconds): 0.032184 > elapsed time to build and save sample-idx mapping (seconds): 0.005663 > elapsed time to build and save shuffle-idx mapping (seconds): 0.010632 > loading doc-idx mapping from data-hfcm/hest/hest_text_document_train_11_indexmap_310609ns_2048sl_1234s_doc_idx.npy > loading sample-idx mapping from data-hfcm/hest/hest_text_document_train_11_indexmap_310609ns_2048sl_1234s_sample_idx.npy > loading shuffle-idx mapping from data-hfcm/hest/hest_text_document_train_11_indexmap_310609ns_2048sl_1234s_shuffle_idx.npy loaded indexed file in 0.001 seconds total number of samples: 536309 total number of epochs: 2 WARNING: shuffle index length (536307) is not equal to sample index length (536308) reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... valid_11: no. of documents:566715 > WARNING: could not find index map files, building the indices on rank 0 ... > elasped time to build and save doc-idx mapping (seconds): 0.014494 > elapsed time to build and save sample-idx mapping (seconds): 0.003128 > elapsed time to build and save shuffle-idx mapping (seconds): 0.005309 > loading doc-idx mapping from data-hfcm/hest/hest_text_document_valid_11_indexmap_3116ns_2048sl_1234s_doc_idx.npy > loading sample-idx mapping from data-hfcm/hest/hest_text_document_valid_11_indexmap_3116ns_2048sl_1234s_sample_idx.npy > loading shuffle-idx mapping from data-hfcm/hest/hest_text_document_valid_11_indexmap_3116ns_2048sl_1234s_shuffle_idx.npy loaded indexed file in 0.001 seconds total number of samples: 268155 total number of epochs: 1 WARNING: shuffle index length (268153) is not equal to sample index length (268154) reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... test_11: no. of documents:566715 > WARNING: could not find index map files, building the indices on rank 0 ... > elasped time to build and save doc-idx mapping (seconds): 0.014843 > elapsed time to build and save sample-idx mapping (seconds): 0.003101 > elapsed time to build and save shuffle-idx mapping (seconds): 0.005427 > loading doc-idx mapping from data-hfcm/hest/hest_text_document_test_11_indexmap_10ns_2048sl_1234s_doc_idx.npy > loading sample-idx mapping from data-hfcm/hest/hest_text_document_test_11_indexmap_10ns_2048sl_1234s_sample_idx.npy > loading shuffle-idx mapping from data-hfcm/hest/hest_text_document_test_11_indexmap_10ns_2048sl_1234s_shuffle_idx.npy loaded indexed file in 0.001 seconds total number of samples: 268155 total number of epochs: 1 WARNING: shuffle index length (268153) is not equal to sample index length (268154) reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... train_12: no. of documents:4766 > WARNING: could not find index map files, building the indices on rank 0 ... > elasped time to build and save doc-idx mapping (seconds): 0.004624 > elapsed time to build and save sample-idx mapping (seconds): 0.000650 > elapsed time to build and save shuffle-idx mapping (seconds): 0.001815 > loading doc-idx mapping from data-hfcm/jvj/jvj_text_document_train_12_indexmap_82202ns_2048sl_1234s_doc_idx.npy > loading sample-idx mapping from data-hfcm/jvj/jvj_text_document_train_12_indexmap_82202ns_2048sl_1234s_sample_idx.npy > loading shuffle-idx mapping from data-hfcm/jvj/jvj_text_document_train_12_indexmap_82202ns_2048sl_1234s_shuffle_idx.npy loaded indexed file in 0.001 seconds total number of samples: 84120 total number of epochs: 36 WARNING: shuffle index length (84118) is not equal to sample index length (84119) reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... valid_12: no. of documents:4766 > WARNING: could not find index map files, building the indices on rank 0 ... > elasped time to build and save doc-idx mapping (seconds): 0.000230 > elapsed time to build and save sample-idx mapping (seconds): 0.000108 > elapsed time to build and save shuffle-idx mapping (seconds): 0.000115 > loading doc-idx mapping from data-hfcm/jvj/jvj_text_document_valid_12_indexmap_825ns_2048sl_1234s_doc_idx.npy > loading sample-idx mapping from data-hfcm/jvj/jvj_text_document_valid_12_indexmap_825ns_2048sl_1234s_sample_idx.npy > loading shuffle-idx mapping from data-hfcm/jvj/jvj_text_document_valid_12_indexmap_825ns_2048sl_1234s_shuffle_idx.npy loaded indexed file in 0.001 seconds total number of samples: 2337 total number of epochs: 1 WARNING: shuffle index length (2335) is not equal to sample index length (2336) reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... test_12: no. of documents:4766 > WARNING: could not find index map files, building the indices on rank 0 ... > elasped time to build and save doc-idx mapping (seconds): 0.000232 > elapsed time to build and save sample-idx mapping (seconds): 0.000112 > elapsed time to build and save shuffle-idx mapping (seconds): 0.000114 > loading doc-idx mapping from data-hfcm/jvj/jvj_text_document_test_12_indexmap_3ns_2048sl_1234s_doc_idx.npy > loading sample-idx mapping from data-hfcm/jvj/jvj_text_document_test_12_indexmap_3ns_2048sl_1234s_sample_idx.npy > loading shuffle-idx mapping from data-hfcm/jvj/jvj_text_document_test_12_indexmap_3ns_2048sl_1234s_shuffle_idx.npy loaded indexed file in 0.001 seconds total number of samples: 2337 total number of epochs: 1 WARNING: shuffle index length (2335) is not equal to sample index length (2336) reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... train_13: no. of documents:182247 > WARNING: could not find index map files, building the indices on rank 0 ... > elasped time to build and save doc-idx mapping (seconds): 0.019148 > elapsed time to build and save sample-idx mapping (seconds): 0.004684 > elapsed time to build and save shuffle-idx mapping (seconds): 0.005194 > loading doc-idx mapping from data-hfcm/kb/kb_text_document_train_13_indexmap_237602ns_2048sl_1234s_doc_idx.npy > loading sample-idx mapping from data-hfcm/kb/kb_text_document_train_13_indexmap_237602ns_2048sl_1234s_sample_idx.npy > loading shuffle-idx mapping from data-hfcm/kb/kb_text_document_train_13_indexmap_237602ns_2048sl_1234s_shuffle_idx.npy loaded indexed file in 0.001 seconds total number of samples: 243250 total number of epochs: 4 WARNING: shuffle index length (243248) is not equal to sample index length (243249) reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... valid_13: no. of documents:182247 > WARNING: could not find index map files, building the indices on rank 0 ... > elasped time to build and save doc-idx mapping (seconds): 0.004885 > elapsed time to build and save sample-idx mapping (seconds): 0.001248 > elapsed time to build and save shuffle-idx mapping (seconds): 0.001251 > loading doc-idx mapping from data-hfcm/kb/kb_text_document_valid_13_indexmap_2384ns_2048sl_1234s_doc_idx.npy > loading sample-idx mapping from data-hfcm/kb/kb_text_document_valid_13_indexmap_2384ns_2048sl_1234s_sample_idx.npy > loading shuffle-idx mapping from data-hfcm/kb/kb_text_document_valid_13_indexmap_2384ns_2048sl_1234s_shuffle_idx.npy loaded indexed file in 0.001 seconds total number of samples: 60813 total number of epochs: 1 WARNING: shuffle index length (60811) is not equal to sample index length (60812) reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... test_13: no. of documents:182247 > WARNING: could not find index map files, building the indices on rank 0 ... > elasped time to build and save doc-idx mapping (seconds): 0.005075 > elapsed time to build and save sample-idx mapping (seconds): 0.001216 > elapsed time to build and save shuffle-idx mapping (seconds): 0.001280 > loading doc-idx mapping from data-hfcm/kb/kb_text_document_test_13_indexmap_8ns_2048sl_1234s_doc_idx.npy > loading sample-idx mapping from data-hfcm/kb/kb_text_document_test_13_indexmap_8ns_2048sl_1234s_sample_idx.npy > loading shuffle-idx mapping from data-hfcm/kb/kb_text_document_test_13_indexmap_8ns_2048sl_1234s_shuffle_idx.npy loaded indexed file in 0.001 seconds total number of samples: 60813 total number of epochs: 1 WARNING: shuffle index length (60811) is not equal to sample index length (60812) reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... train_14: no. of documents:52472 > WARNING: could not find index map files, building the indices on rank 0 ... > elasped time to build and save doc-idx mapping (seconds): 0.009659 > elapsed time to build and save sample-idx mapping (seconds): 0.001226 > elapsed time to build and save shuffle-idx mapping (seconds): 0.003896 > loading doc-idx mapping from data-hfcm/korpus2000/korpus2000_text_document_train_14_indexmap_167399ns_2048sl_1234s_doc_idx.npy > loading sample-idx mapping from data-hfcm/korpus2000/korpus2000_text_document_train_14_indexmap_167399ns_2048sl_1234s_sample_idx.npy > loading shuffle-idx mapping from data-hfcm/korpus2000/korpus2000_text_document_train_14_indexmap_167399ns_2048sl_1234s_shuffle_idx.npy loaded indexed file in 0.001 seconds total number of samples: 183464 total number of epochs: 7 WARNING: shuffle index length (183462) is not equal to sample index length (183463) reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... valid_14: no. of documents:52472 > WARNING: could not find index map files, building the indices on rank 0 ... > elasped time to build and save doc-idx mapping (seconds): 0.001376 > elapsed time to build and save sample-idx mapping (seconds): 0.000270 > elapsed time to build and save shuffle-idx mapping (seconds): 0.000619 > loading doc-idx mapping from data-hfcm/korpus2000/korpus2000_text_document_valid_14_indexmap_1680ns_2048sl_1234s_doc_idx.npy > loading sample-idx mapping from data-hfcm/korpus2000/korpus2000_text_document_valid_14_indexmap_1680ns_2048sl_1234s_sample_idx.npy > loading shuffle-idx mapping from data-hfcm/korpus2000/korpus2000_text_document_valid_14_indexmap_1680ns_2048sl_1234s_shuffle_idx.npy loaded indexed file in 0.001 seconds total number of samples: 26210 total number of epochs: 1 WARNING: shuffle index length (26208) is not equal to sample index length (26209) reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... test_14: no. of documents:52472 > WARNING: could not find index map files, building the indices on rank 0 ... > elasped time to build and save doc-idx mapping (seconds): 0.001320 > elapsed time to build and save sample-idx mapping (seconds): 0.000270 > elapsed time to build and save shuffle-idx mapping (seconds): 0.000597 > loading doc-idx mapping from data-hfcm/korpus2000/korpus2000_text_document_test_14_indexmap_6ns_2048sl_1234s_doc_idx.npy > loading sample-idx mapping from data-hfcm/korpus2000/korpus2000_text_document_test_14_indexmap_6ns_2048sl_1234s_sample_idx.npy > loading shuffle-idx mapping from data-hfcm/korpus2000/korpus2000_text_document_test_14_indexmap_6ns_2048sl_1234s_shuffle_idx.npy loaded indexed file in 0.001 seconds total number of samples: 26210 total number of epochs: 1 WARNING: shuffle index length (26208) is not equal to sample index length (26209) reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... train_15: no. of documents:637 > WARNING: could not find index map files, building the indices on rank 0 ... > elasped time to build and save doc-idx mapping (seconds): 0.004050 > elapsed time to build and save sample-idx mapping (seconds): 0.000872 > elapsed time to build and save shuffle-idx mapping (seconds): 0.001000 > loading doc-idx mapping from data-hfcm/naat/naat_text_document_train_15_indexmap_44978ns_2048sl_1234s_doc_idx.npy > loading sample-idx mapping from data-hfcm/naat/naat_text_document_train_15_indexmap_44978ns_2048sl_1234s_sample_idx.npy > loading shuffle-idx mapping from data-hfcm/naat/naat_text_document_train_15_indexmap_44978ns_2048sl_1234s_shuffle_idx.npy loaded indexed file in 0.001 seconds total number of samples: 45099 total number of epochs: 247 WARNING: shuffle index length (45097) is not equal to sample index length (45098) reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... valid_15: no. of documents:637 > WARNING: could not find index map files, building the indices on rank 0 ... > elasped time to build and save doc-idx mapping (seconds): 0.000163 > elapsed time to build and save sample-idx mapping (seconds): 0.000097 > elapsed time to build and save shuffle-idx mapping (seconds): 0.000074 > loading doc-idx mapping from data-hfcm/naat/naat_text_document_valid_15_indexmap_452ns_2048sl_1234s_doc_idx.npy > loading sample-idx mapping from data-hfcm/naat/naat_text_document_valid_15_indexmap_452ns_2048sl_1234s_sample_idx.npy > loading shuffle-idx mapping from data-hfcm/naat/naat_text_document_valid_15_indexmap_452ns_2048sl_1234s_shuffle_idx.npy loaded indexed file in 0.001 seconds total number of samples: 548 total number of epochs: 3 WARNING: shuffle index length (546) is not equal to sample index length (547) reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... test_15: no. of documents:637 > WARNING: could not find index map files, building the indices on rank 0 ... > elasped time to build and save doc-idx mapping (seconds): 0.000124 > elapsed time to build and save sample-idx mapping (seconds): 0.000089 > elapsed time to build and save shuffle-idx mapping (seconds): 0.000068 > loading doc-idx mapping from data-hfcm/naat/naat_text_document_test_15_indexmap_2ns_2048sl_1234s_doc_idx.npy > loading sample-idx mapping from data-hfcm/naat/naat_text_document_test_15_indexmap_2ns_2048sl_1234s_sample_idx.npy > loading shuffle-idx mapping from data-hfcm/naat/naat_text_document_test_15_indexmap_2ns_2048sl_1234s_shuffle_idx.npy loaded indexed file in 0.001 seconds total number of samples: 183 total number of epochs: 1 WARNING: shuffle index length (181) is not equal to sample index length (182) reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... train_16: no. of documents:490404 > WARNING: could not find index map files, building the indices on rank 0 ... > elasped time to build and save doc-idx mapping (seconds): 0.025975 > elapsed time to build and save sample-idx mapping (seconds): 0.005719 > elapsed time to build and save shuffle-idx mapping (seconds): 0.009405 > loading doc-idx mapping from data-hfcm/opensub/opensub_text_document_train_16_indexmap_301854ns_2048sl_1234s_doc_idx.npy > loading sample-idx mapping from data-hfcm/opensub/opensub_text_document_train_16_indexmap_301854ns_2048sl_1234s_sample_idx.npy > loading shuffle-idx mapping from data-hfcm/opensub/opensub_text_document_train_16_indexmap_301854ns_2048sl_1234s_shuffle_idx.npy loaded indexed file in 0.001 seconds total number of samples: 425410 total number of epochs: 2 WARNING: shuffle index length (425408) is not equal to sample index length (425409) reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... valid_16: no. of documents:490404 > WARNING: could not find index map files, building the indices on rank 0 ... > elasped time to build and save doc-idx mapping (seconds): 0.012905 > elapsed time to build and save sample-idx mapping (seconds): 0.003116 > elapsed time to build and save shuffle-idx mapping (seconds): 0.004588 > loading doc-idx mapping from data-hfcm/opensub/opensub_text_document_valid_16_indexmap_3028ns_2048sl_1234s_doc_idx.npy > loading sample-idx mapping from data-hfcm/opensub/opensub_text_document_valid_16_indexmap_3028ns_2048sl_1234s_sample_idx.npy > loading shuffle-idx mapping from data-hfcm/opensub/opensub_text_document_valid_16_indexmap_3028ns_2048sl_1234s_shuffle_idx.npy loaded indexed file in 0.001 seconds total number of samples: 212705 total number of epochs: 1 WARNING: shuffle index length (212703) is not equal to sample index length (212704) reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... test_16: no. of documents:490404 > WARNING: could not find index map files, building the indices on rank 0 ... > elasped time to build and save doc-idx mapping (seconds): 0.013090 > elapsed time to build and save sample-idx mapping (seconds): 0.003195 > elapsed time to build and save shuffle-idx mapping (seconds): 0.004539 > loading doc-idx mapping from data-hfcm/opensub/opensub_text_document_test_16_indexmap_10ns_2048sl_1234s_doc_idx.npy > loading sample-idx mapping from data-hfcm/opensub/opensub_text_document_test_16_indexmap_10ns_2048sl_1234s_sample_idx.npy > loading shuffle-idx mapping from data-hfcm/opensub/opensub_text_document_test_16_indexmap_10ns_2048sl_1234s_shuffle_idx.npy loaded indexed file in 0.001 seconds total number of samples: 212705 total number of epochs: 1 WARNING: shuffle index length (212703) is not equal to sample index length (212704) reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... train_17: no. of documents:1846 > WARNING: could not find index map files, building the indices on rank 0 ... > elasped time to build and save doc-idx mapping (seconds): 0.003466 > elapsed time to build and save sample-idx mapping (seconds): 0.000615 > elapsed time to build and save shuffle-idx mapping (seconds): 0.001283 > loading doc-idx mapping from data-hfcm/relig/relig_text_document_train_17_indexmap_61877ns_2048sl_1234s_doc_idx.npy > loading sample-idx mapping from data-hfcm/relig/relig_text_document_train_17_indexmap_61877ns_2048sl_1234s_sample_idx.npy > loading shuffle-idx mapping from data-hfcm/relig/relig_text_document_train_17_indexmap_61877ns_2048sl_1234s_shuffle_idx.npy loaded indexed file in 0.001 seconds total number of samples: 62297 total number of epochs: 73 WARNING: shuffle index length (62295) is not equal to sample index length (62296) reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... valid_17: no. of documents:1846 > WARNING: could not find index map files, building the indices on rank 0 ... > elasped time to build and save doc-idx mapping (seconds): 0.000163 > elapsed time to build and save sample-idx mapping (seconds): 0.000085 > elapsed time to build and save shuffle-idx mapping (seconds): 0.000071 > loading doc-idx mapping from data-hfcm/relig/relig_text_document_valid_17_indexmap_621ns_2048sl_1234s_doc_idx.npy > loading sample-idx mapping from data-hfcm/relig/relig_text_document_valid_17_indexmap_621ns_2048sl_1234s_sample_idx.npy > loading shuffle-idx mapping from data-hfcm/relig/relig_text_document_valid_17_indexmap_621ns_2048sl_1234s_shuffle_idx.npy loaded indexed file in 0.001 seconds total number of samples: 854 total number of epochs: 1 WARNING: shuffle index length (852) is not equal to sample index length (853) reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... test_17: no. of documents:1846 > WARNING: could not find index map files, building the indices on rank 0 ... > elasped time to build and save doc-idx mapping (seconds): 0.000149 > elapsed time to build and save sample-idx mapping (seconds): 0.000082 > elapsed time to build and save shuffle-idx mapping (seconds): 0.000068 > loading doc-idx mapping from data-hfcm/relig/relig_text_document_test_17_indexmap_2ns_2048sl_1234s_doc_idx.npy > loading sample-idx mapping from data-hfcm/relig/relig_text_document_test_17_indexmap_2ns_2048sl_1234s_sample_idx.npy > loading shuffle-idx mapping from data-hfcm/relig/relig_text_document_test_17_indexmap_2ns_2048sl_1234s_shuffle_idx.npy loaded indexed file in 0.038 seconds total number of samples: 854 total number of epochs: 1 WARNING: shuffle index length (852) is not equal to sample index length (853) reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... train_18: no. of documents:718649 > WARNING: could not find index map files, building the indices on rank 0 ... > elasped time to build and save doc-idx mapping (seconds): 0.041108 > elapsed time to build and save sample-idx mapping (seconds): 0.008988 > elapsed time to build and save shuffle-idx mapping (seconds): 0.012011 > loading doc-idx mapping from data-hfcm/retsinformationdk/retsinformationdk_text_document_train_18_indexmap_323651ns_2048sl_1234s_doc_idx.npy > loading sample-idx mapping from data-hfcm/retsinformationdk/retsinformationdk_text_document_train_18_indexmap_323651ns_2048sl_1234s_sample_idx.npy > loading shuffle-idx mapping from data-hfcm/retsinformationdk/retsinformationdk_text_document_train_18_indexmap_323651ns_2048sl_1234s_shuffle_idx.npy loaded indexed file in 0.001 seconds total number of samples: 585513 total number of epochs: 2 WARNING: shuffle index length (585511) is not equal to sample index length (585512) reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... valid_18: no. of documents:718649 > WARNING: could not find index map files, building the indices on rank 0 ... > elasped time to build and save doc-idx mapping (seconds): 0.018758 > elapsed time to build and save sample-idx mapping (seconds): 0.004802 > elapsed time to build and save shuffle-idx mapping (seconds): 0.006146 > loading doc-idx mapping from data-hfcm/retsinformationdk/retsinformationdk_text_document_valid_18_indexmap_3247ns_2048sl_1234s_doc_idx.npy > loading sample-idx mapping from data-hfcm/retsinformationdk/retsinformationdk_text_document_valid_18_indexmap_3247ns_2048sl_1234s_sample_idx.npy > loading shuffle-idx mapping from data-hfcm/retsinformationdk/retsinformationdk_text_document_valid_18_indexmap_3247ns_2048sl_1234s_shuffle_idx.npy loaded indexed file in 0.001 seconds total number of samples: 292757 total number of epochs: 1 WARNING: shuffle index length (292755) is not equal to sample index length (292756) reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... test_18: no. of documents:718649 > WARNING: could not find index map files, building the indices on rank 0 ... > elasped time to build and save doc-idx mapping (seconds): 0.018870 > elapsed time to build and save sample-idx mapping (seconds): 0.004808 > elapsed time to build and save shuffle-idx mapping (seconds): 0.006050 > loading doc-idx mapping from data-hfcm/retsinformationdk/retsinformationdk_text_document_test_18_indexmap_11ns_2048sl_1234s_doc_idx.npy > loading sample-idx mapping from data-hfcm/retsinformationdk/retsinformationdk_text_document_test_18_indexmap_11ns_2048sl_1234s_sample_idx.npy > loading shuffle-idx mapping from data-hfcm/retsinformationdk/retsinformationdk_text_document_test_18_indexmap_11ns_2048sl_1234s_shuffle_idx.npy loaded indexed file in 0.001 seconds total number of samples: 292757 total number of epochs: 1 WARNING: shuffle index length (292755) is not equal to sample index length (292756) reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... train_19: no. of documents:76431 > WARNING: could not find index map files, building the indices on rank 0 ... > elasped time to build and save doc-idx mapping (seconds): 0.012111 > elapsed time to build and save sample-idx mapping (seconds): 0.002073 > elapsed time to build and save shuffle-idx mapping (seconds): 0.004257 > loading doc-idx mapping from data-hfcm/retspraksis/retspraksis_text_document_train_19_indexmap_186596ns_2048sl_1234s_doc_idx.npy > loading sample-idx mapping from data-hfcm/retspraksis/retspraksis_text_document_train_19_indexmap_186596ns_2048sl_1234s_sample_idx.npy > loading shuffle-idx mapping from data-hfcm/retspraksis/retspraksis_text_document_train_19_indexmap_186596ns_2048sl_1234s_shuffle_idx.npy loaded indexed file in 0.001 seconds total number of samples: 201925 total number of epochs: 6 WARNING: shuffle index length (201923) is not equal to sample index length (201924) reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... valid_19: no. of documents:76431 > WARNING: could not find index map files, building the indices on rank 0 ... > elasped time to build and save doc-idx mapping (seconds): 0.001914 > elapsed time to build and save sample-idx mapping (seconds): 0.000461 > elapsed time to build and save shuffle-idx mapping (seconds): 0.000712 > loading doc-idx mapping from data-hfcm/retspraksis/retspraksis_text_document_valid_19_indexmap_1872ns_2048sl_1234s_doc_idx.npy > loading sample-idx mapping from data-hfcm/retspraksis/retspraksis_text_document_valid_19_indexmap_1872ns_2048sl_1234s_sample_idx.npy > loading shuffle-idx mapping from data-hfcm/retspraksis/retspraksis_text_document_valid_19_indexmap_1872ns_2048sl_1234s_shuffle_idx.npy loaded indexed file in 0.001 seconds total number of samples: 33655 total number of epochs: 1 WARNING: shuffle index length (33653) is not equal to sample index length (33654) reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... test_19: no. of documents:76431 > WARNING: could not find index map files, building the indices on rank 0 ... > elasped time to build and save doc-idx mapping (seconds): 0.001896 > elapsed time to build and save sample-idx mapping (seconds): 0.000431 > elapsed time to build and save shuffle-idx mapping (seconds): 0.000706 > loading doc-idx mapping from data-hfcm/retspraksis/retspraksis_text_document_test_19_indexmap_6ns_2048sl_1234s_doc_idx.npy > loading sample-idx mapping from data-hfcm/retspraksis/retspraksis_text_document_test_19_indexmap_6ns_2048sl_1234s_sample_idx.npy > loading shuffle-idx mapping from data-hfcm/retspraksis/retspraksis_text_document_test_19_indexmap_6ns_2048sl_1234s_shuffle_idx.npy loaded indexed file in 0.001 seconds total number of samples: 33655 total number of epochs: 1 WARNING: shuffle index length (33653) is not equal to sample index length (33654) reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... train_20: no. of documents:168453 > WARNING: could not find index map files, building the indices on rank 0 ... > elasped time to build and save doc-idx mapping (seconds): 0.017652 > elapsed time to build and save sample-idx mapping (seconds): 0.003547 > elapsed time to build and save shuffle-idx mapping (seconds): 0.005333 > loading doc-idx mapping from data-hfcm/skat/skat_text_document_train_20_indexmap_232639ns_2048sl_1234s_doc_idx.npy > loading sample-idx mapping from data-hfcm/skat/skat_text_document_train_20_indexmap_232639ns_2048sl_1234s_sample_idx.npy > loading shuffle-idx mapping from data-hfcm/skat/skat_text_document_train_20_indexmap_232639ns_2048sl_1234s_shuffle_idx.npy loaded indexed file in 0.001 seconds total number of samples: 276247 total number of epochs: 4 WARNING: shuffle index length (276245) is not equal to sample index length (276246) reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... valid_20: no. of documents:168453 > WARNING: could not find index map files, building the indices on rank 0 ... > elasped time to build and save doc-idx mapping (seconds): 0.004540 > elapsed time to build and save sample-idx mapping (seconds): 0.001016 > elapsed time to build and save shuffle-idx mapping (seconds): 0.001414 > loading doc-idx mapping from data-hfcm/skat/skat_text_document_valid_20_indexmap_2334ns_2048sl_1234s_doc_idx.npy > loading sample-idx mapping from data-hfcm/skat/skat_text_document_valid_20_indexmap_2334ns_2048sl_1234s_sample_idx.npy > loading shuffle-idx mapping from data-hfcm/skat/skat_text_document_valid_20_indexmap_2334ns_2048sl_1234s_shuffle_idx.npy loaded indexed file in 0.001 seconds total number of samples: 69062 total number of epochs: 1 WARNING: shuffle index length (69060) is not equal to sample index length (69061) reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... test_20: no. of documents:168453 > WARNING: could not find index map files, building the indices on rank 0 ... > elasped time to build and save doc-idx mapping (seconds): 0.004508 > elapsed time to build and save sample-idx mapping (seconds): 0.000982 > elapsed time to build and save shuffle-idx mapping (seconds): 0.001407 > loading doc-idx mapping from data-hfcm/skat/skat_text_document_test_20_indexmap_8ns_2048sl_1234s_doc_idx.npy > loading sample-idx mapping from data-hfcm/skat/skat_text_document_test_20_indexmap_8ns_2048sl_1234s_sample_idx.npy > loading shuffle-idx mapping from data-hfcm/skat/skat_text_document_test_20_indexmap_8ns_2048sl_1234s_shuffle_idx.npy loaded indexed file in 0.001 seconds total number of samples: 69062 total number of epochs: 1 WARNING: shuffle index length (69060) is not equal to sample index length (69061) reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... train_21: no. of documents:2174 > WARNING: could not find index map files, building the indices on rank 0 ... > elasped time to build and save doc-idx mapping (seconds): 0.005652 > elapsed time to build and save sample-idx mapping (seconds): 0.001189 > elapsed time to build and save shuffle-idx mapping (seconds): 0.001316 > loading doc-idx mapping from data-hfcm/spont/spont_text_document_train_21_indexmap_64985ns_2048sl_1234s_doc_idx.npy > loading sample-idx mapping from data-hfcm/spont/spont_text_document_train_21_indexmap_64985ns_2048sl_1234s_sample_idx.npy > loading shuffle-idx mapping from data-hfcm/spont/spont_text_document_train_21_indexmap_64985ns_2048sl_1234s_shuffle_idx.npy loaded indexed file in 0.001 seconds total number of samples: 65436 total number of epochs: 101 WARNING: shuffle index length (65434) is not equal to sample index length (65435) reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... valid_21: no. of documents:2174 > WARNING: could not find index map files, building the indices on rank 0 ... > elasped time to build and save doc-idx mapping (seconds): 0.000208 > elapsed time to build and save sample-idx mapping (seconds): 0.000102 > elapsed time to build and save shuffle-idx mapping (seconds): 0.000082 > loading doc-idx mapping from data-hfcm/spont/spont_text_document_valid_21_indexmap_652ns_2048sl_1234s_doc_idx.npy > loading sample-idx mapping from data-hfcm/spont/spont_text_document_valid_21_indexmap_652ns_2048sl_1234s_sample_idx.npy > loading shuffle-idx mapping from data-hfcm/spont/spont_text_document_valid_21_indexmap_652ns_2048sl_1234s_shuffle_idx.npy loaded indexed file in 0.001 seconds total number of samples: 1296 total number of epochs: 2 WARNING: shuffle index length (1294) is not equal to sample index length (1295) reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... test_21: no. of documents:2174 > WARNING: could not find index map files, building the indices on rank 0 ... > elasped time to build and save doc-idx mapping (seconds): 0.000153 > elapsed time to build and save sample-idx mapping (seconds): 0.000090 > elapsed time to build and save shuffle-idx mapping (seconds): 0.000068 > loading doc-idx mapping from data-hfcm/spont/spont_text_document_test_21_indexmap_3ns_2048sl_1234s_doc_idx.npy > loading sample-idx mapping from data-hfcm/spont/spont_text_document_test_21_indexmap_3ns_2048sl_1234s_sample_idx.npy > loading shuffle-idx mapping from data-hfcm/spont/spont_text_document_test_21_indexmap_3ns_2048sl_1234s_shuffle_idx.npy loaded indexed file in 0.001 seconds total number of samples: 648 total number of epochs: 1 WARNING: shuffle index length (646) is not equal to sample index length (647) reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... train_22: no. of documents:534 > WARNING: could not find index map files, building the indices on rank 0 ... > elasped time to build and save doc-idx mapping (seconds): 0.015704 > elapsed time to build and save sample-idx mapping (seconds): 0.001159 > elapsed time to build and save shuffle-idx mapping (seconds): 0.000960 > loading doc-idx mapping from data-hfcm/synne/synne_text_document_train_22_indexmap_42660ns_2048sl_1234s_doc_idx.npy > loading sample-idx mapping from data-hfcm/synne/synne_text_document_train_22_indexmap_42660ns_2048sl_1234s_sample_idx.npy > loading shuffle-idx mapping from data-hfcm/synne/synne_text_document_train_22_indexmap_42660ns_2048sl_1234s_shuffle_idx.npy loaded indexed file in 0.001 seconds total number of samples: 42672 total number of epochs: 1143 WARNING: shuffle index length (42670) is not equal to sample index length (42671) reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... valid_22: no. of documents:534 > WARNING: could not find index map files, building the indices on rank 0 ... > elasped time to build and save doc-idx mapping (seconds): 0.000270 > elapsed time to build and save sample-idx mapping (seconds): 0.000092 > elapsed time to build and save shuffle-idx mapping (seconds): 0.000062 > loading doc-idx mapping from data-hfcm/synne/synne_text_document_valid_22_indexmap_428ns_2048sl_1234s_doc_idx.npy > loading sample-idx mapping from data-hfcm/synne/synne_text_document_valid_22_indexmap_428ns_2048sl_1234s_sample_idx.npy > loading shuffle-idx mapping from data-hfcm/synne/synne_text_document_valid_22_indexmap_428ns_2048sl_1234s_shuffle_idx.npy loaded indexed file in 0.001 seconds total number of samples: 448 total number of epochs: 12 WARNING: shuffle index length (446) is not equal to sample index length (447) reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... test_22: no. of documents:534 > WARNING: could not find index map files, building the indices on rank 0 ... > elasped time to build and save doc-idx mapping (seconds): 0.000120 > elapsed time to build and save sample-idx mapping (seconds): 0.000078 > elapsed time to build and save shuffle-idx mapping (seconds): 0.000054 > loading doc-idx mapping from data-hfcm/synne/synne_text_document_test_22_indexmap_2ns_2048sl_1234s_doc_idx.npy > loading sample-idx mapping from data-hfcm/synne/synne_text_document_test_22_indexmap_2ns_2048sl_1234s_sample_idx.npy > loading shuffle-idx mapping from data-hfcm/synne/synne_text_document_test_22_indexmap_2ns_2048sl_1234s_shuffle_idx.npy loaded indexed file in 0.001 seconds total number of samples: 38 total number of epochs: 1 WARNING: shuffle index length (36) is not equal to sample index length (37) reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... train_23: no. of documents:772965 > WARNING: could not find index map files, building the indices on rank 0 ... > elasped time to build and save doc-idx mapping (seconds): 0.020848 > elapsed time to build and save sample-idx mapping (seconds): 0.004543 > elapsed time to build and save shuffle-idx mapping (seconds): 0.007730 > loading doc-idx mapping from data-hfcm/tidsskrift/tidsskrift_text_document_train_23_indexmap_327186ns_2048sl_1234s_doc_idx.npy > loading sample-idx mapping from data-hfcm/tidsskrift/tidsskrift_text_document_train_23_indexmap_327186ns_2048sl_1234s_sample_idx.npy > loading shuffle-idx mapping from data-hfcm/tidsskrift/tidsskrift_text_document_train_23_indexmap_327186ns_2048sl_1234s_shuffle_idx.npy loaded indexed file in 0.001 seconds total number of samples: 353505 total number of epochs: 1 WARNING: shuffle index length (353503) is not equal to sample index length (353504) reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... valid_23: no. of documents:772965 > WARNING: could not find index map files, building the indices on rank 0 ... > elasped time to build and save doc-idx mapping (seconds): 0.020564 > elapsed time to build and save sample-idx mapping (seconds): 0.004427 > elapsed time to build and save shuffle-idx mapping (seconds): 0.007776 > loading doc-idx mapping from data-hfcm/tidsskrift/tidsskrift_text_document_valid_23_indexmap_3283ns_2048sl_1234s_doc_idx.npy > loading sample-idx mapping from data-hfcm/tidsskrift/tidsskrift_text_document_valid_23_indexmap_3283ns_2048sl_1234s_sample_idx.npy > loading shuffle-idx mapping from data-hfcm/tidsskrift/tidsskrift_text_document_valid_23_indexmap_3283ns_2048sl_1234s_shuffle_idx.npy loaded indexed file in 0.001 seconds total number of samples: 353505 total number of epochs: 1 WARNING: shuffle index length (353503) is not equal to sample index length (353504) reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... test_23: no. of documents:772965 > WARNING: could not find index map files, building the indices on rank 0 ... > elasped time to build and save doc-idx mapping (seconds): 0.020609 > elapsed time to build and save sample-idx mapping (seconds): 0.004468 > elapsed time to build and save shuffle-idx mapping (seconds): 0.007710 > loading doc-idx mapping from data-hfcm/tidsskrift/tidsskrift_text_document_test_23_indexmap_11ns_2048sl_1234s_doc_idx.npy > loading sample-idx mapping from data-hfcm/tidsskrift/tidsskrift_text_document_test_23_indexmap_11ns_2048sl_1234s_sample_idx.npy > loading shuffle-idx mapping from data-hfcm/tidsskrift/tidsskrift_text_document_test_23_indexmap_11ns_2048sl_1234s_shuffle_idx.npy loaded indexed file in 0.001 seconds total number of samples: 353505 total number of epochs: 1 WARNING: shuffle index length (353503) is not equal to sample index length (353504) reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... train_24: no. of documents:147871 > WARNING: could not find index map files, building the indices on rank 0 ... > elasped time to build and save doc-idx mapping (seconds): 0.074308 > elapsed time to build and save sample-idx mapping (seconds): 0.006576 > elapsed time to build and save shuffle-idx mapping (seconds): 0.004911 > loading doc-idx mapping from data-hfcm/tv2r/tv2r_text_document_train_24_indexmap_224554ns_2048sl_1234s_doc_idx.npy > loading sample-idx mapping from data-hfcm/tv2r/tv2r_text_document_train_24_indexmap_224554ns_2048sl_1234s_sample_idx.npy > loading shuffle-idx mapping from data-hfcm/tv2r/tv2r_text_document_train_24_indexmap_224554ns_2048sl_1234s_shuffle_idx.npy loaded indexed file in 0.001 seconds total number of samples: 231279 total number of epochs: 17 WARNING: shuffle index length (231277) is not equal to sample index length (231278) reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... valid_24: no. of documents:147871 > WARNING: could not find index map files, building the indices on rank 0 ... > elasped time to build and save doc-idx mapping (seconds): 0.003917 > elapsed time to build and save sample-idx mapping (seconds): 0.000568 > elapsed time to build and save shuffle-idx mapping (seconds): 0.000344 > loading doc-idx mapping from data-hfcm/tv2r/tv2r_text_document_valid_24_indexmap_2253ns_2048sl_1234s_doc_idx.npy > loading sample-idx mapping from data-hfcm/tv2r/tv2r_text_document_valid_24_indexmap_2253ns_2048sl_1234s_sample_idx.npy > loading shuffle-idx mapping from data-hfcm/tv2r/tv2r_text_document_valid_24_indexmap_2253ns_2048sl_1234s_shuffle_idx.npy loaded indexed file in 0.001 seconds total number of samples: 13605 total number of epochs: 1 WARNING: shuffle index length (13603) is not equal to sample index length (13604) reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... test_24: no. of documents:147871 > WARNING: could not find index map files, building the indices on rank 0 ... > elasped time to build and save doc-idx mapping (seconds): 0.004082 > elapsed time to build and save sample-idx mapping (seconds): 0.000604 > elapsed time to build and save shuffle-idx mapping (seconds): 0.000352 > loading doc-idx mapping from data-hfcm/tv2r/tv2r_text_document_test_24_indexmap_8ns_2048sl_1234s_doc_idx.npy > loading sample-idx mapping from data-hfcm/tv2r/tv2r_text_document_test_24_indexmap_8ns_2048sl_1234s_sample_idx.npy > loading shuffle-idx mapping from data-hfcm/tv2r/tv2r_text_document_test_24_indexmap_8ns_2048sl_1234s_shuffle_idx.npy loaded indexed file in 0.001 seconds total number of samples: 13605 total number of epochs: 1 WARNING: shuffle index length (13603) is not equal to sample index length (13604) reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... train_25: no. of documents:1028438 > WARNING: could not find index map files, building the indices on rank 0 ... > elasped time to build and save doc-idx mapping (seconds): 0.138608 > elapsed time to build and save sample-idx mapping (seconds): 0.014802 > elapsed time to build and save shuffle-idx mapping (seconds): 0.009222 > loading doc-idx mapping from data-hfcm/wiki/wiki_text_document_train_25_indexmap_337918ns_2048sl_1234s_doc_idx.npy > loading sample-idx mapping from data-hfcm/wiki/wiki_text_document_train_25_indexmap_337918ns_2048sl_1234s_sample_idx.npy > loading shuffle-idx mapping from data-hfcm/wiki/wiki_text_document_train_25_indexmap_337918ns_2048sl_1234s_shuffle_idx.npy loaded indexed file in 0.001 seconds total number of samples: 391818 total number of epochs: 4 WARNING: shuffle index length (391816) is not equal to sample index length (391817) reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... valid_25: no. of documents:1028438 > WARNING: could not find index map files, building the indices on rank 0 ... > elasped time to build and save doc-idx mapping (seconds): 0.029253 > elapsed time to build and save sample-idx mapping (seconds): 0.004003 > elapsed time to build and save shuffle-idx mapping (seconds): 0.002205 > loading doc-idx mapping from data-hfcm/wiki/wiki_text_document_valid_25_indexmap_3390ns_2048sl_1234s_doc_idx.npy > loading sample-idx mapping from data-hfcm/wiki/wiki_text_document_valid_25_indexmap_3390ns_2048sl_1234s_sample_idx.npy > loading shuffle-idx mapping from data-hfcm/wiki/wiki_text_document_valid_25_indexmap_3390ns_2048sl_1234s_shuffle_idx.npy loaded indexed file in 0.001 seconds total number of samples: 97955 total number of epochs: 1 WARNING: shuffle index length (97953) is not equal to sample index length (97954) reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... test_25: no. of documents:1028438 > WARNING: could not find index map files, building the indices on rank 0 ... > elasped time to build and save doc-idx mapping (seconds): 0.029280 > elapsed time to build and save sample-idx mapping (seconds): 0.004041 > elapsed time to build and save shuffle-idx mapping (seconds): 0.002186 > loading doc-idx mapping from data-hfcm/wiki/wiki_text_document_test_25_indexmap_11ns_2048sl_1234s_doc_idx.npy > loading sample-idx mapping from data-hfcm/wiki/wiki_text_document_test_25_indexmap_11ns_2048sl_1234s_sample_idx.npy > loading shuffle-idx mapping from data-hfcm/wiki/wiki_text_document_test_25_indexmap_11ns_2048sl_1234s_shuffle_idx.npy loaded indexed file in 0.001 seconds total number of samples: 97955 total number of epochs: 1 WARNING: shuffle index length (97953) is not equal to sample index length (97954) reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... train_26: no. of documents:11763 > WARNING: could not find index map files, building the indices on rank 0 ... > elasped time to build and save doc-idx mapping (seconds): 0.008348 > elapsed time to build and save sample-idx mapping (seconds): 0.001740 > elapsed time to build and save shuffle-idx mapping (seconds): 0.002411 > loading doc-idx mapping from data-hfcm/wikibooks/wikibooks_text_document_train_26_indexmap_107660ns_2048sl_1234s_doc_idx.npy > loading sample-idx mapping from data-hfcm/wikibooks/wikibooks_text_document_train_26_indexmap_107660ns_2048sl_1234s_sample_idx.npy > loading shuffle-idx mapping from data-hfcm/wikibooks/wikibooks_text_document_train_26_indexmap_107660ns_2048sl_1234s_shuffle_idx.npy loaded indexed file in 0.001 seconds total number of samples: 110501 total number of epochs: 26 WARNING: shuffle index length (110499) is not equal to sample index length (110500) reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... valid_26: no. of documents:11763 > WARNING: could not find index map files, building the indices on rank 0 ... > elasped time to build and save doc-idx mapping (seconds): 0.000418 > elapsed time to build and save sample-idx mapping (seconds): 0.000153 > elapsed time to build and save shuffle-idx mapping (seconds): 0.000153 > loading doc-idx mapping from data-hfcm/wikibooks/wikibooks_text_document_valid_26_indexmap_1080ns_2048sl_1234s_doc_idx.npy > loading sample-idx mapping from data-hfcm/wikibooks/wikibooks_text_document_valid_26_indexmap_1080ns_2048sl_1234s_sample_idx.npy > loading shuffle-idx mapping from data-hfcm/wikibooks/wikibooks_text_document_valid_26_indexmap_1080ns_2048sl_1234s_shuffle_idx.npy loaded indexed file in 0.001 seconds total number of samples: 4251 total number of epochs: 1 WARNING: shuffle index length (4249) is not equal to sample index length (4250) reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... test_26: no. of documents:11763 > WARNING: could not find index map files, building the indices on rank 0 ... > elasped time to build and save doc-idx mapping (seconds): 0.000393 > elapsed time to build and save sample-idx mapping (seconds): 0.000137 > elapsed time to build and save shuffle-idx mapping (seconds): 0.000145 > loading doc-idx mapping from data-hfcm/wikibooks/wikibooks_text_document_test_26_indexmap_4ns_2048sl_1234s_doc_idx.npy > loading sample-idx mapping from data-hfcm/wikibooks/wikibooks_text_document_test_26_indexmap_4ns_2048sl_1234s_sample_idx.npy > loading shuffle-idx mapping from data-hfcm/wikibooks/wikibooks_text_document_test_26_indexmap_4ns_2048sl_1234s_shuffle_idx.npy loaded indexed file in 0.001 seconds total number of samples: 4251 total number of epochs: 1 WARNING: shuffle index length (4249) is not equal to sample index length (4250) reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... train_27: no. of documents:8702 > WARNING: could not find index map files, building the indices on rank 0 ... > elasped time to build and save doc-idx mapping (seconds): 0.006744 > elapsed time to build and save sample-idx mapping (seconds): 0.001333 > elapsed time to build and save shuffle-idx mapping (seconds): 0.002172 > loading doc-idx mapping from data-hfcm/wikisource/wikisource_text_document_train_27_indexmap_98406ns_2048sl_1234s_doc_idx.npy > loading sample-idx mapping from data-hfcm/wikisource/wikisource_text_document_train_27_indexmap_98406ns_2048sl_1234s_sample_idx.npy > loading shuffle-idx mapping from data-hfcm/wikisource/wikisource_text_document_train_27_indexmap_98406ns_2048sl_1234s_shuffle_idx.npy loaded indexed file in 0.001 seconds total number of samples: 101749 total number of epochs: 30 WARNING: shuffle index length (101747) is not equal to sample index length (101748) reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... valid_27: no. of documents:8702 > WARNING: could not find index map files, building the indices on rank 0 ... > elasped time to build and save doc-idx mapping (seconds): 0.000318 > elapsed time to build and save sample-idx mapping (seconds): 0.000128 > elapsed time to build and save shuffle-idx mapping (seconds): 0.000130 > loading doc-idx mapping from data-hfcm/wikisource/wikisource_text_document_valid_27_indexmap_988ns_2048sl_1234s_doc_idx.npy > loading sample-idx mapping from data-hfcm/wikisource/wikisource_text_document_valid_27_indexmap_988ns_2048sl_1234s_sample_idx.npy > loading shuffle-idx mapping from data-hfcm/wikisource/wikisource_text_document_valid_27_indexmap_988ns_2048sl_1234s_shuffle_idx.npy loaded indexed file in 0.001 seconds total number of samples: 3392 total number of epochs: 1 WARNING: shuffle index length (3390) is not equal to sample index length (3391) reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... test_27: no. of documents:8702 > WARNING: could not find index map files, building the indices on rank 0 ... > elasped time to build and save doc-idx mapping (seconds): 0.000320 > elapsed time to build and save sample-idx mapping (seconds): 0.000122 > elapsed time to build and save shuffle-idx mapping (seconds): 0.000130 > loading doc-idx mapping from data-hfcm/wikisource/wikisource_text_document_test_27_indexmap_4ns_2048sl_1234s_doc_idx.npy > loading sample-idx mapping from data-hfcm/wikisource/wikisource_text_document_test_27_indexmap_4ns_2048sl_1234s_sample_idx.npy > loading shuffle-idx mapping from data-hfcm/wikisource/wikisource_text_document_test_27_indexmap_4ns_2048sl_1234s_shuffle_idx.npy loaded indexed file in 0.001 seconds total number of samples: 3392 total number of epochs: 1 WARNING: shuffle index length (3390) is not equal to sample index length (3391) > RANK 0 elapsed time for building blendable dataset indices: 0.27 (sec) > RANK 0 elapsed time for building blendable dataset indices: 0.09 (sec) > RANK 0 elapsed time for building blendable dataset indices: 0.09 (sec) setting training data start iteration to 0 setting validation data start iteration to 0 done with setups ... time (ms) | model and optimizer: 1532.97 | train/valid/test data iterators: 4047.50 training ... samples/sec: 6.535 | iteration 100/ 320000 | elapsed time per iteration (ms): 2448.2 | learning rate: 9.375E-06 | approx flops per GPU: 40.6TFLOPS | lm_loss: 9.549332E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | after 100 iterations memory (MB) | allocated: 3902.71630859375 | max allocated: 14147.748046875 | reserved: 17338.0 | max reserved: 17338.0 time (ms) | forward: 579.69 | backward: 1812.59 | backward-backward: 1812.57 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.85 samples/sec: 6.590 | iteration 200/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 1.875E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 8.082000E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.62 | backward: 1804.95 | backward-backward: 1804.92 | backward-allreduce: 0.00 | optimizer: 55.93 | batch generator: 0.84 samples/sec: 6.591 | iteration 300/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 2.812E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 7.059252E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.07 | backward: 1803.97 | backward-backward: 1803.94 | backward-allreduce: 0.00 | optimizer: 55.94 | batch generator: 1.09 samples/sec: 6.594 | iteration 400/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 3.750E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 6.603406E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.81 | backward: 1803.89 | backward-backward: 1803.87 | backward-allreduce: 0.00 | optimizer: 55.49 | batch generator: 0.80 samples/sec: 6.594 | iteration 500/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 4.688E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 6.385542E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.77 | backward: 1803.93 | backward-backward: 1803.91 | backward-allreduce: 0.00 | optimizer: 55.50 | batch generator: 0.79 samples/sec: 6.593 | iteration 600/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 5.625E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 6.186406E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.90 | backward: 1803.64 | backward-backward: 1803.62 | backward-allreduce: 0.00 | optimizer: 55.91 | batch generator: 0.82 samples/sec: 6.595 | iteration 700/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 6.562E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 6.007888E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.66 | backward: 1803.58 | backward-backward: 1803.55 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.79 samples/sec: 6.594 | iteration 800/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 7.500E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 5.869252E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.44 | backward: 1803.75 | backward-backward: 1803.73 | backward-allreduce: 0.00 | optimizer: 55.86 | batch generator: 0.79 samples/sec: 6.599 | iteration 900/ 320000 | elapsed time per iteration (ms): 2424.5 | learning rate: 8.437E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 5.711679E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.27 | backward: 1802.58 | backward-backward: 1802.56 | backward-allreduce: 0.00 | optimizer: 55.26 | batch generator: 0.79 samples/sec: 6.587 | iteration 1000/ 320000 | elapsed time per iteration (ms): 2428.9 | learning rate: 9.375E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 5.629186E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.61 | backward: 1805.39 | backward-backward: 1805.37 | backward-allreduce: 0.00 | optimizer: 56.47 | batch generator: 0.80 --------------------------------------------------------------------------------------------------------- validation results at iteration 1000 | lm_loss value: 5.682423E+00 | lm_loss_ppl value: 2.936601E+02 | --------------------------------------------------------------------------------------------------------- samples/sec: 6.444 | iteration 1100/ 320000 | elapsed time per iteration (ms): 2482.8 | learning rate: 1.031E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 5.534498E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.54 | backward: 1803.80 | backward-backward: 1803.78 | backward-allreduce: 0.00 | optimizer: 55.42 | batch generator: 0.88 samples/sec: 6.597 | iteration 1200/ 320000 | elapsed time per iteration (ms): 2425.4 | learning rate: 1.125E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 5.461071E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.15 | backward: 1803.35 | backward-backward: 1803.33 | backward-allreduce: 0.00 | optimizer: 55.49 | batch generator: 0.82 samples/sec: 6.589 | iteration 1300/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 1.219E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 5.385663E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.66 | backward: 1805.41 | backward-backward: 1805.39 | backward-allreduce: 0.00 | optimizer: 55.87 | batch generator: 0.81 samples/sec: 6.591 | iteration 1400/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 1.312E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 5.330443E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.96 | backward: 1804.46 | backward-backward: 1804.44 | backward-allreduce: 0.00 | optimizer: 55.84 | batch generator: 0.82 samples/sec: 6.596 | iteration 1500/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 1.406E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 5.269926E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.43 | backward: 1803.32 | backward-backward: 1803.29 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.79 samples/sec: 6.598 | iteration 1600/ 320000 | elapsed time per iteration (ms): 2425.1 | learning rate: 1.500E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 5.179033E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 565.90 | backward: 1803.46 | backward-backward: 1803.44 | backward-allreduce: 0.00 | optimizer: 55.32 | batch generator: 0.80 samples/sec: 6.589 | iteration 1700/ 320000 | elapsed time per iteration (ms): 2428.4 | learning rate: 1.594E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 5.137948E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.71 | backward: 1805.36 | backward-backward: 1805.33 | backward-allreduce: 0.00 | optimizer: 55.89 | batch generator: 0.81 samples/sec: 6.592 | iteration 1800/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 1.687E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 5.089219E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.63 | backward: 1804.41 | backward-backward: 1804.38 | backward-allreduce: 0.00 | optimizer: 55.69 | batch generator: 0.77 samples/sec: 6.598 | iteration 1900/ 320000 | elapsed time per iteration (ms): 2425.1 | learning rate: 1.781E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 5.032278E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.77 | backward: 1802.49 | backward-backward: 1802.47 | backward-allreduce: 0.00 | optimizer: 55.46 | batch generator: 0.81 samples/sec: 6.591 | iteration 2000/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 1.875E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 4.984029E+00 | loss scale: 262144.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.49 | backward: 1804.76 | backward-backward: 1804.74 | backward-allreduce: 0.00 | optimizer: 55.78 | batch generator: 0.78 --------------------------------------------------------------------------------------------------------- validation results at iteration 2000 | lm_loss value: 4.896064E+00 | lm_loss_ppl value: 1.337623E+02 | --------------------------------------------------------------------------------------------------------- samples/sec: 6.437 | iteration 2100/ 320000 | elapsed time per iteration (ms): 2485.6 | learning rate: 1.969E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 4.959482E+00 | loss scale: 262144.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.73 | backward: 1805.53 | backward-backward: 1805.50 | backward-allreduce: 0.00 | optimizer: 56.21 | batch generator: 0.86 samples/sec: 6.594 | iteration 2200/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 2.062E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 4.907664E+00 | loss scale: 262144.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.33 | backward: 1803.84 | backward-backward: 1803.81 | backward-allreduce: 0.00 | optimizer: 55.89 | batch generator: 0.88 samples/sec: 6.588 | iteration 2300/ 320000 | elapsed time per iteration (ms): 2428.5 | learning rate: 2.156E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 4.876400E+00 | loss scale: 262144.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.93 | backward: 1805.53 | backward-backward: 1805.51 | backward-allreduce: 0.00 | optimizer: 55.67 | batch generator: 0.80 samples/sec: 6.593 | iteration 2400/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 2.250E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 4.819312E+00 | loss scale: 262144.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.80 | backward: 1803.89 | backward-backward: 1803.87 | backward-allreduce: 0.00 | optimizer: 55.67 | batch generator: 0.99 samples/sec: 6.591 | iteration 2500/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 2.343E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 4.791298E+00 | loss scale: 262144.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.33 | backward: 1805.43 | backward-backward: 1805.40 | backward-allreduce: 0.00 | optimizer: 55.26 | batch generator: 0.78 samples/sec: 6.588 | iteration 2600/ 320000 | elapsed time per iteration (ms): 2428.8 | learning rate: 2.437E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 4.767423E+00 | loss scale: 262144.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.88 | backward: 1805.83 | backward-backward: 1805.80 | backward-allreduce: 0.00 | optimizer: 55.67 | batch generator: 0.78 samples/sec: 6.588 | iteration 2700/ 320000 | elapsed time per iteration (ms): 2428.8 | learning rate: 2.530E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 4.716384E+00 | loss scale: 262144.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.96 | backward: 1805.85 | backward-backward: 1805.83 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.80 samples/sec: 6.584 | iteration 2800/ 320000 | elapsed time per iteration (ms): 2430.1 | learning rate: 2.624E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 4.678875E+00 | loss scale: 262144.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.88 | backward: 1807.33 | backward-backward: 1807.31 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.76 samples/sec: 6.590 | iteration 2900/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 2.718E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 4.643036E+00 | loss scale: 262144.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.45 | backward: 1805.39 | backward-backward: 1805.36 | backward-allreduce: 0.00 | optimizer: 55.66 | batch generator: 0.77 samples/sec: 6.584 | iteration 3000/ 320000 | elapsed time per iteration (ms): 2430.2 | learning rate: 2.812E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 4.603987E+00 | loss scale: 262144.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.90 | backward: 1807.06 | backward-backward: 1807.04 | backward-allreduce: 0.00 | optimizer: 55.81 | batch generator: 0.79 --------------------------------------------------------------------------------------------------------- validation results at iteration 3000 | lm_loss value: 4.605845E+00 | lm_loss_ppl value: 1.000675E+02 | --------------------------------------------------------------------------------------------------------- samples/sec: 6.439 | iteration 3100/ 320000 | elapsed time per iteration (ms): 2484.9 | learning rate: 2.905E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 4.591966E+00 | loss scale: 262144.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.64 | backward: 1804.88 | backward-backward: 1804.86 | backward-allreduce: 0.00 | optimizer: 56.17 | batch generator: 0.86 samples/sec: 6.581 | iteration 3200/ 320000 | elapsed time per iteration (ms): 2431.2 | learning rate: 2.999E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 4.541979E+00 | loss scale: 262144.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.86 | backward: 1807.91 | backward-backward: 1807.89 | backward-allreduce: 0.00 | optimizer: 56.04 | batch generator: 0.82 samples/sec: 6.590 | iteration 3300/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 3.000E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 4.524418E+00 | loss scale: 262144.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.63 | backward: 1805.55 | backward-backward: 1805.52 | backward-allreduce: 0.00 | optimizer: 55.34 | batch generator: 0.81 samples/sec: 6.591 | iteration 3400/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 3.000E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 4.507504E+00 | loss scale: 65536.0 | number of skipped iterations: 2 | number of nan iterations: 0 | time (ms) | forward: 566.42 | backward: 1806.01 | backward-backward: 1805.98 | backward-allreduce: 0.00 | optimizer: 54.81 | batch generator: 0.81 samples/sec: 6.586 | iteration 3500/ 320000 | elapsed time per iteration (ms): 2429.4 | learning rate: 3.000E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 4.463646E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.06 | backward: 1806.15 | backward-backward: 1806.13 | backward-allreduce: 0.00 | optimizer: 55.75 | batch generator: 0.79 samples/sec: 6.595 | iteration 3600/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 3.000E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 4.425423E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.10 | backward: 1804.17 | backward-backward: 1804.15 | backward-allreduce: 0.00 | optimizer: 55.21 | batch generator: 0.81 samples/sec: 6.585 | iteration 3700/ 320000 | elapsed time per iteration (ms): 2429.6 | learning rate: 3.000E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 4.418594E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.12 | backward: 1806.07 | backward-backward: 1806.05 | backward-allreduce: 0.00 | optimizer: 56.03 | batch generator: 0.79 samples/sec: 6.593 | iteration 3800/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 3.000E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 4.384482E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.66 | backward: 1804.05 | backward-backward: 1804.03 | backward-allreduce: 0.00 | optimizer: 55.69 | batch generator: 0.79 samples/sec: 6.589 | iteration 3900/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 3.000E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 4.384638E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.80 | backward: 1805.31 | backward-backward: 1805.29 | backward-allreduce: 0.00 | optimizer: 55.69 | batch generator: 0.83 samples/sec: 6.588 | iteration 4000/ 320000 | elapsed time per iteration (ms): 2428.5 | learning rate: 3.000E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 4.319682E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.36 | backward: 1804.68 | backward-backward: 1804.66 | backward-allreduce: 0.00 | optimizer: 56.13 | batch generator: 0.87 --------------------------------------------------------------------------------------------------------- validation results at iteration 4000 | lm_loss value: 4.253010E+00 | lm_loss_ppl value: 7.031673E+01 | --------------------------------------------------------------------------------------------------------- samples/sec: 6.443 | iteration 4100/ 320000 | elapsed time per iteration (ms): 2483.3 | learning rate: 3.000E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 4.296846E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.28 | backward: 1804.08 | backward-backward: 1804.06 | backward-allreduce: 0.00 | optimizer: 55.81 | batch generator: 0.84 samples/sec: 6.585 | iteration 4200/ 320000 | elapsed time per iteration (ms): 2429.6 | learning rate: 3.000E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 4.282844E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.19 | backward: 1805.90 | backward-backward: 1805.88 | backward-allreduce: 0.00 | optimizer: 56.11 | batch generator: 0.79 samples/sec: 6.594 | iteration 4300/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 3.000E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 4.272869E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.45 | backward: 1804.15 | backward-backward: 1804.13 | backward-allreduce: 0.00 | optimizer: 55.43 | batch generator: 0.80 samples/sec: 6.587 | iteration 4400/ 320000 | elapsed time per iteration (ms): 2429.1 | learning rate: 3.000E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 4.245605E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.07 | backward: 1805.74 | backward-backward: 1805.72 | backward-allreduce: 0.00 | optimizer: 55.91 | batch generator: 0.84 samples/sec: 6.591 | iteration 4500/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 3.000E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 4.233589E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.93 | backward: 1804.17 | backward-backward: 1804.15 | backward-allreduce: 0.00 | optimizer: 55.92 | batch generator: 0.82 samples/sec: 6.588 | iteration 4600/ 320000 | elapsed time per iteration (ms): 2428.7 | learning rate: 3.000E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 4.206819E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.67 | backward: 1805.86 | backward-backward: 1805.83 | backward-allreduce: 0.00 | optimizer: 55.76 | batch generator: 0.81 samples/sec: 6.589 | iteration 4700/ 320000 | elapsed time per iteration (ms): 2428.4 | learning rate: 3.000E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 4.204915E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.14 | backward: 1805.07 | backward-backward: 1805.04 | backward-allreduce: 0.00 | optimizer: 55.77 | batch generator: 0.77 samples/sec: 6.590 | iteration 4800/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 3.000E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 4.148910E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.45 | backward: 1805.23 | backward-backward: 1805.20 | backward-allreduce: 0.00 | optimizer: 55.76 | batch generator: 0.77 samples/sec: 6.586 | iteration 4900/ 320000 | elapsed time per iteration (ms): 2429.3 | learning rate: 3.000E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 4.143568E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.22 | backward: 1806.24 | backward-backward: 1806.22 | backward-allreduce: 0.00 | optimizer: 55.49 | batch generator: 0.79 samples/sec: 6.591 | iteration 5000/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 3.000E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 4.154825E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.77 | backward: 1804.50 | backward-backward: 1804.47 | backward-allreduce: 0.00 | optimizer: 55.96 | batch generator: 0.87 --------------------------------------------------------------------------------------------------------- validation results at iteration 5000 | lm_loss value: 4.111569E+00 | lm_loss_ppl value: 6.104244E+01 | --------------------------------------------------------------------------------------------------------- samples/sec: 6.434 | iteration 5100/ 320000 | elapsed time per iteration (ms): 2486.7 | learning rate: 3.000E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 4.134828E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.04 | backward: 1806.78 | backward-backward: 1806.76 | backward-allreduce: 0.00 | optimizer: 55.66 | batch generator: 0.86 samples/sec: 6.591 | iteration 5200/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 3.000E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 4.089501E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.72 | backward: 1804.24 | backward-backward: 1804.21 | backward-allreduce: 0.00 | optimizer: 55.38 | batch generator: 0.77 samples/sec: 6.585 | iteration 5300/ 320000 | elapsed time per iteration (ms): 2429.7 | learning rate: 3.000E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 4.105779E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.24 | backward: 1805.81 | backward-backward: 1805.79 | backward-allreduce: 0.00 | optimizer: 56.23 | batch generator: 0.83 samples/sec: 6.585 | iteration 5400/ 320000 | elapsed time per iteration (ms): 2429.8 | learning rate: 3.000E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 4.083565E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.73 | backward: 1805.97 | backward-backward: 1805.94 | backward-allreduce: 0.00 | optimizer: 55.75 | batch generator: 0.80 samples/sec: 6.593 | iteration 5500/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 3.000E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 4.059950E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.44 | backward: 1804.61 | backward-backward: 1804.59 | backward-allreduce: 0.00 | optimizer: 55.45 | batch generator: 0.81 samples/sec: 6.577 | iteration 5600/ 320000 | elapsed time per iteration (ms): 2432.8 | learning rate: 3.000E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 4.055535E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 568.36 | backward: 1807.60 | backward-backward: 1807.57 | backward-allreduce: 0.00 | optimizer: 56.40 | batch generator: 0.86 samples/sec: 6.590 | iteration 5700/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 3.000E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 4.058848E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.75 | backward: 1805.01 | backward-backward: 1804.98 | backward-allreduce: 0.00 | optimizer: 55.79 | batch generator: 0.84 samples/sec: 6.581 | iteration 5800/ 320000 | elapsed time per iteration (ms): 2431.3 | learning rate: 3.000E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 4.027263E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.89 | backward: 1807.91 | backward-backward: 1807.89 | backward-allreduce: 0.00 | optimizer: 55.98 | batch generator: 0.83 samples/sec: 6.582 | iteration 5900/ 320000 | elapsed time per iteration (ms): 2430.9 | learning rate: 2.999E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 4.020742E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.40 | backward: 1807.04 | backward-backward: 1807.02 | backward-allreduce: 0.00 | optimizer: 55.99 | batch generator: 0.89 samples/sec: 6.580 | iteration 6000/ 320000 | elapsed time per iteration (ms): 2431.5 | learning rate: 2.999E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 4.008304E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.38 | backward: 1807.69 | backward-backward: 1807.67 | backward-allreduce: 0.00 | optimizer: 56.01 | batch generator: 0.84 --------------------------------------------------------------------------------------------------------- validation results at iteration 6000 | lm_loss value: 3.978245E+00 | lm_loss_ppl value: 5.342318E+01 | --------------------------------------------------------------------------------------------------------- samples/sec: 6.430 | iteration 6100/ 320000 | elapsed time per iteration (ms): 2488.4 | learning rate: 2.999E-04 | approx flops per GPU: 39.9TFLOPS | lm_loss: 4.007632E+00 | loss scale: 131072.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 567.92 | backward: 1807.68 | backward-backward: 1807.65 | backward-allreduce: 0.00 | optimizer: 55.42 | batch generator: 0.89 samples/sec: 6.586 | iteration 6200/ 320000 | elapsed time per iteration (ms): 2429.3 | learning rate: 2.999E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.984602E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.73 | backward: 1806.57 | backward-backward: 1806.54 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.83 samples/sec: 6.575 | iteration 6300/ 320000 | elapsed time per iteration (ms): 2433.5 | learning rate: 2.999E-04 | approx flops per GPU: 40.8TFLOPS | lm_loss: 3.979956E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.59 | backward: 1809.41 | backward-backward: 1809.39 | backward-allreduce: 0.00 | optimizer: 56.09 | batch generator: 0.87 samples/sec: 6.580 | iteration 6400/ 320000 | elapsed time per iteration (ms): 2431.6 | learning rate: 2.999E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.948014E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.18 | backward: 1807.00 | backward-backward: 1806.97 | backward-allreduce: 0.00 | optimizer: 56.94 | batch generator: 0.88 samples/sec: 6.578 | iteration 6500/ 320000 | elapsed time per iteration (ms): 2432.4 | learning rate: 2.999E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.953824E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.03 | backward: 1808.74 | backward-backward: 1808.72 | backward-allreduce: 0.00 | optimizer: 56.17 | batch generator: 0.85 samples/sec: 6.583 | iteration 6600/ 320000 | elapsed time per iteration (ms): 2430.5 | learning rate: 2.999E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.939685E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.69 | backward: 1806.45 | backward-backward: 1806.42 | backward-allreduce: 0.00 | optimizer: 55.91 | batch generator: 0.83 samples/sec: 6.581 | iteration 6700/ 320000 | elapsed time per iteration (ms): 2431.2 | learning rate: 2.999E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.918176E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.95 | backward: 1807.82 | backward-backward: 1807.80 | backward-allreduce: 0.00 | optimizer: 56.00 | batch generator: 0.82 samples/sec: 6.584 | iteration 6800/ 320000 | elapsed time per iteration (ms): 2430.0 | learning rate: 2.999E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.928960E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.52 | backward: 1806.30 | backward-backward: 1806.28 | backward-allreduce: 0.00 | optimizer: 55.72 | batch generator: 0.83 samples/sec: 6.587 | iteration 6900/ 320000 | elapsed time per iteration (ms): 2428.9 | learning rate: 2.999E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.922198E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.55 | backward: 1806.25 | backward-backward: 1806.23 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.80 samples/sec: 6.582 | iteration 7000/ 320000 | elapsed time per iteration (ms): 2430.9 | learning rate: 2.999E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.892187E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.15 | backward: 1807.63 | backward-backward: 1807.60 | backward-allreduce: 0.00 | optimizer: 55.75 | batch generator: 0.82 --------------------------------------------------------------------------------------------------------- validation results at iteration 7000 | lm_loss value: 3.923717E+00 | lm_loss_ppl value: 5.058816E+01 | --------------------------------------------------------------------------------------------------------- samples/sec: 6.440 | iteration 7100/ 320000 | elapsed time per iteration (ms): 2484.6 | learning rate: 2.999E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.868007E+00 | loss scale: 262144.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.45 | backward: 1805.48 | backward-backward: 1805.46 | backward-allreduce: 0.00 | optimizer: 55.49 | batch generator: 0.87 samples/sec: 6.578 | iteration 7200/ 320000 | elapsed time per iteration (ms): 2432.2 | learning rate: 2.999E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.883707E+00 | loss scale: 262144.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.27 | backward: 1808.33 | backward-backward: 1808.31 | backward-allreduce: 0.00 | optimizer: 56.18 | batch generator: 0.85 samples/sec: 6.587 | iteration 7300/ 320000 | elapsed time per iteration (ms): 2429.2 | learning rate: 2.999E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.880604E+00 | loss scale: 262144.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.84 | backward: 1805.89 | backward-backward: 1805.87 | backward-allreduce: 0.00 | optimizer: 56.03 | batch generator: 0.80 samples/sec: 6.579 | iteration 7400/ 320000 | elapsed time per iteration (ms): 2431.9 | learning rate: 2.999E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.871510E+00 | loss scale: 262144.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.98 | backward: 1808.49 | backward-backward: 1808.47 | backward-allreduce: 0.00 | optimizer: 56.04 | batch generator: 0.87 samples/sec: 6.576 | iteration 7500/ 320000 | elapsed time per iteration (ms): 2433.2 | learning rate: 2.999E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.850698E+00 | loss scale: 262144.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.97 | backward: 1807.83 | backward-backward: 1807.80 | backward-allreduce: 0.00 | optimizer: 56.94 | batch generator: 0.92 samples/sec: 6.583 | iteration 7600/ 320000 | elapsed time per iteration (ms): 2430.4 | learning rate: 2.999E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.852787E+00 | loss scale: 131072.0 | number of skipped iterations: 2 | number of nan iterations: 0 | time (ms) | forward: 567.38 | backward: 1807.47 | backward-backward: 1807.44 | backward-allreduce: 0.00 | optimizer: 55.11 | batch generator: 0.99 samples/sec: 6.578 | iteration 7700/ 320000 | elapsed time per iteration (ms): 2432.2 | learning rate: 2.999E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.840576E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.36 | backward: 1808.37 | backward-backward: 1808.34 | backward-allreduce: 0.00 | optimizer: 56.06 | batch generator: 0.86 samples/sec: 6.583 | iteration 7800/ 320000 | elapsed time per iteration (ms): 2430.6 | learning rate: 2.998E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.826696E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.24 | backward: 1806.81 | backward-backward: 1806.78 | backward-allreduce: 0.00 | optimizer: 56.09 | batch generator: 0.91 samples/sec: 6.575 | iteration 7900/ 320000 | elapsed time per iteration (ms): 2433.4 | learning rate: 2.998E-04 | approx flops per GPU: 40.8TFLOPS | lm_loss: 3.806460E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.64 | backward: 1809.01 | backward-backward: 1808.98 | backward-allreduce: 0.00 | optimizer: 56.25 | batch generator: 0.83 samples/sec: 6.584 | iteration 8000/ 320000 | elapsed time per iteration (ms): 2430.0 | learning rate: 2.998E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.827209E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.36 | backward: 1806.20 | backward-backward: 1806.17 | backward-allreduce: 0.00 | optimizer: 56.00 | batch generator: 0.99 --------------------------------------------------------------------------------------------------------- validation results at iteration 8000 | lm_loss value: 3.797544E+00 | lm_loss_ppl value: 4.459155E+01 | --------------------------------------------------------------------------------------------------------- samples/sec: 6.428 | iteration 8100/ 320000 | elapsed time per iteration (ms): 2489.3 | learning rate: 2.998E-04 | approx flops per GPU: 39.9TFLOPS | lm_loss: 3.816886E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.53 | backward: 1808.27 | backward-backward: 1808.25 | backward-allreduce: 0.00 | optimizer: 56.21 | batch generator: 0.97 samples/sec: 6.582 | iteration 8200/ 320000 | elapsed time per iteration (ms): 2430.7 | learning rate: 2.998E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.789092E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 567.51 | backward: 1807.18 | backward-backward: 1807.16 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.85 samples/sec: 6.584 | iteration 8300/ 320000 | elapsed time per iteration (ms): 2430.1 | learning rate: 2.998E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.792008E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.95 | backward: 1806.97 | backward-backward: 1806.94 | backward-allreduce: 0.00 | optimizer: 55.70 | batch generator: 0.85 samples/sec: 6.583 | iteration 8400/ 320000 | elapsed time per iteration (ms): 2430.5 | learning rate: 2.998E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.783204E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.24 | backward: 1806.88 | backward-backward: 1806.86 | backward-allreduce: 0.00 | optimizer: 55.96 | batch generator: 0.83 samples/sec: 6.584 | iteration 8500/ 320000 | elapsed time per iteration (ms): 2430.0 | learning rate: 2.998E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.776155E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.09 | backward: 1806.62 | backward-backward: 1806.59 | backward-allreduce: 0.00 | optimizer: 55.86 | batch generator: 0.89 samples/sec: 6.577 | iteration 8600/ 320000 | elapsed time per iteration (ms): 2432.7 | learning rate: 2.998E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.790146E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.57 | backward: 1807.78 | backward-backward: 1807.75 | backward-allreduce: 0.00 | optimizer: 56.84 | batch generator: 0.83 samples/sec: 6.588 | iteration 8700/ 320000 | elapsed time per iteration (ms): 2428.8 | learning rate: 2.998E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.750791E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.84 | backward: 1805.97 | backward-backward: 1805.95 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.81 samples/sec: 6.586 | iteration 8800/ 320000 | elapsed time per iteration (ms): 2429.5 | learning rate: 2.998E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.749613E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.14 | backward: 1806.02 | backward-backward: 1806.00 | backward-allreduce: 0.00 | optimizer: 55.97 | batch generator: 0.92 samples/sec: 6.589 | iteration 8900/ 320000 | elapsed time per iteration (ms): 2428.5 | learning rate: 2.998E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.759783E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.64 | backward: 1805.82 | backward-backward: 1805.80 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.83 samples/sec: 6.582 | iteration 9000/ 320000 | elapsed time per iteration (ms): 2430.8 | learning rate: 2.998E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.764304E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.15 | backward: 1807.36 | backward-backward: 1807.33 | backward-allreduce: 0.00 | optimizer: 55.92 | batch generator: 0.81 --------------------------------------------------------------------------------------------------------- validation results at iteration 9000 | lm_loss value: 3.736098E+00 | lm_loss_ppl value: 4.193404E+01 | --------------------------------------------------------------------------------------------------------- samples/sec: 6.434 | iteration 9100/ 320000 | elapsed time per iteration (ms): 2486.7 | learning rate: 2.997E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.731948E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.48 | backward: 1805.76 | backward-backward: 1805.74 | backward-allreduce: 0.00 | optimizer: 56.18 | batch generator: 0.89 samples/sec: 6.584 | iteration 9200/ 320000 | elapsed time per iteration (ms): 2430.0 | learning rate: 2.997E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.745574E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.80 | backward: 1807.09 | backward-backward: 1807.07 | backward-allreduce: 0.00 | optimizer: 55.72 | batch generator: 0.79 samples/sec: 6.587 | iteration 9300/ 320000 | elapsed time per iteration (ms): 2428.9 | learning rate: 2.997E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.727777E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.95 | backward: 1805.83 | backward-backward: 1805.81 | backward-allreduce: 0.00 | optimizer: 55.75 | batch generator: 0.87 samples/sec: 6.586 | iteration 9400/ 320000 | elapsed time per iteration (ms): 2429.4 | learning rate: 2.997E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.720867E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.63 | backward: 1806.24 | backward-backward: 1806.22 | backward-allreduce: 0.00 | optimizer: 56.12 | batch generator: 0.82 samples/sec: 6.578 | iteration 9500/ 320000 | elapsed time per iteration (ms): 2432.3 | learning rate: 2.997E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.703245E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.51 | backward: 1808.40 | backward-backward: 1808.37 | backward-allreduce: 0.00 | optimizer: 55.93 | batch generator: 0.83 samples/sec: 6.589 | iteration 9600/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 2.997E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.724936E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.80 | backward: 1805.27 | backward-backward: 1805.25 | backward-allreduce: 0.00 | optimizer: 55.73 | batch generator: 0.81 samples/sec: 6.576 | iteration 9700/ 320000 | elapsed time per iteration (ms): 2432.9 | learning rate: 2.997E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.704388E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.28 | backward: 1808.92 | backward-backward: 1808.89 | backward-allreduce: 0.00 | optimizer: 56.32 | batch generator: 0.81 samples/sec: 6.584 | iteration 9800/ 320000 | elapsed time per iteration (ms): 2430.3 | learning rate: 2.997E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.706010E+00 | loss scale: 131072.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 567.26 | backward: 1806.56 | backward-backward: 1806.54 | backward-allreduce: 0.00 | optimizer: 56.03 | batch generator: 0.82 samples/sec: 6.580 | iteration 9900/ 320000 | elapsed time per iteration (ms): 2431.5 | learning rate: 2.997E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.712389E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.99 | backward: 1807.96 | backward-backward: 1807.93 | backward-allreduce: 0.00 | optimizer: 56.07 | batch generator: 0.81 samples/sec: 6.581 | iteration 10000/ 320000 | elapsed time per iteration (ms): 2431.4 | learning rate: 2.997E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.693038E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.19 | backward: 1808.08 | backward-backward: 1808.05 | backward-allreduce: 0.00 | optimizer: 55.70 | batch generator: 0.81 ---------------------------------------------------------------------------------------------------------- validation results at iteration 10000 | lm_loss value: 3.724523E+00 | lm_loss_ppl value: 4.145147E+01 | ---------------------------------------------------------------------------------------------------------- samples/sec: 6.229 | iteration 10100/ 320000 | elapsed time per iteration (ms): 2568.6 | learning rate: 2.997E-04 | approx flops per GPU: 38.7TFLOPS | lm_loss: 3.677791E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.54 | backward: 1808.48 | backward-backward: 1808.45 | backward-allreduce: 0.00 | optimizer: 56.08 | batch generator: 0.92 samples/sec: 6.579 | iteration 10200/ 320000 | elapsed time per iteration (ms): 2432.0 | learning rate: 2.996E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.698020E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 568.05 | backward: 1807.80 | backward-backward: 1807.78 | backward-allreduce: 0.00 | optimizer: 55.74 | batch generator: 0.80 samples/sec: 6.591 | iteration 10300/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 2.996E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.683244E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.67 | backward: 1804.79 | backward-backward: 1804.77 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.78 samples/sec: 6.583 | iteration 10400/ 320000 | elapsed time per iteration (ms): 2430.6 | learning rate: 2.996E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.660412E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.91 | backward: 1807.71 | backward-backward: 1807.69 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.78 samples/sec: 6.580 | iteration 10500/ 320000 | elapsed time per iteration (ms): 2431.6 | learning rate: 2.996E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.670549E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.70 | backward: 1807.26 | backward-backward: 1807.24 | backward-allreduce: 0.00 | optimizer: 56.22 | batch generator: 0.85 samples/sec: 6.580 | iteration 10600/ 320000 | elapsed time per iteration (ms): 2431.5 | learning rate: 2.996E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.653632E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.95 | backward: 1808.05 | backward-backward: 1808.02 | backward-allreduce: 0.00 | optimizer: 56.04 | batch generator: 0.81 samples/sec: 6.580 | iteration 10700/ 320000 | elapsed time per iteration (ms): 2431.7 | learning rate: 2.996E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.670483E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.30 | backward: 1808.06 | backward-backward: 1808.03 | backward-allreduce: 0.00 | optimizer: 55.95 | batch generator: 0.82 samples/sec: 6.587 | iteration 10800/ 320000 | elapsed time per iteration (ms): 2429.2 | learning rate: 2.996E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.662378E+00 | loss scale: 262144.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.29 | backward: 1806.16 | backward-backward: 1806.14 | backward-allreduce: 0.00 | optimizer: 56.32 | batch generator: 0.79 samples/sec: 6.565 | iteration 10900/ 320000 | elapsed time per iteration (ms): 2437.0 | learning rate: 2.996E-04 | approx flops per GPU: 40.8TFLOPS | lm_loss: 3.642135E+00 | loss scale: 262144.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 568.36 | backward: 1811.32 | backward-backward: 1811.30 | backward-allreduce: 0.00 | optimizer: 56.84 | batch generator: 0.88 samples/sec: 6.584 | iteration 11000/ 320000 | elapsed time per iteration (ms): 2430.2 | learning rate: 2.996E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.634459E+00 | loss scale: 262144.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.42 | backward: 1806.67 | backward-backward: 1806.65 | backward-allreduce: 0.00 | optimizer: 55.72 | batch generator: 0.96 ---------------------------------------------------------------------------------------------------------- validation results at iteration 11000 | lm_loss value: 3.654724E+00 | lm_loss_ppl value: 3.865686E+01 | ---------------------------------------------------------------------------------------------------------- samples/sec: 6.433 | iteration 11100/ 320000 | elapsed time per iteration (ms): 2487.3 | learning rate: 2.996E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.646715E+00 | loss scale: 65536.0 | number of skipped iterations: 3 | number of nan iterations: 0 | time (ms) | forward: 567.12 | backward: 1808.29 | backward-backward: 1808.26 | backward-allreduce: 0.00 | optimizer: 54.65 | batch generator: 0.88 samples/sec: 6.588 | iteration 11200/ 320000 | elapsed time per iteration (ms): 2428.5 | learning rate: 2.995E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.636438E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.08 | backward: 1805.48 | backward-backward: 1805.45 | backward-allreduce: 0.00 | optimizer: 55.59 | batch generator: 0.79 samples/sec: 6.592 | iteration 11300/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 2.995E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.648000E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.39 | backward: 1804.91 | backward-backward: 1804.89 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.80 samples/sec: 6.582 | iteration 11400/ 320000 | elapsed time per iteration (ms): 2430.8 | learning rate: 2.995E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.657173E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.16 | backward: 1806.83 | backward-backward: 1806.81 | backward-allreduce: 0.00 | optimizer: 56.37 | batch generator: 0.84 samples/sec: 6.588 | iteration 11500/ 320000 | elapsed time per iteration (ms): 2428.6 | learning rate: 2.995E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.621507E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.98 | backward: 1805.56 | backward-backward: 1805.53 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.82 samples/sec: 6.583 | iteration 11600/ 320000 | elapsed time per iteration (ms): 2430.5 | learning rate: 2.995E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.614926E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.98 | backward: 1807.32 | backward-backward: 1807.29 | backward-allreduce: 0.00 | optimizer: 55.79 | batch generator: 0.79 samples/sec: 6.583 | iteration 11700/ 320000 | elapsed time per iteration (ms): 2430.5 | learning rate: 2.995E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.640282E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.35 | backward: 1806.18 | backward-backward: 1806.15 | backward-allreduce: 0.00 | optimizer: 56.47 | batch generator: 0.86 samples/sec: 6.581 | iteration 11800/ 320000 | elapsed time per iteration (ms): 2431.3 | learning rate: 2.995E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.627943E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.16 | backward: 1807.71 | backward-backward: 1807.69 | backward-allreduce: 0.00 | optimizer: 56.03 | batch generator: 0.86 samples/sec: 6.587 | iteration 11900/ 320000 | elapsed time per iteration (ms): 2429.0 | learning rate: 2.995E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.621376E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.04 | backward: 1805.33 | backward-backward: 1805.30 | backward-allreduce: 0.00 | optimizer: 56.22 | batch generator: 0.77 samples/sec: 6.594 | iteration 12000/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 2.994E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.603548E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.34 | backward: 1804.34 | backward-backward: 1804.32 | backward-allreduce: 0.00 | optimizer: 55.43 | batch generator: 0.79 ---------------------------------------------------------------------------------------------------------- validation results at iteration 12000 | lm_loss value: 3.605436E+00 | lm_loss_ppl value: 3.679773E+01 | ---------------------------------------------------------------------------------------------------------- samples/sec: 6.433 | iteration 12100/ 320000 | elapsed time per iteration (ms): 2487.2 | learning rate: 2.994E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.596931E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.34 | backward: 1807.05 | backward-backward: 1807.03 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.87 samples/sec: 6.595 | iteration 12200/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 2.994E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.603660E+00 | loss scale: 131072.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.52 | backward: 1804.43 | backward-backward: 1804.40 | backward-allreduce: 0.00 | optimizer: 54.85 | batch generator: 0.79 samples/sec: 6.582 | iteration 12300/ 320000 | elapsed time per iteration (ms): 2430.7 | learning rate: 2.994E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.588769E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.76 | backward: 1807.24 | backward-backward: 1807.21 | backward-allreduce: 0.00 | optimizer: 56.36 | batch generator: 0.81 samples/sec: 6.588 | iteration 12400/ 320000 | elapsed time per iteration (ms): 2428.8 | learning rate: 2.994E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.596487E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.96 | backward: 1805.79 | backward-backward: 1805.77 | backward-allreduce: 0.00 | optimizer: 55.71 | batch generator: 0.79 samples/sec: 6.590 | iteration 12500/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 2.994E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.576975E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.19 | backward: 1805.34 | backward-backward: 1805.32 | backward-allreduce: 0.00 | optimizer: 55.91 | batch generator: 0.80 samples/sec: 6.588 | iteration 12600/ 320000 | elapsed time per iteration (ms): 2428.6 | learning rate: 2.994E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.598080E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.85 | backward: 1805.83 | backward-backward: 1805.81 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.80 samples/sec: 6.591 | iteration 12700/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 2.993E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.598098E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.25 | backward: 1805.21 | backward-backward: 1805.19 | backward-allreduce: 0.00 | optimizer: 55.67 | batch generator: 0.77 samples/sec: 6.590 | iteration 12800/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 2.993E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.580490E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.34 | backward: 1805.36 | backward-backward: 1805.34 | backward-allreduce: 0.00 | optimizer: 55.82 | batch generator: 0.79 samples/sec: 6.592 | iteration 12900/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 2.993E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.575678E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.30 | backward: 1805.04 | backward-backward: 1805.02 | backward-allreduce: 0.00 | optimizer: 55.51 | batch generator: 0.81 samples/sec: 6.588 | iteration 13000/ 320000 | elapsed time per iteration (ms): 2428.7 | learning rate: 2.993E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.567848E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.31 | backward: 1805.81 | backward-backward: 1805.79 | backward-allreduce: 0.00 | optimizer: 56.22 | batch generator: 0.78 ---------------------------------------------------------------------------------------------------------- validation results at iteration 13000 | lm_loss value: 3.578448E+00 | lm_loss_ppl value: 3.581790E+01 | ---------------------------------------------------------------------------------------------------------- samples/sec: 6.436 | iteration 13100/ 320000 | elapsed time per iteration (ms): 2485.9 | learning rate: 2.993E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.578574E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.86 | backward: 1805.96 | backward-backward: 1805.94 | backward-allreduce: 0.00 | optimizer: 55.92 | batch generator: 0.92 samples/sec: 6.590 | iteration 13200/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 2.993E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.565614E+00 | loss scale: 262144.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.40 | backward: 1805.59 | backward-backward: 1805.57 | backward-allreduce: 0.00 | optimizer: 55.45 | batch generator: 0.78 samples/sec: 6.592 | iteration 13300/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 2.993E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.583825E+00 | loss scale: 131072.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.49 | backward: 1805.20 | backward-backward: 1805.18 | backward-allreduce: 0.00 | optimizer: 55.27 | batch generator: 0.80 samples/sec: 6.587 | iteration 13400/ 320000 | elapsed time per iteration (ms): 2429.0 | learning rate: 2.993E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.572882E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.80 | backward: 1806.29 | backward-backward: 1806.27 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.78 samples/sec: 6.594 | iteration 13500/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 2.992E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.563380E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.03 | backward: 1804.45 | backward-backward: 1804.43 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.77 samples/sec: 6.584 | iteration 13600/ 320000 | elapsed time per iteration (ms): 2430.0 | learning rate: 2.992E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.553704E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.91 | backward: 1806.90 | backward-backward: 1806.88 | backward-allreduce: 0.00 | optimizer: 55.78 | batch generator: 0.77 samples/sec: 6.591 | iteration 13700/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 2.992E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.554678E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.60 | backward: 1805.21 | backward-backward: 1805.19 | backward-allreduce: 0.00 | optimizer: 55.40 | batch generator: 0.77 samples/sec: 6.590 | iteration 13800/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 2.992E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.557820E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.43 | backward: 1805.63 | backward-backward: 1805.61 | backward-allreduce: 0.00 | optimizer: 55.49 | batch generator: 0.79 samples/sec: 6.591 | iteration 13900/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 2.992E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.556737E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.91 | backward: 1804.99 | backward-backward: 1804.97 | backward-allreduce: 0.00 | optimizer: 55.19 | batch generator: 0.80 samples/sec: 6.589 | iteration 14000/ 320000 | elapsed time per iteration (ms): 2428.1 | learning rate: 2.992E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.547551E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.33 | backward: 1805.22 | backward-backward: 1805.20 | backward-allreduce: 0.00 | optimizer: 56.18 | batch generator: 0.78 ---------------------------------------------------------------------------------------------------------- validation results at iteration 14000 | lm_loss value: 3.528798E+00 | lm_loss_ppl value: 3.408296E+01 | ---------------------------------------------------------------------------------------------------------- samples/sec: 6.439 | iteration 14100/ 320000 | elapsed time per iteration (ms): 2484.8 | learning rate: 2.991E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.557135E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.58 | backward: 1805.61 | backward-backward: 1805.59 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.85 samples/sec: 6.588 | iteration 14200/ 320000 | elapsed time per iteration (ms): 2428.5 | learning rate: 2.991E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.536500E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.72 | backward: 1805.56 | backward-backward: 1805.54 | backward-allreduce: 0.00 | optimizer: 55.86 | batch generator: 0.77 samples/sec: 6.595 | iteration 14300/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 2.991E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.542112E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.24 | backward: 1803.90 | backward-backward: 1803.87 | backward-allreduce: 0.00 | optimizer: 55.41 | batch generator: 0.81 samples/sec: 6.587 | iteration 14400/ 320000 | elapsed time per iteration (ms): 2429.0 | learning rate: 2.991E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.540164E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.86 | backward: 1806.16 | backward-backward: 1806.14 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.79 samples/sec: 6.594 | iteration 14500/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 2.991E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.545260E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.25 | backward: 1803.99 | backward-backward: 1803.97 | backward-allreduce: 0.00 | optimizer: 55.77 | batch generator: 0.78 samples/sec: 6.585 | iteration 14600/ 320000 | elapsed time per iteration (ms): 2429.6 | learning rate: 2.991E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.528848E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.64 | backward: 1806.73 | backward-backward: 1806.71 | backward-allreduce: 0.00 | optimizer: 55.86 | batch generator: 0.80 samples/sec: 6.591 | iteration 14700/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 2.990E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.533107E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.17 | backward: 1804.19 | backward-backward: 1804.16 | backward-allreduce: 0.00 | optimizer: 55.99 | batch generator: 0.80 samples/sec: 6.593 | iteration 14800/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 2.990E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.530526E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.33 | backward: 1804.40 | backward-backward: 1804.37 | backward-allreduce: 0.00 | optimizer: 55.59 | batch generator: 0.80 samples/sec: 6.583 | iteration 14900/ 320000 | elapsed time per iteration (ms): 2430.3 | learning rate: 2.990E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.536016E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.99 | backward: 1807.24 | backward-backward: 1807.22 | backward-allreduce: 0.00 | optimizer: 55.72 | batch generator: 0.78 samples/sec: 6.594 | iteration 15000/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 2.990E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.544323E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.44 | backward: 1804.26 | backward-backward: 1804.24 | backward-allreduce: 0.00 | optimizer: 55.39 | batch generator: 0.79 ---------------------------------------------------------------------------------------------------------- validation results at iteration 15000 | lm_loss value: 3.558048E+00 | lm_loss_ppl value: 3.509463E+01 | ---------------------------------------------------------------------------------------------------------- samples/sec: 6.434 | iteration 15100/ 320000 | elapsed time per iteration (ms): 2486.8 | learning rate: 2.990E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.515501E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.82 | backward: 1806.81 | backward-backward: 1806.79 | backward-allreduce: 0.00 | optimizer: 56.05 | batch generator: 0.86 samples/sec: 6.588 | iteration 15200/ 320000 | elapsed time per iteration (ms): 2428.6 | learning rate: 2.990E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.540483E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.64 | backward: 1805.92 | backward-backward: 1805.89 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.75 samples/sec: 6.588 | iteration 15300/ 320000 | elapsed time per iteration (ms): 2428.6 | learning rate: 2.989E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.520809E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.62 | backward: 1806.22 | backward-backward: 1806.20 | backward-allreduce: 0.00 | optimizer: 55.41 | batch generator: 0.78 samples/sec: 6.585 | iteration 15400/ 320000 | elapsed time per iteration (ms): 2429.9 | learning rate: 2.989E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.515193E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.98 | backward: 1807.05 | backward-backward: 1807.03 | backward-allreduce: 0.00 | optimizer: 55.50 | batch generator: 0.76 samples/sec: 6.597 | iteration 15500/ 320000 | elapsed time per iteration (ms): 2425.2 | learning rate: 2.989E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.526956E+00 | loss scale: 65536.0 | number of skipped iterations: 2 | number of nan iterations: 0 | time (ms) | forward: 566.20 | backward: 1804.32 | backward-backward: 1804.30 | backward-allreduce: 0.00 | optimizer: 54.33 | batch generator: 0.77 samples/sec: 6.585 | iteration 15600/ 320000 | elapsed time per iteration (ms): 2429.8 | learning rate: 2.989E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.529986E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.96 | backward: 1806.74 | backward-backward: 1806.72 | backward-allreduce: 0.00 | optimizer: 55.72 | batch generator: 0.76 samples/sec: 6.594 | iteration 15700/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 2.989E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.521941E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.57 | backward: 1804.10 | backward-backward: 1804.07 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.84 samples/sec: 6.586 | iteration 15800/ 320000 | elapsed time per iteration (ms): 2429.4 | learning rate: 2.989E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.510869E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.83 | backward: 1805.96 | backward-backward: 1805.93 | backward-allreduce: 0.00 | optimizer: 56.18 | batch generator: 0.88 samples/sec: 6.594 | iteration 15900/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 2.988E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.510009E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.40 | backward: 1804.47 | backward-backward: 1804.44 | backward-allreduce: 0.00 | optimizer: 55.38 | batch generator: 0.78 samples/sec: 6.587 | iteration 16000/ 320000 | elapsed time per iteration (ms): 2429.1 | learning rate: 2.988E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.498010E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.66 | backward: 1806.02 | backward-backward: 1806.00 | backward-allreduce: 0.00 | optimizer: 56.01 | batch generator: 0.78 ---------------------------------------------------------------------------------------------------------- validation results at iteration 16000 | lm_loss value: 3.484224E+00 | lm_loss_ppl value: 3.259711E+01 | ---------------------------------------------------------------------------------------------------------- samples/sec: 6.440 | iteration 16100/ 320000 | elapsed time per iteration (ms): 2484.3 | learning rate: 2.988E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.500763E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.60 | backward: 1805.04 | backward-backward: 1805.02 | backward-allreduce: 0.00 | optimizer: 55.43 | batch generator: 0.86 samples/sec: 6.592 | iteration 16200/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 2.988E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.505605E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.43 | backward: 1804.31 | backward-backward: 1804.29 | backward-allreduce: 0.00 | optimizer: 56.04 | batch generator: 0.82 samples/sec: 6.584 | iteration 16300/ 320000 | elapsed time per iteration (ms): 2430.1 | learning rate: 2.988E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.503309E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.24 | backward: 1806.75 | backward-backward: 1806.72 | backward-allreduce: 0.00 | optimizer: 55.74 | batch generator: 0.77 samples/sec: 6.594 | iteration 16400/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 2.987E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.522551E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.30 | backward: 1804.17 | backward-backward: 1804.15 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.77 samples/sec: 6.586 | iteration 16500/ 320000 | elapsed time per iteration (ms): 2429.3 | learning rate: 2.987E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.495613E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.85 | backward: 1806.45 | backward-backward: 1806.42 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.84 samples/sec: 6.591 | iteration 16600/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 2.987E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.494635E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.92 | backward: 1804.60 | backward-backward: 1804.57 | backward-allreduce: 0.00 | optimizer: 55.75 | batch generator: 0.79 samples/sec: 6.588 | iteration 16700/ 320000 | elapsed time per iteration (ms): 2428.5 | learning rate: 2.987E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.501434E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.57 | backward: 1806.04 | backward-backward: 1806.02 | backward-allreduce: 0.00 | optimizer: 55.54 | batch generator: 0.84 samples/sec: 6.587 | iteration 16800/ 320000 | elapsed time per iteration (ms): 2429.1 | learning rate: 2.987E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.485729E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.82 | backward: 1806.11 | backward-backward: 1806.08 | backward-allreduce: 0.00 | optimizer: 55.84 | batch generator: 0.78 samples/sec: 6.595 | iteration 16900/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 2.986E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.499949E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 565.98 | backward: 1804.36 | backward-backward: 1804.34 | backward-allreduce: 0.00 | optimizer: 55.24 | batch generator: 0.78 samples/sec: 6.586 | iteration 17000/ 320000 | elapsed time per iteration (ms): 2429.5 | learning rate: 2.986E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.479489E+00 | loss scale: 131072.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.83 | backward: 1807.13 | backward-backward: 1807.10 | backward-allreduce: 0.00 | optimizer: 55.17 | batch generator: 0.79 ---------------------------------------------------------------------------------------------------------- validation results at iteration 17000 | lm_loss value: 3.478612E+00 | lm_loss_ppl value: 3.241471E+01 | ---------------------------------------------------------------------------------------------------------- samples/sec: 6.441 | iteration 17100/ 320000 | elapsed time per iteration (ms): 2484.0 | learning rate: 2.986E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.488137E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.47 | backward: 1804.87 | backward-backward: 1804.84 | backward-allreduce: 0.00 | optimizer: 55.43 | batch generator: 0.88 samples/sec: 6.591 | iteration 17200/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 2.986E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.476403E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.33 | backward: 1805.37 | backward-backward: 1805.35 | backward-allreduce: 0.00 | optimizer: 55.40 | batch generator: 0.76 samples/sec: 6.587 | iteration 17300/ 320000 | elapsed time per iteration (ms): 2429.2 | learning rate: 2.986E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.490224E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.78 | backward: 1805.75 | backward-backward: 1805.73 | backward-allreduce: 0.00 | optimizer: 56.30 | batch generator: 0.83 samples/sec: 6.591 | iteration 17400/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 2.985E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.490039E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.59 | backward: 1804.89 | backward-backward: 1804.87 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.80 samples/sec: 6.589 | iteration 17500/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 2.985E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.486064E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.48 | backward: 1805.28 | backward-backward: 1805.26 | backward-allreduce: 0.00 | optimizer: 56.11 | batch generator: 0.88 samples/sec: 6.587 | iteration 17600/ 320000 | elapsed time per iteration (ms): 2429.0 | learning rate: 2.985E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.465668E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.75 | backward: 1806.10 | backward-backward: 1806.07 | backward-allreduce: 0.00 | optimizer: 55.80 | batch generator: 0.81 samples/sec: 6.595 | iteration 17700/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 2.985E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.484441E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 565.92 | backward: 1804.56 | backward-backward: 1804.54 | backward-allreduce: 0.00 | optimizer: 55.34 | batch generator: 0.74 samples/sec: 6.587 | iteration 17800/ 320000 | elapsed time per iteration (ms): 2429.0 | learning rate: 2.985E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.487324E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.68 | backward: 1805.99 | backward-backward: 1805.96 | backward-allreduce: 0.00 | optimizer: 55.96 | batch generator: 0.83 samples/sec: 6.593 | iteration 17900/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 2.984E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.468199E+00 | loss scale: 32768.0 | number of skipped iterations: 2 | number of nan iterations: 0 | time (ms) | forward: 566.46 | backward: 1805.14 | backward-backward: 1805.12 | backward-allreduce: 0.00 | optimizer: 54.88 | batch generator: 0.81 samples/sec: 6.594 | iteration 18000/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 2.984E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.460568E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.46 | backward: 1804.40 | backward-backward: 1804.38 | backward-allreduce: 0.00 | optimizer: 55.30 | batch generator: 0.79 ---------------------------------------------------------------------------------------------------------- validation results at iteration 18000 | lm_loss value: 3.487677E+00 | lm_loss_ppl value: 3.270988E+01 | ---------------------------------------------------------------------------------------------------------- samples/sec: 6.436 | iteration 18100/ 320000 | elapsed time per iteration (ms): 2486.0 | learning rate: 2.984E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.470311E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.96 | backward: 1805.90 | backward-backward: 1805.87 | backward-allreduce: 0.00 | optimizer: 55.95 | batch generator: 0.86 samples/sec: 6.598 | iteration 18200/ 320000 | elapsed time per iteration (ms): 2425.1 | learning rate: 2.984E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.459359E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.00 | backward: 1803.34 | backward-backward: 1803.32 | backward-allreduce: 0.00 | optimizer: 55.35 | batch generator: 0.82 samples/sec: 6.589 | iteration 18300/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 2.984E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.466425E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.81 | backward: 1805.57 | backward-backward: 1805.55 | backward-allreduce: 0.00 | optimizer: 55.50 | batch generator: 0.76 samples/sec: 6.590 | iteration 18400/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 2.983E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.459520E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.66 | backward: 1804.79 | backward-backward: 1804.76 | backward-allreduce: 0.00 | optimizer: 55.93 | batch generator: 0.80 samples/sec: 6.595 | iteration 18500/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 2.983E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.468585E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.16 | backward: 1803.75 | backward-backward: 1803.73 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.82 samples/sec: 6.585 | iteration 18600/ 320000 | elapsed time per iteration (ms): 2429.9 | learning rate: 2.983E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.449602E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.53 | backward: 1806.13 | backward-backward: 1806.11 | backward-allreduce: 0.00 | optimizer: 55.86 | batch generator: 0.96 samples/sec: 6.592 | iteration 18700/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 2.983E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.449367E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.40 | backward: 1803.89 | backward-backward: 1803.86 | backward-allreduce: 0.00 | optimizer: 55.34 | batch generator: 0.81 samples/sec: 6.595 | iteration 18800/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 2.982E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.445179E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.33 | backward: 1803.47 | backward-backward: 1803.45 | backward-allreduce: 0.00 | optimizer: 55.72 | batch generator: 0.81 samples/sec: 6.586 | iteration 18900/ 320000 | elapsed time per iteration (ms): 2429.4 | learning rate: 2.982E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.454380E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.36 | backward: 1805.88 | backward-backward: 1805.86 | backward-allreduce: 0.00 | optimizer: 55.83 | batch generator: 0.78 samples/sec: 6.594 | iteration 19000/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 2.982E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.461644E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.54 | backward: 1803.68 | backward-backward: 1803.66 | backward-allreduce: 0.00 | optimizer: 55.86 | batch generator: 0.76 ---------------------------------------------------------------------------------------------------------- validation results at iteration 19000 | lm_loss value: 3.442200E+00 | lm_loss_ppl value: 3.125565E+01 | ---------------------------------------------------------------------------------------------------------- samples/sec: 6.438 | iteration 19100/ 320000 | elapsed time per iteration (ms): 2485.1 | learning rate: 2.982E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.447596E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.64 | backward: 1805.68 | backward-backward: 1805.66 | backward-allreduce: 0.00 | optimizer: 55.59 | batch generator: 0.86 samples/sec: 6.593 | iteration 19200/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 2.982E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.444178E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.57 | backward: 1804.57 | backward-backward: 1804.54 | backward-allreduce: 0.00 | optimizer: 55.36 | batch generator: 0.76 samples/sec: 6.594 | iteration 19300/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 2.981E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.456484E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.19 | backward: 1804.05 | backward-backward: 1804.02 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.78 samples/sec: 6.585 | iteration 19400/ 320000 | elapsed time per iteration (ms): 2429.8 | learning rate: 2.981E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.438658E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.95 | backward: 1806.34 | backward-backward: 1806.32 | backward-allreduce: 0.00 | optimizer: 56.10 | batch generator: 0.76 samples/sec: 6.583 | iteration 19500/ 320000 | elapsed time per iteration (ms): 2430.6 | learning rate: 2.981E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.450551E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.38 | backward: 1806.36 | backward-backward: 1806.34 | backward-allreduce: 0.00 | optimizer: 56.48 | batch generator: 0.78 samples/sec: 6.596 | iteration 19600/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 2.981E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.444273E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.30 | backward: 1803.57 | backward-backward: 1803.54 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.78 samples/sec: 6.588 | iteration 19700/ 320000 | elapsed time per iteration (ms): 2428.8 | learning rate: 2.980E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.427438E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.79 | backward: 1806.03 | backward-backward: 1806.01 | backward-allreduce: 0.00 | optimizer: 55.54 | batch generator: 0.79 samples/sec: 6.585 | iteration 19800/ 320000 | elapsed time per iteration (ms): 2429.9 | learning rate: 2.980E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.426350E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.09 | backward: 1806.35 | backward-backward: 1806.32 | backward-allreduce: 0.00 | optimizer: 56.08 | batch generator: 0.78 samples/sec: 6.580 | iteration 19900/ 320000 | elapsed time per iteration (ms): 2431.6 | learning rate: 2.980E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.444643E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 568.42 | backward: 1806.66 | backward-backward: 1806.63 | backward-allreduce: 0.00 | optimizer: 56.10 | batch generator: 0.83 samples/sec: 6.594 | iteration 20000/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 2.980E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.428382E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.16 | backward: 1804.37 | backward-backward: 1804.34 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.83 ---------------------------------------------------------------------------------------------------------- validation results at iteration 20000 | lm_loss value: 3.369170E+00 | lm_loss_ppl value: 2.905439E+01 | ---------------------------------------------------------------------------------------------------------- samples/sec: 6.223 | iteration 20100/ 320000 | elapsed time per iteration (ms): 2571.3 | learning rate: 2.979E-04 | approx flops per GPU: 38.7TFLOPS | lm_loss: 3.445608E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 568.73 | backward: 1809.85 | backward-backward: 1809.82 | backward-allreduce: 0.00 | optimizer: 55.70 | batch generator: 0.94 samples/sec: 6.583 | iteration 20200/ 320000 | elapsed time per iteration (ms): 2430.3 | learning rate: 2.979E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.433979E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.05 | backward: 1806.74 | backward-backward: 1806.72 | backward-allreduce: 0.00 | optimizer: 56.17 | batch generator: 0.80 samples/sec: 6.588 | iteration 20300/ 320000 | elapsed time per iteration (ms): 2428.8 | learning rate: 2.979E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.437804E+00 | loss scale: 131072.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.89 | backward: 1806.09 | backward-backward: 1806.07 | backward-allreduce: 0.00 | optimizer: 55.42 | batch generator: 0.82 samples/sec: 6.595 | iteration 20400/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 2.979E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.447111E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 565.89 | backward: 1804.56 | backward-backward: 1804.54 | backward-allreduce: 0.00 | optimizer: 55.12 | batch generator: 0.77 samples/sec: 6.586 | iteration 20500/ 320000 | elapsed time per iteration (ms): 2429.5 | learning rate: 2.978E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.439632E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.10 | backward: 1806.18 | backward-backward: 1806.16 | backward-allreduce: 0.00 | optimizer: 55.87 | batch generator: 0.84 samples/sec: 6.584 | iteration 20600/ 320000 | elapsed time per iteration (ms): 2430.3 | learning rate: 2.978E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.438907E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.83 | backward: 1806.87 | backward-backward: 1806.84 | backward-allreduce: 0.00 | optimizer: 56.23 | batch generator: 0.79 samples/sec: 6.592 | iteration 20700/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 2.978E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.448674E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.55 | backward: 1804.67 | backward-backward: 1804.64 | backward-allreduce: 0.00 | optimizer: 55.51 | batch generator: 0.79 samples/sec: 6.592 | iteration 20800/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 2.978E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.440555E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.38 | backward: 1804.83 | backward-backward: 1804.81 | backward-allreduce: 0.00 | optimizer: 55.75 | batch generator: 0.79 samples/sec: 6.584 | iteration 20900/ 320000 | elapsed time per iteration (ms): 2430.1 | learning rate: 2.977E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.439475E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.95 | backward: 1806.92 | backward-backward: 1806.89 | backward-allreduce: 0.00 | optimizer: 55.82 | batch generator: 0.80 samples/sec: 6.587 | iteration 21000/ 320000 | elapsed time per iteration (ms): 2429.1 | learning rate: 2.977E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.429688E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.97 | backward: 1806.05 | backward-backward: 1806.03 | backward-allreduce: 0.00 | optimizer: 55.66 | batch generator: 0.94 ---------------------------------------------------------------------------------------------------------- validation results at iteration 21000 | lm_loss value: 3.472486E+00 | lm_loss_ppl value: 3.221675E+01 | ---------------------------------------------------------------------------------------------------------- samples/sec: 6.444 | iteration 21100/ 320000 | elapsed time per iteration (ms): 2483.1 | learning rate: 2.977E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.427542E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.44 | backward: 1803.87 | backward-backward: 1803.85 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 1.03 samples/sec: 6.589 | iteration 21200/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 2.977E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.412682E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.63 | backward: 1805.23 | backward-backward: 1805.21 | backward-allreduce: 0.00 | optimizer: 56.09 | batch generator: 0.77 samples/sec: 6.585 | iteration 21300/ 320000 | elapsed time per iteration (ms): 2429.8 | learning rate: 2.976E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.421915E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.77 | backward: 1806.44 | backward-backward: 1806.42 | backward-allreduce: 0.00 | optimizer: 56.19 | batch generator: 0.84 samples/sec: 6.590 | iteration 21400/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 2.976E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.425612E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.91 | backward: 1804.48 | backward-backward: 1804.46 | backward-allreduce: 0.00 | optimizer: 56.22 | batch generator: 0.86 samples/sec: 6.597 | iteration 21500/ 320000 | elapsed time per iteration (ms): 2425.3 | learning rate: 2.976E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.406616E+00 | loss scale: 131072.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.08 | backward: 1803.82 | backward-backward: 1803.80 | backward-allreduce: 0.00 | optimizer: 55.05 | batch generator: 0.77 samples/sec: 6.583 | iteration 21600/ 320000 | elapsed time per iteration (ms): 2430.4 | learning rate: 2.976E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.413427E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.92 | backward: 1807.08 | backward-backward: 1807.05 | backward-allreduce: 0.00 | optimizer: 55.98 | batch generator: 0.79 samples/sec: 6.583 | iteration 21700/ 320000 | elapsed time per iteration (ms): 2430.7 | learning rate: 2.975E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.437197E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.31 | backward: 1806.66 | backward-backward: 1806.64 | backward-allreduce: 0.00 | optimizer: 56.31 | batch generator: 0.79 samples/sec: 6.591 | iteration 21800/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 2.975E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.426331E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.82 | backward: 1804.79 | backward-backward: 1804.76 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.79 samples/sec: 6.593 | iteration 21900/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 2.975E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.401378E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.57 | backward: 1804.13 | backward-backward: 1804.11 | backward-allreduce: 0.00 | optimizer: 55.85 | batch generator: 0.83 samples/sec: 6.586 | iteration 22000/ 320000 | elapsed time per iteration (ms): 2429.5 | learning rate: 2.975E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.412116E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.90 | backward: 1806.50 | backward-backward: 1806.48 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.79 ---------------------------------------------------------------------------------------------------------- validation results at iteration 22000 | lm_loss value: 3.409003E+00 | lm_loss_ppl value: 3.023508E+01 | ---------------------------------------------------------------------------------------------------------- samples/sec: 6.434 | iteration 22100/ 320000 | elapsed time per iteration (ms): 2487.0 | learning rate: 2.974E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.409708E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.28 | backward: 1806.33 | backward-backward: 1806.30 | backward-allreduce: 0.00 | optimizer: 56.15 | batch generator: 0.84 samples/sec: 6.593 | iteration 22200/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 2.974E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.414800E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.63 | backward: 1803.98 | backward-backward: 1803.95 | backward-allreduce: 0.00 | optimizer: 55.73 | batch generator: 0.93 samples/sec: 6.591 | iteration 22300/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 2.974E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.407791E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.58 | backward: 1805.28 | backward-backward: 1805.26 | backward-allreduce: 0.00 | optimizer: 55.39 | batch generator: 0.79 samples/sec: 6.585 | iteration 22400/ 320000 | elapsed time per iteration (ms): 2429.9 | learning rate: 2.973E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.418904E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.99 | backward: 1806.65 | backward-backward: 1806.62 | backward-allreduce: 0.00 | optimizer: 55.88 | batch generator: 0.82 samples/sec: 6.588 | iteration 22500/ 320000 | elapsed time per iteration (ms): 2428.7 | learning rate: 2.973E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.408504E+00 | loss scale: 131072.0 | number of skipped iterations: 2 | number of nan iterations: 0 | time (ms) | forward: 567.17 | backward: 1806.14 | backward-backward: 1806.11 | backward-allreduce: 0.00 | optimizer: 54.93 | batch generator: 0.82 samples/sec: 6.588 | iteration 22600/ 320000 | elapsed time per iteration (ms): 2428.8 | learning rate: 2.973E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.402657E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.86 | backward: 1805.63 | backward-backward: 1805.60 | backward-allreduce: 0.00 | optimizer: 55.94 | batch generator: 0.82 samples/sec: 6.586 | iteration 22700/ 320000 | elapsed time per iteration (ms): 2429.4 | learning rate: 2.973E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.406193E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.75 | backward: 1806.63 | backward-backward: 1806.61 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.80 samples/sec: 6.579 | iteration 22800/ 320000 | elapsed time per iteration (ms): 2431.9 | learning rate: 2.972E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.415700E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.68 | backward: 1807.53 | backward-backward: 1807.51 | backward-allreduce: 0.00 | optimizer: 56.28 | batch generator: 0.79 samples/sec: 6.596 | iteration 22900/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 2.972E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.411639E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.32 | backward: 1803.75 | backward-backward: 1803.73 | backward-allreduce: 0.00 | optimizer: 55.35 | batch generator: 0.78 samples/sec: 6.591 | iteration 23000/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 2.972E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.406105E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.65 | backward: 1805.16 | backward-backward: 1805.13 | backward-allreduce: 0.00 | optimizer: 55.46 | batch generator: 0.80 ---------------------------------------------------------------------------------------------------------- validation results at iteration 23000 | lm_loss value: 3.386911E+00 | lm_loss_ppl value: 2.957445E+01 | ---------------------------------------------------------------------------------------------------------- samples/sec: 6.437 | iteration 23100/ 320000 | elapsed time per iteration (ms): 2485.8 | learning rate: 2.972E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.431310E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.93 | backward: 1806.06 | backward-backward: 1806.04 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.85 samples/sec: 6.592 | iteration 23200/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 2.971E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.404346E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.90 | backward: 1804.08 | backward-backward: 1804.06 | backward-allreduce: 0.00 | optimizer: 55.67 | batch generator: 0.85 samples/sec: 6.593 | iteration 23300/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 2.971E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.415671E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.11 | backward: 1804.55 | backward-backward: 1804.53 | backward-allreduce: 0.00 | optimizer: 55.91 | batch generator: 0.80 samples/sec: 6.583 | iteration 23400/ 320000 | elapsed time per iteration (ms): 2430.4 | learning rate: 2.971E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.396132E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.82 | backward: 1806.82 | backward-backward: 1806.79 | backward-allreduce: 0.00 | optimizer: 56.41 | batch generator: 0.79 samples/sec: 6.591 | iteration 23500/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 2.970E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.400174E+00 | loss scale: 131072.0 | number of skipped iterations: 2 | number of nan iterations: 0 | time (ms) | forward: 566.93 | backward: 1805.43 | backward-backward: 1805.41 | backward-allreduce: 0.00 | optimizer: 54.69 | batch generator: 0.77 samples/sec: 6.594 | iteration 23600/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 2.970E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.420445E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.06 | backward: 1804.65 | backward-backward: 1804.63 | backward-allreduce: 0.00 | optimizer: 55.50 | batch generator: 0.84 samples/sec: 6.586 | iteration 23700/ 320000 | elapsed time per iteration (ms): 2429.5 | learning rate: 2.970E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.388505E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 567.14 | backward: 1806.60 | backward-backward: 1806.58 | backward-allreduce: 0.00 | optimizer: 55.34 | batch generator: 0.78 samples/sec: 6.585 | iteration 23800/ 320000 | elapsed time per iteration (ms): 2429.7 | learning rate: 2.970E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.398495E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.42 | backward: 1805.59 | backward-backward: 1805.57 | backward-allreduce: 0.00 | optimizer: 56.29 | batch generator: 0.77 samples/sec: 6.595 | iteration 23900/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 2.969E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.385289E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.09 | backward: 1803.99 | backward-backward: 1803.97 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.80 samples/sec: 6.592 | iteration 24000/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 2.969E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.373112E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.27 | backward: 1804.58 | backward-backward: 1804.56 | backward-allreduce: 0.00 | optimizer: 55.97 | batch generator: 0.81 ---------------------------------------------------------------------------------------------------------- validation results at iteration 24000 | lm_loss value: 3.372802E+00 | lm_loss_ppl value: 2.916012E+01 | ---------------------------------------------------------------------------------------------------------- samples/sec: 6.437 | iteration 24100/ 320000 | elapsed time per iteration (ms): 2485.7 | learning rate: 2.969E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.401346E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.77 | backward: 1806.03 | backward-backward: 1806.00 | backward-allreduce: 0.00 | optimizer: 55.67 | batch generator: 0.90 samples/sec: 6.590 | iteration 24200/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 2.968E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.409446E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.84 | backward: 1805.30 | backward-backward: 1805.28 | backward-allreduce: 0.00 | optimizer: 55.51 | batch generator: 0.78 samples/sec: 6.598 | iteration 24300/ 320000 | elapsed time per iteration (ms): 2425.0 | learning rate: 2.968E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.404403E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.13 | backward: 1803.26 | backward-backward: 1803.23 | backward-allreduce: 0.00 | optimizer: 55.28 | batch generator: 0.80 samples/sec: 6.589 | iteration 24400/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 2.968E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.375716E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.78 | backward: 1805.49 | backward-backward: 1805.47 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.80 samples/sec: 6.588 | iteration 24500/ 320000 | elapsed time per iteration (ms): 2428.7 | learning rate: 2.967E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.373438E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.99 | backward: 1805.33 | backward-backward: 1805.31 | backward-allreduce: 0.00 | optimizer: 55.98 | batch generator: 0.85 samples/sec: 6.590 | iteration 24600/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 2.967E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.378137E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.03 | backward: 1805.03 | backward-backward: 1805.01 | backward-allreduce: 0.00 | optimizer: 55.45 | batch generator: 0.78 samples/sec: 6.594 | iteration 24700/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 2.967E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.398987E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.37 | backward: 1804.30 | backward-backward: 1804.28 | backward-allreduce: 0.00 | optimizer: 55.49 | batch generator: 0.75 samples/sec: 6.597 | iteration 24800/ 320000 | elapsed time per iteration (ms): 2425.3 | learning rate: 2.966E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.375590E+00 | loss scale: 65536.0 | number of skipped iterations: 2 | number of nan iterations: 0 | time (ms) | forward: 566.20 | backward: 1804.01 | backward-backward: 1803.99 | backward-allreduce: 0.00 | optimizer: 54.67 | batch generator: 0.90 samples/sec: 6.587 | iteration 24900/ 320000 | elapsed time per iteration (ms): 2429.2 | learning rate: 2.966E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.385600E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.58 | backward: 1805.91 | backward-backward: 1805.88 | backward-allreduce: 0.00 | optimizer: 56.31 | batch generator: 0.79 samples/sec: 6.589 | iteration 25000/ 320000 | elapsed time per iteration (ms): 2428.4 | learning rate: 2.966E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.386148E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.78 | backward: 1805.76 | backward-backward: 1805.74 | backward-allreduce: 0.00 | optimizer: 55.45 | batch generator: 0.80 ---------------------------------------------------------------------------------------------------------- validation results at iteration 25000 | lm_loss value: 3.390328E+00 | lm_loss_ppl value: 2.967569E+01 | ---------------------------------------------------------------------------------------------------------- samples/sec: 6.438 | iteration 25100/ 320000 | elapsed time per iteration (ms): 2485.3 | learning rate: 2.966E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.381085E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.93 | backward: 1805.29 | backward-backward: 1805.26 | backward-allreduce: 0.00 | optimizer: 55.87 | batch generator: 0.85 samples/sec: 6.589 | iteration 25200/ 320000 | elapsed time per iteration (ms): 2428.4 | learning rate: 2.965E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.390410E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.90 | backward: 1805.56 | backward-backward: 1805.54 | backward-allreduce: 0.00 | optimizer: 55.54 | batch generator: 0.78 samples/sec: 6.590 | iteration 25300/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 2.965E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.363958E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.85 | backward: 1804.73 | backward-backward: 1804.71 | backward-allreduce: 0.00 | optimizer: 55.84 | batch generator: 0.81 samples/sec: 6.591 | iteration 25400/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 2.965E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.378372E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.77 | backward: 1804.78 | backward-backward: 1804.75 | backward-allreduce: 0.00 | optimizer: 55.75 | batch generator: 0.80 samples/sec: 6.590 | iteration 25500/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 2.964E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.411517E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.81 | backward: 1804.82 | backward-backward: 1804.80 | backward-allreduce: 0.00 | optimizer: 55.87 | batch generator: 0.84 samples/sec: 6.592 | iteration 25600/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 2.964E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.372944E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.60 | backward: 1804.52 | backward-backward: 1804.50 | backward-allreduce: 0.00 | optimizer: 55.54 | batch generator: 0.78 samples/sec: 6.591 | iteration 25700/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 2.964E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.387851E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.75 | backward: 1804.81 | backward-backward: 1804.79 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.80 samples/sec: 6.591 | iteration 25800/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 2.963E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.395983E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.72 | backward: 1804.65 | backward-backward: 1804.63 | backward-allreduce: 0.00 | optimizer: 55.88 | batch generator: 0.81 samples/sec: 6.591 | iteration 25900/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 2.963E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.358106E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.92 | backward: 1804.98 | backward-backward: 1804.96 | backward-allreduce: 0.00 | optimizer: 55.43 | batch generator: 0.77 samples/sec: 6.590 | iteration 26000/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 2.963E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.371229E+00 | loss scale: 131072.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.74 | backward: 1805.14 | backward-backward: 1805.12 | backward-allreduce: 0.00 | optimizer: 55.77 | batch generator: 0.86 ---------------------------------------------------------------------------------------------------------- validation results at iteration 26000 | lm_loss value: 3.317383E+00 | lm_loss_ppl value: 2.758805E+01 | ---------------------------------------------------------------------------------------------------------- samples/sec: 6.440 | iteration 26100/ 320000 | elapsed time per iteration (ms): 2484.4 | learning rate: 2.962E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.362961E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.59 | backward: 1805.68 | backward-backward: 1805.66 | backward-allreduce: 0.00 | optimizer: 54.91 | batch generator: 0.86 samples/sec: 6.591 | iteration 26200/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 2.962E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.372546E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.71 | backward: 1804.86 | backward-backward: 1804.83 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.77 samples/sec: 6.592 | iteration 26300/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 2.962E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.383401E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.59 | backward: 1804.58 | backward-backward: 1804.55 | backward-allreduce: 0.00 | optimizer: 55.70 | batch generator: 0.84 samples/sec: 6.592 | iteration 26400/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 2.961E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.369084E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.69 | backward: 1804.36 | backward-backward: 1804.34 | backward-allreduce: 0.00 | optimizer: 55.75 | batch generator: 0.78 samples/sec: 6.593 | iteration 26500/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 2.961E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.353497E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.85 | backward: 1804.15 | backward-backward: 1804.12 | backward-allreduce: 0.00 | optimizer: 55.46 | batch generator: 0.78 samples/sec: 6.592 | iteration 26600/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 2.961E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.357881E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.71 | backward: 1804.42 | backward-backward: 1804.40 | backward-allreduce: 0.00 | optimizer: 55.78 | batch generator: 0.80 samples/sec: 6.593 | iteration 26700/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 2.960E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.352808E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.50 | backward: 1804.56 | backward-backward: 1804.53 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.75 samples/sec: 6.592 | iteration 26800/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 2.960E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.369270E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.59 | backward: 1804.81 | backward-backward: 1804.78 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.81 samples/sec: 6.591 | iteration 26900/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 2.960E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.356558E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.63 | backward: 1804.53 | backward-backward: 1804.51 | backward-allreduce: 0.00 | optimizer: 55.99 | batch generator: 0.80 samples/sec: 6.592 | iteration 27000/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 2.959E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.377886E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.56 | backward: 1804.73 | backward-backward: 1804.71 | backward-allreduce: 0.00 | optimizer: 55.50 | batch generator: 0.77 ---------------------------------------------------------------------------------------------------------- validation results at iteration 27000 | lm_loss value: 3.338592E+00 | lm_loss_ppl value: 2.817942E+01 | ---------------------------------------------------------------------------------------------------------- samples/sec: 6.437 | iteration 27100/ 320000 | elapsed time per iteration (ms): 2485.7 | learning rate: 2.959E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.380413E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.73 | backward: 1805.47 | backward-backward: 1805.45 | backward-allreduce: 0.00 | optimizer: 56.30 | batch generator: 0.83 samples/sec: 6.590 | iteration 27200/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 2.959E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.343455E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.57 | backward: 1805.22 | backward-backward: 1805.20 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.80 samples/sec: 6.589 | iteration 27300/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 2.958E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.370528E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.87 | backward: 1805.04 | backward-backward: 1805.01 | backward-allreduce: 0.00 | optimizer: 55.86 | batch generator: 0.76 samples/sec: 6.593 | iteration 27400/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 2.958E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.361313E+00 | loss scale: 131072.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.65 | backward: 1804.95 | backward-backward: 1804.93 | backward-allreduce: 0.00 | optimizer: 54.93 | batch generator: 0.78 samples/sec: 6.590 | iteration 27500/ 320000 | elapsed time per iteration (ms): 2428.1 | learning rate: 2.958E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.340943E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.96 | backward: 1804.82 | backward-backward: 1804.80 | backward-allreduce: 0.00 | optimizer: 55.90 | batch generator: 0.80 samples/sec: 6.591 | iteration 27600/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 2.957E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.356842E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.65 | backward: 1804.88 | backward-backward: 1804.86 | backward-allreduce: 0.00 | optimizer: 55.51 | batch generator: 0.84 samples/sec: 6.591 | iteration 27700/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 2.957E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.383666E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.83 | backward: 1804.45 | backward-backward: 1804.43 | backward-allreduce: 0.00 | optimizer: 55.76 | batch generator: 0.85 samples/sec: 6.591 | iteration 27800/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 2.957E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.362769E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.83 | backward: 1804.65 | backward-backward: 1804.63 | backward-allreduce: 0.00 | optimizer: 55.54 | batch generator: 0.80 samples/sec: 6.591 | iteration 27900/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 2.956E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.358910E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.49 | backward: 1804.76 | backward-backward: 1804.74 | backward-allreduce: 0.00 | optimizer: 55.96 | batch generator: 0.79 samples/sec: 6.592 | iteration 28000/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 2.956E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.352151E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.52 | backward: 1804.56 | backward-backward: 1804.54 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.77 ---------------------------------------------------------------------------------------------------------- validation results at iteration 28000 | lm_loss value: 3.335882E+00 | lm_loss_ppl value: 2.810316E+01 | ---------------------------------------------------------------------------------------------------------- samples/sec: 6.442 | iteration 28100/ 320000 | elapsed time per iteration (ms): 2483.6 | learning rate: 2.956E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.356265E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.59 | backward: 1804.21 | backward-backward: 1804.19 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.83 samples/sec: 6.589 | iteration 28200/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 2.955E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.379067E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.53 | backward: 1805.19 | backward-backward: 1805.17 | backward-allreduce: 0.00 | optimizer: 56.09 | batch generator: 0.76 samples/sec: 6.593 | iteration 28300/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 2.955E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.383939E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.60 | backward: 1804.31 | backward-backward: 1804.29 | backward-allreduce: 0.00 | optimizer: 55.54 | batch generator: 0.78 samples/sec: 6.593 | iteration 28400/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 2.954E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.361360E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.69 | backward: 1804.08 | backward-backward: 1804.06 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.83 samples/sec: 6.593 | iteration 28500/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 2.954E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.366101E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.51 | backward: 1804.55 | backward-backward: 1804.53 | backward-allreduce: 0.00 | optimizer: 55.45 | batch generator: 0.77 samples/sec: 6.592 | iteration 28600/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 2.954E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.359407E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.54 | backward: 1804.62 | backward-backward: 1804.60 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.78 samples/sec: 6.590 | iteration 28700/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 2.953E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.354857E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.47 | backward: 1805.15 | backward-backward: 1805.12 | backward-allreduce: 0.00 | optimizer: 55.79 | batch generator: 0.77 samples/sec: 6.593 | iteration 28800/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 2.953E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.368465E+00 | loss scale: 131072.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.53 | backward: 1805.14 | backward-backward: 1805.11 | backward-allreduce: 0.00 | optimizer: 54.94 | batch generator: 0.75 samples/sec: 6.593 | iteration 28900/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 2.953E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.345328E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.64 | backward: 1804.72 | backward-backward: 1804.69 | backward-allreduce: 0.00 | optimizer: 55.04 | batch generator: 0.77 samples/sec: 6.592 | iteration 29000/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 2.952E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.348986E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.55 | backward: 1804.86 | backward-backward: 1804.84 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.77 ---------------------------------------------------------------------------------------------------------- validation results at iteration 29000 | lm_loss value: 3.377374E+00 | lm_loss_ppl value: 2.929374E+01 | ---------------------------------------------------------------------------------------------------------- samples/sec: 6.440 | iteration 29100/ 320000 | elapsed time per iteration (ms): 2484.3 | learning rate: 2.952E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.364436E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.64 | backward: 1804.39 | backward-backward: 1804.37 | backward-allreduce: 0.00 | optimizer: 55.98 | batch generator: 0.88 samples/sec: 6.592 | iteration 29200/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 2.952E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.353659E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.98 | backward: 1804.15 | backward-backward: 1804.12 | backward-allreduce: 0.00 | optimizer: 55.84 | batch generator: 0.90 samples/sec: 6.590 | iteration 29300/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 2.951E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.353422E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.57 | backward: 1805.04 | backward-backward: 1805.01 | backward-allreduce: 0.00 | optimizer: 56.04 | batch generator: 0.80 samples/sec: 6.592 | iteration 29400/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 2.951E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.339091E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.47 | backward: 1804.87 | backward-backward: 1804.85 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.80 samples/sec: 6.593 | iteration 29500/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 2.950E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.367910E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.41 | backward: 1804.63 | backward-backward: 1804.61 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.76 samples/sec: 6.592 | iteration 29600/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 2.950E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.346667E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.37 | backward: 1804.90 | backward-backward: 1804.87 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.78 samples/sec: 6.591 | iteration 29700/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 2.950E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.356226E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.94 | backward: 1804.70 | backward-backward: 1804.68 | backward-allreduce: 0.00 | optimizer: 55.51 | batch generator: 0.81 samples/sec: 6.591 | iteration 29800/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 2.949E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.353547E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.54 | backward: 1804.84 | backward-backward: 1804.82 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.79 samples/sec: 6.593 | iteration 29900/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 2.949E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.327869E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.52 | backward: 1804.60 | backward-backward: 1804.58 | backward-allreduce: 0.00 | optimizer: 55.38 | batch generator: 0.78 samples/sec: 6.592 | iteration 30000/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 2.949E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.349870E+00 | loss scale: 131072.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.78 | backward: 1805.06 | backward-backward: 1805.03 | backward-allreduce: 0.00 | optimizer: 55.06 | batch generator: 0.79 ---------------------------------------------------------------------------------------------------------- validation results at iteration 30000 | lm_loss value: 3.410203E+00 | lm_loss_ppl value: 3.027140E+01 | ---------------------------------------------------------------------------------------------------------- samples/sec: 6.242 | iteration 30100/ 320000 | elapsed time per iteration (ms): 2563.4 | learning rate: 2.948E-04 | approx flops per GPU: 38.8TFLOPS | lm_loss: 3.346676E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.83 | backward: 1805.03 | backward-backward: 1805.01 | backward-allreduce: 0.00 | optimizer: 55.25 | batch generator: 0.86 samples/sec: 6.592 | iteration 30200/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 2.948E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.353948E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.51 | backward: 1804.63 | backward-backward: 1804.61 | backward-allreduce: 0.00 | optimizer: 55.79 | batch generator: 0.77 samples/sec: 6.594 | iteration 30300/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 2.947E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.340773E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.51 | backward: 1804.18 | backward-backward: 1804.16 | backward-allreduce: 0.00 | optimizer: 55.41 | batch generator: 0.76 samples/sec: 6.588 | iteration 30400/ 320000 | elapsed time per iteration (ms): 2428.6 | learning rate: 2.947E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.346868E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.46 | backward: 1805.14 | backward-backward: 1805.12 | backward-allreduce: 0.00 | optimizer: 56.68 | batch generator: 0.78 samples/sec: 6.592 | iteration 30500/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 2.947E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.327868E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.64 | backward: 1804.48 | backward-backward: 1804.45 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.80 samples/sec: 6.593 | iteration 30600/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 2.946E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.321915E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.07 | backward: 1803.87 | backward-backward: 1803.84 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.81 samples/sec: 6.593 | iteration 30700/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 2.946E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.349595E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.45 | backward: 1804.62 | backward-backward: 1804.60 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.80 samples/sec: 6.591 | iteration 30800/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 2.945E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.329826E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.55 | backward: 1805.02 | backward-backward: 1804.99 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.92 samples/sec: 6.592 | iteration 30900/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 2.945E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.341872E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.50 | backward: 1804.49 | backward-backward: 1804.46 | backward-allreduce: 0.00 | optimizer: 55.72 | batch generator: 0.76 samples/sec: 6.592 | iteration 31000/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 2.945E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.349847E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.55 | backward: 1804.67 | backward-backward: 1804.65 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.78 ---------------------------------------------------------------------------------------------------------- validation results at iteration 31000 | lm_loss value: 3.307766E+00 | lm_loss_ppl value: 2.732401E+01 | ---------------------------------------------------------------------------------------------------------- samples/sec: 6.439 | iteration 31100/ 320000 | elapsed time per iteration (ms): 2484.9 | learning rate: 2.944E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.323090E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.52 | backward: 1805.17 | backward-backward: 1805.15 | backward-allreduce: 0.00 | optimizer: 56.05 | batch generator: 0.87 samples/sec: 6.590 | iteration 31200/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 2.944E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.349490E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.37 | backward: 1805.49 | backward-backward: 1805.47 | backward-allreduce: 0.00 | optimizer: 55.66 | batch generator: 0.76 samples/sec: 6.591 | iteration 31300/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 2.943E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.332287E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.47 | backward: 1805.19 | backward-backward: 1805.17 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.79 samples/sec: 6.590 | iteration 31400/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 2.943E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.319173E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.32 | backward: 1805.50 | backward-backward: 1805.48 | backward-allreduce: 0.00 | optimizer: 55.77 | batch generator: 0.74 samples/sec: 6.588 | iteration 31500/ 320000 | elapsed time per iteration (ms): 2428.7 | learning rate: 2.943E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.346702E+00 | loss scale: 131072.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.68 | backward: 1805.88 | backward-backward: 1805.86 | backward-allreduce: 0.00 | optimizer: 55.74 | batch generator: 0.89 samples/sec: 6.592 | iteration 31600/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 2.942E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.332429E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.77 | backward: 1804.56 | backward-backward: 1804.53 | backward-allreduce: 0.00 | optimizer: 55.33 | batch generator: 0.79 samples/sec: 6.592 | iteration 31700/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 2.942E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.333305E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.84 | backward: 1804.26 | backward-backward: 1804.23 | backward-allreduce: 0.00 | optimizer: 55.84 | batch generator: 0.81 samples/sec: 6.590 | iteration 31800/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 2.941E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.337776E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.72 | backward: 1804.89 | backward-backward: 1804.86 | backward-allreduce: 0.00 | optimizer: 55.92 | batch generator: 0.80 samples/sec: 6.594 | iteration 31900/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 2.941E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.352589E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.54 | backward: 1804.18 | backward-backward: 1804.16 | backward-allreduce: 0.00 | optimizer: 55.47 | batch generator: 0.79 samples/sec: 6.592 | iteration 32000/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 2.941E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.337958E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.36 | backward: 1804.87 | backward-backward: 1804.85 | backward-allreduce: 0.00 | optimizer: 55.46 | batch generator: 0.75 ---------------------------------------------------------------------------------------------------------- validation results at iteration 32000 | lm_loss value: 3.334552E+00 | lm_loss_ppl value: 2.806579E+01 | ---------------------------------------------------------------------------------------------------------- samples/sec: 6.441 | iteration 32100/ 320000 | elapsed time per iteration (ms): 2484.2 | learning rate: 2.940E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.353282E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.40 | backward: 1804.84 | backward-backward: 1804.81 | backward-allreduce: 0.00 | optimizer: 55.72 | batch generator: 0.83 samples/sec: 6.592 | iteration 32200/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 2.940E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.348693E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.72 | backward: 1804.62 | backward-backward: 1804.60 | backward-allreduce: 0.00 | optimizer: 55.45 | batch generator: 0.78 samples/sec: 6.593 | iteration 32300/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 2.939E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.332858E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.52 | backward: 1804.53 | backward-backward: 1804.51 | backward-allreduce: 0.00 | optimizer: 55.49 | batch generator: 0.75 samples/sec: 6.593 | iteration 32400/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 2.939E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.347071E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.43 | backward: 1804.60 | backward-backward: 1804.58 | backward-allreduce: 0.00 | optimizer: 55.30 | batch generator: 0.76 samples/sec: 6.595 | iteration 32500/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 2.939E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.336064E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.46 | backward: 1803.77 | backward-backward: 1803.74 | backward-allreduce: 0.00 | optimizer: 55.47 | batch generator: 0.77 samples/sec: 6.590 | iteration 32600/ 320000 | elapsed time per iteration (ms): 2428.1 | learning rate: 2.938E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.322897E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.60 | backward: 1804.49 | backward-backward: 1804.46 | backward-allreduce: 0.00 | optimizer: 56.60 | batch generator: 0.85 samples/sec: 6.594 | iteration 32700/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 2.938E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.308574E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.55 | backward: 1803.47 | backward-backward: 1803.45 | backward-allreduce: 0.00 | optimizer: 55.91 | batch generator: 0.79 samples/sec: 6.593 | iteration 32800/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 2.937E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.329342E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.40 | backward: 1804.18 | backward-backward: 1804.15 | backward-allreduce: 0.00 | optimizer: 55.72 | batch generator: 0.80 samples/sec: 6.593 | iteration 32900/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 2.937E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.349044E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.41 | backward: 1804.32 | backward-backward: 1804.30 | backward-allreduce: 0.00 | optimizer: 55.80 | batch generator: 0.83 samples/sec: 6.592 | iteration 33000/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 2.936E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.330421E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.55 | backward: 1804.53 | backward-backward: 1804.51 | backward-allreduce: 0.00 | optimizer: 55.67 | batch generator: 0.78 ---------------------------------------------------------------------------------------------------------- validation results at iteration 33000 | lm_loss value: 3.309073E+00 | lm_loss_ppl value: 2.735975E+01 | ---------------------------------------------------------------------------------------------------------- samples/sec: 6.442 | iteration 33100/ 320000 | elapsed time per iteration (ms): 2483.6 | learning rate: 2.936E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.316125E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.53 | backward: 1804.28 | backward-backward: 1804.26 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.81 samples/sec: 6.587 | iteration 33200/ 320000 | elapsed time per iteration (ms): 2429.0 | learning rate: 2.936E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.326443E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.24 | backward: 1805.58 | backward-backward: 1805.55 | backward-allreduce: 0.00 | optimizer: 55.82 | batch generator: 0.84 samples/sec: 6.591 | iteration 33300/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 2.935E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.311514E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.90 | backward: 1804.23 | backward-backward: 1804.21 | backward-allreduce: 0.00 | optimizer: 55.89 | batch generator: 0.79 samples/sec: 6.593 | iteration 33400/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 2.935E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.317407E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.59 | backward: 1804.45 | backward-backward: 1804.43 | backward-allreduce: 0.00 | optimizer: 55.49 | batch generator: 0.80 samples/sec: 6.591 | iteration 33500/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 2.934E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.349631E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.67 | backward: 1804.86 | backward-backward: 1804.84 | backward-allreduce: 0.00 | optimizer: 55.46 | batch generator: 0.78 samples/sec: 6.589 | iteration 33600/ 320000 | elapsed time per iteration (ms): 2428.4 | learning rate: 2.934E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.309319E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.62 | backward: 1805.06 | backward-backward: 1805.04 | backward-allreduce: 0.00 | optimizer: 56.35 | batch generator: 0.78 samples/sec: 6.589 | iteration 33700/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 2.933E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.311014E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.52 | backward: 1805.66 | backward-backward: 1805.64 | backward-allreduce: 0.00 | optimizer: 55.73 | batch generator: 0.77 samples/sec: 6.591 | iteration 33800/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 2.933E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.326368E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.42 | backward: 1805.23 | backward-backward: 1805.21 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.78 samples/sec: 6.591 | iteration 33900/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 2.933E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.326279E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.63 | backward: 1804.88 | backward-backward: 1804.85 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.79 samples/sec: 6.591 | iteration 34000/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 2.932E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.324206E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.62 | backward: 1804.79 | backward-backward: 1804.77 | backward-allreduce: 0.00 | optimizer: 55.85 | batch generator: 0.81 ---------------------------------------------------------------------------------------------------------- validation results at iteration 34000 | lm_loss value: 3.296870E+00 | lm_loss_ppl value: 2.702792E+01 | ---------------------------------------------------------------------------------------------------------- samples/sec: 6.440 | iteration 34100/ 320000 | elapsed time per iteration (ms): 2484.4 | learning rate: 2.932E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.323241E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.51 | backward: 1805.17 | backward-backward: 1805.15 | backward-allreduce: 0.00 | optimizer: 55.54 | batch generator: 0.84 samples/sec: 6.591 | iteration 34200/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 2.931E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.316648E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.47 | backward: 1804.85 | backward-backward: 1804.83 | backward-allreduce: 0.00 | optimizer: 55.71 | batch generator: 0.80 samples/sec: 6.592 | iteration 34300/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 2.931E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.309069E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.46 | backward: 1804.76 | backward-backward: 1804.74 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.79 samples/sec: 6.592 | iteration 34400/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 2.930E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.324214E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.41 | backward: 1805.02 | backward-backward: 1805.00 | backward-allreduce: 0.00 | optimizer: 55.46 | batch generator: 0.78 samples/sec: 6.589 | iteration 34500/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 2.930E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.326719E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.62 | backward: 1805.41 | backward-backward: 1805.39 | backward-allreduce: 0.00 | optimizer: 55.90 | batch generator: 0.82 samples/sec: 6.591 | iteration 34600/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 2.929E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.321427E+00 | loss scale: 131072.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.61 | backward: 1805.50 | backward-backward: 1805.47 | backward-allreduce: 0.00 | optimizer: 55.14 | batch generator: 0.77 samples/sec: 6.590 | iteration 34700/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 2.929E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.318526E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.54 | backward: 1805.26 | backward-backward: 1805.23 | backward-allreduce: 0.00 | optimizer: 55.81 | batch generator: 0.83 samples/sec: 6.595 | iteration 34800/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 2.929E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.310091E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.11 | backward: 1804.15 | backward-backward: 1804.13 | backward-allreduce: 0.00 | optimizer: 55.30 | batch generator: 0.80 samples/sec: 6.594 | iteration 34900/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 2.928E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.327155E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.27 | backward: 1804.10 | backward-backward: 1804.08 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.79 samples/sec: 6.591 | iteration 35000/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 2.928E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.315563E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.54 | backward: 1805.17 | backward-backward: 1805.15 | backward-allreduce: 0.00 | optimizer: 55.54 | batch generator: 0.79 ---------------------------------------------------------------------------------------------------------- validation results at iteration 35000 | lm_loss value: 3.389436E+00 | lm_loss_ppl value: 2.964921E+01 | ---------------------------------------------------------------------------------------------------------- samples/sec: 6.441 | iteration 35100/ 320000 | elapsed time per iteration (ms): 2484.0 | learning rate: 2.927E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.317277E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.65 | backward: 1804.45 | backward-backward: 1804.42 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.85 samples/sec: 6.592 | iteration 35200/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 2.927E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.314090E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.68 | backward: 1804.46 | backward-backward: 1804.44 | backward-allreduce: 0.00 | optimizer: 55.73 | batch generator: 0.76 samples/sec: 6.593 | iteration 35300/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 2.926E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.314318E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.61 | backward: 1804.18 | backward-backward: 1804.16 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.78 samples/sec: 6.593 | iteration 35400/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 2.926E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.316464E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.44 | backward: 1804.16 | backward-backward: 1804.13 | backward-allreduce: 0.00 | optimizer: 55.77 | batch generator: 0.77 samples/sec: 6.592 | iteration 35500/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 2.925E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.294127E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.43 | backward: 1804.38 | backward-backward: 1804.36 | backward-allreduce: 0.00 | optimizer: 55.90 | batch generator: 0.79 samples/sec: 6.593 | iteration 35600/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 2.925E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.329562E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.24 | backward: 1804.40 | backward-backward: 1804.38 | backward-allreduce: 0.00 | optimizer: 55.74 | batch generator: 0.76 samples/sec: 6.594 | iteration 35700/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 2.925E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.311970E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.27 | backward: 1804.18 | backward-backward: 1804.16 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.78 samples/sec: 6.589 | iteration 35800/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 2.924E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.308219E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.69 | backward: 1805.31 | backward-backward: 1805.28 | backward-allreduce: 0.00 | optimizer: 55.90 | batch generator: 0.80 samples/sec: 6.592 | iteration 35900/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 2.924E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.309868E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.60 | backward: 1804.46 | backward-backward: 1804.44 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.81 samples/sec: 6.592 | iteration 36000/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 2.923E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.307813E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.44 | backward: 1804.99 | backward-backward: 1804.97 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.78 ---------------------------------------------------------------------------------------------------------- validation results at iteration 36000 | lm_loss value: 3.306255E+00 | lm_loss_ppl value: 2.728277E+01 | ---------------------------------------------------------------------------------------------------------- samples/sec: 6.441 | iteration 36100/ 320000 | elapsed time per iteration (ms): 2484.1 | learning rate: 2.923E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.319524E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.48 | backward: 1804.72 | backward-backward: 1804.70 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.86 samples/sec: 6.593 | iteration 36200/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 2.922E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.296093E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.56 | backward: 1804.56 | backward-backward: 1804.54 | backward-allreduce: 0.00 | optimizer: 55.42 | batch generator: 0.79 samples/sec: 6.597 | iteration 36300/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 2.922E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.313030E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.04 | backward: 1803.52 | backward-backward: 1803.49 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.79 samples/sec: 6.589 | iteration 36400/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 2.921E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.314016E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.52 | backward: 1805.58 | backward-backward: 1805.55 | backward-allreduce: 0.00 | optimizer: 55.82 | batch generator: 0.78 samples/sec: 6.591 | iteration 36500/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 2.921E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.319534E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.67 | backward: 1805.22 | backward-backward: 1805.20 | backward-allreduce: 0.00 | optimizer: 55.33 | batch generator: 0.80 samples/sec: 6.590 | iteration 36600/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 2.920E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.312068E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.71 | backward: 1804.81 | backward-backward: 1804.78 | backward-allreduce: 0.00 | optimizer: 55.92 | batch generator: 0.77 samples/sec: 6.591 | iteration 36700/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 2.920E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.298063E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.39 | backward: 1804.97 | backward-backward: 1804.95 | backward-allreduce: 0.00 | optimizer: 55.83 | batch generator: 0.78 samples/sec: 6.591 | iteration 36800/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 2.919E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.292935E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.38 | backward: 1804.90 | backward-backward: 1804.87 | backward-allreduce: 0.00 | optimizer: 55.97 | batch generator: 0.77 samples/sec: 6.589 | iteration 36900/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 2.919E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.305718E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.28 | backward: 1805.41 | backward-backward: 1805.39 | backward-allreduce: 0.00 | optimizer: 56.12 | batch generator: 0.78 samples/sec: 6.593 | iteration 37000/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 2.918E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.296155E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.42 | backward: 1804.46 | backward-backward: 1804.43 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.80 ---------------------------------------------------------------------------------------------------------- validation results at iteration 37000 | lm_loss value: 3.285924E+00 | lm_loss_ppl value: 2.673367E+01 | ---------------------------------------------------------------------------------------------------------- samples/sec: 6.442 | iteration 37100/ 320000 | elapsed time per iteration (ms): 2483.5 | learning rate: 2.918E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.319655E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.48 | backward: 1804.25 | backward-backward: 1804.22 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.88 samples/sec: 6.591 | iteration 37200/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 2.917E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.304612E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.57 | backward: 1804.47 | backward-backward: 1804.45 | backward-allreduce: 0.00 | optimizer: 56.19 | batch generator: 0.83 samples/sec: 6.594 | iteration 37300/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 2.917E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.304551E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.39 | backward: 1804.20 | backward-backward: 1804.18 | backward-allreduce: 0.00 | optimizer: 55.37 | batch generator: 0.81 samples/sec: 6.594 | iteration 37400/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 2.916E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.321405E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.25 | backward: 1804.48 | backward-backward: 1804.45 | backward-allreduce: 0.00 | optimizer: 55.33 | batch generator: 0.76 samples/sec: 6.594 | iteration 37500/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 2.916E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.286706E+00 | loss scale: 65536.0 | number of skipped iterations: 2 | number of nan iterations: 0 | time (ms) | forward: 566.58 | backward: 1804.92 | backward-backward: 1804.90 | backward-allreduce: 0.00 | optimizer: 54.54 | batch generator: 0.78 samples/sec: 6.592 | iteration 37600/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 2.916E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.303002E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.45 | backward: 1804.68 | backward-backward: 1804.66 | backward-allreduce: 0.00 | optimizer: 55.69 | batch generator: 0.79 samples/sec: 6.592 | iteration 37700/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 2.915E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.303816E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.39 | backward: 1804.79 | backward-backward: 1804.77 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.82 samples/sec: 6.590 | iteration 37800/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 2.915E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.310947E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.50 | backward: 1805.31 | backward-backward: 1805.29 | backward-allreduce: 0.00 | optimizer: 55.69 | batch generator: 0.80 samples/sec: 6.591 | iteration 37900/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 2.914E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.305179E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.51 | backward: 1804.90 | backward-backward: 1804.87 | backward-allreduce: 0.00 | optimizer: 55.79 | batch generator: 0.80 samples/sec: 6.592 | iteration 38000/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 2.914E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.309701E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.39 | backward: 1804.24 | backward-backward: 1804.22 | backward-allreduce: 0.00 | optimizer: 56.27 | batch generator: 0.85 ---------------------------------------------------------------------------------------------------------- validation results at iteration 38000 | lm_loss value: 3.314566E+00 | lm_loss_ppl value: 2.751046E+01 | ---------------------------------------------------------------------------------------------------------- samples/sec: 6.445 | iteration 38100/ 320000 | elapsed time per iteration (ms): 2482.7 | learning rate: 2.913E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.304493E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.39 | backward: 1803.83 | backward-backward: 1803.80 | backward-allreduce: 0.00 | optimizer: 55.32 | batch generator: 0.88 samples/sec: 6.589 | iteration 38200/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 2.913E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.293611E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.52 | backward: 1805.38 | backward-backward: 1805.35 | backward-allreduce: 0.00 | optimizer: 55.87 | batch generator: 0.79 samples/sec: 6.591 | iteration 38300/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 2.912E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.288315E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.61 | backward: 1804.80 | backward-backward: 1804.78 | backward-allreduce: 0.00 | optimizer: 55.54 | batch generator: 0.79 samples/sec: 6.591 | iteration 38400/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 2.912E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.301026E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.74 | backward: 1804.87 | backward-backward: 1804.85 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.83 samples/sec: 6.593 | iteration 38500/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 2.911E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.299116E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.73 | backward: 1804.42 | backward-backward: 1804.40 | backward-allreduce: 0.00 | optimizer: 55.42 | batch generator: 0.81 samples/sec: 6.592 | iteration 38600/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 2.911E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.287967E+00 | loss scale: 131072.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.62 | backward: 1804.95 | backward-backward: 1804.93 | backward-allreduce: 0.00 | optimizer: 55.12 | batch generator: 0.81 samples/sec: 6.594 | iteration 38700/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 2.910E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.321618E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.27 | backward: 1804.14 | backward-backward: 1804.11 | backward-allreduce: 0.00 | optimizer: 55.54 | batch generator: 0.79 samples/sec: 6.589 | iteration 38800/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 2.910E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.282846E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 567.13 | backward: 1805.32 | backward-backward: 1805.30 | backward-allreduce: 0.00 | optimizer: 55.36 | batch generator: 0.81 samples/sec: 6.589 | iteration 38900/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 2.909E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.317735E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.70 | backward: 1805.28 | backward-backward: 1805.26 | backward-allreduce: 0.00 | optimizer: 55.95 | batch generator: 0.77 samples/sec: 6.591 | iteration 39000/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 2.909E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.296999E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.76 | backward: 1804.87 | backward-backward: 1804.84 | backward-allreduce: 0.00 | optimizer: 55.51 | batch generator: 0.79 ---------------------------------------------------------------------------------------------------------- validation results at iteration 39000 | lm_loss value: 3.252206E+00 | lm_loss_ppl value: 2.584730E+01 | ---------------------------------------------------------------------------------------------------------- samples/sec: 6.439 | iteration 39100/ 320000 | elapsed time per iteration (ms): 2484.8 | learning rate: 2.908E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.314054E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.73 | backward: 1805.02 | backward-backward: 1805.00 | backward-allreduce: 0.00 | optimizer: 55.82 | batch generator: 0.92 samples/sec: 6.592 | iteration 39200/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 2.908E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.280681E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.51 | backward: 1804.44 | backward-backward: 1804.41 | backward-allreduce: 0.00 | optimizer: 55.73 | batch generator: 0.81 samples/sec: 6.597 | iteration 39300/ 320000 | elapsed time per iteration (ms): 2425.4 | learning rate: 2.907E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.288549E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.11 | backward: 1803.21 | backward-backward: 1803.19 | backward-allreduce: 0.00 | optimizer: 55.73 | batch generator: 0.78 samples/sec: 6.593 | iteration 39400/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 2.907E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.286515E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.72 | backward: 1803.95 | backward-backward: 1803.93 | backward-allreduce: 0.00 | optimizer: 55.79 | batch generator: 0.75 samples/sec: 6.589 | iteration 39500/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 2.906E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.289086E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.58 | backward: 1805.27 | backward-backward: 1805.25 | backward-allreduce: 0.00 | optimizer: 55.95 | batch generator: 0.81 samples/sec: 6.593 | iteration 39600/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 2.905E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.278347E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.65 | backward: 1804.28 | backward-backward: 1804.25 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.77 samples/sec: 6.597 | iteration 39700/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 2.905E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.302220E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.07 | backward: 1803.38 | backward-backward: 1803.36 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.78 samples/sec: 6.591 | iteration 39800/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 2.904E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.300404E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.60 | backward: 1804.98 | backward-backward: 1804.96 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.86 samples/sec: 6.592 | iteration 39900/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 2.904E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.297825E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.69 | backward: 1804.39 | backward-backward: 1804.36 | backward-allreduce: 0.00 | optimizer: 55.89 | batch generator: 0.78 samples/sec: 6.594 | iteration 40000/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 2.903E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.270776E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.33 | backward: 1803.88 | backward-backward: 1803.86 | backward-allreduce: 0.00 | optimizer: 55.81 | batch generator: 0.81 ---------------------------------------------------------------------------------------------------------- validation results at iteration 40000 | lm_loss value: 3.329822E+00 | lm_loss_ppl value: 2.793336E+01 | ---------------------------------------------------------------------------------------------------------- samples/sec: 6.231 | iteration 40100/ 320000 | elapsed time per iteration (ms): 2568.0 | learning rate: 2.903E-04 | approx flops per GPU: 38.7TFLOPS | lm_loss: 3.301219E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.93 | backward: 1807.02 | backward-backward: 1807.00 | backward-allreduce: 0.00 | optimizer: 56.26 | batch generator: 0.85 samples/sec: 6.592 | iteration 40200/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 2.902E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.270639E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.67 | backward: 1804.34 | backward-backward: 1804.32 | backward-allreduce: 0.00 | optimizer: 55.97 | batch generator: 0.79 samples/sec: 6.588 | iteration 40300/ 320000 | elapsed time per iteration (ms): 2428.7 | learning rate: 2.902E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.271835E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.57 | backward: 1806.01 | backward-backward: 1805.99 | backward-allreduce: 0.00 | optimizer: 55.79 | batch generator: 0.75 samples/sec: 6.593 | iteration 40400/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 2.901E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.272263E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.62 | backward: 1804.45 | backward-backward: 1804.42 | backward-allreduce: 0.00 | optimizer: 55.41 | batch generator: 0.80 samples/sec: 6.593 | iteration 40500/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 2.901E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.291687E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.28 | backward: 1804.69 | backward-backward: 1804.67 | backward-allreduce: 0.00 | optimizer: 55.54 | batch generator: 0.78 samples/sec: 6.586 | iteration 40600/ 320000 | elapsed time per iteration (ms): 2429.5 | learning rate: 2.900E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.314083E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.39 | backward: 1805.63 | backward-backward: 1805.60 | backward-allreduce: 0.00 | optimizer: 56.00 | batch generator: 0.79 samples/sec: 6.594 | iteration 40700/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 2.900E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.286881E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.11 | backward: 1804.16 | backward-backward: 1804.13 | backward-allreduce: 0.00 | optimizer: 55.72 | batch generator: 0.80 samples/sec: 6.590 | iteration 40800/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 2.899E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.289744E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.68 | backward: 1805.61 | backward-backward: 1805.59 | backward-allreduce: 0.00 | optimizer: 55.38 | batch generator: 0.78 samples/sec: 6.596 | iteration 40900/ 320000 | elapsed time per iteration (ms): 2425.6 | learning rate: 2.899E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.294540E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.26 | backward: 1803.12 | backward-backward: 1803.09 | backward-allreduce: 0.00 | optimizer: 55.87 | batch generator: 0.79 samples/sec: 6.591 | iteration 41000/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 2.898E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.295363E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.57 | backward: 1805.05 | backward-backward: 1805.03 | backward-allreduce: 0.00 | optimizer: 55.72 | batch generator: 0.84 ---------------------------------------------------------------------------------------------------------- validation results at iteration 41000 | lm_loss value: 3.346437E+00 | lm_loss_ppl value: 2.840137E+01 | ---------------------------------------------------------------------------------------------------------- samples/sec: 6.442 | iteration 41100/ 320000 | elapsed time per iteration (ms): 2483.6 | learning rate: 2.898E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.280298E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.64 | backward: 1804.07 | backward-backward: 1804.04 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.85 samples/sec: 6.590 | iteration 41200/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 2.897E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.297362E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.55 | backward: 1804.83 | backward-backward: 1804.81 | backward-allreduce: 0.00 | optimizer: 56.14 | batch generator: 0.79 samples/sec: 6.588 | iteration 41300/ 320000 | elapsed time per iteration (ms): 2428.8 | learning rate: 2.897E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.276998E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.57 | backward: 1805.60 | backward-backward: 1805.58 | backward-allreduce: 0.00 | optimizer: 56.23 | batch generator: 0.76 samples/sec: 6.595 | iteration 41400/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 2.896E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.291323E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.54 | backward: 1803.41 | backward-backward: 1803.38 | backward-allreduce: 0.00 | optimizer: 55.59 | batch generator: 0.82 samples/sec: 6.588 | iteration 41500/ 320000 | elapsed time per iteration (ms): 2428.8 | learning rate: 2.895E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.285616E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.55 | backward: 1806.03 | backward-backward: 1806.01 | backward-allreduce: 0.00 | optimizer: 55.83 | batch generator: 0.75 samples/sec: 6.596 | iteration 41600/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 2.895E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.294633E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.28 | backward: 1803.39 | backward-backward: 1803.37 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.77 samples/sec: 6.590 | iteration 41700/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 2.894E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.284065E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.54 | backward: 1805.29 | backward-backward: 1805.27 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.75 samples/sec: 6.592 | iteration 41800/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 2.894E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.284609E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.55 | backward: 1804.62 | backward-backward: 1804.60 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.75 samples/sec: 6.593 | iteration 41900/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 2.893E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.290269E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.45 | backward: 1804.32 | backward-backward: 1804.30 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.80 samples/sec: 6.587 | iteration 42000/ 320000 | elapsed time per iteration (ms): 2429.1 | learning rate: 2.893E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.294609E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.94 | backward: 1806.06 | backward-backward: 1806.03 | backward-allreduce: 0.00 | optimizer: 55.69 | batch generator: 0.78 ---------------------------------------------------------------------------------------------------------- validation results at iteration 42000 | lm_loss value: 3.268299E+00 | lm_loss_ppl value: 2.626662E+01 | ---------------------------------------------------------------------------------------------------------- samples/sec: 6.443 | iteration 42100/ 320000 | elapsed time per iteration (ms): 2483.3 | learning rate: 2.892E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.274986E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.24 | backward: 1804.12 | backward-backward: 1804.10 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.83 samples/sec: 6.585 | iteration 42200/ 320000 | elapsed time per iteration (ms): 2429.7 | learning rate: 2.892E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.296068E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.86 | backward: 1806.22 | backward-backward: 1806.20 | backward-allreduce: 0.00 | optimizer: 56.24 | batch generator: 0.83 samples/sec: 6.596 | iteration 42300/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 2.891E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.279047E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.34 | backward: 1803.68 | backward-backward: 1803.66 | backward-allreduce: 0.00 | optimizer: 55.46 | batch generator: 0.77 samples/sec: 6.587 | iteration 42400/ 320000 | elapsed time per iteration (ms): 2429.1 | learning rate: 2.891E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.275681E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.56 | backward: 1805.53 | backward-backward: 1805.51 | backward-allreduce: 0.00 | optimizer: 56.48 | batch generator: 0.85 samples/sec: 6.589 | iteration 42500/ 320000 | elapsed time per iteration (ms): 2428.1 | learning rate: 2.890E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.294459E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.69 | backward: 1805.52 | backward-backward: 1805.50 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.76 samples/sec: 6.596 | iteration 42600/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 2.889E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.276050E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.12 | backward: 1804.00 | backward-backward: 1803.97 | backward-allreduce: 0.00 | optimizer: 55.20 | batch generator: 0.75 samples/sec: 6.586 | iteration 42700/ 320000 | elapsed time per iteration (ms): 2429.4 | learning rate: 2.889E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.287803E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.62 | backward: 1806.28 | backward-backward: 1806.26 | backward-allreduce: 0.00 | optimizer: 56.09 | batch generator: 0.79 samples/sec: 6.593 | iteration 42800/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 2.888E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.282441E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.39 | backward: 1804.32 | backward-backward: 1804.30 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.78 samples/sec: 6.585 | iteration 42900/ 320000 | elapsed time per iteration (ms): 2429.9 | learning rate: 2.888E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.278379E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.69 | backward: 1806.71 | backward-backward: 1806.69 | backward-allreduce: 0.00 | optimizer: 56.09 | batch generator: 0.78 samples/sec: 6.594 | iteration 43000/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 2.887E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.297359E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.26 | backward: 1804.26 | backward-backward: 1804.24 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.78 ---------------------------------------------------------------------------------------------------------- validation results at iteration 43000 | lm_loss value: 3.279119E+00 | lm_loss_ppl value: 2.655236E+01 | ---------------------------------------------------------------------------------------------------------- samples/sec: 6.440 | iteration 43100/ 320000 | elapsed time per iteration (ms): 2484.6 | learning rate: 2.887E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.279263E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.49 | backward: 1805.41 | backward-backward: 1805.39 | backward-allreduce: 0.00 | optimizer: 55.50 | batch generator: 0.83 samples/sec: 6.590 | iteration 43200/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 2.886E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.286079E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.65 | backward: 1805.09 | backward-backward: 1805.07 | backward-allreduce: 0.00 | optimizer: 55.73 | batch generator: 0.76 samples/sec: 6.591 | iteration 43300/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 2.886E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.270399E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.48 | backward: 1805.19 | backward-backward: 1805.17 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.80 samples/sec: 6.585 | iteration 43400/ 320000 | elapsed time per iteration (ms): 2429.7 | learning rate: 2.885E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.286746E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.98 | backward: 1805.97 | backward-backward: 1805.95 | backward-allreduce: 0.00 | optimizer: 56.38 | batch generator: 0.80 samples/sec: 6.593 | iteration 43500/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 2.884E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.279970E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.74 | backward: 1804.15 | backward-backward: 1804.13 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.79 samples/sec: 6.589 | iteration 43600/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 2.884E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.279883E+00 | loss scale: 65536.0 | number of skipped iterations: 2 | number of nan iterations: 0 | time (ms) | forward: 566.59 | backward: 1806.58 | backward-backward: 1806.56 | backward-allreduce: 0.00 | optimizer: 54.72 | batch generator: 0.79 samples/sec: 6.592 | iteration 43700/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 2.883E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.291905E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.85 | backward: 1804.25 | backward-backward: 1804.23 | backward-allreduce: 0.00 | optimizer: 55.70 | batch generator: 0.82 samples/sec: 6.585 | iteration 43800/ 320000 | elapsed time per iteration (ms): 2429.9 | learning rate: 2.883E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.269002E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.84 | backward: 1806.84 | backward-backward: 1806.82 | backward-allreduce: 0.00 | optimizer: 55.89 | batch generator: 0.81 samples/sec: 6.591 | iteration 43900/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 2.882E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.297493E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.66 | backward: 1804.33 | backward-backward: 1804.30 | backward-allreduce: 0.00 | optimizer: 56.11 | batch generator: 0.81 samples/sec: 6.590 | iteration 44000/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 2.882E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.270192E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.25 | backward: 1805.53 | backward-backward: 1805.50 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.76 ---------------------------------------------------------------------------------------------------------- validation results at iteration 44000 | lm_loss value: 3.239417E+00 | lm_loss_ppl value: 2.551883E+01 | ---------------------------------------------------------------------------------------------------------- samples/sec: 6.435 | iteration 44100/ 320000 | elapsed time per iteration (ms): 2486.4 | learning rate: 2.881E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.262809E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.17 | backward: 1806.24 | backward-backward: 1806.22 | backward-allreduce: 0.00 | optimizer: 55.79 | batch generator: 0.95 samples/sec: 6.596 | iteration 44200/ 320000 | elapsed time per iteration (ms): 2425.6 | learning rate: 2.880E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.273385E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.19 | backward: 1803.97 | backward-backward: 1803.95 | backward-allreduce: 0.00 | optimizer: 55.04 | batch generator: 0.78 samples/sec: 6.585 | iteration 44300/ 320000 | elapsed time per iteration (ms): 2429.9 | learning rate: 2.880E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.256344E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.92 | backward: 1806.54 | backward-backward: 1806.52 | backward-allreduce: 0.00 | optimizer: 56.04 | batch generator: 0.78 samples/sec: 6.595 | iteration 44400/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 2.879E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.270520E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.41 | backward: 1803.75 | backward-backward: 1803.72 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.81 samples/sec: 6.588 | iteration 44500/ 320000 | elapsed time per iteration (ms): 2428.5 | learning rate: 2.879E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.268042E+00 | loss scale: 16384.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.59 | backward: 1806.00 | backward-backward: 1805.98 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.76 samples/sec: 6.594 | iteration 44600/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 2.878E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.278680E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.62 | backward: 1803.76 | backward-backward: 1803.74 | backward-allreduce: 0.00 | optimizer: 55.67 | batch generator: 0.79 samples/sec: 6.593 | iteration 44700/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 2.878E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.259483E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.48 | backward: 1804.06 | backward-backward: 1804.03 | backward-allreduce: 0.00 | optimizer: 56.05 | batch generator: 0.81 samples/sec: 6.591 | iteration 44800/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 2.877E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.256126E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.56 | backward: 1804.95 | backward-backward: 1804.93 | backward-allreduce: 0.00 | optimizer: 55.74 | batch generator: 0.76 samples/sec: 6.596 | iteration 44900/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 2.876E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.260494E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.29 | backward: 1803.14 | backward-backward: 1803.11 | backward-allreduce: 0.00 | optimizer: 56.04 | batch generator: 0.80 samples/sec: 6.586 | iteration 45000/ 320000 | elapsed time per iteration (ms): 2429.5 | learning rate: 2.876E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.297653E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.25 | backward: 1805.75 | backward-backward: 1805.72 | backward-allreduce: 0.00 | optimizer: 56.06 | batch generator: 0.97 ---------------------------------------------------------------------------------------------------------- validation results at iteration 45000 | lm_loss value: 3.254574E+00 | lm_loss_ppl value: 2.590856E+01 | ---------------------------------------------------------------------------------------------------------- samples/sec: 6.446 | iteration 45100/ 320000 | elapsed time per iteration (ms): 2482.1 | learning rate: 2.875E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.277897E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.38 | backward: 1802.75 | backward-backward: 1802.73 | backward-allreduce: 0.00 | optimizer: 55.82 | batch generator: 0.87 samples/sec: 6.592 | iteration 45200/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 2.875E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.263303E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.42 | backward: 1804.70 | backward-backward: 1804.68 | backward-allreduce: 0.00 | optimizer: 55.83 | batch generator: 0.79 samples/sec: 6.591 | iteration 45300/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 2.874E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.290665E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.69 | backward: 1804.07 | backward-backward: 1804.04 | backward-allreduce: 0.00 | optimizer: 56.24 | batch generator: 0.79 samples/sec: 6.593 | iteration 45400/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 2.873E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.280866E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.45 | backward: 1804.24 | backward-backward: 1804.22 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.76 samples/sec: 6.590 | iteration 45500/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 2.873E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.259130E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.99 | backward: 1804.60 | backward-backward: 1804.58 | backward-allreduce: 0.00 | optimizer: 55.87 | batch generator: 0.87 samples/sec: 6.592 | iteration 45600/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 2.872E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.273474E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.13 | backward: 1804.10 | backward-backward: 1804.08 | backward-allreduce: 0.00 | optimizer: 56.58 | batch generator: 0.80 samples/sec: 6.585 | iteration 45700/ 320000 | elapsed time per iteration (ms): 2429.6 | learning rate: 2.872E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.263455E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.05 | backward: 1806.20 | backward-backward: 1806.17 | backward-allreduce: 0.00 | optimizer: 56.02 | batch generator: 0.81 samples/sec: 6.590 | iteration 45800/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 2.871E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.268053E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.97 | backward: 1804.47 | backward-backward: 1804.45 | backward-allreduce: 0.00 | optimizer: 56.09 | batch generator: 0.84 samples/sec: 6.586 | iteration 45900/ 320000 | elapsed time per iteration (ms): 2429.5 | learning rate: 2.870E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.283255E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.88 | backward: 1806.40 | backward-backward: 1806.38 | backward-allreduce: 0.00 | optimizer: 55.87 | batch generator: 0.81 samples/sec: 6.594 | iteration 46000/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 2.870E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.257815E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.54 | backward: 1804.19 | backward-backward: 1804.17 | backward-allreduce: 0.00 | optimizer: 55.36 | batch generator: 0.77 ---------------------------------------------------------------------------------------------------------- validation results at iteration 46000 | lm_loss value: 3.240120E+00 | lm_loss_ppl value: 2.553678E+01 | ---------------------------------------------------------------------------------------------------------- samples/sec: 6.439 | iteration 46100/ 320000 | elapsed time per iteration (ms): 2484.9 | learning rate: 2.869E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.260155E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.56 | backward: 1805.18 | backward-backward: 1805.16 | backward-allreduce: 0.00 | optimizer: 55.83 | batch generator: 0.88 samples/sec: 6.588 | iteration 46200/ 320000 | elapsed time per iteration (ms): 2428.5 | learning rate: 2.869E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.286548E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.97 | backward: 1805.21 | backward-backward: 1805.19 | backward-allreduce: 0.00 | optimizer: 55.91 | batch generator: 0.81 samples/sec: 6.595 | iteration 46300/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 2.868E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.263115E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.26 | backward: 1803.71 | backward-backward: 1803.69 | backward-allreduce: 0.00 | optimizer: 55.86 | batch generator: 0.79 samples/sec: 6.586 | iteration 46400/ 320000 | elapsed time per iteration (ms): 2429.3 | learning rate: 2.867E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.291322E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.75 | backward: 1806.35 | backward-backward: 1806.33 | backward-allreduce: 0.00 | optimizer: 55.85 | batch generator: 0.80 samples/sec: 6.595 | iteration 46500/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 2.867E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.255900E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.28 | backward: 1803.68 | backward-backward: 1803.65 | backward-allreduce: 0.00 | optimizer: 55.76 | batch generator: 0.78 samples/sec: 6.586 | iteration 46600/ 320000 | elapsed time per iteration (ms): 2429.3 | learning rate: 2.866E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.249855E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.75 | backward: 1806.58 | backward-backward: 1806.56 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.79 samples/sec: 6.590 | iteration 46700/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 2.866E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.270190E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.71 | backward: 1804.77 | backward-backward: 1804.74 | backward-allreduce: 0.00 | optimizer: 56.13 | batch generator: 0.81 samples/sec: 6.591 | iteration 46800/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 2.865E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.280140E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.41 | backward: 1804.90 | backward-backward: 1804.87 | backward-allreduce: 0.00 | optimizer: 55.85 | batch generator: 0.78 samples/sec: 6.587 | iteration 46900/ 320000 | elapsed time per iteration (ms): 2429.1 | learning rate: 2.864E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.263974E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.99 | backward: 1805.24 | backward-backward: 1805.22 | backward-allreduce: 0.00 | optimizer: 56.45 | batch generator: 0.77 samples/sec: 6.595 | iteration 47000/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 2.864E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.256031E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.60 | backward: 1804.12 | backward-backward: 1804.10 | backward-allreduce: 0.00 | optimizer: 54.88 | batch generator: 0.79 ---------------------------------------------------------------------------------------------------------- validation results at iteration 47000 | lm_loss value: 3.204565E+00 | lm_loss_ppl value: 2.464478E+01 | ---------------------------------------------------------------------------------------------------------- samples/sec: 6.431 | iteration 47100/ 320000 | elapsed time per iteration (ms): 2488.0 | learning rate: 2.863E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.275240E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.50 | backward: 1806.74 | backward-backward: 1806.71 | backward-allreduce: 0.00 | optimizer: 56.51 | batch generator: 1.03 samples/sec: 6.595 | iteration 47200/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 2.863E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.256296E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.25 | backward: 1803.90 | backward-backward: 1803.88 | backward-allreduce: 0.00 | optimizer: 55.47 | batch generator: 0.76 samples/sec: 6.588 | iteration 47300/ 320000 | elapsed time per iteration (ms): 2428.5 | learning rate: 2.862E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.267269E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.63 | backward: 1806.07 | backward-backward: 1806.04 | backward-allreduce: 0.00 | optimizer: 55.42 | batch generator: 0.79 samples/sec: 6.592 | iteration 47400/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 2.861E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.261505E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.50 | backward: 1804.67 | backward-backward: 1804.64 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.77 samples/sec: 6.591 | iteration 47500/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 2.861E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.281366E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.39 | backward: 1804.95 | backward-backward: 1804.92 | backward-allreduce: 0.00 | optimizer: 55.84 | batch generator: 0.77 samples/sec: 6.592 | iteration 47600/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 2.860E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.268159E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.55 | backward: 1804.79 | backward-backward: 1804.77 | backward-allreduce: 0.00 | optimizer: 55.50 | batch generator: 0.75 samples/sec: 6.593 | iteration 47700/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 2.859E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.260878E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.36 | backward: 1804.34 | backward-backward: 1804.32 | backward-allreduce: 0.00 | optimizer: 55.83 | batch generator: 0.86 samples/sec: 6.585 | iteration 47800/ 320000 | elapsed time per iteration (ms): 2429.9 | learning rate: 2.859E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.258181E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.80 | backward: 1806.56 | backward-backward: 1806.53 | backward-allreduce: 0.00 | optimizer: 56.19 | batch generator: 0.77 samples/sec: 6.596 | iteration 47900/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 2.858E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.257437E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.14 | backward: 1803.56 | backward-backward: 1803.54 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.79 samples/sec: 6.589 | iteration 48000/ 320000 | elapsed time per iteration (ms): 2428.1 | learning rate: 2.858E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.265240E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.58 | backward: 1805.54 | backward-backward: 1805.51 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.78 ---------------------------------------------------------------------------------------------------------- validation results at iteration 48000 | lm_loss value: 3.280769E+00 | lm_loss_ppl value: 2.659623E+01 | ---------------------------------------------------------------------------------------------------------- samples/sec: 6.442 | iteration 48100/ 320000 | elapsed time per iteration (ms): 2483.9 | learning rate: 2.857E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.266076E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.54 | backward: 1804.44 | backward-backward: 1804.42 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.85 samples/sec: 6.591 | iteration 48200/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 2.856E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.268269E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.49 | backward: 1804.86 | backward-backward: 1804.83 | backward-allreduce: 0.00 | optimizer: 55.92 | batch generator: 0.79 samples/sec: 6.587 | iteration 48300/ 320000 | elapsed time per iteration (ms): 2428.9 | learning rate: 2.856E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.242260E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.21 | backward: 1805.44 | backward-backward: 1805.42 | backward-allreduce: 0.00 | optimizer: 55.86 | batch generator: 0.80 samples/sec: 6.596 | iteration 48400/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 2.855E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.257613E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.12 | backward: 1804.03 | backward-backward: 1804.01 | backward-allreduce: 0.00 | optimizer: 55.33 | batch generator: 0.79 samples/sec: 6.586 | iteration 48500/ 320000 | elapsed time per iteration (ms): 2429.4 | learning rate: 2.854E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.249594E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.75 | backward: 1806.40 | backward-backward: 1806.37 | backward-allreduce: 0.00 | optimizer: 55.87 | batch generator: 0.79 samples/sec: 6.594 | iteration 48600/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 2.854E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.259575E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.41 | backward: 1803.92 | backward-backward: 1803.89 | backward-allreduce: 0.00 | optimizer: 55.86 | batch generator: 0.78 samples/sec: 6.589 | iteration 48700/ 320000 | elapsed time per iteration (ms): 2428.4 | learning rate: 2.853E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.278095E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.86 | backward: 1805.80 | backward-backward: 1805.78 | backward-allreduce: 0.00 | optimizer: 55.36 | batch generator: 0.81 samples/sec: 6.593 | iteration 48800/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 2.853E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.245847E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.69 | backward: 1803.82 | backward-backward: 1803.79 | backward-allreduce: 0.00 | optimizer: 55.91 | batch generator: 0.77 samples/sec: 6.587 | iteration 48900/ 320000 | elapsed time per iteration (ms): 2429.0 | learning rate: 2.852E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.253952E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.43 | backward: 1805.80 | backward-backward: 1805.78 | backward-allreduce: 0.00 | optimizer: 56.41 | batch generator: 0.80 samples/sec: 6.591 | iteration 49000/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 2.851E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.253908E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.89 | backward: 1804.68 | backward-backward: 1804.65 | backward-allreduce: 0.00 | optimizer: 55.75 | batch generator: 0.82 ---------------------------------------------------------------------------------------------------------- validation results at iteration 49000 | lm_loss value: 3.155888E+00 | lm_loss_ppl value: 2.347386E+01 | ---------------------------------------------------------------------------------------------------------- samples/sec: 6.443 | iteration 49100/ 320000 | elapsed time per iteration (ms): 2483.4 | learning rate: 2.851E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.239802E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.33 | backward: 1804.28 | backward-backward: 1804.25 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.86 samples/sec: 6.589 | iteration 49200/ 320000 | elapsed time per iteration (ms): 2428.4 | learning rate: 2.850E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.260440E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.88 | backward: 1805.38 | backward-backward: 1805.36 | backward-allreduce: 0.00 | optimizer: 55.74 | batch generator: 0.76 samples/sec: 6.596 | iteration 49300/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 2.849E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.250363E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.14 | backward: 1803.22 | backward-backward: 1803.20 | backward-allreduce: 0.00 | optimizer: 55.92 | batch generator: 0.76 samples/sec: 6.589 | iteration 49400/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 2.849E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.253287E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.76 | backward: 1805.45 | backward-backward: 1805.42 | backward-allreduce: 0.00 | optimizer: 55.69 | batch generator: 0.83 samples/sec: 6.596 | iteration 49500/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 2.848E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.253913E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.43 | backward: 1803.49 | backward-backward: 1803.46 | backward-allreduce: 0.00 | optimizer: 55.50 | batch generator: 0.81 samples/sec: 6.591 | iteration 49600/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 2.847E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.254687E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.45 | backward: 1804.48 | backward-backward: 1804.46 | backward-allreduce: 0.00 | optimizer: 56.07 | batch generator: 0.80 samples/sec: 6.593 | iteration 49700/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 2.847E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.233920E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.59 | backward: 1804.16 | backward-backward: 1804.14 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.83 samples/sec: 6.593 | iteration 49800/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 2.846E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.275487E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.43 | backward: 1804.16 | backward-backward: 1804.14 | backward-allreduce: 0.00 | optimizer: 55.80 | batch generator: 0.79 samples/sec: 6.588 | iteration 49900/ 320000 | elapsed time per iteration (ms): 2428.7 | learning rate: 2.846E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.239694E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.59 | backward: 1805.98 | backward-backward: 1805.96 | backward-allreduce: 0.00 | optimizer: 55.78 | batch generator: 0.78 samples/sec: 6.593 | iteration 50000/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 2.845E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.233346E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.17 | backward: 1804.06 | backward-backward: 1804.04 | backward-allreduce: 0.00 | optimizer: 56.18 | batch generator: 0.80 ---------------------------------------------------------------------------------------------------------- validation results at iteration 50000 | lm_loss value: 3.290895E+00 | lm_loss_ppl value: 2.686691E+01 | ---------------------------------------------------------------------------------------------------------- samples/sec: 6.229 | iteration 50100/ 320000 | elapsed time per iteration (ms): 2568.5 | learning rate: 2.844E-04 | approx flops per GPU: 38.7TFLOPS | lm_loss: 3.242060E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 568.89 | backward: 1807.21 | backward-backward: 1807.18 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.87 samples/sec: 6.593 | iteration 50200/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 2.844E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.250886E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.81 | backward: 1804.20 | backward-backward: 1804.18 | backward-allreduce: 0.00 | optimizer: 55.47 | batch generator: 0.81 samples/sec: 6.588 | iteration 50300/ 320000 | elapsed time per iteration (ms): 2428.5 | learning rate: 2.843E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.264590E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.05 | backward: 1805.62 | backward-backward: 1805.60 | backward-allreduce: 0.00 | optimizer: 55.51 | batch generator: 0.80 samples/sec: 6.591 | iteration 50400/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 2.842E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.238495E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 567.29 | backward: 1804.74 | backward-backward: 1804.72 | backward-allreduce: 0.00 | optimizer: 55.08 | batch generator: 0.79 samples/sec: 6.596 | iteration 50500/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 2.842E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.255937E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.04 | backward: 1803.94 | backward-backward: 1803.92 | backward-allreduce: 0.00 | optimizer: 55.41 | batch generator: 0.79 samples/sec: 6.587 | iteration 50600/ 320000 | elapsed time per iteration (ms): 2429.1 | learning rate: 2.841E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.245934E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.90 | backward: 1805.84 | backward-backward: 1805.81 | backward-allreduce: 0.00 | optimizer: 56.03 | batch generator: 0.77 samples/sec: 6.597 | iteration 50700/ 320000 | elapsed time per iteration (ms): 2425.4 | learning rate: 2.840E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.246624E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.21 | backward: 1803.57 | backward-backward: 1803.54 | backward-allreduce: 0.00 | optimizer: 55.26 | batch generator: 0.76 samples/sec: 6.590 | iteration 50800/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 2.840E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.249048E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.94 | backward: 1805.09 | backward-backward: 1805.07 | backward-allreduce: 0.00 | optimizer: 55.45 | batch generator: 0.80 samples/sec: 6.593 | iteration 50900/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 2.839E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.260016E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.57 | backward: 1804.25 | backward-backward: 1804.23 | backward-allreduce: 0.00 | optimizer: 55.77 | batch generator: 0.77 samples/sec: 6.596 | iteration 51000/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 2.838E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.253366E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.12 | backward: 1803.89 | backward-backward: 1803.87 | backward-allreduce: 0.00 | optimizer: 55.35 | batch generator: 0.80 ---------------------------------------------------------------------------------------------------------- validation results at iteration 51000 | lm_loss value: 3.196629E+00 | lm_loss_ppl value: 2.444998E+01 | ---------------------------------------------------------------------------------------------------------- samples/sec: 6.433 | iteration 51100/ 320000 | elapsed time per iteration (ms): 2487.0 | learning rate: 2.838E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.256830E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.06 | backward: 1806.59 | backward-backward: 1806.57 | backward-allreduce: 0.00 | optimizer: 56.13 | batch generator: 0.86 samples/sec: 6.593 | iteration 51200/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 2.837E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.239461E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.50 | backward: 1804.20 | backward-backward: 1804.18 | backward-allreduce: 0.00 | optimizer: 55.88 | batch generator: 0.91 samples/sec: 6.589 | iteration 51300/ 320000 | elapsed time per iteration (ms): 2428.4 | learning rate: 2.836E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.242321E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.58 | backward: 1805.53 | backward-backward: 1805.50 | backward-allreduce: 0.00 | optimizer: 55.88 | batch generator: 0.79 samples/sec: 6.592 | iteration 51400/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 2.836E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.247846E+00 | loss scale: 131072.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.78 | backward: 1805.08 | backward-backward: 1805.06 | backward-allreduce: 0.00 | optimizer: 54.81 | batch generator: 0.80 samples/sec: 6.599 | iteration 51500/ 320000 | elapsed time per iteration (ms): 2424.6 | learning rate: 2.835E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.237466E+00 | loss scale: 32768.0 | number of skipped iterations: 2 | number of nan iterations: 0 | time (ms) | forward: 566.08 | backward: 1803.71 | backward-backward: 1803.68 | backward-allreduce: 0.00 | optimizer: 54.47 | batch generator: 0.82 samples/sec: 6.588 | iteration 51600/ 320000 | elapsed time per iteration (ms): 2428.6 | learning rate: 2.834E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.252052E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.05 | backward: 1805.39 | backward-backward: 1805.36 | backward-allreduce: 0.00 | optimizer: 55.75 | batch generator: 0.83 samples/sec: 6.597 | iteration 51700/ 320000 | elapsed time per iteration (ms): 2425.3 | learning rate: 2.834E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.242857E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.63 | backward: 1802.71 | backward-backward: 1802.69 | backward-allreduce: 0.00 | optimizer: 55.54 | batch generator: 0.87 samples/sec: 6.593 | iteration 51800/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 2.833E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.236541E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.36 | backward: 1804.35 | backward-backward: 1804.33 | backward-allreduce: 0.00 | optimizer: 55.59 | batch generator: 0.78 samples/sec: 6.590 | iteration 51900/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 2.832E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.248896E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.13 | backward: 1804.77 | backward-backward: 1804.74 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.80 samples/sec: 6.597 | iteration 52000/ 320000 | elapsed time per iteration (ms): 2425.3 | learning rate: 2.832E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.268438E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.26 | backward: 1803.34 | backward-backward: 1803.32 | backward-allreduce: 0.00 | optimizer: 55.37 | batch generator: 0.79 ---------------------------------------------------------------------------------------------------------- validation results at iteration 52000 | lm_loss value: 3.260769E+00 | lm_loss_ppl value: 2.606959E+01 | ---------------------------------------------------------------------------------------------------------- samples/sec: 6.438 | iteration 52100/ 320000 | elapsed time per iteration (ms): 2485.2 | learning rate: 2.831E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.259160E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.88 | backward: 1805.17 | backward-backward: 1805.15 | backward-allreduce: 0.00 | optimizer: 55.98 | batch generator: 0.92 samples/sec: 6.592 | iteration 52200/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 2.830E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.267081E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.28 | backward: 1804.06 | backward-backward: 1804.03 | backward-allreduce: 0.00 | optimizer: 56.59 | batch generator: 0.87 samples/sec: 6.593 | iteration 52300/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 2.830E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.243748E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.58 | backward: 1804.22 | backward-backward: 1804.20 | backward-allreduce: 0.00 | optimizer: 55.47 | batch generator: 0.79 samples/sec: 6.592 | iteration 52400/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 2.829E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.255930E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.61 | backward: 1804.21 | backward-backward: 1804.18 | backward-allreduce: 0.00 | optimizer: 55.95 | batch generator: 0.85 samples/sec: 6.595 | iteration 52500/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 2.828E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.227540E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 565.99 | backward: 1804.22 | backward-backward: 1804.20 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.79 samples/sec: 6.588 | iteration 52600/ 320000 | elapsed time per iteration (ms): 2428.6 | learning rate: 2.827E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.251512E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.88 | backward: 1805.78 | backward-backward: 1805.76 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.79 samples/sec: 6.597 | iteration 52700/ 320000 | elapsed time per iteration (ms): 2425.3 | learning rate: 2.827E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.259548E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.23 | backward: 1803.12 | backward-backward: 1803.10 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.80 samples/sec: 6.590 | iteration 52800/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 2.826E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.229273E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.67 | backward: 1804.86 | backward-backward: 1804.84 | backward-allreduce: 0.00 | optimizer: 55.86 | batch generator: 0.90 samples/sec: 6.593 | iteration 52900/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 2.825E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.247198E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.74 | backward: 1804.13 | backward-backward: 1804.11 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.85 samples/sec: 6.594 | iteration 53000/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 2.825E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.256194E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.40 | backward: 1804.03 | backward-backward: 1804.01 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.81 ---------------------------------------------------------------------------------------------------------- validation results at iteration 53000 | lm_loss value: 3.253098E+00 | lm_loss_ppl value: 2.587036E+01 | ---------------------------------------------------------------------------------------------------------- samples/sec: 6.437 | iteration 53100/ 320000 | elapsed time per iteration (ms): 2485.8 | learning rate: 2.824E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.250617E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.60 | backward: 1805.45 | backward-backward: 1805.43 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.86 samples/sec: 6.596 | iteration 53200/ 320000 | elapsed time per iteration (ms): 2425.6 | learning rate: 2.823E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.255753E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.15 | backward: 1803.33 | backward-backward: 1803.31 | backward-allreduce: 0.00 | optimizer: 55.72 | batch generator: 0.82 samples/sec: 6.586 | iteration 53300/ 320000 | elapsed time per iteration (ms): 2429.4 | learning rate: 2.823E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.251505E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.19 | backward: 1805.83 | backward-backward: 1805.80 | backward-allreduce: 0.00 | optimizer: 56.04 | batch generator: 0.78 samples/sec: 6.594 | iteration 53400/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 2.822E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.241434E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.63 | backward: 1804.22 | backward-backward: 1804.20 | backward-allreduce: 0.00 | optimizer: 55.40 | batch generator: 0.76 samples/sec: 6.597 | iteration 53500/ 320000 | elapsed time per iteration (ms): 2425.4 | learning rate: 2.821E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.225008E+00 | loss scale: 65536.0 | number of skipped iterations: 2 | number of nan iterations: 0 | time (ms) | forward: 566.34 | backward: 1804.06 | backward-backward: 1804.03 | backward-allreduce: 0.00 | optimizer: 54.63 | batch generator: 0.82 samples/sec: 6.588 | iteration 53600/ 320000 | elapsed time per iteration (ms): 2428.6 | learning rate: 2.821E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.240923E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.20 | backward: 1805.54 | backward-backward: 1805.52 | backward-allreduce: 0.00 | optimizer: 55.49 | batch generator: 0.81 samples/sec: 6.596 | iteration 53700/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 2.820E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.239907E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.07 | backward: 1803.87 | backward-backward: 1803.85 | backward-allreduce: 0.00 | optimizer: 55.51 | batch generator: 0.78 samples/sec: 6.591 | iteration 53800/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 2.819E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.249483E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.93 | backward: 1805.19 | backward-backward: 1805.17 | backward-allreduce: 0.00 | optimizer: 55.17 | batch generator: 0.79 samples/sec: 6.595 | iteration 53900/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 2.818E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.232355E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.57 | backward: 1803.44 | backward-backward: 1803.42 | backward-allreduce: 0.00 | optimizer: 55.72 | batch generator: 0.80 samples/sec: 6.596 | iteration 54000/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 2.818E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.234094E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.38 | backward: 1803.71 | backward-backward: 1803.68 | backward-allreduce: 0.00 | optimizer: 55.29 | batch generator: 0.78 ---------------------------------------------------------------------------------------------------------- validation results at iteration 54000 | lm_loss value: 3.219492E+00 | lm_loss_ppl value: 2.501541E+01 | ---------------------------------------------------------------------------------------------------------- samples/sec: 6.440 | iteration 54100/ 320000 | elapsed time per iteration (ms): 2484.5 | learning rate: 2.817E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.229659E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.84 | backward: 1804.90 | backward-backward: 1804.88 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.85 samples/sec: 6.599 | iteration 54200/ 320000 | elapsed time per iteration (ms): 2424.7 | learning rate: 2.816E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.248117E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.04 | backward: 1803.03 | backward-backward: 1803.01 | backward-allreduce: 0.00 | optimizer: 55.27 | batch generator: 0.76 samples/sec: 6.587 | iteration 54300/ 320000 | elapsed time per iteration (ms): 2429.0 | learning rate: 2.816E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.248514E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.90 | backward: 1805.32 | backward-backward: 1805.29 | backward-allreduce: 0.00 | optimizer: 56.38 | batch generator: 0.79 samples/sec: 6.594 | iteration 54400/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 2.815E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.271505E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.55 | backward: 1803.88 | backward-backward: 1803.86 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.78 samples/sec: 6.596 | iteration 54500/ 320000 | elapsed time per iteration (ms): 2425.6 | learning rate: 2.814E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.237229E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.38 | backward: 1803.44 | backward-backward: 1803.42 | backward-allreduce: 0.00 | optimizer: 55.42 | batch generator: 0.82 samples/sec: 6.591 | iteration 54600/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 2.814E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.233646E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.16 | backward: 1804.59 | backward-backward: 1804.57 | backward-allreduce: 0.00 | optimizer: 55.59 | batch generator: 0.80 samples/sec: 6.597 | iteration 54700/ 320000 | elapsed time per iteration (ms): 2425.3 | learning rate: 2.813E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.232425E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.09 | backward: 1803.02 | backward-backward: 1803.00 | backward-allreduce: 0.00 | optimizer: 55.78 | batch generator: 0.77 samples/sec: 6.590 | iteration 54800/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 2.812E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.243950E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.92 | backward: 1805.07 | backward-backward: 1805.04 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.85 samples/sec: 6.595 | iteration 54900/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 2.811E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.254999E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.57 | backward: 1803.82 | backward-backward: 1803.80 | backward-allreduce: 0.00 | optimizer: 55.35 | batch generator: 0.78 samples/sec: 6.598 | iteration 55000/ 320000 | elapsed time per iteration (ms): 2424.9 | learning rate: 2.811E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.236828E+00 | loss scale: 32768.0 | number of skipped iterations: 2 | number of nan iterations: 0 | time (ms) | forward: 566.42 | backward: 1803.77 | backward-backward: 1803.74 | backward-allreduce: 0.00 | optimizer: 54.37 | batch generator: 0.80 ---------------------------------------------------------------------------------------------------------- validation results at iteration 55000 | lm_loss value: 3.212023E+00 | lm_loss_ppl value: 2.482925E+01 | ---------------------------------------------------------------------------------------------------------- samples/sec: 6.436 | iteration 55100/ 320000 | elapsed time per iteration (ms): 2485.8 | learning rate: 2.810E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.241522E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.93 | backward: 1805.60 | backward-backward: 1805.58 | backward-allreduce: 0.00 | optimizer: 56.07 | batch generator: 0.92 samples/sec: 6.594 | iteration 55200/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 2.809E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.260769E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.82 | backward: 1803.87 | backward-backward: 1803.84 | backward-allreduce: 0.00 | optimizer: 55.37 | batch generator: 0.80 samples/sec: 6.590 | iteration 55300/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 2.809E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.234917E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.87 | backward: 1804.71 | backward-backward: 1804.68 | backward-allreduce: 0.00 | optimizer: 55.89 | batch generator: 0.86 samples/sec: 6.589 | iteration 55400/ 320000 | elapsed time per iteration (ms): 2428.4 | learning rate: 2.808E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.233477E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.85 | backward: 1804.82 | backward-backward: 1804.80 | backward-allreduce: 0.00 | optimizer: 56.33 | batch generator: 0.76 samples/sec: 6.598 | iteration 55500/ 320000 | elapsed time per iteration (ms): 2425.1 | learning rate: 2.807E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.239772E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.06 | backward: 1803.03 | backward-backward: 1803.01 | backward-allreduce: 0.00 | optimizer: 55.59 | batch generator: 0.78 samples/sec: 6.588 | iteration 55600/ 320000 | elapsed time per iteration (ms): 2428.5 | learning rate: 2.806E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.250387E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.02 | backward: 1805.43 | backward-backward: 1805.40 | backward-allreduce: 0.00 | optimizer: 55.70 | batch generator: 0.82 samples/sec: 6.597 | iteration 55700/ 320000 | elapsed time per iteration (ms): 2425.3 | learning rate: 2.806E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.228537E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.34 | backward: 1803.25 | backward-backward: 1803.22 | backward-allreduce: 0.00 | optimizer: 55.34 | batch generator: 0.77 samples/sec: 6.594 | iteration 55800/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 2.805E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.233900E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.90 | backward: 1803.75 | backward-backward: 1803.72 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.78 samples/sec: 6.590 | iteration 55900/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 2.804E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.217724E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.79 | backward: 1804.62 | backward-backward: 1804.60 | backward-allreduce: 0.00 | optimizer: 55.97 | batch generator: 0.80 samples/sec: 6.599 | iteration 56000/ 320000 | elapsed time per iteration (ms): 2424.6 | learning rate: 2.803E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.237220E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.15 | backward: 1802.79 | backward-backward: 1802.76 | backward-allreduce: 0.00 | optimizer: 55.32 | batch generator: 0.82 ---------------------------------------------------------------------------------------------------------- validation results at iteration 56000 | lm_loss value: 3.269206E+00 | lm_loss_ppl value: 2.629046E+01 | ---------------------------------------------------------------------------------------------------------- samples/sec: 6.438 | iteration 56100/ 320000 | elapsed time per iteration (ms): 2485.2 | learning rate: 2.803E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.249896E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.82 | backward: 1805.36 | backward-backward: 1805.34 | backward-allreduce: 0.00 | optimizer: 55.81 | batch generator: 0.84 samples/sec: 6.597 | iteration 56200/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 2.802E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.218255E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.48 | backward: 1803.20 | backward-backward: 1803.17 | backward-allreduce: 0.00 | optimizer: 55.45 | batch generator: 0.77 samples/sec: 6.592 | iteration 56300/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 2.801E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.243900E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.57 | backward: 1804.70 | backward-backward: 1804.68 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.78 samples/sec: 6.593 | iteration 56400/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 2.801E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.244637E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.58 | backward: 1804.61 | backward-backward: 1804.58 | backward-allreduce: 0.00 | optimizer: 55.38 | batch generator: 0.78 samples/sec: 6.593 | iteration 56500/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 2.800E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.235101E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.16 | backward: 1804.16 | backward-backward: 1804.14 | backward-allreduce: 0.00 | optimizer: 56.00 | batch generator: 0.81 samples/sec: 6.588 | iteration 56600/ 320000 | elapsed time per iteration (ms): 2428.6 | learning rate: 2.799E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.252615E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.85 | backward: 1805.71 | backward-backward: 1805.69 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.77 samples/sec: 6.594 | iteration 56700/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 2.798E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.231829E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.57 | backward: 1803.57 | backward-backward: 1803.55 | backward-allreduce: 0.00 | optimizer: 55.75 | batch generator: 0.87 samples/sec: 6.594 | iteration 56800/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 2.798E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.220171E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.49 | backward: 1803.85 | backward-backward: 1803.83 | backward-allreduce: 0.00 | optimizer: 55.67 | batch generator: 0.83 samples/sec: 6.594 | iteration 56900/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 2.797E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.225293E+00 | loss scale: 32768.0 | number of skipped iterations: 2 | number of nan iterations: 0 | time (ms) | forward: 566.88 | backward: 1804.80 | backward-backward: 1804.77 | backward-allreduce: 0.00 | optimizer: 54.55 | batch generator: 0.80 samples/sec: 6.598 | iteration 57000/ 320000 | elapsed time per iteration (ms): 2425.1 | learning rate: 2.796E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.256384E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.16 | backward: 1803.11 | backward-backward: 1803.09 | backward-allreduce: 0.00 | optimizer: 55.49 | batch generator: 0.81 ---------------------------------------------------------------------------------------------------------- validation results at iteration 57000 | lm_loss value: 3.165028E+00 | lm_loss_ppl value: 2.368941E+01 | ---------------------------------------------------------------------------------------------------------- samples/sec: 6.441 | iteration 57100/ 320000 | elapsed time per iteration (ms): 2484.2 | learning rate: 2.795E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.240739E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.82 | backward: 1804.68 | backward-backward: 1804.66 | backward-allreduce: 0.00 | optimizer: 55.51 | batch generator: 0.86 samples/sec: 6.593 | iteration 57200/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 2.795E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.221743E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.59 | backward: 1804.12 | backward-backward: 1804.10 | backward-allreduce: 0.00 | optimizer: 55.89 | batch generator: 0.81 samples/sec: 6.595 | iteration 57300/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 2.794E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.227459E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.53 | backward: 1803.55 | backward-backward: 1803.53 | backward-allreduce: 0.00 | optimizer: 55.67 | batch generator: 0.78 samples/sec: 6.590 | iteration 57400/ 320000 | elapsed time per iteration (ms): 2428.1 | learning rate: 2.793E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.221999E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.78 | backward: 1805.07 | backward-backward: 1805.05 | backward-allreduce: 0.00 | optimizer: 55.88 | batch generator: 0.81 samples/sec: 6.597 | iteration 57500/ 320000 | elapsed time per iteration (ms): 2425.2 | learning rate: 2.792E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.230146E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.24 | backward: 1802.89 | backward-backward: 1802.86 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.78 samples/sec: 6.588 | iteration 57600/ 320000 | elapsed time per iteration (ms): 2428.6 | learning rate: 2.792E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.225335E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.73 | backward: 1804.96 | backward-backward: 1804.94 | backward-allreduce: 0.00 | optimizer: 56.53 | batch generator: 0.80 samples/sec: 6.593 | iteration 57700/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 2.791E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.222803E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.85 | backward: 1804.07 | backward-backward: 1804.05 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.81 samples/sec: 6.596 | iteration 57800/ 320000 | elapsed time per iteration (ms): 2425.6 | learning rate: 2.790E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.227372E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.04 | backward: 1803.25 | backward-backward: 1803.22 | backward-allreduce: 0.00 | optimizer: 55.89 | batch generator: 0.81 samples/sec: 6.588 | iteration 57900/ 320000 | elapsed time per iteration (ms): 2428.7 | learning rate: 2.789E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.219656E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.00 | backward: 1805.83 | backward-backward: 1805.81 | backward-allreduce: 0.00 | optimizer: 55.47 | batch generator: 0.79 samples/sec: 6.596 | iteration 58000/ 320000 | elapsed time per iteration (ms): 2425.6 | learning rate: 2.789E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.238712E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.25 | backward: 1803.46 | backward-backward: 1803.44 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.77 ---------------------------------------------------------------------------------------------------------- validation results at iteration 58000 | lm_loss value: 3.271872E+00 | lm_loss_ppl value: 2.636064E+01 | ---------------------------------------------------------------------------------------------------------- samples/sec: 6.440 | iteration 58100/ 320000 | elapsed time per iteration (ms): 2484.3 | learning rate: 2.788E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.221659E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.70 | backward: 1804.89 | backward-backward: 1804.87 | backward-allreduce: 0.00 | optimizer: 55.54 | batch generator: 0.90 samples/sec: 6.591 | iteration 58200/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 2.787E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.241850E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.67 | backward: 1805.14 | backward-backward: 1805.12 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.77 samples/sec: 6.597 | iteration 58300/ 320000 | elapsed time per iteration (ms): 2425.4 | learning rate: 2.786E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.240576E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 565.97 | backward: 1803.64 | backward-backward: 1803.62 | backward-allreduce: 0.00 | optimizer: 55.37 | batch generator: 0.80 samples/sec: 6.588 | iteration 58400/ 320000 | elapsed time per iteration (ms): 2428.5 | learning rate: 2.786E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.227897E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.21 | backward: 1805.41 | backward-backward: 1805.39 | backward-allreduce: 0.00 | optimizer: 55.54 | batch generator: 0.92 samples/sec: 6.594 | iteration 58500/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 2.785E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.223883E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.65 | backward: 1803.65 | backward-backward: 1803.63 | backward-allreduce: 0.00 | optimizer: 55.67 | batch generator: 0.78 samples/sec: 6.596 | iteration 58600/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 2.784E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.211090E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.35 | backward: 1804.06 | backward-backward: 1804.03 | backward-allreduce: 0.00 | optimizer: 54.98 | batch generator: 0.78 samples/sec: 6.585 | iteration 58700/ 320000 | elapsed time per iteration (ms): 2429.7 | learning rate: 2.783E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.217493E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.24 | backward: 1805.65 | backward-backward: 1805.63 | backward-allreduce: 0.00 | optimizer: 56.41 | batch generator: 0.80 samples/sec: 6.597 | iteration 58800/ 320000 | elapsed time per iteration (ms): 2425.4 | learning rate: 2.783E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.233257E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.13 | backward: 1803.36 | backward-backward: 1803.34 | backward-allreduce: 0.00 | optimizer: 55.50 | batch generator: 0.76 samples/sec: 6.591 | iteration 58900/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 2.782E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.226630E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.71 | backward: 1804.92 | backward-backward: 1804.90 | backward-allreduce: 0.00 | optimizer: 55.59 | batch generator: 0.79 samples/sec: 6.591 | iteration 59000/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 2.781E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.208111E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.04 | backward: 1804.39 | backward-backward: 1804.37 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.83 ---------------------------------------------------------------------------------------------------------- validation results at iteration 59000 | lm_loss value: 3.163083E+00 | lm_loss_ppl value: 2.364338E+01 | ---------------------------------------------------------------------------------------------------------- samples/sec: 6.446 | iteration 59100/ 320000 | elapsed time per iteration (ms): 2482.0 | learning rate: 2.780E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.232128E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.01 | backward: 1803.83 | backward-backward: 1803.81 | backward-allreduce: 0.00 | optimizer: 55.01 | batch generator: 0.84 samples/sec: 6.591 | iteration 59200/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 2.780E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.233714E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.74 | backward: 1804.97 | backward-backward: 1804.95 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.76 samples/sec: 6.596 | iteration 59300/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 2.779E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.213171E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.35 | backward: 1803.44 | backward-backward: 1803.42 | backward-allreduce: 0.00 | optimizer: 55.47 | batch generator: 0.79 samples/sec: 6.596 | iteration 59400/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 2.778E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.220905E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.57 | backward: 1803.17 | backward-backward: 1803.14 | backward-allreduce: 0.00 | optimizer: 55.54 | batch generator: 0.78 samples/sec: 6.590 | iteration 59500/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 2.777E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.226154E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.79 | backward: 1805.06 | backward-backward: 1805.04 | backward-allreduce: 0.00 | optimizer: 55.70 | batch generator: 0.77 samples/sec: 6.597 | iteration 59600/ 320000 | elapsed time per iteration (ms): 2425.2 | learning rate: 2.776E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.225659E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.50 | backward: 1803.00 | backward-backward: 1802.98 | backward-allreduce: 0.00 | optimizer: 55.30 | batch generator: 0.77 samples/sec: 6.591 | iteration 59700/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 2.776E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.213723E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.76 | backward: 1804.41 | backward-backward: 1804.38 | backward-allreduce: 0.00 | optimizer: 55.94 | batch generator: 0.84 samples/sec: 6.590 | iteration 59800/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 2.775E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.223622E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.63 | backward: 1804.69 | backward-backward: 1804.66 | backward-allreduce: 0.00 | optimizer: 56.09 | batch generator: 0.82 samples/sec: 6.597 | iteration 59900/ 320000 | elapsed time per iteration (ms): 2425.4 | learning rate: 2.774E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.231256E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.07 | backward: 1803.38 | backward-backward: 1803.36 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.78 samples/sec: 6.590 | iteration 60000/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 2.773E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.225829E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.89 | backward: 1804.93 | backward-backward: 1804.91 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.81 ---------------------------------------------------------------------------------------------------------- validation results at iteration 60000 | lm_loss value: 3.177427E+00 | lm_loss_ppl value: 2.398495E+01 | ---------------------------------------------------------------------------------------------------------- samples/sec: 6.248 | iteration 60100/ 320000 | elapsed time per iteration (ms): 2561.0 | learning rate: 2.773E-04 | approx flops per GPU: 38.8TFLOPS | lm_loss: 3.222048E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.16 | backward: 1802.84 | backward-backward: 1802.81 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.84 samples/sec: 6.593 | iteration 60200/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 2.772E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.236669E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.62 | backward: 1804.12 | backward-backward: 1804.10 | backward-allreduce: 0.00 | optimizer: 55.70 | batch generator: 0.81 samples/sec: 6.591 | iteration 60300/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 2.771E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.217015E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.41 | backward: 1805.29 | backward-backward: 1805.27 | backward-allreduce: 0.00 | optimizer: 55.46 | batch generator: 0.76 samples/sec: 6.597 | iteration 60400/ 320000 | elapsed time per iteration (ms): 2425.3 | learning rate: 2.770E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.223794E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.12 | backward: 1803.07 | backward-backward: 1803.04 | backward-allreduce: 0.00 | optimizer: 55.72 | batch generator: 0.85 samples/sec: 6.592 | iteration 60500/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 2.769E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.212409E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.58 | backward: 1804.92 | backward-backward: 1804.89 | backward-allreduce: 0.00 | optimizer: 55.43 | batch generator: 0.78 samples/sec: 6.592 | iteration 60600/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 2.769E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.226401E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.48 | backward: 1804.76 | backward-backward: 1804.74 | backward-allreduce: 0.00 | optimizer: 55.54 | batch generator: 0.77 samples/sec: 6.595 | iteration 60700/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 2.768E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.206890E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.33 | backward: 1803.24 | backward-backward: 1803.21 | backward-allreduce: 0.00 | optimizer: 56.09 | batch generator: 0.80 samples/sec: 6.588 | iteration 60800/ 320000 | elapsed time per iteration (ms): 2428.8 | learning rate: 2.767E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.237561E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.82 | backward: 1805.92 | backward-backward: 1805.90 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.80 samples/sec: 6.595 | iteration 60900/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 2.766E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.220703E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.21 | backward: 1803.62 | backward-backward: 1803.59 | backward-allreduce: 0.00 | optimizer: 55.89 | batch generator: 0.78 samples/sec: 6.593 | iteration 61000/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 2.766E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.221794E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.69 | backward: 1804.31 | backward-backward: 1804.29 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.83 ---------------------------------------------------------------------------------------------------------- validation results at iteration 61000 | lm_loss value: 3.196951E+00 | lm_loss_ppl value: 2.445784E+01 | ---------------------------------------------------------------------------------------------------------- samples/sec: 6.441 | iteration 61100/ 320000 | elapsed time per iteration (ms): 2484.0 | learning rate: 2.765E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.238845E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.61 | backward: 1805.26 | backward-backward: 1805.24 | backward-allreduce: 0.00 | optimizer: 54.89 | batch generator: 0.86 samples/sec: 6.597 | iteration 61200/ 320000 | elapsed time per iteration (ms): 2425.2 | learning rate: 2.764E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.219928E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 565.98 | backward: 1803.43 | backward-backward: 1803.41 | backward-allreduce: 0.00 | optimizer: 55.45 | batch generator: 0.78 samples/sec: 6.590 | iteration 61300/ 320000 | elapsed time per iteration (ms): 2428.1 | learning rate: 2.763E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.243437E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.60 | backward: 1805.41 | backward-backward: 1805.39 | backward-allreduce: 0.00 | optimizer: 55.71 | batch generator: 0.80 samples/sec: 6.595 | iteration 61400/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 2.762E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.213774E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.35 | backward: 1804.53 | backward-backward: 1804.51 | backward-allreduce: 0.00 | optimizer: 54.99 | batch generator: 0.77 samples/sec: 6.597 | iteration 61500/ 320000 | elapsed time per iteration (ms): 2425.2 | learning rate: 2.762E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.227321E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.21 | backward: 1803.03 | backward-backward: 1803.01 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.79 samples/sec: 6.590 | iteration 61600/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 2.761E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.220639E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.78 | backward: 1805.13 | backward-backward: 1805.10 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.80 samples/sec: 6.598 | iteration 61700/ 320000 | elapsed time per iteration (ms): 2425.0 | learning rate: 2.760E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.205122E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.44 | backward: 1802.95 | backward-backward: 1802.92 | backward-allreduce: 0.00 | optimizer: 55.22 | batch generator: 0.83 samples/sec: 6.595 | iteration 61800/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 2.759E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.212227E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.38 | backward: 1803.60 | backward-backward: 1803.58 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.79 samples/sec: 6.592 | iteration 61900/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 2.758E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.243146E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.56 | backward: 1805.02 | backward-backward: 1804.99 | backward-allreduce: 0.00 | optimizer: 55.39 | batch generator: 0.78 samples/sec: 6.595 | iteration 62000/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 2.758E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.219973E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 565.93 | backward: 1803.75 | backward-backward: 1803.72 | backward-allreduce: 0.00 | optimizer: 56.07 | batch generator: 0.81 ---------------------------------------------------------------------------------------------------------- validation results at iteration 62000 | lm_loss value: 3.216501E+00 | lm_loss_ppl value: 2.494070E+01 | ---------------------------------------------------------------------------------------------------------- samples/sec: 6.443 | iteration 62100/ 320000 | elapsed time per iteration (ms): 2483.3 | learning rate: 2.757E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.221018E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.51 | backward: 1804.10 | backward-backward: 1804.08 | backward-allreduce: 0.00 | optimizer: 55.51 | batch generator: 0.87 samples/sec: 6.593 | iteration 62200/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 2.756E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.230749E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.45 | backward: 1804.25 | backward-backward: 1804.22 | backward-allreduce: 0.00 | optimizer: 55.88 | batch generator: 0.80 samples/sec: 6.597 | iteration 62300/ 320000 | elapsed time per iteration (ms): 2425.3 | learning rate: 2.755E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.232782E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.28 | backward: 1803.23 | backward-backward: 1803.20 | backward-allreduce: 0.00 | optimizer: 55.38 | batch generator: 0.77 samples/sec: 6.589 | iteration 62400/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 2.754E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.224240E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.74 | backward: 1805.50 | backward-backward: 1805.48 | backward-allreduce: 0.00 | optimizer: 55.66 | batch generator: 0.76 samples/sec: 6.591 | iteration 62500/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 2.754E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.214712E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.41 | backward: 1804.53 | backward-backward: 1804.51 | backward-allreduce: 0.00 | optimizer: 56.06 | batch generator: 0.78 samples/sec: 6.595 | iteration 62600/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 2.753E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.224613E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.25 | backward: 1803.75 | backward-backward: 1803.73 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.80 samples/sec: 6.589 | iteration 62700/ 320000 | elapsed time per iteration (ms): 2428.4 | learning rate: 2.752E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.228999E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.61 | backward: 1805.88 | backward-backward: 1805.86 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.76 samples/sec: 6.595 | iteration 62800/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 2.751E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.231889E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.56 | backward: 1803.93 | backward-backward: 1803.91 | backward-allreduce: 0.00 | optimizer: 55.08 | batch generator: 0.82 samples/sec: 6.596 | iteration 62900/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 2.750E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.207621E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.05 | backward: 1804.21 | backward-backward: 1804.19 | backward-allreduce: 0.00 | optimizer: 55.26 | batch generator: 0.76 samples/sec: 6.588 | iteration 63000/ 320000 | elapsed time per iteration (ms): 2428.5 | learning rate: 2.749E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.216727E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.78 | backward: 1805.90 | backward-backward: 1805.88 | backward-allreduce: 0.00 | optimizer: 55.45 | batch generator: 0.79 ---------------------------------------------------------------------------------------------------------- validation results at iteration 63000 | lm_loss value: 3.193329E+00 | lm_loss_ppl value: 2.436941E+01 | ---------------------------------------------------------------------------------------------------------- samples/sec: 6.441 | iteration 63100/ 320000 | elapsed time per iteration (ms): 2483.9 | learning rate: 2.749E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.214647E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.20 | backward: 1804.38 | backward-backward: 1804.36 | backward-allreduce: 0.00 | optimizer: 56.10 | batch generator: 0.88 samples/sec: 6.592 | iteration 63200/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 2.748E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.217624E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.30 | backward: 1804.89 | backward-backward: 1804.86 | backward-allreduce: 0.00 | optimizer: 55.59 | batch generator: 0.77 samples/sec: 6.589 | iteration 63300/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 2.747E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.219663E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.64 | backward: 1805.61 | backward-backward: 1805.59 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.80 samples/sec: 6.596 | iteration 63400/ 320000 | elapsed time per iteration (ms): 2425.6 | learning rate: 2.746E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.221066E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.00 | backward: 1803.63 | backward-backward: 1803.60 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.79 samples/sec: 6.591 | iteration 63500/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 2.745E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.220538E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.76 | backward: 1804.96 | backward-backward: 1804.93 | backward-allreduce: 0.00 | optimizer: 55.42 | batch generator: 0.91 samples/sec: 6.593 | iteration 63600/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 2.745E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.228148E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.48 | backward: 1804.58 | backward-backward: 1804.55 | backward-allreduce: 0.00 | optimizer: 55.45 | batch generator: 0.76 samples/sec: 6.598 | iteration 63700/ 320000 | elapsed time per iteration (ms): 2425.1 | learning rate: 2.744E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.203540E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.01 | backward: 1803.37 | backward-backward: 1803.34 | backward-allreduce: 0.00 | optimizer: 55.37 | batch generator: 0.79 samples/sec: 6.590 | iteration 63800/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 2.743E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.214251E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.69 | backward: 1804.93 | backward-backward: 1804.91 | backward-allreduce: 0.00 | optimizer: 55.84 | batch generator: 0.79 samples/sec: 6.594 | iteration 63900/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 2.742E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.219657E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.24 | backward: 1803.89 | backward-backward: 1803.87 | backward-allreduce: 0.00 | optimizer: 55.80 | batch generator: 0.77 samples/sec: 6.596 | iteration 64000/ 320000 | elapsed time per iteration (ms): 2425.6 | learning rate: 2.741E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.231366E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.35 | backward: 1803.42 | backward-backward: 1803.40 | backward-allreduce: 0.00 | optimizer: 55.47 | batch generator: 0.78 ---------------------------------------------------------------------------------------------------------- validation results at iteration 64000 | lm_loss value: 3.146246E+00 | lm_loss_ppl value: 2.324863E+01 | ---------------------------------------------------------------------------------------------------------- samples/sec: 6.436 | iteration 64100/ 320000 | elapsed time per iteration (ms): 2486.1 | learning rate: 2.740E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.231191E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.88 | backward: 1805.95 | backward-backward: 1805.93 | backward-allreduce: 0.00 | optimizer: 56.10 | batch generator: 0.84 samples/sec: 6.599 | iteration 64200/ 320000 | elapsed time per iteration (ms): 2424.6 | learning rate: 2.740E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.209615E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.21 | backward: 1802.64 | backward-backward: 1802.62 | backward-allreduce: 0.00 | optimizer: 55.40 | batch generator: 0.78 samples/sec: 6.594 | iteration 64300/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 2.739E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.225256E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.41 | backward: 1804.04 | backward-backward: 1804.02 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.78 samples/sec: 6.592 | iteration 64400/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 2.738E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.213970E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.51 | backward: 1804.44 | backward-backward: 1804.42 | backward-allreduce: 0.00 | optimizer: 56.02 | batch generator: 0.80 samples/sec: 6.597 | iteration 64500/ 320000 | elapsed time per iteration (ms): 2425.4 | learning rate: 2.737E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.200124E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 565.87 | backward: 1803.37 | backward-backward: 1803.35 | backward-allreduce: 0.00 | optimizer: 55.71 | batch generator: 0.76 samples/sec: 6.592 | iteration 64600/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 2.736E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.218416E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.69 | backward: 1804.81 | backward-backward: 1804.79 | backward-allreduce: 0.00 | optimizer: 55.30 | batch generator: 0.76 samples/sec: 6.593 | iteration 64700/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 2.735E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.222168E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.33 | backward: 1804.21 | backward-backward: 1804.19 | backward-allreduce: 0.00 | optimizer: 55.72 | batch generator: 0.77 samples/sec: 6.596 | iteration 64800/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 2.735E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.223291E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.01 | backward: 1803.67 | backward-backward: 1803.64 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.80 samples/sec: 6.589 | iteration 64900/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 2.734E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.208735E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.77 | backward: 1805.45 | backward-backward: 1805.43 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.80 samples/sec: 6.597 | iteration 65000/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 2.733E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.216362E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.15 | backward: 1803.61 | backward-backward: 1803.58 | backward-allreduce: 0.00 | optimizer: 55.38 | batch generator: 0.78 ---------------------------------------------------------------------------------------------------------- validation results at iteration 65000 | lm_loss value: 3.202994E+00 | lm_loss_ppl value: 2.460609E+01 | ---------------------------------------------------------------------------------------------------------- samples/sec: 6.443 | iteration 65100/ 320000 | elapsed time per iteration (ms): 2483.4 | learning rate: 2.732E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.186476E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.32 | backward: 1804.43 | backward-backward: 1804.41 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.83 samples/sec: 6.587 | iteration 65200/ 320000 | elapsed time per iteration (ms): 2429.2 | learning rate: 2.731E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.214738E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.69 | backward: 1805.74 | backward-backward: 1805.72 | backward-allreduce: 0.00 | optimizer: 56.37 | batch generator: 0.84 samples/sec: 6.598 | iteration 65300/ 320000 | elapsed time per iteration (ms): 2425.1 | learning rate: 2.730E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.226334E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 565.87 | backward: 1803.53 | backward-backward: 1803.51 | backward-allreduce: 0.00 | optimizer: 55.31 | batch generator: 0.77 samples/sec: 6.592 | iteration 65400/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 2.730E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.221592E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.66 | backward: 1804.70 | backward-backward: 1804.68 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.78 samples/sec: 6.592 | iteration 65500/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 2.729E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.218912E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.45 | backward: 1804.66 | backward-backward: 1804.64 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.79 samples/sec: 6.598 | iteration 65600/ 320000 | elapsed time per iteration (ms): 2425.1 | learning rate: 2.728E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.217648E+00 | loss scale: 131072.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 565.97 | backward: 1803.51 | backward-backward: 1803.48 | backward-allreduce: 0.00 | optimizer: 55.18 | batch generator: 0.80 samples/sec: 6.588 | iteration 65700/ 320000 | elapsed time per iteration (ms): 2428.8 | learning rate: 2.727E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.195191E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.71 | backward: 1805.99 | backward-backward: 1805.97 | backward-allreduce: 0.00 | optimizer: 55.69 | batch generator: 0.78 samples/sec: 6.596 | iteration 65800/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 2.726E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.203148E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.33 | backward: 1803.93 | backward-backward: 1803.91 | backward-allreduce: 0.00 | optimizer: 55.09 | batch generator: 0.77 samples/sec: 6.596 | iteration 65900/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 2.725E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.231768E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.09 | backward: 1803.73 | backward-backward: 1803.71 | backward-allreduce: 0.00 | optimizer: 55.49 | batch generator: 0.76 samples/sec: 6.590 | iteration 66000/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 2.725E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.219772E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.81 | backward: 1805.66 | backward-backward: 1805.64 | backward-allreduce: 0.00 | optimizer: 55.07 | batch generator: 0.80 ---------------------------------------------------------------------------------------------------------- validation results at iteration 66000 | lm_loss value: 3.201762E+00 | lm_loss_ppl value: 2.457579E+01 | ---------------------------------------------------------------------------------------------------------- samples/sec: 6.441 | iteration 66100/ 320000 | elapsed time per iteration (ms): 2484.0 | learning rate: 2.724E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.203145E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.79 | backward: 1804.46 | backward-backward: 1804.44 | backward-allreduce: 0.00 | optimizer: 55.54 | batch generator: 0.84 samples/sec: 6.594 | iteration 66200/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 2.723E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.194385E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.62 | backward: 1803.30 | backward-backward: 1803.28 | backward-allreduce: 0.00 | optimizer: 56.13 | batch generator: 0.78 samples/sec: 6.592 | iteration 66300/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 2.722E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.204377E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.46 | backward: 1804.20 | backward-backward: 1804.18 | backward-allreduce: 0.00 | optimizer: 56.25 | batch generator: 0.79 samples/sec: 6.591 | iteration 66400/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 2.721E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.223768E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.70 | backward: 1805.06 | backward-backward: 1805.03 | backward-allreduce: 0.00 | optimizer: 55.50 | batch generator: 0.77 samples/sec: 6.589 | iteration 66500/ 320000 | elapsed time per iteration (ms): 2428.1 | learning rate: 2.720E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.213112E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.17 | backward: 1804.88 | backward-backward: 1804.85 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.78 samples/sec: 6.587 | iteration 66600/ 320000 | elapsed time per iteration (ms): 2429.0 | learning rate: 2.719E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.227490E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.90 | backward: 1805.05 | backward-backward: 1805.03 | backward-allreduce: 0.00 | optimizer: 55.66 | batch generator: 0.77 samples/sec: 6.597 | iteration 66700/ 320000 | elapsed time per iteration (ms): 2425.3 | learning rate: 2.719E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.223289E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.40 | backward: 1802.86 | backward-backward: 1802.84 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.79 samples/sec: 6.595 | iteration 66800/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 2.718E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.201969E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.52 | backward: 1803.55 | backward-backward: 1803.53 | backward-allreduce: 0.00 | optimizer: 55.72 | batch generator: 0.80 samples/sec: 6.590 | iteration 66900/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 2.717E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.208502E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.87 | backward: 1805.17 | backward-backward: 1805.15 | backward-allreduce: 0.00 | optimizer: 55.51 | batch generator: 0.78 samples/sec: 6.588 | iteration 67000/ 320000 | elapsed time per iteration (ms): 2428.8 | learning rate: 2.716E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.185471E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.41 | backward: 1804.71 | backward-backward: 1804.69 | backward-allreduce: 0.00 | optimizer: 56.31 | batch generator: 0.77 ---------------------------------------------------------------------------------------------------------- validation results at iteration 67000 | lm_loss value: 3.238892E+00 | lm_loss_ppl value: 2.550544E+01 | ---------------------------------------------------------------------------------------------------------- samples/sec: 6.444 | iteration 67100/ 320000 | elapsed time per iteration (ms): 2482.9 | learning rate: 2.715E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.198860E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.62 | backward: 1803.45 | backward-backward: 1803.43 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.87 samples/sec: 6.593 | iteration 67200/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 2.714E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.206497E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.22 | backward: 1804.04 | backward-backward: 1804.02 | backward-allreduce: 0.00 | optimizer: 56.03 | batch generator: 0.81 samples/sec: 6.589 | iteration 67300/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 2.713E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.203495E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.76 | backward: 1805.80 | backward-backward: 1805.77 | backward-allreduce: 0.00 | optimizer: 55.34 | batch generator: 0.76 samples/sec: 6.588 | iteration 67400/ 320000 | elapsed time per iteration (ms): 2428.7 | learning rate: 2.713E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.209557E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.61 | backward: 1805.62 | backward-backward: 1805.59 | backward-allreduce: 0.00 | optimizer: 56.09 | batch generator: 0.74 samples/sec: 6.589 | iteration 67500/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 2.712E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.203265E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.87 | backward: 1805.54 | backward-backward: 1805.52 | backward-allreduce: 0.00 | optimizer: 55.47 | batch generator: 0.86 samples/sec: 6.593 | iteration 67600/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 2.711E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.206510E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.12 | backward: 1803.66 | backward-backward: 1803.64 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 1.05 samples/sec: 6.596 | iteration 67700/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 2.710E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.201647E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.25 | backward: 1803.67 | backward-backward: 1803.65 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.77 samples/sec: 6.589 | iteration 67800/ 320000 | elapsed time per iteration (ms): 2428.4 | learning rate: 2.709E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.210469E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.66 | backward: 1805.95 | backward-backward: 1805.93 | backward-allreduce: 0.00 | optimizer: 55.42 | batch generator: 0.81 samples/sec: 6.590 | iteration 67900/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 2.708E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.210548E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.98 | backward: 1804.97 | backward-backward: 1804.94 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.88 samples/sec: 6.591 | iteration 68000/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 2.707E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.215033E+00 | loss scale: 65536.0 | number of skipped iterations: 2 | number of nan iterations: 0 | time (ms) | forward: 566.96 | backward: 1805.19 | backward-backward: 1805.17 | backward-allreduce: 0.00 | optimizer: 54.86 | batch generator: 0.80 ---------------------------------------------------------------------------------------------------------- validation results at iteration 68000 | lm_loss value: 3.158133E+00 | lm_loss_ppl value: 2.352663E+01 | ---------------------------------------------------------------------------------------------------------- samples/sec: 6.438 | iteration 68100/ 320000 | elapsed time per iteration (ms): 2485.2 | learning rate: 2.706E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.197274E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.19 | backward: 1805.18 | backward-backward: 1805.16 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.90 samples/sec: 6.598 | iteration 68200/ 320000 | elapsed time per iteration (ms): 2425.0 | learning rate: 2.706E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.219820E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 565.85 | backward: 1803.48 | backward-backward: 1803.46 | backward-allreduce: 0.00 | optimizer: 55.30 | batch generator: 0.76 samples/sec: 6.593 | iteration 68300/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 2.705E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.199157E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.39 | backward: 1804.88 | backward-backward: 1804.86 | backward-allreduce: 0.00 | optimizer: 55.33 | batch generator: 0.75 samples/sec: 6.587 | iteration 68400/ 320000 | elapsed time per iteration (ms): 2428.9 | learning rate: 2.704E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.197353E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.01 | backward: 1805.77 | backward-backward: 1805.75 | backward-allreduce: 0.00 | optimizer: 55.73 | batch generator: 0.76 samples/sec: 6.589 | iteration 68500/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 2.703E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.218870E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.68 | backward: 1805.45 | backward-backward: 1805.42 | backward-allreduce: 0.00 | optimizer: 55.69 | batch generator: 0.79 samples/sec: 6.591 | iteration 68600/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 2.702E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.192111E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.13 | backward: 1804.14 | backward-backward: 1804.12 | backward-allreduce: 0.00 | optimizer: 55.98 | batch generator: 0.88 samples/sec: 6.592 | iteration 68700/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 2.701E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.205505E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.86 | backward: 1804.43 | backward-backward: 1804.40 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.78 samples/sec: 6.591 | iteration 68800/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 2.700E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.196254E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.92 | backward: 1804.38 | backward-backward: 1804.36 | backward-allreduce: 0.00 | optimizer: 55.75 | batch generator: 0.79 samples/sec: 6.592 | iteration 68900/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 2.699E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.203711E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.89 | backward: 1804.31 | backward-backward: 1804.29 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.78 samples/sec: 6.593 | iteration 69000/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 2.699E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.192998E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.71 | backward: 1804.04 | backward-backward: 1804.02 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.79 ---------------------------------------------------------------------------------------------------------- validation results at iteration 69000 | lm_loss value: 3.178095E+00 | lm_loss_ppl value: 2.400100E+01 | ---------------------------------------------------------------------------------------------------------- samples/sec: 6.441 | iteration 69100/ 320000 | elapsed time per iteration (ms): 2484.3 | learning rate: 2.698E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.211982E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.84 | backward: 1804.38 | backward-backward: 1804.35 | backward-allreduce: 0.00 | optimizer: 55.78 | batch generator: 0.89 samples/sec: 6.594 | iteration 69200/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 2.697E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.196714E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.52 | backward: 1803.94 | backward-backward: 1803.92 | backward-allreduce: 0.00 | optimizer: 55.59 | batch generator: 0.78 samples/sec: 6.593 | iteration 69300/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 2.696E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.214374E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.41 | backward: 1804.46 | backward-backward: 1804.44 | backward-allreduce: 0.00 | optimizer: 55.39 | batch generator: 0.77 samples/sec: 6.594 | iteration 69400/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 2.695E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.204929E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.55 | backward: 1804.09 | backward-backward: 1804.06 | backward-allreduce: 0.00 | optimizer: 55.50 | batch generator: 0.90 samples/sec: 6.591 | iteration 69500/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 2.694E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.208918E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.33 | backward: 1804.72 | backward-backward: 1804.70 | backward-allreduce: 0.00 | optimizer: 56.04 | batch generator: 0.75 samples/sec: 6.587 | iteration 69600/ 320000 | elapsed time per iteration (ms): 2429.0 | learning rate: 2.693E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.210676E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.82 | backward: 1805.18 | backward-backward: 1805.16 | backward-allreduce: 0.00 | optimizer: 56.59 | batch generator: 0.82 samples/sec: 6.594 | iteration 69700/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 2.692E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.186623E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.45 | backward: 1804.66 | backward-backward: 1804.64 | backward-allreduce: 0.00 | optimizer: 55.13 | batch generator: 0.80 samples/sec: 6.591 | iteration 69800/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 2.691E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.191218E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.69 | backward: 1804.99 | backward-backward: 1804.96 | backward-allreduce: 0.00 | optimizer: 55.67 | batch generator: 0.79 samples/sec: 6.592 | iteration 69900/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 2.691E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.222297E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.43 | backward: 1804.71 | backward-backward: 1804.68 | backward-allreduce: 0.00 | optimizer: 55.49 | batch generator: 0.78 samples/sec: 6.595 | iteration 70000/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 2.690E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.210601E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.53 | backward: 1804.20 | backward-backward: 1804.18 | backward-allreduce: 0.00 | optimizer: 54.92 | batch generator: 0.77 ---------------------------------------------------------------------------------------------------------- validation results at iteration 70000 | lm_loss value: 3.214946E+00 | lm_loss_ppl value: 2.490195E+01 | ---------------------------------------------------------------------------------------------------------- samples/sec: 6.245 | iteration 70100/ 320000 | elapsed time per iteration (ms): 2562.0 | learning rate: 2.689E-04 | approx flops per GPU: 38.8TFLOPS | lm_loss: 3.207356E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.64 | backward: 1804.17 | backward-backward: 1804.15 | backward-allreduce: 0.00 | optimizer: 55.41 | batch generator: 0.85 samples/sec: 6.593 | iteration 70200/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 2.688E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.207410E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.42 | backward: 1804.51 | backward-backward: 1804.48 | backward-allreduce: 0.00 | optimizer: 55.51 | batch generator: 0.79 samples/sec: 6.593 | iteration 70300/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 2.687E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.219283E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.30 | backward: 1804.50 | backward-backward: 1804.48 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.76 samples/sec: 6.593 | iteration 70400/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 2.686E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.193169E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.39 | backward: 1804.56 | backward-backward: 1804.53 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.79 samples/sec: 6.593 | iteration 70500/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 2.685E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.200254E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.45 | backward: 1804.47 | backward-backward: 1804.45 | backward-allreduce: 0.00 | optimizer: 55.47 | batch generator: 0.75 samples/sec: 6.595 | iteration 70600/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 2.684E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.191952E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.22 | backward: 1803.86 | backward-backward: 1803.83 | backward-allreduce: 0.00 | optimizer: 55.49 | batch generator: 0.80 samples/sec: 6.595 | iteration 70700/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 2.683E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.216900E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.12 | backward: 1803.67 | backward-backward: 1803.64 | backward-allreduce: 0.00 | optimizer: 56.02 | batch generator: 0.81 samples/sec: 6.590 | iteration 70800/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 2.682E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.199843E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.46 | backward: 1805.40 | backward-backward: 1805.38 | backward-allreduce: 0.00 | optimizer: 55.81 | batch generator: 0.77 samples/sec: 6.587 | iteration 70900/ 320000 | elapsed time per iteration (ms): 2429.1 | learning rate: 2.682E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.210083E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.07 | backward: 1805.51 | backward-backward: 1805.48 | backward-allreduce: 0.00 | optimizer: 56.07 | batch generator: 0.80 samples/sec: 6.595 | iteration 71000/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 2.681E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.198021E+00 | loss scale: 32768.0 | number of skipped iterations: 2 | number of nan iterations: 0 | time (ms) | forward: 566.48 | backward: 1804.63 | backward-backward: 1804.60 | backward-allreduce: 0.00 | optimizer: 54.59 | batch generator: 0.78 ---------------------------------------------------------------------------------------------------------- validation results at iteration 71000 | lm_loss value: 3.225305E+00 | lm_loss_ppl value: 2.516126E+01 | ---------------------------------------------------------------------------------------------------------- samples/sec: 6.445 | iteration 71100/ 320000 | elapsed time per iteration (ms): 2482.7 | learning rate: 2.680E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.194190E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.23 | backward: 1803.86 | backward-backward: 1803.84 | backward-allreduce: 0.00 | optimizer: 55.46 | batch generator: 0.85 samples/sec: 6.587 | iteration 71200/ 320000 | elapsed time per iteration (ms): 2429.0 | learning rate: 2.679E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.204758E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.70 | backward: 1805.94 | backward-backward: 1805.92 | backward-allreduce: 0.00 | optimizer: 55.96 | batch generator: 0.80 samples/sec: 6.594 | iteration 71300/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 2.678E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.202560E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.42 | backward: 1803.79 | backward-backward: 1803.77 | backward-allreduce: 0.00 | optimizer: 55.86 | batch generator: 0.93 samples/sec: 6.593 | iteration 71400/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 2.677E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.189879E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.60 | backward: 1804.36 | backward-backward: 1804.34 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.86 samples/sec: 6.591 | iteration 71500/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 2.676E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.179264E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.58 | backward: 1804.99 | backward-backward: 1804.96 | backward-allreduce: 0.00 | optimizer: 55.74 | batch generator: 0.78 samples/sec: 6.597 | iteration 71600/ 320000 | elapsed time per iteration (ms): 2425.3 | learning rate: 2.675E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.226166E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.22 | backward: 1803.19 | backward-backward: 1803.16 | backward-allreduce: 0.00 | optimizer: 55.51 | batch generator: 0.80 samples/sec: 6.590 | iteration 71700/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 2.674E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.194912E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.95 | backward: 1804.98 | backward-backward: 1804.95 | backward-allreduce: 0.00 | optimizer: 55.70 | batch generator: 0.88 samples/sec: 6.594 | iteration 71800/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 2.673E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.207286E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.31 | backward: 1803.88 | backward-backward: 1803.86 | backward-allreduce: 0.00 | optimizer: 56.03 | batch generator: 0.75 samples/sec: 6.595 | iteration 71900/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 2.672E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.201055E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.33 | backward: 1803.79 | backward-backward: 1803.77 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.80 samples/sec: 6.590 | iteration 72000/ 320000 | elapsed time per iteration (ms): 2428.1 | learning rate: 2.671E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.209268E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.70 | backward: 1805.38 | backward-backward: 1805.36 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.78 ---------------------------------------------------------------------------------------------------------- validation results at iteration 72000 | lm_loss value: 3.158572E+00 | lm_loss_ppl value: 2.353696E+01 | ---------------------------------------------------------------------------------------------------------- samples/sec: 6.445 | iteration 72100/ 320000 | elapsed time per iteration (ms): 2482.6 | learning rate: 2.671E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.184396E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.12 | backward: 1803.50 | backward-backward: 1803.47 | backward-allreduce: 0.00 | optimizer: 55.79 | batch generator: 0.83 samples/sec: 6.590 | iteration 72200/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 2.670E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.189465E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.95 | backward: 1804.91 | backward-backward: 1804.89 | backward-allreduce: 0.00 | optimizer: 55.67 | batch generator: 0.78 samples/sec: 6.597 | iteration 72300/ 320000 | elapsed time per iteration (ms): 2425.3 | learning rate: 2.669E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.204576E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.21 | backward: 1803.73 | backward-backward: 1803.70 | backward-allreduce: 0.00 | optimizer: 55.00 | batch generator: 0.76 samples/sec: 6.592 | iteration 72400/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 2.668E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.191662E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.38 | backward: 1804.69 | backward-backward: 1804.66 | backward-allreduce: 0.00 | optimizer: 55.70 | batch generator: 0.89 samples/sec: 6.590 | iteration 72500/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 2.667E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.213322E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.86 | backward: 1805.70 | backward-backward: 1805.68 | backward-allreduce: 0.00 | optimizer: 55.10 | batch generator: 0.90 samples/sec: 6.599 | iteration 72600/ 320000 | elapsed time per iteration (ms): 2424.7 | learning rate: 2.666E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.207765E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.02 | backward: 1802.95 | backward-backward: 1802.92 | backward-allreduce: 0.00 | optimizer: 55.31 | batch generator: 0.78 samples/sec: 6.589 | iteration 72700/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 2.665E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.199929E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.95 | backward: 1805.02 | backward-backward: 1805.00 | backward-allreduce: 0.00 | optimizer: 55.85 | batch generator: 0.76 samples/sec: 6.597 | iteration 72800/ 320000 | elapsed time per iteration (ms): 2425.2 | learning rate: 2.664E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.203205E+00 | loss scale: 16384.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.48 | backward: 1803.36 | backward-backward: 1803.33 | backward-allreduce: 0.00 | optimizer: 54.98 | batch generator: 0.77 samples/sec: 6.592 | iteration 72900/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 2.663E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.195619E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.27 | backward: 1804.24 | backward-backward: 1804.22 | backward-allreduce: 0.00 | optimizer: 56.15 | batch generator: 0.77 samples/sec: 6.593 | iteration 73000/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 2.662E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.196782E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.72 | backward: 1804.27 | backward-backward: 1804.25 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.76 ---------------------------------------------------------------------------------------------------------- validation results at iteration 73000 | lm_loss value: 3.222079E+00 | lm_loss_ppl value: 2.508021E+01 | ---------------------------------------------------------------------------------------------------------- samples/sec: 6.449 | iteration 73100/ 320000 | elapsed time per iteration (ms): 2480.9 | learning rate: 2.661E-04 | approx flops per GPU: 40.1TFLOPS | lm_loss: 3.192364E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 565.92 | backward: 1802.56 | backward-backward: 1802.54 | backward-allreduce: 0.00 | optimizer: 55.29 | batch generator: 0.85 samples/sec: 6.591 | iteration 73200/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 2.660E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.206586E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.84 | backward: 1804.55 | backward-backward: 1804.53 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.78 samples/sec: 6.595 | iteration 73300/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 2.659E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.194344E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.68 | backward: 1803.19 | backward-backward: 1803.17 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.77 samples/sec: 6.592 | iteration 73400/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 2.659E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.218098E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.61 | backward: 1804.33 | backward-backward: 1804.31 | backward-allreduce: 0.00 | optimizer: 55.89 | batch generator: 0.78 samples/sec: 6.593 | iteration 73500/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 2.658E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.196882E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.68 | backward: 1804.07 | backward-backward: 1804.04 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.79 samples/sec: 6.596 | iteration 73600/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 2.657E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.190002E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.39 | backward: 1803.43 | backward-backward: 1803.41 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.78 samples/sec: 6.592 | iteration 73700/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 2.656E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.189712E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.70 | backward: 1804.81 | backward-backward: 1804.79 | backward-allreduce: 0.00 | optimizer: 55.45 | batch generator: 0.77 samples/sec: 6.598 | iteration 73800/ 320000 | elapsed time per iteration (ms): 2424.8 | learning rate: 2.655E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.192291E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.17 | backward: 1802.91 | backward-backward: 1802.89 | backward-allreduce: 0.00 | optimizer: 55.39 | batch generator: 0.78 samples/sec: 6.589 | iteration 73900/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 2.654E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.214104E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.76 | backward: 1805.47 | backward-backward: 1805.45 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.77 samples/sec: 6.594 | iteration 74000/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 2.653E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.191720E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.36 | backward: 1803.52 | backward-backward: 1803.50 | backward-allreduce: 0.00 | optimizer: 56.22 | batch generator: 0.73 ---------------------------------------------------------------------------------------------------------- validation results at iteration 74000 | lm_loss value: 3.152706E+00 | lm_loss_ppl value: 2.339930E+01 | ---------------------------------------------------------------------------------------------------------- samples/sec: 6.442 | iteration 74100/ 320000 | elapsed time per iteration (ms): 2483.5 | learning rate: 2.652E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.210204E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.41 | backward: 1804.60 | backward-backward: 1804.57 | backward-allreduce: 0.00 | optimizer: 55.39 | batch generator: 0.86 samples/sec: 6.592 | iteration 74200/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 2.651E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.199562E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.73 | backward: 1804.50 | backward-backward: 1804.48 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.78 samples/sec: 6.597 | iteration 74300/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 2.650E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.202240E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.05 | backward: 1803.43 | backward-backward: 1803.41 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.77 samples/sec: 6.589 | iteration 74400/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 2.649E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.210505E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.85 | backward: 1805.46 | backward-backward: 1805.43 | backward-allreduce: 0.00 | optimizer: 55.51 | batch generator: 0.76 samples/sec: 6.596 | iteration 74500/ 320000 | elapsed time per iteration (ms): 2425.6 | learning rate: 2.648E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.196573E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.29 | backward: 1803.49 | backward-backward: 1803.47 | backward-allreduce: 0.00 | optimizer: 55.46 | batch generator: 0.77 samples/sec: 6.590 | iteration 74600/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 2.647E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.187168E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.64 | backward: 1805.11 | backward-backward: 1805.09 | backward-allreduce: 0.00 | optimizer: 55.80 | batch generator: 0.77 samples/sec: 6.594 | iteration 74700/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 2.646E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.187488E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.75 | backward: 1803.79 | backward-backward: 1803.77 | backward-allreduce: 0.00 | optimizer: 55.47 | batch generator: 0.86 samples/sec: 6.593 | iteration 74800/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 2.645E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.200475E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.48 | backward: 1804.52 | backward-backward: 1804.49 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.78 samples/sec: 6.588 | iteration 74900/ 320000 | elapsed time per iteration (ms): 2428.5 | learning rate: 2.644E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.182850E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.03 | backward: 1805.47 | backward-backward: 1805.44 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.80 samples/sec: 6.593 | iteration 75000/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 2.643E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.187916E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.15 | backward: 1804.01 | backward-backward: 1803.99 | backward-allreduce: 0.00 | optimizer: 56.12 | batch generator: 0.76 ---------------------------------------------------------------------------------------------------------- validation results at iteration 75000 | lm_loss value: 3.214430E+00 | lm_loss_ppl value: 2.488909E+01 | ---------------------------------------------------------------------------------------------------------- samples/sec: 6.437 | iteration 75100/ 320000 | elapsed time per iteration (ms): 2485.8 | learning rate: 2.642E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.182092E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.92 | backward: 1805.81 | backward-backward: 1805.79 | backward-allreduce: 0.00 | optimizer: 55.89 | batch generator: 0.86 samples/sec: 6.595 | iteration 75200/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 2.642E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.205387E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.31 | backward: 1804.08 | backward-backward: 1804.05 | backward-allreduce: 0.00 | optimizer: 55.47 | batch generator: 0.76 samples/sec: 6.591 | iteration 75300/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 2.641E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.186383E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.35 | backward: 1805.38 | backward-backward: 1805.35 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.78 samples/sec: 6.590 | iteration 75400/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 2.640E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.184060E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.68 | backward: 1805.19 | backward-backward: 1805.17 | backward-allreduce: 0.00 | optimizer: 55.75 | batch generator: 0.76 samples/sec: 6.596 | iteration 75500/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 2.639E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.188686E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.19 | backward: 1803.89 | backward-backward: 1803.87 | backward-allreduce: 0.00 | optimizer: 55.41 | batch generator: 0.79 samples/sec: 6.587 | iteration 75600/ 320000 | elapsed time per iteration (ms): 2429.0 | learning rate: 2.638E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.192377E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.31 | backward: 1805.62 | backward-backward: 1805.59 | backward-allreduce: 0.00 | optimizer: 55.67 | batch generator: 0.77 samples/sec: 6.598 | iteration 75700/ 320000 | elapsed time per iteration (ms): 2425.1 | learning rate: 2.637E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.193410E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.11 | backward: 1803.34 | backward-backward: 1803.31 | backward-allreduce: 0.00 | optimizer: 55.33 | batch generator: 0.79 samples/sec: 6.591 | iteration 75800/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 2.636E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.196602E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.79 | backward: 1804.81 | backward-backward: 1804.78 | backward-allreduce: 0.00 | optimizer: 55.66 | batch generator: 0.81 samples/sec: 6.596 | iteration 75900/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 2.635E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.186381E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.49 | backward: 1803.60 | backward-backward: 1803.58 | backward-allreduce: 0.00 | optimizer: 55.28 | batch generator: 0.74 samples/sec: 6.594 | iteration 76000/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 2.634E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.175833E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.33 | backward: 1804.41 | backward-backward: 1804.39 | backward-allreduce: 0.00 | optimizer: 55.28 | batch generator: 0.74 ---------------------------------------------------------------------------------------------------------- validation results at iteration 76000 | lm_loss value: 3.222251E+00 | lm_loss_ppl value: 2.508453E+01 | ---------------------------------------------------------------------------------------------------------- samples/sec: 6.438 | iteration 76100/ 320000 | elapsed time per iteration (ms): 2485.3 | learning rate: 2.633E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.176752E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.80 | backward: 1804.98 | backward-backward: 1804.96 | backward-allreduce: 0.00 | optimizer: 56.29 | batch generator: 0.86 samples/sec: 6.596 | iteration 76200/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 2.632E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.189550E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.24 | backward: 1803.43 | backward-backward: 1803.40 | backward-allreduce: 0.00 | optimizer: 55.73 | batch generator: 0.87 samples/sec: 6.587 | iteration 76300/ 320000 | elapsed time per iteration (ms): 2428.9 | learning rate: 2.631E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.197633E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.93 | backward: 1806.12 | backward-backward: 1806.10 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.78 samples/sec: 6.597 | iteration 76400/ 320000 | elapsed time per iteration (ms): 2425.4 | learning rate: 2.630E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.189893E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.25 | backward: 1803.35 | backward-backward: 1803.33 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.78 samples/sec: 6.592 | iteration 76500/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 2.629E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.200304E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.79 | backward: 1804.54 | backward-backward: 1804.52 | backward-allreduce: 0.00 | optimizer: 55.40 | batch generator: 0.79 samples/sec: 6.592 | iteration 76600/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 2.628E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.204260E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.75 | backward: 1804.19 | backward-backward: 1804.17 | backward-allreduce: 0.00 | optimizer: 56.01 | batch generator: 0.88 samples/sec: 6.591 | iteration 76700/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 2.627E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.194413E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.50 | backward: 1805.04 | backward-backward: 1805.02 | backward-allreduce: 0.00 | optimizer: 55.70 | batch generator: 0.78 samples/sec: 6.589 | iteration 76800/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 2.626E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.202957E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.93 | backward: 1805.26 | backward-backward: 1805.24 | backward-allreduce: 0.00 | optimizer: 55.69 | batch generator: 0.77 samples/sec: 6.591 | iteration 76900/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 2.625E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.168351E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.43 | backward: 1804.78 | backward-backward: 1804.75 | backward-allreduce: 0.00 | optimizer: 55.86 | batch generator: 0.84 samples/sec: 6.590 | iteration 77000/ 320000 | elapsed time per iteration (ms): 2428.1 | learning rate: 2.624E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.195857E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.17 | backward: 1805.10 | backward-backward: 1805.08 | backward-allreduce: 0.00 | optimizer: 55.42 | batch generator: 0.82 ---------------------------------------------------------------------------------------------------------- validation results at iteration 77000 | lm_loss value: 3.155496E+00 | lm_loss_ppl value: 2.346468E+01 | ---------------------------------------------------------------------------------------------------------- samples/sec: 6.445 | iteration 77100/ 320000 | elapsed time per iteration (ms): 2482.4 | learning rate: 2.623E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.183968E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.13 | backward: 1803.83 | backward-backward: 1803.81 | backward-allreduce: 0.00 | optimizer: 55.24 | batch generator: 0.83 samples/sec: 6.586 | iteration 77200/ 320000 | elapsed time per iteration (ms): 2429.5 | learning rate: 2.622E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.198158E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.00 | backward: 1806.10 | backward-backward: 1806.08 | backward-allreduce: 0.00 | optimizer: 56.03 | batch generator: 0.79 samples/sec: 6.594 | iteration 77300/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 2.621E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.183985E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 567.07 | backward: 1803.56 | backward-backward: 1803.54 | backward-allreduce: 0.00 | optimizer: 55.35 | batch generator: 0.80 samples/sec: 6.589 | iteration 77400/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 2.620E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.203013E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.91 | backward: 1805.35 | backward-backward: 1805.33 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.78 samples/sec: 6.592 | iteration 77500/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 2.619E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.191857E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.99 | backward: 1803.89 | backward-backward: 1803.87 | backward-allreduce: 0.00 | optimizer: 55.76 | batch generator: 0.81 samples/sec: 6.597 | iteration 77600/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 2.618E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.183832E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.24 | backward: 1803.79 | backward-backward: 1803.76 | backward-allreduce: 0.00 | optimizer: 55.09 | batch generator: 0.83 samples/sec: 6.590 | iteration 77700/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 2.617E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.195821E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.88 | backward: 1805.16 | backward-backward: 1805.14 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.77 samples/sec: 6.598 | iteration 77800/ 320000 | elapsed time per iteration (ms): 2425.0 | learning rate: 2.616E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.197376E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.06 | backward: 1803.03 | backward-backward: 1803.00 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.79 samples/sec: 6.591 | iteration 77900/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 2.615E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.190579E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.77 | backward: 1805.10 | backward-backward: 1805.07 | backward-allreduce: 0.00 | optimizer: 55.45 | batch generator: 0.80 samples/sec: 6.595 | iteration 78000/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 2.614E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.175598E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.61 | backward: 1803.29 | backward-backward: 1803.27 | backward-allreduce: 0.00 | optimizer: 55.86 | batch generator: 0.79 ---------------------------------------------------------------------------------------------------------- validation results at iteration 78000 | lm_loss value: 3.125446E+00 | lm_loss_ppl value: 2.277004E+01 | ---------------------------------------------------------------------------------------------------------- samples/sec: 6.440 | iteration 78100/ 320000 | elapsed time per iteration (ms): 2484.3 | learning rate: 2.613E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.183948E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.45 | backward: 1805.07 | backward-backward: 1805.05 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.83 samples/sec: 6.593 | iteration 78200/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 2.612E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.180526E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.65 | backward: 1803.94 | backward-backward: 1803.92 | backward-allreduce: 0.00 | optimizer: 55.72 | batch generator: 0.77 samples/sec: 6.591 | iteration 78300/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 2.611E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.188159E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.18 | backward: 1804.77 | backward-backward: 1804.75 | backward-allreduce: 0.00 | optimizer: 56.16 | batch generator: 0.77 samples/sec: 6.589 | iteration 78400/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 2.610E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.197767E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.80 | backward: 1805.42 | backward-backward: 1805.39 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.77 samples/sec: 6.596 | iteration 78500/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 2.609E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.186751E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.14 | backward: 1803.22 | backward-backward: 1803.20 | backward-allreduce: 0.00 | optimizer: 55.79 | batch generator: 0.81 samples/sec: 6.587 | iteration 78600/ 320000 | elapsed time per iteration (ms): 2428.9 | learning rate: 2.608E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.189009E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.82 | backward: 1805.93 | backward-backward: 1805.91 | backward-allreduce: 0.00 | optimizer: 55.76 | batch generator: 0.80 samples/sec: 6.594 | iteration 78700/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 2.607E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.191260E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.42 | backward: 1803.75 | backward-backward: 1803.73 | backward-allreduce: 0.00 | optimizer: 55.72 | batch generator: 0.79 samples/sec: 6.587 | iteration 78800/ 320000 | elapsed time per iteration (ms): 2429.0 | learning rate: 2.606E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.181686E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.50 | backward: 1806.24 | backward-backward: 1806.22 | backward-allreduce: 0.00 | optimizer: 55.92 | batch generator: 0.79 samples/sec: 6.594 | iteration 78900/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 2.605E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.196209E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.49 | backward: 1804.04 | backward-backward: 1804.02 | backward-allreduce: 0.00 | optimizer: 55.69 | batch generator: 0.78 samples/sec: 6.593 | iteration 79000/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 2.604E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.188609E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.49 | backward: 1804.17 | backward-backward: 1804.14 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.79 ---------------------------------------------------------------------------------------------------------- validation results at iteration 79000 | lm_loss value: 3.155281E+00 | lm_loss_ppl value: 2.345963E+01 | ---------------------------------------------------------------------------------------------------------- samples/sec: 6.437 | iteration 79100/ 320000 | elapsed time per iteration (ms): 2485.7 | learning rate: 2.603E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.185381E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.79 | backward: 1805.99 | backward-backward: 1805.97 | backward-allreduce: 0.00 | optimizer: 55.74 | batch generator: 0.84 samples/sec: 6.593 | iteration 79200/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 2.602E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.180567E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.15 | backward: 1804.56 | backward-backward: 1804.39 | backward-allreduce: 0.00 | optimizer: 55.71 | batch generator: 0.78 samples/sec: 6.587 | iteration 79300/ 320000 | elapsed time per iteration (ms): 2428.9 | learning rate: 2.601E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.188322E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.99 | backward: 1805.85 | backward-backward: 1805.82 | backward-allreduce: 0.00 | optimizer: 55.72 | batch generator: 0.79 samples/sec: 6.595 | iteration 79400/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 2.600E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.176060E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.25 | backward: 1803.74 | backward-backward: 1803.72 | backward-allreduce: 0.00 | optimizer: 55.89 | batch generator: 0.81 samples/sec: 6.591 | iteration 79500/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 2.599E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.188872E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.69 | backward: 1805.24 | backward-backward: 1805.22 | backward-allreduce: 0.00 | optimizer: 55.17 | batch generator: 0.84 samples/sec: 6.592 | iteration 79600/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 2.598E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.166220E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.72 | backward: 1804.38 | backward-backward: 1804.35 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.80 samples/sec: 6.596 | iteration 79700/ 320000 | elapsed time per iteration (ms): 2425.6 | learning rate: 2.597E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.184237E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.16 | backward: 1803.99 | backward-backward: 1803.97 | backward-allreduce: 0.00 | optimizer: 55.05 | batch generator: 0.79 samples/sec: 6.590 | iteration 79800/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 2.596E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.178338E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.89 | backward: 1804.99 | backward-backward: 1804.97 | backward-allreduce: 0.00 | optimizer: 55.50 | batch generator: 0.76 samples/sec: 6.598 | iteration 79900/ 320000 | elapsed time per iteration (ms): 2425.1 | learning rate: 2.595E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.170538E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.06 | backward: 1803.24 | backward-backward: 1803.22 | backward-allreduce: 0.00 | optimizer: 55.45 | batch generator: 0.77 samples/sec: 6.591 | iteration 80000/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 2.594E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.188634E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.63 | backward: 1804.87 | backward-backward: 1804.85 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.78 ---------------------------------------------------------------------------------------------------------- validation results at iteration 80000 | lm_loss value: 3.173866E+00 | lm_loss_ppl value: 2.389971E+01 | ---------------------------------------------------------------------------------------------------------- samples/sec: 6.248 | iteration 80100/ 320000 | elapsed time per iteration (ms): 2560.7 | learning rate: 2.593E-04 | approx flops per GPU: 38.8TFLOPS | lm_loss: 3.191499E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.25 | backward: 1803.23 | backward-backward: 1803.20 | backward-allreduce: 0.00 | optimizer: 55.28 | batch generator: 0.86 samples/sec: 6.594 | iteration 80200/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 2.592E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.187076E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.58 | backward: 1803.92 | backward-backward: 1803.90 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.82 samples/sec: 6.591 | iteration 80300/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 2.591E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.166912E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.67 | backward: 1804.72 | backward-backward: 1804.69 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.79 samples/sec: 6.595 | iteration 80400/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 2.590E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.172651E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.99 | backward: 1803.43 | backward-backward: 1803.41 | backward-allreduce: 0.00 | optimizer: 55.42 | batch generator: 0.83 samples/sec: 6.588 | iteration 80500/ 320000 | elapsed time per iteration (ms): 2428.6 | learning rate: 2.589E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.171784E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.66 | backward: 1805.40 | backward-backward: 1805.38 | backward-allreduce: 0.00 | optimizer: 56.18 | batch generator: 0.79 samples/sec: 6.599 | iteration 80600/ 320000 | elapsed time per iteration (ms): 2424.7 | learning rate: 2.588E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.163859E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.11 | backward: 1802.88 | backward-backward: 1802.85 | backward-allreduce: 0.00 | optimizer: 55.39 | batch generator: 0.76 samples/sec: 6.593 | iteration 80700/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 2.587E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.187343E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.77 | backward: 1804.22 | backward-backward: 1804.20 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.78 samples/sec: 6.594 | iteration 80800/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 2.586E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.178166E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.69 | backward: 1804.57 | backward-backward: 1804.55 | backward-allreduce: 0.00 | optimizer: 54.91 | batch generator: 0.78 samples/sec: 6.597 | iteration 80900/ 320000 | elapsed time per iteration (ms): 2425.3 | learning rate: 2.585E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.182049E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.00 | backward: 1803.71 | backward-backward: 1803.68 | backward-allreduce: 0.00 | optimizer: 55.21 | batch generator: 0.76 samples/sec: 6.589 | iteration 81000/ 320000 | elapsed time per iteration (ms): 2428.5 | learning rate: 2.584E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.177003E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.84 | backward: 1805.71 | backward-backward: 1805.69 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.79 ---------------------------------------------------------------------------------------------------------- validation results at iteration 81000 | lm_loss value: 3.190201E+00 | lm_loss_ppl value: 2.429330E+01 | ---------------------------------------------------------------------------------------------------------- samples/sec: 6.445 | iteration 81100/ 320000 | elapsed time per iteration (ms): 2482.4 | learning rate: 2.583E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.181110E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.45 | backward: 1803.37 | backward-backward: 1803.34 | backward-allreduce: 0.00 | optimizer: 55.37 | batch generator: 0.84 samples/sec: 6.592 | iteration 81200/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 2.582E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.200572E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.46 | backward: 1804.67 | backward-backward: 1804.65 | backward-allreduce: 0.00 | optimizer: 55.71 | batch generator: 0.80 samples/sec: 6.593 | iteration 81300/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 2.581E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.171389E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.52 | backward: 1804.66 | backward-backward: 1804.63 | backward-allreduce: 0.00 | optimizer: 55.09 | batch generator: 0.76 samples/sec: 6.597 | iteration 81400/ 320000 | elapsed time per iteration (ms): 2425.3 | learning rate: 2.580E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.178480E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.14 | backward: 1803.27 | backward-backward: 1803.25 | backward-allreduce: 0.00 | optimizer: 55.54 | batch generator: 0.76 samples/sec: 6.589 | iteration 81500/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 2.579E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.196807E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.08 | backward: 1804.93 | backward-backward: 1804.90 | backward-allreduce: 0.00 | optimizer: 55.86 | batch generator: 0.81 samples/sec: 6.594 | iteration 81600/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 2.578E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.178592E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.13 | backward: 1803.78 | backward-backward: 1803.76 | backward-allreduce: 0.00 | optimizer: 56.10 | batch generator: 0.80 samples/sec: 6.591 | iteration 81700/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 2.577E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.174253E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.58 | backward: 1805.05 | backward-backward: 1805.03 | backward-allreduce: 0.00 | optimizer: 55.49 | batch generator: 0.74 samples/sec: 6.594 | iteration 81800/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 2.576E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.194128E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.57 | backward: 1803.93 | backward-backward: 1803.90 | backward-allreduce: 0.00 | optimizer: 55.43 | batch generator: 0.78 samples/sec: 6.594 | iteration 81900/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 2.575E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.184395E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.55 | backward: 1803.92 | backward-backward: 1803.90 | backward-allreduce: 0.00 | optimizer: 55.69 | batch generator: 0.84 samples/sec: 6.589 | iteration 82000/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 2.574E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.191394E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.11 | backward: 1804.96 | backward-backward: 1804.94 | backward-allreduce: 0.00 | optimizer: 55.73 | batch generator: 0.79 ---------------------------------------------------------------------------------------------------------- validation results at iteration 82000 | lm_loss value: 3.169994E+00 | lm_loss_ppl value: 2.380734E+01 | ---------------------------------------------------------------------------------------------------------- samples/sec: 6.446 | iteration 82100/ 320000 | elapsed time per iteration (ms): 2482.0 | learning rate: 2.573E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.158089E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 565.95 | backward: 1803.08 | backward-backward: 1803.06 | backward-allreduce: 0.00 | optimizer: 55.75 | batch generator: 0.86 samples/sec: 6.590 | iteration 82200/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 2.572E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.187859E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.76 | backward: 1805.31 | backward-backward: 1805.29 | backward-allreduce: 0.00 | optimizer: 55.49 | batch generator: 0.79 samples/sec: 6.597 | iteration 82300/ 320000 | elapsed time per iteration (ms): 2425.4 | learning rate: 2.571E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.179200E+00 | loss scale: 32768.0 | number of skipped iterations: 2 | number of nan iterations: 0 | time (ms) | forward: 566.57 | backward: 1803.76 | backward-backward: 1803.73 | backward-allreduce: 0.00 | optimizer: 54.65 | batch generator: 0.79 samples/sec: 6.594 | iteration 82400/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 2.570E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.166886E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.38 | backward: 1804.11 | backward-backward: 1804.09 | backward-allreduce: 0.00 | optimizer: 55.72 | batch generator: 0.80 samples/sec: 6.590 | iteration 82500/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 2.569E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.160280E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.17 | backward: 1804.62 | backward-backward: 1804.60 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.79 samples/sec: 6.597 | iteration 82600/ 320000 | elapsed time per iteration (ms): 2425.4 | learning rate: 2.568E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.182464E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.01 | backward: 1803.33 | backward-backward: 1803.31 | backward-allreduce: 0.00 | optimizer: 55.66 | batch generator: 0.77 samples/sec: 6.587 | iteration 82700/ 320000 | elapsed time per iteration (ms): 2429.0 | learning rate: 2.567E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.187014E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.85 | backward: 1805.50 | backward-backward: 1805.47 | backward-allreduce: 0.00 | optimizer: 56.33 | batch generator: 0.77 samples/sec: 6.594 | iteration 82800/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 2.566E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.178228E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.57 | backward: 1803.71 | backward-backward: 1803.69 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.80 samples/sec: 6.595 | iteration 82900/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 2.565E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.173566E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.31 | backward: 1803.84 | backward-backward: 1803.82 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.80 samples/sec: 6.590 | iteration 83000/ 320000 | elapsed time per iteration (ms): 2428.1 | learning rate: 2.564E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.191358E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.78 | backward: 1805.50 | backward-backward: 1805.47 | backward-allreduce: 0.00 | optimizer: 55.45 | batch generator: 0.77 ---------------------------------------------------------------------------------------------------------- validation results at iteration 83000 | lm_loss value: 3.164551E+00 | lm_loss_ppl value: 2.367811E+01 | ---------------------------------------------------------------------------------------------------------- samples/sec: 6.445 | iteration 83100/ 320000 | elapsed time per iteration (ms): 2482.7 | learning rate: 2.563E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.160366E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.15 | backward: 1803.08 | backward-backward: 1803.06 | backward-allreduce: 0.00 | optimizer: 55.88 | batch generator: 0.85 samples/sec: 6.591 | iteration 83200/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 2.562E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.168655E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.65 | backward: 1804.80 | backward-backward: 1804.78 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.82 samples/sec: 6.592 | iteration 83300/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 2.561E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.170126E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.87 | backward: 1804.20 | backward-backward: 1804.18 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.82 samples/sec: 6.597 | iteration 83400/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 2.560E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.174152E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.44 | backward: 1803.41 | backward-backward: 1803.39 | backward-allreduce: 0.00 | optimizer: 55.24 | batch generator: 0.91 samples/sec: 6.587 | iteration 83500/ 320000 | elapsed time per iteration (ms): 2429.1 | learning rate: 2.559E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.163420E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.17 | backward: 1805.85 | backward-backward: 1805.83 | backward-allreduce: 0.00 | optimizer: 55.69 | batch generator: 0.76 samples/sec: 6.596 | iteration 83600/ 320000 | elapsed time per iteration (ms): 2425.6 | learning rate: 2.558E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.164569E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.18 | backward: 1803.49 | backward-backward: 1803.46 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.76 samples/sec: 6.591 | iteration 83700/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 2.557E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.188335E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.41 | backward: 1805.16 | backward-backward: 1805.14 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.80 samples/sec: 6.589 | iteration 83800/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 2.555E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.163058E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.71 | backward: 1805.07 | backward-backward: 1805.05 | backward-allreduce: 0.00 | optimizer: 56.17 | batch generator: 0.79 samples/sec: 6.597 | iteration 83900/ 320000 | elapsed time per iteration (ms): 2425.4 | learning rate: 2.554E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.170161E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 565.94 | backward: 1803.41 | backward-backward: 1803.38 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.79 samples/sec: 6.590 | iteration 84000/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 2.553E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.167448E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.78 | backward: 1805.27 | backward-backward: 1805.24 | backward-allreduce: 0.00 | optimizer: 55.43 | batch generator: 0.75 ---------------------------------------------------------------------------------------------------------- validation results at iteration 84000 | lm_loss value: 3.111575E+00 | lm_loss_ppl value: 2.245638E+01 | ---------------------------------------------------------------------------------------------------------- samples/sec: 6.445 | iteration 84100/ 320000 | elapsed time per iteration (ms): 2482.7 | learning rate: 2.552E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.180006E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.44 | backward: 1804.04 | backward-backward: 1804.01 | backward-allreduce: 0.00 | optimizer: 54.96 | batch generator: 0.85 samples/sec: 6.595 | iteration 84200/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 2.551E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.159954E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.25 | backward: 1803.79 | backward-backward: 1803.76 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.80 samples/sec: 6.588 | iteration 84300/ 320000 | elapsed time per iteration (ms): 2428.5 | learning rate: 2.550E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.161135E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.77 | backward: 1805.94 | backward-backward: 1805.92 | backward-allreduce: 0.00 | optimizer: 55.46 | batch generator: 0.81 samples/sec: 6.597 | iteration 84400/ 320000 | elapsed time per iteration (ms): 2425.4 | learning rate: 2.549E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.183119E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.73 | backward: 1802.79 | backward-backward: 1802.77 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.77 samples/sec: 6.595 | iteration 84500/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 2.548E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.168308E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.56 | backward: 1803.85 | backward-backward: 1803.83 | backward-allreduce: 0.00 | optimizer: 55.38 | batch generator: 0.79 samples/sec: 6.592 | iteration 84600/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 2.547E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.150393E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.82 | backward: 1804.44 | backward-backward: 1804.42 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.86 samples/sec: 6.597 | iteration 84700/ 320000 | elapsed time per iteration (ms): 2425.4 | learning rate: 2.546E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.155958E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.09 | backward: 1803.12 | backward-backward: 1803.09 | backward-allreduce: 0.00 | optimizer: 55.79 | batch generator: 0.90 samples/sec: 6.586 | iteration 84800/ 320000 | elapsed time per iteration (ms): 2429.4 | learning rate: 2.545E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.170983E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.11 | backward: 1805.84 | backward-backward: 1805.82 | backward-allreduce: 0.00 | optimizer: 56.09 | batch generator: 0.78 samples/sec: 6.598 | iteration 84900/ 320000 | elapsed time per iteration (ms): 2424.9 | learning rate: 2.544E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.165421E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.39 | backward: 1802.73 | backward-backward: 1802.71 | backward-allreduce: 0.00 | optimizer: 55.45 | batch generator: 0.78 samples/sec: 6.594 | iteration 85000/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 2.543E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.170595E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.45 | backward: 1804.16 | backward-backward: 1804.14 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.76 ---------------------------------------------------------------------------------------------------------- validation results at iteration 85000 | lm_loss value: 3.171565E+00 | lm_loss_ppl value: 2.384478E+01 | ---------------------------------------------------------------------------------------------------------- samples/sec: 6.442 | iteration 85100/ 320000 | elapsed time per iteration (ms): 2483.8 | learning rate: 2.542E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.171020E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.67 | backward: 1804.20 | backward-backward: 1804.18 | backward-allreduce: 0.00 | optimizer: 55.73 | batch generator: 0.87 samples/sec: 6.597 | iteration 85200/ 320000 | elapsed time per iteration (ms): 2425.4 | learning rate: 2.541E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.183347E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 565.93 | backward: 1803.60 | backward-backward: 1803.57 | backward-allreduce: 0.00 | optimizer: 55.51 | batch generator: 0.77 samples/sec: 6.590 | iteration 85300/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 2.540E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.160792E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.61 | backward: 1805.97 | backward-backward: 1805.95 | backward-allreduce: 0.00 | optimizer: 55.04 | batch generator: 0.77 samples/sec: 6.595 | iteration 85400/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 2.539E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.171513E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.62 | backward: 1803.80 | backward-backward: 1803.78 | backward-allreduce: 0.00 | optimizer: 55.19 | batch generator: 0.76 samples/sec: 6.596 | iteration 85500/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 2.538E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.173251E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.28 | backward: 1803.58 | backward-backward: 1803.55 | backward-allreduce: 0.00 | optimizer: 55.51 | batch generator: 0.85 samples/sec: 6.590 | iteration 85600/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 2.537E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.170909E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.92 | backward: 1805.07 | backward-backward: 1805.05 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.79 samples/sec: 6.599 | iteration 85700/ 320000 | elapsed time per iteration (ms): 2424.7 | learning rate: 2.535E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.167260E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.27 | backward: 1802.39 | backward-backward: 1802.37 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.79 samples/sec: 6.594 | iteration 85800/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 2.534E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.164525E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.41 | backward: 1804.33 | backward-backward: 1804.30 | backward-allreduce: 0.00 | optimizer: 55.49 | batch generator: 0.81 samples/sec: 6.589 | iteration 85900/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 2.533E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.167343E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.70 | backward: 1804.84 | backward-backward: 1804.81 | backward-allreduce: 0.00 | optimizer: 56.40 | batch generator: 0.79 samples/sec: 6.596 | iteration 86000/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 2.532E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.160208E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.10 | backward: 1803.33 | backward-backward: 1803.31 | backward-allreduce: 0.00 | optimizer: 55.72 | batch generator: 0.78 ---------------------------------------------------------------------------------------------------------- validation results at iteration 86000 | lm_loss value: 3.211562E+00 | lm_loss_ppl value: 2.481783E+01 | ---------------------------------------------------------------------------------------------------------- samples/sec: 6.440 | iteration 86100/ 320000 | elapsed time per iteration (ms): 2484.4 | learning rate: 2.531E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.168373E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.70 | backward: 1804.70 | backward-backward: 1804.68 | backward-allreduce: 0.00 | optimizer: 55.79 | batch generator: 0.85 samples/sec: 6.594 | iteration 86200/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 2.530E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.177252E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.51 | backward: 1803.63 | backward-backward: 1803.60 | backward-allreduce: 0.00 | optimizer: 56.00 | batch generator: 0.76 samples/sec: 6.595 | iteration 86300/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 2.529E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.160843E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.17 | backward: 1804.04 | backward-backward: 1804.02 | backward-allreduce: 0.00 | optimizer: 55.34 | batch generator: 0.79 samples/sec: 6.589 | iteration 86400/ 320000 | elapsed time per iteration (ms): 2428.4 | learning rate: 2.528E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.165807E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.76 | backward: 1806.09 | backward-backward: 1806.07 | backward-allreduce: 0.00 | optimizer: 55.21 | batch generator: 0.78 samples/sec: 6.597 | iteration 86500/ 320000 | elapsed time per iteration (ms): 2425.3 | learning rate: 2.527E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.170130E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.52 | backward: 1802.98 | backward-backward: 1802.96 | backward-allreduce: 0.00 | optimizer: 55.47 | batch generator: 0.78 samples/sec: 6.592 | iteration 86600/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 2.526E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.161390E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.36 | backward: 1804.92 | backward-backward: 1804.90 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.79 samples/sec: 6.589 | iteration 86700/ 320000 | elapsed time per iteration (ms): 2428.4 | learning rate: 2.525E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.159032E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.72 | backward: 1805.83 | backward-backward: 1805.81 | backward-allreduce: 0.00 | optimizer: 55.45 | batch generator: 0.79 samples/sec: 6.599 | iteration 86800/ 320000 | elapsed time per iteration (ms): 2424.5 | learning rate: 2.524E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.166180E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.22 | backward: 1802.68 | backward-backward: 1802.65 | backward-allreduce: 0.00 | optimizer: 55.21 | batch generator: 0.78 samples/sec: 6.592 | iteration 86900/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 2.523E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.177769E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.59 | backward: 1804.60 | backward-backward: 1804.58 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.78 samples/sec: 6.590 | iteration 87000/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 2.522E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.173921E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.63 | backward: 1804.56 | backward-backward: 1804.53 | backward-allreduce: 0.00 | optimizer: 56.23 | batch generator: 0.82 ---------------------------------------------------------------------------------------------------------- validation results at iteration 87000 | lm_loss value: 3.197587E+00 | lm_loss_ppl value: 2.447342E+01 | ---------------------------------------------------------------------------------------------------------- samples/sec: 6.447 | iteration 87100/ 320000 | elapsed time per iteration (ms): 2481.7 | learning rate: 2.520E-04 | approx flops per GPU: 40.1TFLOPS | lm_loss: 3.161129E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.05 | backward: 1803.23 | backward-backward: 1803.20 | backward-allreduce: 0.00 | optimizer: 55.25 | batch generator: 0.84 samples/sec: 6.588 | iteration 87200/ 320000 | elapsed time per iteration (ms): 2428.8 | learning rate: 2.519E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.161631E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.62 | backward: 1805.90 | backward-backward: 1805.88 | backward-allreduce: 0.00 | optimizer: 55.90 | batch generator: 0.74 samples/sec: 6.595 | iteration 87300/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 2.518E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.164774E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.35 | backward: 1803.81 | backward-backward: 1803.79 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.74 samples/sec: 6.598 | iteration 87400/ 320000 | elapsed time per iteration (ms): 2425.0 | learning rate: 2.517E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.176385E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.13 | backward: 1803.14 | backward-backward: 1803.12 | backward-allreduce: 0.00 | optimizer: 55.34 | batch generator: 0.77 samples/sec: 6.589 | iteration 87500/ 320000 | elapsed time per iteration (ms): 2428.4 | learning rate: 2.516E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.157504E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.22 | backward: 1805.13 | backward-backward: 1805.10 | backward-allreduce: 0.00 | optimizer: 55.69 | batch generator: 0.80 samples/sec: 6.597 | iteration 87600/ 320000 | elapsed time per iteration (ms): 2425.4 | learning rate: 2.515E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.150890E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.18 | backward: 1803.05 | backward-backward: 1803.02 | backward-allreduce: 0.00 | optimizer: 55.77 | batch generator: 0.81 samples/sec: 6.592 | iteration 87700/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 2.514E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.186179E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.40 | backward: 1804.82 | backward-backward: 1804.79 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.75 samples/sec: 6.589 | iteration 87800/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 2.513E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.164232E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.60 | backward: 1805.30 | backward-backward: 1805.28 | backward-allreduce: 0.00 | optimizer: 55.97 | batch generator: 0.76 samples/sec: 6.599 | iteration 87900/ 320000 | elapsed time per iteration (ms): 2424.6 | learning rate: 2.512E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.154839E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.00 | backward: 1803.01 | backward-backward: 1802.99 | backward-allreduce: 0.00 | optimizer: 55.20 | batch generator: 0.75 samples/sec: 6.590 | iteration 88000/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 2.511E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.155442E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.60 | backward: 1804.74 | backward-backward: 1804.72 | backward-allreduce: 0.00 | optimizer: 55.96 | batch generator: 0.78 ---------------------------------------------------------------------------------------------------------- validation results at iteration 88000 | lm_loss value: 3.139551E+00 | lm_loss_ppl value: 2.309350E+01 | ---------------------------------------------------------------------------------------------------------- samples/sec: 6.437 | iteration 88100/ 320000 | elapsed time per iteration (ms): 2485.5 | learning rate: 2.510E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.157777E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.70 | backward: 1805.32 | backward-backward: 1805.30 | backward-allreduce: 0.00 | optimizer: 56.23 | batch generator: 0.84 samples/sec: 6.597 | iteration 88200/ 320000 | elapsed time per iteration (ms): 2425.3 | learning rate: 2.509E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.170030E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 565.90 | backward: 1803.38 | backward-backward: 1803.36 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.79 samples/sec: 6.591 | iteration 88300/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 2.507E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.150869E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.79 | backward: 1805.00 | backward-backward: 1804.98 | backward-allreduce: 0.00 | optimizer: 55.41 | batch generator: 0.76 samples/sec: 6.593 | iteration 88400/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 2.506E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.180996E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.98 | backward: 1804.07 | backward-backward: 1804.04 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.82 samples/sec: 6.597 | iteration 88500/ 320000 | elapsed time per iteration (ms): 2425.4 | learning rate: 2.505E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.140287E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 565.94 | backward: 1803.79 | backward-backward: 1803.77 | backward-allreduce: 0.00 | optimizer: 55.32 | batch generator: 0.78 samples/sec: 6.589 | iteration 88600/ 320000 | elapsed time per iteration (ms): 2428.4 | learning rate: 2.504E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.162487E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.90 | backward: 1805.29 | backward-backward: 1805.27 | backward-allreduce: 0.00 | optimizer: 55.81 | batch generator: 0.78 samples/sec: 6.597 | iteration 88700/ 320000 | elapsed time per iteration (ms): 2425.2 | learning rate: 2.503E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.155170E+00 | loss scale: 65536.0 | number of skipped iterations: 2 | number of nan iterations: 0 | time (ms) | forward: 566.48 | backward: 1803.88 | backward-backward: 1803.86 | backward-allreduce: 0.00 | optimizer: 54.48 | batch generator: 0.78 samples/sec: 6.595 | iteration 88800/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 2.502E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.161862E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.12 | backward: 1803.89 | backward-backward: 1803.87 | backward-allreduce: 0.00 | optimizer: 55.75 | batch generator: 0.85 samples/sec: 6.588 | iteration 88900/ 320000 | elapsed time per iteration (ms): 2428.6 | learning rate: 2.501E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.166875E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.70 | backward: 1805.81 | backward-backward: 1805.79 | backward-allreduce: 0.00 | optimizer: 55.72 | batch generator: 0.79 samples/sec: 6.596 | iteration 89000/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 2.500E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.155128E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.30 | backward: 1803.71 | backward-backward: 1803.68 | backward-allreduce: 0.00 | optimizer: 55.45 | batch generator: 0.76 ---------------------------------------------------------------------------------------------------------- validation results at iteration 89000 | lm_loss value: 3.168436E+00 | lm_loss_ppl value: 2.377028E+01 | ---------------------------------------------------------------------------------------------------------- samples/sec: 6.442 | iteration 89100/ 320000 | elapsed time per iteration (ms): 2483.6 | learning rate: 2.499E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.157219E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.16 | backward: 1804.22 | backward-backward: 1804.20 | backward-allreduce: 0.00 | optimizer: 56.02 | batch generator: 0.90 samples/sec: 6.588 | iteration 89200/ 320000 | elapsed time per iteration (ms): 2428.7 | learning rate: 2.498E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.160415E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.81 | backward: 1805.99 | backward-backward: 1805.97 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.80 samples/sec: 6.598 | iteration 89300/ 320000 | elapsed time per iteration (ms): 2425.0 | learning rate: 2.497E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.163226E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.24 | backward: 1802.84 | backward-backward: 1802.81 | backward-allreduce: 0.00 | optimizer: 55.51 | batch generator: 0.76 samples/sec: 6.594 | iteration 89400/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 2.495E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.150146E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.33 | backward: 1803.84 | backward-backward: 1803.82 | backward-allreduce: 0.00 | optimizer: 55.76 | batch generator: 0.79 samples/sec: 6.590 | iteration 89500/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 2.494E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.155840E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.59 | backward: 1805.36 | backward-backward: 1805.34 | backward-allreduce: 0.00 | optimizer: 55.69 | batch generator: 0.77 samples/sec: 6.599 | iteration 89600/ 320000 | elapsed time per iteration (ms): 2424.6 | learning rate: 2.493E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.155780E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.05 | backward: 1802.94 | backward-backward: 1802.92 | backward-allreduce: 0.00 | optimizer: 55.24 | batch generator: 0.81 samples/sec: 6.592 | iteration 89700/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 2.492E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.164142E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.55 | backward: 1804.41 | backward-backward: 1804.39 | backward-allreduce: 0.00 | optimizer: 55.87 | batch generator: 0.76 samples/sec: 6.594 | iteration 89800/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 2.491E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.157707E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.58 | backward: 1804.05 | backward-backward: 1804.03 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.76 samples/sec: 6.600 | iteration 89900/ 320000 | elapsed time per iteration (ms): 2424.4 | learning rate: 2.490E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.138062E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 565.91 | backward: 1802.51 | backward-backward: 1802.48 | backward-allreduce: 0.00 | optimizer: 55.59 | batch generator: 0.74 samples/sec: 6.591 | iteration 90000/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 2.489E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.161230E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.58 | backward: 1804.81 | backward-backward: 1804.78 | backward-allreduce: 0.00 | optimizer: 55.67 | batch generator: 0.75 ---------------------------------------------------------------------------------------------------------- validation results at iteration 90000 | lm_loss value: 3.133420E+00 | lm_loss_ppl value: 2.295234E+01 | ---------------------------------------------------------------------------------------------------------- samples/sec: 6.247 | iteration 90100/ 320000 | elapsed time per iteration (ms): 2561.3 | learning rate: 2.488E-04 | approx flops per GPU: 38.8TFLOPS | lm_loss: 3.150463E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.51 | backward: 1803.13 | backward-backward: 1803.11 | backward-allreduce: 0.00 | optimizer: 55.50 | batch generator: 0.86 samples/sec: 6.597 | iteration 90200/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 2.487E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.159293E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.37 | backward: 1803.16 | backward-backward: 1803.14 | backward-allreduce: 0.00 | optimizer: 55.59 | batch generator: 0.77 samples/sec: 6.587 | iteration 90300/ 320000 | elapsed time per iteration (ms): 2429.2 | learning rate: 2.485E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.151565E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.53 | backward: 1806.02 | backward-backward: 1806.00 | backward-allreduce: 0.00 | optimizer: 56.22 | batch generator: 0.80 samples/sec: 6.595 | iteration 90400/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 2.484E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.138879E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.29 | backward: 1803.55 | backward-backward: 1803.53 | backward-allreduce: 0.00 | optimizer: 55.75 | batch generator: 0.77 samples/sec: 6.598 | iteration 90500/ 320000 | elapsed time per iteration (ms): 2424.9 | learning rate: 2.483E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.158046E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.13 | backward: 1803.57 | backward-backward: 1803.55 | backward-allreduce: 0.00 | optimizer: 54.84 | batch generator: 0.77 samples/sec: 6.591 | iteration 90600/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 2.482E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.157926E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.45 | backward: 1805.47 | backward-backward: 1805.45 | backward-allreduce: 0.00 | optimizer: 55.28 | batch generator: 0.81 samples/sec: 6.597 | iteration 90700/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 2.481E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.161741E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.32 | backward: 1803.26 | backward-backward: 1803.24 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.77 samples/sec: 6.597 | iteration 90800/ 320000 | elapsed time per iteration (ms): 2425.4 | learning rate: 2.480E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.164449E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.35 | backward: 1803.22 | backward-backward: 1803.20 | backward-allreduce: 0.00 | optimizer: 55.49 | batch generator: 0.78 samples/sec: 6.591 | iteration 90900/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 2.479E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.164070E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.50 | backward: 1805.06 | backward-backward: 1805.03 | backward-allreduce: 0.00 | optimizer: 55.47 | batch generator: 0.81 samples/sec: 6.597 | iteration 91000/ 320000 | elapsed time per iteration (ms): 2425.4 | learning rate: 2.478E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.162447E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.25 | backward: 1803.23 | backward-backward: 1803.21 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.75 ---------------------------------------------------------------------------------------------------------- validation results at iteration 91000 | lm_loss value: 3.146791E+00 | lm_loss_ppl value: 2.326129E+01 | ---------------------------------------------------------------------------------------------------------- samples/sec: 6.445 | iteration 91100/ 320000 | elapsed time per iteration (ms): 2482.5 | learning rate: 2.477E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.132745E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.19 | backward: 1803.32 | backward-backward: 1803.30 | backward-allreduce: 0.00 | optimizer: 55.78 | batch generator: 0.86 samples/sec: 6.591 | iteration 91200/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 2.475E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.134507E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.77 | backward: 1805.02 | backward-backward: 1804.99 | backward-allreduce: 0.00 | optimizer: 55.43 | batch generator: 0.80 samples/sec: 6.596 | iteration 91300/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 2.474E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.161000E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.11 | backward: 1803.54 | backward-backward: 1803.52 | backward-allreduce: 0.00 | optimizer: 55.67 | batch generator: 0.78 samples/sec: 6.591 | iteration 91400/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 2.473E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.151769E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.78 | backward: 1804.35 | backward-backward: 1804.33 | backward-allreduce: 0.00 | optimizer: 56.22 | batch generator: 0.78 samples/sec: 6.592 | iteration 91500/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 2.472E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.150492E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.66 | backward: 1804.47 | backward-backward: 1804.45 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.81 samples/sec: 6.597 | iteration 91600/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 2.471E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.151971E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.22 | backward: 1803.01 | backward-backward: 1802.98 | backward-allreduce: 0.00 | optimizer: 55.81 | batch generator: 0.79 samples/sec: 6.594 | iteration 91700/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 2.470E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.156445E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.34 | backward: 1804.43 | backward-backward: 1804.41 | backward-allreduce: 0.00 | optimizer: 55.24 | batch generator: 0.77 samples/sec: 6.593 | iteration 91800/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 2.469E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.167152E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.64 | backward: 1804.85 | backward-backward: 1804.82 | backward-allreduce: 0.00 | optimizer: 55.11 | batch generator: 0.74 samples/sec: 6.597 | iteration 91900/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 2.468E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.156797E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.02 | backward: 1803.53 | backward-backward: 1803.51 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.80 samples/sec: 6.593 | iteration 92000/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 2.466E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.161810E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.68 | backward: 1804.27 | backward-backward: 1804.25 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.82 ---------------------------------------------------------------------------------------------------------- validation results at iteration 92000 | lm_loss value: 3.139031E+00 | lm_loss_ppl value: 2.308149E+01 | ---------------------------------------------------------------------------------------------------------- samples/sec: 6.439 | iteration 92100/ 320000 | elapsed time per iteration (ms): 2484.7 | learning rate: 2.465E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.144295E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.38 | backward: 1804.52 | backward-backward: 1804.49 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.85 samples/sec: 6.599 | iteration 92200/ 320000 | elapsed time per iteration (ms): 2424.7 | learning rate: 2.464E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.153248E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 565.94 | backward: 1802.88 | backward-backward: 1802.86 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.77 samples/sec: 6.592 | iteration 92300/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 2.463E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.143183E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.46 | backward: 1804.87 | backward-backward: 1804.85 | backward-allreduce: 0.00 | optimizer: 55.49 | batch generator: 0.75 samples/sec: 6.590 | iteration 92400/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 2.462E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.149341E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.66 | backward: 1805.42 | backward-backward: 1805.39 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.77 samples/sec: 6.596 | iteration 92500/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 2.461E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.147384E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 565.96 | backward: 1803.36 | backward-backward: 1803.34 | backward-allreduce: 0.00 | optimizer: 56.01 | batch generator: 0.78 samples/sec: 6.593 | iteration 92600/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 2.460E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.158649E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.49 | backward: 1804.30 | backward-backward: 1804.27 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.81 samples/sec: 6.591 | iteration 92700/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 2.459E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.152407E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.73 | backward: 1804.95 | backward-backward: 1804.92 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.77 samples/sec: 6.599 | iteration 92800/ 320000 | elapsed time per iteration (ms): 2424.6 | learning rate: 2.457E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.166384E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.21 | backward: 1803.22 | backward-backward: 1803.20 | backward-allreduce: 0.00 | optimizer: 54.84 | batch generator: 0.79 samples/sec: 6.596 | iteration 92900/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 2.456E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.154807E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.41 | backward: 1803.47 | backward-backward: 1803.45 | backward-allreduce: 0.00 | optimizer: 55.54 | batch generator: 0.76 samples/sec: 6.591 | iteration 93000/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 2.455E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.151213E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.57 | backward: 1804.94 | backward-backward: 1804.92 | backward-allreduce: 0.00 | optimizer: 55.84 | batch generator: 0.78 ---------------------------------------------------------------------------------------------------------- validation results at iteration 93000 | lm_loss value: 3.150698E+00 | lm_loss_ppl value: 2.335236E+01 | ---------------------------------------------------------------------------------------------------------- samples/sec: 6.447 | iteration 93100/ 320000 | elapsed time per iteration (ms): 2481.7 | learning rate: 2.454E-04 | approx flops per GPU: 40.1TFLOPS | lm_loss: 3.146280E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.21 | backward: 1802.87 | backward-backward: 1802.84 | backward-allreduce: 0.00 | optimizer: 55.43 | batch generator: 0.87 samples/sec: 6.595 | iteration 93200/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 2.453E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.157567E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.43 | backward: 1803.93 | backward-backward: 1803.91 | backward-allreduce: 0.00 | optimizer: 55.51 | batch generator: 0.82 samples/sec: 6.591 | iteration 93300/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 2.452E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.175349E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.46 | backward: 1805.07 | backward-backward: 1805.05 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.74 samples/sec: 6.599 | iteration 93400/ 320000 | elapsed time per iteration (ms): 2424.7 | learning rate: 2.451E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.154363E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.20 | backward: 1802.78 | backward-backward: 1802.75 | backward-allreduce: 0.00 | optimizer: 55.30 | batch generator: 0.90 samples/sec: 6.591 | iteration 93500/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 2.449E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.143394E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.29 | backward: 1804.12 | backward-backward: 1804.09 | backward-allreduce: 0.00 | optimizer: 55.74 | batch generator: 0.78 samples/sec: 6.590 | iteration 93600/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 2.448E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.154014E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.42 | backward: 1804.93 | backward-backward: 1804.90 | backward-allreduce: 0.00 | optimizer: 56.33 | batch generator: 0.74 samples/sec: 6.597 | iteration 93700/ 320000 | elapsed time per iteration (ms): 2425.4 | learning rate: 2.447E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.154136E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.25 | backward: 1802.77 | backward-backward: 1802.74 | backward-allreduce: 0.00 | optimizer: 55.99 | batch generator: 0.80 samples/sec: 6.595 | iteration 93800/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 2.446E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.139427E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.30 | backward: 1803.97 | backward-backward: 1803.95 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.79 samples/sec: 6.590 | iteration 93900/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 2.445E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.139304E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.48 | backward: 1805.65 | backward-backward: 1805.63 | backward-allreduce: 0.00 | optimizer: 55.45 | batch generator: 0.79 samples/sec: 6.594 | iteration 94000/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 2.444E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.173400E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.52 | backward: 1803.91 | backward-backward: 1803.89 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.76 ---------------------------------------------------------------------------------------------------------- validation results at iteration 94000 | lm_loss value: 3.153888E+00 | lm_loss_ppl value: 2.342697E+01 | ---------------------------------------------------------------------------------------------------------- samples/sec: 6.444 | iteration 94100/ 320000 | elapsed time per iteration (ms): 2482.8 | learning rate: 2.443E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.150541E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.56 | backward: 1803.19 | backward-backward: 1803.16 | backward-allreduce: 0.00 | optimizer: 55.84 | batch generator: 0.85 samples/sec: 6.590 | iteration 94200/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 2.441E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.133774E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.81 | backward: 1805.37 | backward-backward: 1805.35 | backward-allreduce: 0.00 | optimizer: 55.39 | batch generator: 0.77 samples/sec: 6.594 | iteration 94300/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 2.440E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.157657E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.62 | backward: 1804.03 | backward-backward: 1804.00 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.81 samples/sec: 6.599 | iteration 94400/ 320000 | elapsed time per iteration (ms): 2424.4 | learning rate: 2.439E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.152146E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.08 | backward: 1802.81 | backward-backward: 1802.79 | backward-allreduce: 0.00 | optimizer: 55.19 | batch generator: 0.80 samples/sec: 6.590 | iteration 94500/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 2.438E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.149305E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.77 | backward: 1804.99 | backward-backward: 1804.97 | backward-allreduce: 0.00 | optimizer: 55.70 | batch generator: 0.89 samples/sec: 6.589 | iteration 94600/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 2.437E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.152496E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.78 | backward: 1804.76 | backward-backward: 1804.74 | backward-allreduce: 0.00 | optimizer: 56.30 | batch generator: 0.75 samples/sec: 6.599 | iteration 94700/ 320000 | elapsed time per iteration (ms): 2424.8 | learning rate: 2.436E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.155341E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.01 | backward: 1802.85 | backward-backward: 1802.83 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.81 samples/sec: 6.595 | iteration 94800/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 2.435E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.152647E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.57 | backward: 1804.19 | backward-backward: 1804.17 | backward-allreduce: 0.00 | optimizer: 54.87 | batch generator: 0.78 samples/sec: 6.592 | iteration 94900/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 2.433E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.129557E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.80 | backward: 1804.52 | backward-backward: 1804.49 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.87 samples/sec: 6.599 | iteration 95000/ 320000 | elapsed time per iteration (ms): 2424.6 | learning rate: 2.432E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.128496E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 565.96 | backward: 1802.76 | backward-backward: 1802.74 | backward-allreduce: 0.00 | optimizer: 55.54 | batch generator: 0.77 ---------------------------------------------------------------------------------------------------------- validation results at iteration 95000 | lm_loss value: 3.124268E+00 | lm_loss_ppl value: 2.274325E+01 | ---------------------------------------------------------------------------------------------------------- samples/sec: 6.443 | iteration 95100/ 320000 | elapsed time per iteration (ms): 2483.4 | learning rate: 2.431E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.140984E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.60 | backward: 1803.94 | backward-backward: 1803.92 | backward-allreduce: 0.00 | optimizer: 55.66 | batch generator: 0.88 samples/sec: 6.592 | iteration 95200/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 2.430E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.150500E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.51 | backward: 1804.70 | backward-backward: 1804.68 | backward-allreduce: 0.00 | optimizer: 55.69 | batch generator: 0.78 samples/sec: 6.599 | iteration 95300/ 320000 | elapsed time per iteration (ms): 2424.7 | learning rate: 2.429E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.145917E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.18 | backward: 1802.72 | backward-backward: 1802.69 | backward-allreduce: 0.00 | optimizer: 55.46 | batch generator: 0.78 samples/sec: 6.596 | iteration 95400/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 2.428E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.156916E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.43 | backward: 1803.70 | backward-backward: 1803.67 | backward-allreduce: 0.00 | optimizer: 55.34 | batch generator: 0.75 samples/sec: 6.591 | iteration 95500/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 2.427E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.146064E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.56 | backward: 1805.01 | backward-backward: 1804.99 | backward-allreduce: 0.00 | optimizer: 55.46 | batch generator: 0.78 samples/sec: 6.596 | iteration 95600/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 2.425E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.159815E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.40 | backward: 1803.17 | backward-backward: 1803.15 | backward-allreduce: 0.00 | optimizer: 55.72 | batch generator: 0.78 samples/sec: 6.594 | iteration 95700/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 2.424E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.149319E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.18 | backward: 1803.73 | backward-backward: 1803.71 | backward-allreduce: 0.00 | optimizer: 56.29 | batch generator: 0.79 samples/sec: 6.590 | iteration 95800/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 2.423E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.172126E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.48 | backward: 1805.63 | backward-backward: 1805.61 | backward-allreduce: 0.00 | optimizer: 55.45 | batch generator: 0.79 samples/sec: 6.594 | iteration 95900/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 2.422E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.150391E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.46 | backward: 1803.49 | backward-backward: 1803.46 | backward-allreduce: 0.00 | optimizer: 55.96 | batch generator: 0.79 samples/sec: 6.595 | iteration 96000/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 2.421E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.156203E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.13 | backward: 1803.81 | backward-backward: 1803.79 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.78 ---------------------------------------------------------------------------------------------------------- validation results at iteration 96000 | lm_loss value: 3.111786E+00 | lm_loss_ppl value: 2.246112E+01 | ---------------------------------------------------------------------------------------------------------- samples/sec: 6.439 | iteration 96100/ 320000 | elapsed time per iteration (ms): 2484.7 | learning rate: 2.420E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.141287E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.40 | backward: 1805.60 | backward-backward: 1805.57 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.85 samples/sec: 6.595 | iteration 96200/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 2.418E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.152064E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.44 | backward: 1804.13 | backward-backward: 1804.11 | backward-allreduce: 0.00 | optimizer: 55.28 | batch generator: 0.81 samples/sec: 6.597 | iteration 96300/ 320000 | elapsed time per iteration (ms): 2425.3 | learning rate: 2.417E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.154124E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 565.88 | backward: 1803.46 | backward-backward: 1803.44 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.78 samples/sec: 6.591 | iteration 96400/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 2.416E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.136651E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.47 | backward: 1805.36 | backward-backward: 1805.34 | backward-allreduce: 0.00 | optimizer: 55.51 | batch generator: 0.79 samples/sec: 6.590 | iteration 96500/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 2.415E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.145542E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.57 | backward: 1804.81 | backward-backward: 1804.79 | backward-allreduce: 0.00 | optimizer: 56.04 | batch generator: 0.82 samples/sec: 6.599 | iteration 96600/ 320000 | elapsed time per iteration (ms): 2424.7 | learning rate: 2.414E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.126860E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 565.81 | backward: 1803.00 | backward-backward: 1802.97 | backward-allreduce: 0.00 | optimizer: 55.47 | batch generator: 0.77 samples/sec: 6.593 | iteration 96700/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 2.413E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.142593E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.81 | backward: 1804.17 | backward-backward: 1804.15 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.84 samples/sec: 6.594 | iteration 96800/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 2.411E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.156677E+00 | loss scale: 32768.0 | number of skipped iterations: 3 | number of nan iterations: 0 | time (ms) | forward: 566.56 | backward: 1804.94 | backward-backward: 1804.92 | backward-allreduce: 0.00 | optimizer: 54.64 | batch generator: 0.78 samples/sec: 6.597 | iteration 96900/ 320000 | elapsed time per iteration (ms): 2425.4 | learning rate: 2.410E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.142737E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.19 | backward: 1802.88 | backward-backward: 1802.86 | backward-allreduce: 0.00 | optimizer: 55.81 | batch generator: 0.80 samples/sec: 6.597 | iteration 97000/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 2.409E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.136678E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.42 | backward: 1803.34 | backward-backward: 1803.32 | backward-allreduce: 0.00 | optimizer: 55.31 | batch generator: 0.81 ---------------------------------------------------------------------------------------------------------- validation results at iteration 97000 | lm_loss value: 3.049885E+00 | lm_loss_ppl value: 2.111291E+01 | ---------------------------------------------------------------------------------------------------------- samples/sec: 6.441 | iteration 97100/ 320000 | elapsed time per iteration (ms): 2484.2 | learning rate: 2.408E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.143678E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.47 | backward: 1804.76 | backward-backward: 1804.74 | backward-allreduce: 0.00 | optimizer: 55.81 | batch generator: 0.81 samples/sec: 6.596 | iteration 97200/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 2.407E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.139281E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.39 | backward: 1803.29 | backward-backward: 1803.27 | backward-allreduce: 0.00 | optimizer: 55.49 | batch generator: 0.79 samples/sec: 6.597 | iteration 97300/ 320000 | elapsed time per iteration (ms): 2425.3 | learning rate: 2.406E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.145815E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.04 | backward: 1803.28 | backward-backward: 1803.25 | backward-allreduce: 0.00 | optimizer: 55.54 | batch generator: 0.75 samples/sec: 6.590 | iteration 97400/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 2.404E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.131417E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.52 | backward: 1805.07 | backward-backward: 1805.04 | backward-allreduce: 0.00 | optimizer: 55.91 | batch generator: 0.81 samples/sec: 6.594 | iteration 97500/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 2.403E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.151443E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.67 | backward: 1803.85 | backward-backward: 1803.82 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.87 samples/sec: 6.600 | iteration 97600/ 320000 | elapsed time per iteration (ms): 2424.1 | learning rate: 2.402E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.153010E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 565.97 | backward: 1802.50 | backward-backward: 1802.47 | backward-allreduce: 0.00 | optimizer: 55.23 | batch generator: 0.80 samples/sec: 6.592 | iteration 97700/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 2.401E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.142023E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.47 | backward: 1804.75 | backward-backward: 1804.73 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.78 samples/sec: 6.592 | iteration 97800/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 2.400E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.160511E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.83 | backward: 1804.41 | backward-backward: 1804.39 | backward-allreduce: 0.00 | optimizer: 55.75 | batch generator: 0.80 samples/sec: 6.595 | iteration 97900/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 2.399E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.152551E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.11 | backward: 1803.59 | backward-backward: 1803.57 | backward-allreduce: 0.00 | optimizer: 56.10 | batch generator: 0.84 samples/sec: 6.593 | iteration 98000/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 2.397E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.134702E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.41 | backward: 1804.50 | backward-backward: 1804.48 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.78 ---------------------------------------------------------------------------------------------------------- validation results at iteration 98000 | lm_loss value: 3.120373E+00 | lm_loss_ppl value: 2.265484E+01 | ---------------------------------------------------------------------------------------------------------- samples/sec: 6.440 | iteration 98100/ 320000 | elapsed time per iteration (ms): 2484.6 | learning rate: 2.396E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.165978E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.61 | backward: 1805.32 | backward-backward: 1805.30 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.81 samples/sec: 6.596 | iteration 98200/ 320000 | elapsed time per iteration (ms): 2425.6 | learning rate: 2.395E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.140521E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.32 | backward: 1803.24 | backward-backward: 1803.21 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.80 samples/sec: 6.596 | iteration 98300/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 2.394E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.142148E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.25 | backward: 1803.73 | backward-backward: 1803.71 | backward-allreduce: 0.00 | optimizer: 55.50 | batch generator: 0.78 samples/sec: 6.590 | iteration 98400/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 2.393E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.146277E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.70 | backward: 1805.09 | backward-backward: 1805.07 | backward-allreduce: 0.00 | optimizer: 55.72 | batch generator: 0.78 samples/sec: 6.596 | iteration 98500/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 2.391E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.133373E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.41 | backward: 1803.85 | backward-backward: 1803.82 | backward-allreduce: 0.00 | optimizer: 55.24 | batch generator: 0.78 samples/sec: 6.598 | iteration 98600/ 320000 | elapsed time per iteration (ms): 2424.8 | learning rate: 2.390E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.153006E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.02 | backward: 1803.12 | backward-backward: 1803.10 | backward-allreduce: 0.00 | optimizer: 55.30 | batch generator: 0.75 samples/sec: 6.589 | iteration 98700/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 2.389E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.123730E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.63 | backward: 1805.55 | backward-backward: 1805.52 | backward-allreduce: 0.00 | optimizer: 55.72 | batch generator: 0.98 samples/sec: 6.594 | iteration 98800/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 2.388E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.129089E+00 | loss scale: 65536.0 | number of skipped iterations: 2 | number of nan iterations: 0 | time (ms) | forward: 566.81 | backward: 1804.55 | backward-backward: 1804.53 | backward-allreduce: 0.00 | optimizer: 54.57 | batch generator: 0.85 samples/sec: 6.597 | iteration 98900/ 320000 | elapsed time per iteration (ms): 2425.4 | learning rate: 2.387E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.148252E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 565.95 | backward: 1803.50 | backward-backward: 1803.47 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.78 samples/sec: 6.595 | iteration 99000/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 2.386E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.137578E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.42 | backward: 1803.63 | backward-backward: 1803.60 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.78 ---------------------------------------------------------------------------------------------------------- validation results at iteration 99000 | lm_loss value: 3.094367E+00 | lm_loss_ppl value: 2.207326E+01 | ---------------------------------------------------------------------------------------------------------- samples/sec: 6.442 | iteration 99100/ 320000 | elapsed time per iteration (ms): 2483.7 | learning rate: 2.384E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.159865E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.45 | backward: 1804.54 | backward-backward: 1804.52 | backward-allreduce: 0.00 | optimizer: 55.49 | batch generator: 0.84 samples/sec: 6.596 | iteration 99200/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 2.383E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.148736E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.62 | backward: 1803.37 | backward-backward: 1803.35 | backward-allreduce: 0.00 | optimizer: 55.46 | batch generator: 0.80 samples/sec: 6.600 | iteration 99300/ 320000 | elapsed time per iteration (ms): 2424.3 | learning rate: 2.382E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.157631E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.06 | backward: 1802.37 | backward-backward: 1802.34 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.81 samples/sec: 6.592 | iteration 99400/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 2.381E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.131637E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.59 | backward: 1804.30 | backward-backward: 1804.28 | backward-allreduce: 0.00 | optimizer: 55.93 | batch generator: 0.86 samples/sec: 6.594 | iteration 99500/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 2.380E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.144533E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.36 | backward: 1804.05 | backward-backward: 1804.03 | backward-allreduce: 0.00 | optimizer: 55.75 | batch generator: 0.77 samples/sec: 6.593 | iteration 99600/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 2.378E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.148504E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.68 | backward: 1804.00 | backward-backward: 1803.98 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.79 samples/sec: 6.600 | iteration 99700/ 320000 | elapsed time per iteration (ms): 2424.3 | learning rate: 2.377E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.131561E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 565.91 | backward: 1802.69 | backward-backward: 1802.67 | backward-allreduce: 0.00 | optimizer: 55.35 | batch generator: 0.76 samples/sec: 6.596 | iteration 99800/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 2.376E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.155168E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.83 | backward: 1803.03 | backward-backward: 1803.00 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.78 samples/sec: 6.592 | iteration 99900/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 2.375E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.137651E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.29 | backward: 1804.32 | backward-backward: 1804.30 | backward-allreduce: 0.00 | optimizer: 56.21 | batch generator: 0.76 samples/sec: 6.593 | iteration 100000/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 2.374E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.125206E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.52 | backward: 1804.41 | backward-backward: 1804.39 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.78 ----------------------------------------------------------------------------------------------------------- validation results at iteration 100000 | lm_loss value: 3.103189E+00 | lm_loss_ppl value: 2.226886E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.247 | iteration 100100/ 320000 | elapsed time per iteration (ms): 2561.1 | learning rate: 2.372E-04 | approx flops per GPU: 38.8TFLOPS | lm_loss: 3.133826E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.16 | backward: 1803.08 | backward-backward: 1803.06 | backward-allreduce: 0.00 | optimizer: 56.00 | batch generator: 0.85 samples/sec: 6.594 | iteration 100200/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 2.371E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.127592E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.36 | backward: 1804.00 | backward-backward: 1803.98 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.75 samples/sec: 6.592 | iteration 100300/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 2.370E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.145913E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.38 | backward: 1804.89 | backward-backward: 1804.87 | backward-allreduce: 0.00 | optimizer: 55.66 | batch generator: 0.82 samples/sec: 6.594 | iteration 100400/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 2.369E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.152013E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.29 | backward: 1804.15 | backward-backward: 1804.13 | backward-allreduce: 0.00 | optimizer: 55.73 | batch generator: 0.76 samples/sec: 6.598 | iteration 100500/ 320000 | elapsed time per iteration (ms): 2425.0 | learning rate: 2.368E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.149110E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.07 | backward: 1803.46 | backward-backward: 1803.43 | backward-allreduce: 0.00 | optimizer: 55.12 | batch generator: 0.79 samples/sec: 6.591 | iteration 100600/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 2.366E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.111824E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.46 | backward: 1805.30 | backward-backward: 1805.27 | backward-allreduce: 0.00 | optimizer: 55.54 | batch generator: 0.77 samples/sec: 6.594 | iteration 100700/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 2.365E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.124661E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.57 | backward: 1804.15 | backward-backward: 1804.13 | backward-allreduce: 0.00 | optimizer: 55.31 | batch generator: 0.79 samples/sec: 6.599 | iteration 100800/ 320000 | elapsed time per iteration (ms): 2424.8 | learning rate: 2.364E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.149614E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.28 | backward: 1802.79 | backward-backward: 1802.77 | backward-allreduce: 0.00 | optimizer: 55.34 | batch generator: 0.78 samples/sec: 6.595 | iteration 100900/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 2.363E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.113727E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.27 | backward: 1803.85 | backward-backward: 1803.83 | backward-allreduce: 0.00 | optimizer: 55.50 | batch generator: 0.75 samples/sec: 6.592 | iteration 101000/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 2.362E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.144401E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.33 | backward: 1804.90 | backward-backward: 1804.88 | backward-allreduce: 0.00 | optimizer: 55.70 | batch generator: 0.75 ----------------------------------------------------------------------------------------------------------- validation results at iteration 101000 | lm_loss value: 3.140973E+00 | lm_loss_ppl value: 2.312636E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.444 | iteration 101100/ 320000 | elapsed time per iteration (ms): 2482.9 | learning rate: 2.360E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.115204E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.67 | backward: 1803.58 | backward-backward: 1803.56 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.99 samples/sec: 6.595 | iteration 101200/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 2.359E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.138934E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.00 | backward: 1803.18 | backward-backward: 1803.16 | backward-allreduce: 0.00 | optimizer: 56.36 | batch generator: 0.78 samples/sec: 6.592 | iteration 101300/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 2.358E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.115951E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.49 | backward: 1804.78 | backward-backward: 1804.76 | backward-allreduce: 0.00 | optimizer: 55.51 | batch generator: 0.77 samples/sec: 6.593 | iteration 101400/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 2.357E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.123527E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.53 | backward: 1804.05 | backward-backward: 1804.03 | backward-allreduce: 0.00 | optimizer: 55.77 | batch generator: 0.85 samples/sec: 6.600 | iteration 101500/ 320000 | elapsed time per iteration (ms): 2424.3 | learning rate: 2.356E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.147403E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 565.87 | backward: 1802.67 | backward-backward: 1802.65 | backward-allreduce: 0.00 | optimizer: 55.38 | batch generator: 0.77 samples/sec: 6.594 | iteration 101600/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 2.354E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.134050E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.48 | backward: 1803.73 | backward-backward: 1803.71 | backward-allreduce: 0.00 | optimizer: 55.97 | batch generator: 0.81 samples/sec: 6.591 | iteration 101700/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 2.353E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.137082E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.64 | backward: 1804.76 | backward-backward: 1804.73 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.80 samples/sec: 6.598 | iteration 101800/ 320000 | elapsed time per iteration (ms): 2425.1 | learning rate: 2.352E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.130049E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.30 | backward: 1803.57 | backward-backward: 1803.55 | backward-allreduce: 0.00 | optimizer: 54.86 | batch generator: 0.79 samples/sec: 6.597 | iteration 101900/ 320000 | elapsed time per iteration (ms): 2425.3 | learning rate: 2.351E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.142708E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.26 | backward: 1803.24 | backward-backward: 1803.22 | backward-allreduce: 0.00 | optimizer: 55.47 | batch generator: 0.80 samples/sec: 6.589 | iteration 102000/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 2.350E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.114050E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.33 | backward: 1806.16 | backward-backward: 1806.14 | backward-allreduce: 0.00 | optimizer: 55.45 | batch generator: 0.77 ----------------------------------------------------------------------------------------------------------- validation results at iteration 102000 | lm_loss value: 3.138784E+00 | lm_loss_ppl value: 2.307579E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.441 | iteration 102100/ 320000 | elapsed time per iteration (ms): 2483.9 | learning rate: 2.348E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.115122E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.56 | backward: 1804.49 | backward-backward: 1804.47 | backward-allreduce: 0.00 | optimizer: 55.66 | batch generator: 0.86 samples/sec: 6.598 | iteration 102200/ 320000 | elapsed time per iteration (ms): 2425.1 | learning rate: 2.347E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.133423E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.00 | backward: 1803.35 | backward-backward: 1803.32 | backward-allreduce: 0.00 | optimizer: 55.35 | batch generator: 0.78 samples/sec: 6.590 | iteration 102300/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 2.346E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.132624E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.37 | backward: 1804.63 | backward-backward: 1804.60 | backward-allreduce: 0.00 | optimizer: 56.38 | batch generator: 0.78 samples/sec: 6.591 | iteration 102400/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 2.345E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.115848E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.52 | backward: 1804.88 | backward-backward: 1804.86 | backward-allreduce: 0.00 | optimizer: 55.82 | batch generator: 0.78 samples/sec: 6.595 | iteration 102500/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 2.343E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.130582E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.48 | backward: 1803.87 | backward-backward: 1803.84 | backward-allreduce: 0.00 | optimizer: 55.33 | batch generator: 0.78 samples/sec: 6.597 | iteration 102600/ 320000 | elapsed time per iteration (ms): 2425.3 | learning rate: 2.342E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.138290E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.10 | backward: 1803.08 | backward-backward: 1803.06 | backward-allreduce: 0.00 | optimizer: 55.70 | batch generator: 0.82 samples/sec: 6.591 | iteration 102700/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 2.341E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.135483E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.54 | backward: 1805.08 | backward-backward: 1805.06 | backward-allreduce: 0.00 | optimizer: 55.41 | batch generator: 0.77 samples/sec: 6.596 | iteration 102800/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 2.340E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.134073E+00 | loss scale: 65536.0 | number of skipped iterations: 2 | number of nan iterations: 0 | time (ms) | forward: 566.53 | backward: 1804.30 | backward-backward: 1804.27 | backward-allreduce: 0.00 | optimizer: 54.48 | batch generator: 0.78 samples/sec: 6.598 | iteration 102900/ 320000 | elapsed time per iteration (ms): 2424.9 | learning rate: 2.339E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.120694E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 565.97 | backward: 1803.15 | backward-backward: 1803.12 | backward-allreduce: 0.00 | optimizer: 55.36 | batch generator: 0.84 samples/sec: 6.591 | iteration 103000/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 2.337E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.134282E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.92 | backward: 1804.28 | backward-backward: 1804.26 | backward-allreduce: 0.00 | optimizer: 55.89 | batch generator: 0.76 ----------------------------------------------------------------------------------------------------------- validation results at iteration 103000 | lm_loss value: 3.097703E+00 | lm_loss_ppl value: 2.214703E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.441 | iteration 103100/ 320000 | elapsed time per iteration (ms): 2484.1 | learning rate: 2.336E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.116528E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.57 | backward: 1804.89 | backward-backward: 1804.87 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.84 samples/sec: 6.596 | iteration 103200/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 2.335E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.125468E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.30 | backward: 1803.51 | backward-backward: 1803.49 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.83 samples/sec: 6.594 | iteration 103300/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 2.334E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.131291E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.24 | backward: 1804.13 | backward-backward: 1804.11 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.86 samples/sec: 6.589 | iteration 103400/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 2.332E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.115432E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.63 | backward: 1805.06 | backward-backward: 1805.04 | backward-allreduce: 0.00 | optimizer: 56.21 | batch generator: 0.85 samples/sec: 6.595 | iteration 103500/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 2.331E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.130763E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.60 | backward: 1804.25 | backward-backward: 1804.22 | backward-allreduce: 0.00 | optimizer: 55.00 | batch generator: 0.75 samples/sec: 6.597 | iteration 103600/ 320000 | elapsed time per iteration (ms): 2425.4 | learning rate: 2.330E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.127283E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.48 | backward: 1803.07 | backward-backward: 1803.05 | backward-allreduce: 0.00 | optimizer: 55.49 | batch generator: 0.77 samples/sec: 6.589 | iteration 103700/ 320000 | elapsed time per iteration (ms): 2428.1 | learning rate: 2.329E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.126182E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.08 | backward: 1805.02 | backward-backward: 1805.00 | backward-allreduce: 0.00 | optimizer: 55.66 | batch generator: 0.79 samples/sec: 6.592 | iteration 103800/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 2.328E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.135748E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.37 | backward: 1804.91 | backward-backward: 1804.89 | backward-allreduce: 0.00 | optimizer: 55.69 | batch generator: 0.73 samples/sec: 6.598 | iteration 103900/ 320000 | elapsed time per iteration (ms): 2424.8 | learning rate: 2.326E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.118567E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 565.73 | backward: 1803.25 | backward-backward: 1803.23 | backward-allreduce: 0.00 | optimizer: 55.49 | batch generator: 0.76 samples/sec: 6.593 | iteration 104000/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 2.325E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.126984E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.81 | backward: 1803.86 | backward-backward: 1803.83 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.76 ----------------------------------------------------------------------------------------------------------- validation results at iteration 104000 | lm_loss value: 3.177735E+00 | lm_loss_ppl value: 2.399235E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.442 | iteration 104100/ 320000 | elapsed time per iteration (ms): 2483.9 | learning rate: 2.324E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.124503E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.63 | backward: 1804.37 | backward-backward: 1804.35 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.91 samples/sec: 6.597 | iteration 104200/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 2.323E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.142881E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.21 | backward: 1803.29 | backward-backward: 1803.27 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.79 samples/sec: 6.598 | iteration 104300/ 320000 | elapsed time per iteration (ms): 2424.9 | learning rate: 2.321E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.118152E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.36 | backward: 1802.60 | backward-backward: 1802.58 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.79 samples/sec: 6.589 | iteration 104400/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 2.320E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.130111E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.47 | backward: 1805.47 | backward-backward: 1805.44 | backward-allreduce: 0.00 | optimizer: 55.98 | batch generator: 0.78 samples/sec: 6.595 | iteration 104500/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 2.319E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.120774E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.40 | backward: 1803.74 | backward-backward: 1803.72 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.76 samples/sec: 6.598 | iteration 104600/ 320000 | elapsed time per iteration (ms): 2424.8 | learning rate: 2.318E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.139033E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.09 | backward: 1802.86 | backward-backward: 1802.83 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.80 samples/sec: 6.590 | iteration 104700/ 320000 | elapsed time per iteration (ms): 2428.1 | learning rate: 2.317E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.141771E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.64 | backward: 1805.32 | backward-backward: 1805.30 | backward-allreduce: 0.00 | optimizer: 55.78 | batch generator: 0.78 samples/sec: 6.592 | iteration 104800/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 2.315E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.123546E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.58 | backward: 1804.29 | backward-backward: 1804.27 | backward-allreduce: 0.00 | optimizer: 55.80 | batch generator: 0.81 samples/sec: 6.598 | iteration 104900/ 320000 | elapsed time per iteration (ms): 2425.0 | learning rate: 2.314E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.131695E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 565.88 | backward: 1803.38 | backward-backward: 1803.36 | backward-allreduce: 0.00 | optimizer: 55.32 | batch generator: 0.79 samples/sec: 6.592 | iteration 105000/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 2.313E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.125592E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.67 | backward: 1804.64 | backward-backward: 1804.62 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.76 ----------------------------------------------------------------------------------------------------------- validation results at iteration 105000 | lm_loss value: 3.133254E+00 | lm_loss_ppl value: 2.294853E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.440 | iteration 105100/ 320000 | elapsed time per iteration (ms): 2484.5 | learning rate: 2.312E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.146112E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.75 | backward: 1804.88 | backward-backward: 1804.86 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.92 samples/sec: 6.598 | iteration 105200/ 320000 | elapsed time per iteration (ms): 2425.1 | learning rate: 2.310E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.119759E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.08 | backward: 1803.42 | backward-backward: 1803.40 | backward-allreduce: 0.00 | optimizer: 55.27 | batch generator: 0.78 samples/sec: 6.593 | iteration 105300/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 2.309E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.103782E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.51 | backward: 1804.20 | backward-backward: 1804.18 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.80 samples/sec: 6.589 | iteration 105400/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 2.308E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.128111E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.73 | backward: 1805.30 | backward-backward: 1805.27 | backward-allreduce: 0.00 | optimizer: 55.78 | batch generator: 0.79 samples/sec: 6.592 | iteration 105500/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 2.307E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.144687E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.97 | backward: 1803.93 | backward-backward: 1803.91 | backward-allreduce: 0.00 | optimizer: 56.08 | batch generator: 0.77 samples/sec: 6.593 | iteration 105600/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 2.305E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.114608E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.36 | backward: 1804.18 | backward-backward: 1804.15 | backward-allreduce: 0.00 | optimizer: 55.86 | batch generator: 0.78 samples/sec: 6.590 | iteration 105700/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 2.304E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.130442E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.81 | backward: 1805.64 | backward-backward: 1805.62 | backward-allreduce: 0.00 | optimizer: 55.10 | batch generator: 0.79 samples/sec: 6.597 | iteration 105800/ 320000 | elapsed time per iteration (ms): 2425.4 | learning rate: 2.303E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.120422E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.28 | backward: 1803.21 | backward-backward: 1803.19 | backward-allreduce: 0.00 | optimizer: 55.54 | batch generator: 0.82 samples/sec: 6.592 | iteration 105900/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 2.302E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.122598E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.53 | backward: 1804.71 | backward-backward: 1804.68 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.95 samples/sec: 6.591 | iteration 106000/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 2.300E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.136300E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.77 | backward: 1804.61 | backward-backward: 1804.58 | backward-allreduce: 0.00 | optimizer: 55.77 | batch generator: 0.80 ----------------------------------------------------------------------------------------------------------- validation results at iteration 106000 | lm_loss value: 3.147155E+00 | lm_loss_ppl value: 2.326977E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.446 | iteration 106100/ 320000 | elapsed time per iteration (ms): 2482.3 | learning rate: 2.299E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.140908E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.14 | backward: 1803.38 | backward-backward: 1803.35 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.96 samples/sec: 6.590 | iteration 106200/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 2.298E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.125402E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.90 | backward: 1804.96 | backward-backward: 1804.94 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.83 samples/sec: 6.594 | iteration 106300/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 2.297E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.126628E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.47 | backward: 1803.85 | backward-backward: 1803.83 | backward-allreduce: 0.00 | optimizer: 55.69 | batch generator: 0.78 samples/sec: 6.595 | iteration 106400/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 2.295E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.112168E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.01 | backward: 1803.97 | backward-backward: 1803.94 | backward-allreduce: 0.00 | optimizer: 55.51 | batch generator: 0.75 samples/sec: 6.590 | iteration 106500/ 320000 | elapsed time per iteration (ms): 2428.1 | learning rate: 2.294E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.137392E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.66 | backward: 1805.28 | backward-backward: 1805.26 | backward-allreduce: 0.00 | optimizer: 55.72 | batch generator: 0.76 samples/sec: 6.596 | iteration 106600/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 2.293E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.128005E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.08 | backward: 1803.50 | backward-backward: 1803.47 | backward-allreduce: 0.00 | optimizer: 55.89 | batch generator: 0.79 samples/sec: 6.592 | iteration 106700/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 2.292E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.125280E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.53 | backward: 1804.47 | backward-backward: 1804.45 | backward-allreduce: 0.00 | optimizer: 55.73 | batch generator: 0.81 samples/sec: 6.590 | iteration 106800/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 2.290E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.119307E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.66 | backward: 1805.02 | backward-backward: 1805.00 | backward-allreduce: 0.00 | optimizer: 55.88 | batch generator: 0.79 samples/sec: 6.597 | iteration 106900/ 320000 | elapsed time per iteration (ms): 2425.4 | learning rate: 2.289E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.124781E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.03 | backward: 1803.76 | backward-backward: 1803.74 | backward-allreduce: 0.00 | optimizer: 55.21 | batch generator: 0.77 samples/sec: 6.591 | iteration 107000/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 2.288E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.121125E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.60 | backward: 1805.31 | backward-backward: 1805.28 | backward-allreduce: 0.00 | optimizer: 55.45 | batch generator: 0.77 ----------------------------------------------------------------------------------------------------------- validation results at iteration 107000 | lm_loss value: 3.129923E+00 | lm_loss_ppl value: 2.287221E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.444 | iteration 107100/ 320000 | elapsed time per iteration (ms): 2483.0 | learning rate: 2.287E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.130344E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.43 | backward: 1803.66 | backward-backward: 1803.64 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.85 samples/sec: 6.593 | iteration 107200/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 2.285E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.109968E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.34 | backward: 1804.50 | backward-backward: 1804.48 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.85 samples/sec: 6.588 | iteration 107300/ 320000 | elapsed time per iteration (ms): 2428.5 | learning rate: 2.284E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.124724E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.69 | backward: 1805.63 | backward-backward: 1805.60 | backward-allreduce: 0.00 | optimizer: 55.82 | batch generator: 0.77 samples/sec: 6.594 | iteration 107400/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 2.283E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.129730E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.48 | backward: 1803.74 | backward-backward: 1803.72 | backward-allreduce: 0.00 | optimizer: 56.00 | batch generator: 0.77 samples/sec: 6.590 | iteration 107500/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 2.282E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.136299E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.63 | backward: 1805.33 | backward-backward: 1805.31 | backward-allreduce: 0.00 | optimizer: 55.50 | batch generator: 0.81 samples/sec: 6.592 | iteration 107600/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 2.280E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.109465E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.69 | backward: 1804.66 | backward-backward: 1804.63 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.81 samples/sec: 6.593 | iteration 107700/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 2.279E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.148380E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.26 | backward: 1804.06 | backward-backward: 1804.04 | backward-allreduce: 0.00 | optimizer: 56.23 | batch generator: 0.82 samples/sec: 6.589 | iteration 107800/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 2.278E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.115858E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.84 | backward: 1805.41 | backward-backward: 1805.38 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.78 samples/sec: 6.595 | iteration 107900/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 2.277E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.117443E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.28 | backward: 1803.88 | backward-backward: 1803.86 | backward-allreduce: 0.00 | optimizer: 55.67 | batch generator: 0.82 samples/sec: 6.595 | iteration 108000/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 2.275E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.131894E+00 | loss scale: 65536.0 | number of skipped iterations: 2 | number of nan iterations: 0 | time (ms) | forward: 566.54 | backward: 1804.79 | backward-backward: 1804.76 | backward-allreduce: 0.00 | optimizer: 54.49 | batch generator: 0.80 ----------------------------------------------------------------------------------------------------------- validation results at iteration 108000 | lm_loss value: 3.119994E+00 | lm_loss_ppl value: 2.264625E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.441 | iteration 108100/ 320000 | elapsed time per iteration (ms): 2484.0 | learning rate: 2.274E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.128427E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.61 | backward: 1804.67 | backward-backward: 1804.65 | backward-allreduce: 0.00 | optimizer: 55.49 | batch generator: 0.86 samples/sec: 6.592 | iteration 108200/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 2.273E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.119238E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.33 | backward: 1804.94 | backward-backward: 1804.91 | backward-allreduce: 0.00 | optimizer: 55.70 | batch generator: 0.75 samples/sec: 6.587 | iteration 108300/ 320000 | elapsed time per iteration (ms): 2429.0 | learning rate: 2.272E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.128508E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.29 | backward: 1805.47 | backward-backward: 1805.45 | backward-allreduce: 0.00 | optimizer: 55.82 | batch generator: 0.76 samples/sec: 6.594 | iteration 108400/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 2.270E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.119738E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.26 | backward: 1804.27 | backward-backward: 1804.25 | backward-allreduce: 0.00 | optimizer: 55.70 | batch generator: 0.78 samples/sec: 6.587 | iteration 108500/ 320000 | elapsed time per iteration (ms): 2428.9 | learning rate: 2.269E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.115623E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.86 | backward: 1805.68 | backward-backward: 1805.65 | backward-allreduce: 0.00 | optimizer: 55.99 | batch generator: 0.77 samples/sec: 6.594 | iteration 108600/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 2.268E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.123563E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.17 | backward: 1804.15 | backward-backward: 1804.12 | backward-allreduce: 0.00 | optimizer: 55.76 | batch generator: 0.80 samples/sec: 6.588 | iteration 108700/ 320000 | elapsed time per iteration (ms): 2428.7 | learning rate: 2.267E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.123449E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.10 | backward: 1805.32 | backward-backward: 1805.29 | backward-allreduce: 0.00 | optimizer: 55.91 | batch generator: 0.87 samples/sec: 6.590 | iteration 108800/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 2.265E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.125409E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.77 | backward: 1804.26 | backward-backward: 1804.24 | backward-allreduce: 0.00 | optimizer: 56.47 | batch generator: 0.95 samples/sec: 6.592 | iteration 108900/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 2.264E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.132646E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.45 | backward: 1805.06 | backward-backward: 1805.04 | backward-allreduce: 0.00 | optimizer: 55.43 | batch generator: 0.77 samples/sec: 6.595 | iteration 109000/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 2.263E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.125878E+00 | loss scale: 65536.0 | number of skipped iterations: 2 | number of nan iterations: 0 | time (ms) | forward: 567.10 | backward: 1804.07 | backward-backward: 1804.04 | backward-allreduce: 0.00 | optimizer: 54.51 | batch generator: 0.78 ----------------------------------------------------------------------------------------------------------- validation results at iteration 109000 | lm_loss value: 3.064795E+00 | lm_loss_ppl value: 2.143007E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.443 | iteration 109100/ 320000 | elapsed time per iteration (ms): 2483.3 | learning rate: 2.261E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.102434E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.37 | backward: 1804.29 | backward-backward: 1804.27 | backward-allreduce: 0.00 | optimizer: 55.42 | batch generator: 0.86 samples/sec: 6.587 | iteration 109200/ 320000 | elapsed time per iteration (ms): 2428.9 | learning rate: 2.260E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.119810E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.09 | backward: 1805.61 | backward-backward: 1805.59 | backward-allreduce: 0.00 | optimizer: 55.80 | batch generator: 0.80 samples/sec: 6.599 | iteration 109300/ 320000 | elapsed time per iteration (ms): 2424.7 | learning rate: 2.259E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.126929E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.10 | backward: 1803.18 | backward-backward: 1803.15 | backward-allreduce: 0.00 | optimizer: 55.04 | batch generator: 0.73 samples/sec: 6.592 | iteration 109400/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 2.258E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.122780E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.43 | backward: 1804.88 | backward-backward: 1804.85 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.77 samples/sec: 6.593 | iteration 109500/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 2.256E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.132538E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.76 | backward: 1804.42 | backward-backward: 1804.39 | backward-allreduce: 0.00 | optimizer: 55.40 | batch generator: 0.78 samples/sec: 6.595 | iteration 109600/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 2.255E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.118848E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.19 | backward: 1804.18 | backward-backward: 1804.16 | backward-allreduce: 0.00 | optimizer: 55.43 | batch generator: 0.78 samples/sec: 6.587 | iteration 109700/ 320000 | elapsed time per iteration (ms): 2429.0 | learning rate: 2.254E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.122750E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.53 | backward: 1805.18 | backward-backward: 1805.15 | backward-allreduce: 0.00 | optimizer: 55.85 | batch generator: 0.80 samples/sec: 6.596 | iteration 109800/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 2.253E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.115951E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.28 | backward: 1803.84 | backward-backward: 1803.81 | backward-allreduce: 0.00 | optimizer: 55.35 | batch generator: 0.78 samples/sec: 6.586 | iteration 109900/ 320000 | elapsed time per iteration (ms): 2429.4 | learning rate: 2.251E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.121631E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.98 | backward: 1805.95 | backward-backward: 1805.93 | backward-allreduce: 0.00 | optimizer: 56.11 | batch generator: 0.76 samples/sec: 6.596 | iteration 110000/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 2.250E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.127298E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.46 | backward: 1803.39 | backward-backward: 1803.37 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.83 WARNING: Deleting old checkpoints: checkpoints-hfcm/global_step10000 ----------------------------------------------------------------------------------------------------------- validation results at iteration 110000 | lm_loss value: 3.134111E+00 | lm_loss_ppl value: 2.296822E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.205 | iteration 110100/ 320000 | elapsed time per iteration (ms): 2578.5 | learning rate: 2.249E-04 | approx flops per GPU: 38.5TFLOPS | lm_loss: 3.120486E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 568.98 | backward: 1808.95 | backward-backward: 1808.93 | backward-allreduce: 0.00 | optimizer: 55.95 | batch generator: 0.88 samples/sec: 6.596 | iteration 110200/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 2.248E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.119691E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.53 | backward: 1803.43 | backward-backward: 1803.41 | backward-allreduce: 0.00 | optimizer: 55.41 | batch generator: 0.76 samples/sec: 6.593 | iteration 110300/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 2.246E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.120641E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.44 | backward: 1804.33 | backward-backward: 1804.31 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.78 samples/sec: 6.586 | iteration 110400/ 320000 | elapsed time per iteration (ms): 2429.3 | learning rate: 2.245E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.117405E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.84 | backward: 1805.29 | backward-backward: 1805.27 | backward-allreduce: 0.00 | optimizer: 56.81 | batch generator: 0.78 samples/sec: 6.592 | iteration 110500/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 2.244E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.121624E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.62 | backward: 1804.50 | backward-backward: 1804.48 | backward-allreduce: 0.00 | optimizer: 55.72 | batch generator: 0.77 samples/sec: 6.585 | iteration 110600/ 320000 | elapsed time per iteration (ms): 2429.7 | learning rate: 2.242E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.125991E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.22 | backward: 1805.94 | backward-backward: 1805.92 | backward-allreduce: 0.00 | optimizer: 56.20 | batch generator: 0.79 samples/sec: 6.593 | iteration 110700/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 2.241E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.104675E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.44 | backward: 1804.11 | backward-backward: 1804.09 | backward-allreduce: 0.00 | optimizer: 55.82 | batch generator: 0.75 samples/sec: 6.592 | iteration 110800/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 2.240E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.108428E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.29 | backward: 1805.03 | backward-backward: 1805.01 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.79 samples/sec: 6.589 | iteration 110900/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 2.239E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.130839E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.47 | backward: 1805.65 | backward-backward: 1805.62 | backward-allreduce: 0.00 | optimizer: 55.78 | batch generator: 0.80 samples/sec: 6.588 | iteration 111000/ 320000 | elapsed time per iteration (ms): 2428.6 | learning rate: 2.237E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.121263E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.64 | backward: 1805.12 | backward-backward: 1805.10 | backward-allreduce: 0.00 | optimizer: 56.40 | batch generator: 0.76 ----------------------------------------------------------------------------------------------------------- validation results at iteration 111000 | lm_loss value: 3.063169E+00 | lm_loss_ppl value: 2.139526E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.443 | iteration 111100/ 320000 | elapsed time per iteration (ms): 2483.4 | learning rate: 2.236E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.124957E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.12 | backward: 1804.69 | backward-backward: 1804.67 | backward-allreduce: 0.00 | optimizer: 55.37 | batch generator: 0.82 samples/sec: 6.588 | iteration 111200/ 320000 | elapsed time per iteration (ms): 2428.7 | learning rate: 2.235E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.132281E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.95 | backward: 1805.42 | backward-backward: 1805.39 | backward-allreduce: 0.00 | optimizer: 55.98 | batch generator: 0.80 samples/sec: 6.596 | iteration 111300/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 2.233E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.112280E+00 | loss scale: 131072.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.16 | backward: 1804.13 | backward-backward: 1804.10 | backward-allreduce: 0.00 | optimizer: 55.23 | batch generator: 0.78 samples/sec: 6.586 | iteration 111400/ 320000 | elapsed time per iteration (ms): 2429.3 | learning rate: 2.232E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.124782E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.69 | backward: 1806.56 | backward-backward: 1806.53 | backward-allreduce: 0.00 | optimizer: 55.67 | batch generator: 0.78 samples/sec: 6.592 | iteration 111500/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 2.231E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.104122E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.84 | backward: 1804.09 | backward-backward: 1804.06 | backward-allreduce: 0.00 | optimizer: 55.79 | batch generator: 0.80 samples/sec: 6.593 | iteration 111600/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 2.230E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.120751E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.26 | backward: 1804.68 | backward-backward: 1804.66 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.77 samples/sec: 6.593 | iteration 111700/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 2.228E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.112110E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.35 | backward: 1804.46 | backward-backward: 1804.43 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.81 samples/sec: 6.594 | iteration 111800/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 2.227E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.111371E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.24 | backward: 1804.24 | backward-backward: 1804.21 | backward-allreduce: 0.00 | optimizer: 55.49 | batch generator: 0.76 samples/sec: 6.592 | iteration 111900/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 2.226E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.119091E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.31 | backward: 1805.09 | backward-backward: 1805.07 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.75 samples/sec: 6.588 | iteration 112000/ 320000 | elapsed time per iteration (ms): 2428.7 | learning rate: 2.224E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.108592E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.80 | backward: 1805.77 | backward-backward: 1805.75 | backward-allreduce: 0.00 | optimizer: 55.80 | batch generator: 0.78 ----------------------------------------------------------------------------------------------------------- validation results at iteration 112000 | lm_loss value: 3.084214E+00 | lm_loss_ppl value: 2.185029E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.441 | iteration 112100/ 320000 | elapsed time per iteration (ms): 2484.1 | learning rate: 2.223E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.112469E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.52 | backward: 1804.24 | backward-backward: 1804.22 | backward-allreduce: 0.00 | optimizer: 56.19 | batch generator: 0.84 samples/sec: 6.585 | iteration 112200/ 320000 | elapsed time per iteration (ms): 2429.7 | learning rate: 2.222E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.114779E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.82 | backward: 1806.48 | backward-backward: 1806.46 | backward-allreduce: 0.00 | optimizer: 55.97 | batch generator: 0.78 samples/sec: 6.595 | iteration 112300/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 2.221E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.127374E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.53 | backward: 1803.42 | backward-backward: 1803.39 | backward-allreduce: 0.00 | optimizer: 55.88 | batch generator: 0.84 samples/sec: 6.587 | iteration 112400/ 320000 | elapsed time per iteration (ms): 2429.2 | learning rate: 2.219E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.116086E+00 | loss scale: 65536.0 | number of skipped iterations: 2 | number of nan iterations: 0 | time (ms) | forward: 567.24 | backward: 1806.61 | backward-backward: 1806.59 | backward-allreduce: 0.00 | optimizer: 54.94 | batch generator: 0.81 samples/sec: 6.594 | iteration 112500/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 2.218E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.116925E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.45 | backward: 1804.03 | backward-backward: 1804.00 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.79 samples/sec: 6.588 | iteration 112600/ 320000 | elapsed time per iteration (ms): 2428.8 | learning rate: 2.217E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.094925E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.59 | backward: 1806.19 | backward-backward: 1806.16 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.79 samples/sec: 6.593 | iteration 112700/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 2.215E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.096577E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.34 | backward: 1804.60 | backward-backward: 1804.58 | backward-allreduce: 0.00 | optimizer: 55.51 | batch generator: 0.76 samples/sec: 6.594 | iteration 112800/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 2.214E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.110537E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.19 | backward: 1804.50 | backward-backward: 1804.47 | backward-allreduce: 0.00 | optimizer: 55.37 | batch generator: 0.77 samples/sec: 6.590 | iteration 112900/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 2.213E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.104515E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.59 | backward: 1805.07 | backward-backward: 1805.05 | backward-allreduce: 0.00 | optimizer: 55.76 | batch generator: 0.79 samples/sec: 6.596 | iteration 113000/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 2.212E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.105647E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.45 | backward: 1803.42 | backward-backward: 1803.39 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.79 ----------------------------------------------------------------------------------------------------------- validation results at iteration 113000 | lm_loss value: 3.116553E+00 | lm_loss_ppl value: 2.256846E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.436 | iteration 113100/ 320000 | elapsed time per iteration (ms): 2485.9 | learning rate: 2.210E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.107832E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.83 | backward: 1806.23 | backward-backward: 1806.21 | backward-allreduce: 0.00 | optimizer: 55.67 | batch generator: 0.85 samples/sec: 6.595 | iteration 113200/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 2.209E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.122780E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.17 | backward: 1803.09 | backward-backward: 1803.07 | backward-allreduce: 0.00 | optimizer: 56.57 | batch generator: 0.76 samples/sec: 6.590 | iteration 113300/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 2.208E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.124217E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.34 | backward: 1805.24 | backward-backward: 1805.22 | backward-allreduce: 0.00 | optimizer: 55.91 | batch generator: 0.75 samples/sec: 6.591 | iteration 113400/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 2.206E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.113313E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.79 | backward: 1804.63 | backward-backward: 1804.61 | backward-allreduce: 0.00 | optimizer: 55.90 | batch generator: 0.77 samples/sec: 6.598 | iteration 113500/ 320000 | elapsed time per iteration (ms): 2425.0 | learning rate: 2.205E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.120231E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.04 | backward: 1803.08 | backward-backward: 1803.06 | backward-allreduce: 0.00 | optimizer: 55.46 | batch generator: 0.83 samples/sec: 6.588 | iteration 113600/ 320000 | elapsed time per iteration (ms): 2428.8 | learning rate: 2.204E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.111140E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.80 | backward: 1805.79 | backward-backward: 1805.77 | backward-allreduce: 0.00 | optimizer: 55.87 | batch generator: 0.74 samples/sec: 6.594 | iteration 113700/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 2.202E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.106973E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.46 | backward: 1803.81 | backward-backward: 1803.79 | backward-allreduce: 0.00 | optimizer: 55.75 | batch generator: 0.79 samples/sec: 6.591 | iteration 113800/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 2.201E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.096728E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.53 | backward: 1804.88 | backward-backward: 1804.86 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.77 samples/sec: 6.595 | iteration 113900/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 2.200E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.109965E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.38 | backward: 1804.21 | backward-backward: 1804.19 | backward-allreduce: 0.00 | optimizer: 55.12 | batch generator: 0.78 samples/sec: 6.592 | iteration 114000/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 2.199E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.102942E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.44 | backward: 1804.86 | backward-backward: 1804.84 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.77 ----------------------------------------------------------------------------------------------------------- validation results at iteration 114000 | lm_loss value: 3.069239E+00 | lm_loss_ppl value: 2.152552E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.438 | iteration 114100/ 320000 | elapsed time per iteration (ms): 2485.2 | learning rate: 2.197E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.098866E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.64 | backward: 1805.43 | backward-backward: 1805.41 | backward-allreduce: 0.00 | optimizer: 55.92 | batch generator: 0.85 samples/sec: 6.592 | iteration 114200/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 2.196E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.106554E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.17 | backward: 1805.20 | backward-backward: 1805.18 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.76 samples/sec: 6.584 | iteration 114300/ 320000 | elapsed time per iteration (ms): 2430.0 | learning rate: 2.195E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.123420E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.89 | backward: 1805.73 | backward-backward: 1805.71 | backward-allreduce: 0.00 | optimizer: 57.00 | batch generator: 0.83 samples/sec: 6.596 | iteration 114400/ 320000 | elapsed time per iteration (ms): 2425.6 | learning rate: 2.193E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.105179E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.17 | backward: 1803.53 | backward-backward: 1803.51 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.77 samples/sec: 6.585 | iteration 114500/ 320000 | elapsed time per iteration (ms): 2429.6 | learning rate: 2.192E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.105471E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.41 | backward: 1805.95 | backward-backward: 1805.92 | backward-allreduce: 0.00 | optimizer: 55.91 | batch generator: 0.79 samples/sec: 6.593 | iteration 114600/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 2.191E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.122457E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.44 | backward: 1804.70 | backward-backward: 1804.68 | backward-allreduce: 0.00 | optimizer: 55.38 | batch generator: 0.77 samples/sec: 6.591 | iteration 114700/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 2.189E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.107464E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.56 | backward: 1804.85 | backward-backward: 1804.82 | backward-allreduce: 0.00 | optimizer: 55.69 | batch generator: 0.86 samples/sec: 6.595 | iteration 114800/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 2.188E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.112031E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.28 | backward: 1803.82 | backward-backward: 1803.79 | backward-allreduce: 0.00 | optimizer: 55.71 | batch generator: 0.76 samples/sec: 6.592 | iteration 114900/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 2.187E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.102815E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.48 | backward: 1804.55 | backward-backward: 1804.53 | backward-allreduce: 0.00 | optimizer: 55.79 | batch generator: 0.81 samples/sec: 6.594 | iteration 115000/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 2.186E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.096528E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.46 | backward: 1803.62 | backward-backward: 1803.60 | backward-allreduce: 0.00 | optimizer: 55.93 | batch generator: 0.96 ----------------------------------------------------------------------------------------------------------- validation results at iteration 115000 | lm_loss value: 3.119918E+00 | lm_loss_ppl value: 2.264453E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.444 | iteration 115100/ 320000 | elapsed time per iteration (ms): 2483.0 | learning rate: 2.184E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.117982E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.45 | backward: 1803.76 | backward-backward: 1803.74 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.91 samples/sec: 6.588 | iteration 115200/ 320000 | elapsed time per iteration (ms): 2428.6 | learning rate: 2.183E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.103008E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.06 | backward: 1805.52 | backward-backward: 1805.50 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.77 samples/sec: 6.595 | iteration 115300/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 2.182E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.101171E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.07 | backward: 1803.47 | backward-backward: 1803.45 | backward-allreduce: 0.00 | optimizer: 56.08 | batch generator: 0.83 samples/sec: 6.589 | iteration 115400/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 2.180E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.091690E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.72 | backward: 1805.06 | backward-backward: 1805.04 | backward-allreduce: 0.00 | optimizer: 56.01 | batch generator: 0.80 samples/sec: 6.592 | iteration 115500/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 2.179E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.104701E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.82 | backward: 1803.86 | backward-backward: 1803.84 | backward-allreduce: 0.00 | optimizer: 56.07 | batch generator: 0.76 samples/sec: 6.595 | iteration 115600/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 2.178E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.116031E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.26 | backward: 1803.75 | backward-backward: 1803.72 | backward-allreduce: 0.00 | optimizer: 55.85 | batch generator: 0.82 samples/sec: 6.588 | iteration 115700/ 320000 | elapsed time per iteration (ms): 2428.7 | learning rate: 2.176E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.095998E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.70 | backward: 1806.07 | backward-backward: 1806.05 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.79 samples/sec: 6.596 | iteration 115800/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 2.175E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.120874E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.26 | backward: 1803.57 | backward-backward: 1803.54 | backward-allreduce: 0.00 | optimizer: 55.50 | batch generator: 0.79 samples/sec: 6.586 | iteration 115900/ 320000 | elapsed time per iteration (ms): 2429.3 | learning rate: 2.174E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.114410E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.86 | backward: 1806.18 | backward-backward: 1806.16 | backward-allreduce: 0.00 | optimizer: 55.92 | batch generator: 0.76 samples/sec: 6.595 | iteration 116000/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 2.172E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.091146E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.47 | backward: 1803.90 | backward-backward: 1803.87 | backward-allreduce: 0.00 | optimizer: 55.40 | batch generator: 0.77 ----------------------------------------------------------------------------------------------------------- validation results at iteration 116000 | lm_loss value: 3.095906E+00 | lm_loss_ppl value: 2.210727E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.435 | iteration 116100/ 320000 | elapsed time per iteration (ms): 2486.5 | learning rate: 2.171E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.093949E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.69 | backward: 1806.51 | backward-backward: 1806.48 | backward-allreduce: 0.00 | optimizer: 56.07 | batch generator: 0.90 samples/sec: 6.595 | iteration 116200/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 2.170E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.085372E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.74 | backward: 1803.51 | backward-backward: 1803.49 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.76 samples/sec: 6.588 | iteration 116300/ 320000 | elapsed time per iteration (ms): 2428.8 | learning rate: 2.168E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.089325E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.57 | backward: 1805.99 | backward-backward: 1805.96 | backward-allreduce: 0.00 | optimizer: 55.83 | batch generator: 0.77 samples/sec: 6.592 | iteration 116400/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 2.167E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.092198E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.26 | backward: 1804.44 | backward-backward: 1804.42 | backward-allreduce: 0.00 | optimizer: 56.03 | batch generator: 0.77 samples/sec: 6.591 | iteration 116500/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 2.166E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.082921E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.73 | backward: 1804.62 | backward-backward: 1804.60 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.89 samples/sec: 6.588 | iteration 116600/ 320000 | elapsed time per iteration (ms): 2428.7 | learning rate: 2.164E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.101029E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.64 | backward: 1806.14 | backward-backward: 1806.12 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.75 samples/sec: 6.597 | iteration 116700/ 320000 | elapsed time per iteration (ms): 2425.3 | learning rate: 2.163E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.102455E+00 | loss scale: 65536.0 | number of skipped iterations: 2 | number of nan iterations: 0 | time (ms) | forward: 567.14 | backward: 1803.22 | backward-backward: 1803.20 | backward-allreduce: 0.00 | optimizer: 54.54 | batch generator: 0.76 samples/sec: 6.590 | iteration 116800/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 2.162E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.090819E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.63 | backward: 1805.62 | backward-backward: 1805.60 | backward-allreduce: 0.00 | optimizer: 55.18 | batch generator: 0.83 samples/sec: 6.594 | iteration 116900/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 2.161E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.097575E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.40 | backward: 1803.82 | backward-backward: 1803.79 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.75 samples/sec: 6.594 | iteration 117000/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 2.159E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.096306E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.41 | backward: 1804.09 | backward-backward: 1804.07 | backward-allreduce: 0.00 | optimizer: 55.71 | batch generator: 0.76 ----------------------------------------------------------------------------------------------------------- validation results at iteration 117000 | lm_loss value: 3.064446E+00 | lm_loss_ppl value: 2.142259E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.437 | iteration 117100/ 320000 | elapsed time per iteration (ms): 2485.6 | learning rate: 2.158E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.103384E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.79 | backward: 1805.66 | backward-backward: 1805.64 | backward-allreduce: 0.00 | optimizer: 55.88 | batch generator: 0.90 samples/sec: 6.593 | iteration 117200/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 2.157E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.079222E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.57 | backward: 1803.74 | backward-backward: 1803.72 | backward-allreduce: 0.00 | optimizer: 55.94 | batch generator: 0.77 samples/sec: 6.587 | iteration 117300/ 320000 | elapsed time per iteration (ms): 2429.1 | learning rate: 2.155E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.118185E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.71 | backward: 1805.95 | backward-backward: 1805.93 | backward-allreduce: 0.00 | optimizer: 56.02 | batch generator: 0.76 samples/sec: 6.594 | iteration 117400/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 2.154E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.098085E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.45 | backward: 1803.92 | backward-backward: 1803.90 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.84 samples/sec: 6.584 | iteration 117500/ 320000 | elapsed time per iteration (ms): 2430.3 | learning rate: 2.153E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.110058E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.87 | backward: 1806.54 | backward-backward: 1806.52 | backward-allreduce: 0.00 | optimizer: 56.47 | batch generator: 0.83 samples/sec: 6.595 | iteration 117600/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 2.151E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.097877E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.66 | backward: 1803.47 | backward-backward: 1803.45 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.76 samples/sec: 6.588 | iteration 117700/ 320000 | elapsed time per iteration (ms): 2428.5 | learning rate: 2.150E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.112383E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.59 | backward: 1805.40 | backward-backward: 1805.38 | backward-allreduce: 0.00 | optimizer: 56.08 | batch generator: 0.82 samples/sec: 6.593 | iteration 117800/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 2.149E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.092476E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.39 | backward: 1804.40 | backward-backward: 1804.38 | backward-allreduce: 0.00 | optimizer: 55.71 | batch generator: 0.77 samples/sec: 6.590 | iteration 117900/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 2.147E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.099235E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.25 | backward: 1805.10 | backward-backward: 1805.08 | backward-allreduce: 0.00 | optimizer: 56.07 | batch generator: 0.81 samples/sec: 6.590 | iteration 118000/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 2.146E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.086079E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.56 | backward: 1805.35 | backward-backward: 1805.33 | backward-allreduce: 0.00 | optimizer: 55.50 | batch generator: 0.78 ----------------------------------------------------------------------------------------------------------- validation results at iteration 118000 | lm_loss value: 3.034067E+00 | lm_loss_ppl value: 2.078158E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.445 | iteration 118100/ 320000 | elapsed time per iteration (ms): 2482.7 | learning rate: 2.145E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.093061E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.06 | backward: 1803.72 | backward-backward: 1803.70 | backward-allreduce: 0.00 | optimizer: 55.73 | batch generator: 0.87 samples/sec: 6.587 | iteration 118200/ 320000 | elapsed time per iteration (ms): 2428.8 | learning rate: 2.143E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.088177E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.74 | backward: 1806.05 | backward-backward: 1806.03 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.76 samples/sec: 6.596 | iteration 118300/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 2.142E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.090476E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.47 | backward: 1803.70 | backward-backward: 1803.68 | backward-allreduce: 0.00 | optimizer: 55.36 | batch generator: 0.77 samples/sec: 6.587 | iteration 118400/ 320000 | elapsed time per iteration (ms): 2429.0 | learning rate: 2.141E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.102928E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.84 | backward: 1806.18 | backward-backward: 1806.16 | backward-allreduce: 0.00 | optimizer: 55.66 | batch generator: 0.77 samples/sec: 6.597 | iteration 118500/ 320000 | elapsed time per iteration (ms): 2425.4 | learning rate: 2.139E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.098099E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.22 | backward: 1803.56 | backward-backward: 1803.53 | backward-allreduce: 0.00 | optimizer: 55.25 | batch generator: 0.77 samples/sec: 6.586 | iteration 118600/ 320000 | elapsed time per iteration (ms): 2429.5 | learning rate: 2.138E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.103239E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.59 | backward: 1806.18 | backward-backward: 1806.15 | backward-allreduce: 0.00 | optimizer: 56.32 | batch generator: 0.80 samples/sec: 6.592 | iteration 118700/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 2.137E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.087921E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.60 | backward: 1803.68 | backward-backward: 1803.66 | backward-allreduce: 0.00 | optimizer: 56.39 | batch generator: 0.85 samples/sec: 6.592 | iteration 118800/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 2.135E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.104571E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.32 | backward: 1804.98 | backward-backward: 1804.95 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.78 samples/sec: 6.593 | iteration 118900/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 2.134E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.086606E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.54 | backward: 1804.46 | backward-backward: 1804.44 | backward-allreduce: 0.00 | optimizer: 55.45 | batch generator: 0.79 samples/sec: 6.594 | iteration 119000/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 2.133E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.116094E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.53 | backward: 1804.17 | backward-backward: 1804.15 | backward-allreduce: 0.00 | optimizer: 55.46 | batch generator: 0.79 ----------------------------------------------------------------------------------------------------------- validation results at iteration 119000 | lm_loss value: 3.056569E+00 | lm_loss_ppl value: 2.125450E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.439 | iteration 119100/ 320000 | elapsed time per iteration (ms): 2485.0 | learning rate: 2.131E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.071130E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.93 | backward: 1804.77 | backward-backward: 1804.75 | backward-allreduce: 0.00 | optimizer: 55.97 | batch generator: 0.87 samples/sec: 6.593 | iteration 119200/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 2.130E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.093381E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.63 | backward: 1803.90 | backward-backward: 1803.87 | backward-allreduce: 0.00 | optimizer: 55.84 | batch generator: 0.93 samples/sec: 6.586 | iteration 119300/ 320000 | elapsed time per iteration (ms): 2429.2 | learning rate: 2.129E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.095957E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.99 | backward: 1805.84 | backward-backward: 1805.82 | backward-allreduce: 0.00 | optimizer: 56.02 | batch generator: 0.82 samples/sec: 6.597 | iteration 119400/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 2.127E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.102194E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.33 | backward: 1803.21 | backward-backward: 1803.19 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.79 samples/sec: 6.588 | iteration 119500/ 320000 | elapsed time per iteration (ms): 2428.7 | learning rate: 2.126E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.092291E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.81 | backward: 1805.87 | backward-backward: 1805.84 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.81 samples/sec: 6.595 | iteration 119600/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 2.125E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.067271E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.56 | backward: 1803.59 | backward-backward: 1803.57 | backward-allreduce: 0.00 | optimizer: 55.49 | batch generator: 0.77 samples/sec: 6.585 | iteration 119700/ 320000 | elapsed time per iteration (ms): 2429.8 | learning rate: 2.123E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.099124E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.65 | backward: 1806.62 | backward-backward: 1806.60 | backward-allreduce: 0.00 | optimizer: 56.19 | batch generator: 0.78 samples/sec: 6.594 | iteration 119800/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 2.122E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.100370E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.69 | backward: 1803.82 | backward-backward: 1803.79 | backward-allreduce: 0.00 | optimizer: 55.47 | batch generator: 0.77 samples/sec: 6.591 | iteration 119900/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 2.121E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.109012E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.53 | backward: 1805.03 | backward-backward: 1805.01 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.76 samples/sec: 6.588 | iteration 120000/ 320000 | elapsed time per iteration (ms): 2428.7 | learning rate: 2.119E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.072692E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.85 | backward: 1805.53 | backward-backward: 1805.51 | backward-allreduce: 0.00 | optimizer: 55.91 | batch generator: 0.86 WARNING: Deleting old checkpoints: checkpoints-hfcm/global_step20000 ----------------------------------------------------------------------------------------------------------- validation results at iteration 120000 | lm_loss value: 3.022144E+00 | lm_loss_ppl value: 2.053528E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.221 | iteration 120100/ 320000 | elapsed time per iteration (ms): 2572.0 | learning rate: 2.118E-04 | approx flops per GPU: 38.6TFLOPS | lm_loss: 3.077621E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.56 | backward: 1804.53 | backward-backward: 1804.51 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.88 samples/sec: 6.591 | iteration 120200/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 2.117E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.088855E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 567.08 | backward: 1805.28 | backward-backward: 1805.26 | backward-allreduce: 0.00 | optimizer: 54.97 | batch generator: 0.76 samples/sec: 6.596 | iteration 120300/ 320000 | elapsed time per iteration (ms): 2425.6 | learning rate: 2.115E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.090181E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.26 | backward: 1803.63 | backward-backward: 1803.61 | backward-allreduce: 0.00 | optimizer: 55.33 | batch generator: 0.80 samples/sec: 6.589 | iteration 120400/ 320000 | elapsed time per iteration (ms): 2428.4 | learning rate: 2.114E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.093166E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.62 | backward: 1805.78 | backward-backward: 1805.76 | backward-allreduce: 0.00 | optimizer: 55.66 | batch generator: 0.76 samples/sec: 6.590 | iteration 120500/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 2.113E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.089457E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.19 | backward: 1804.62 | backward-backward: 1804.60 | backward-allreduce: 0.00 | optimizer: 55.69 | batch generator: 0.78 samples/sec: 6.588 | iteration 120600/ 320000 | elapsed time per iteration (ms): 2428.7 | learning rate: 2.111E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.083755E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.00 | backward: 1805.16 | backward-backward: 1805.14 | backward-allreduce: 0.00 | optimizer: 56.02 | batch generator: 0.84 samples/sec: 6.590 | iteration 120700/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 2.110E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.089332E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.98 | backward: 1805.06 | backward-backward: 1805.03 | backward-allreduce: 0.00 | optimizer: 55.51 | batch generator: 0.83 samples/sec: 6.593 | iteration 120800/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 2.109E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.084532E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.17 | backward: 1804.14 | backward-backward: 1804.11 | backward-allreduce: 0.00 | optimizer: 55.98 | batch generator: 0.80 samples/sec: 6.586 | iteration 120900/ 320000 | elapsed time per iteration (ms): 2429.4 | learning rate: 2.107E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.081898E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.98 | backward: 1805.70 | backward-backward: 1805.67 | backward-allreduce: 0.00 | optimizer: 56.31 | batch generator: 0.74 samples/sec: 6.595 | iteration 121000/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 2.106E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.087039E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.37 | backward: 1803.65 | backward-backward: 1803.63 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.76 ----------------------------------------------------------------------------------------------------------- validation results at iteration 121000 | lm_loss value: 3.022154E+00 | lm_loss_ppl value: 2.053548E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.438 | iteration 121100/ 320000 | elapsed time per iteration (ms): 2485.4 | learning rate: 2.105E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.102296E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.35 | backward: 1805.30 | backward-backward: 1805.28 | backward-allreduce: 0.00 | optimizer: 55.51 | batch generator: 0.87 samples/sec: 6.595 | iteration 121200/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 2.103E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.079854E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.55 | backward: 1803.46 | backward-backward: 1803.43 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.76 samples/sec: 6.591 | iteration 121300/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 2.102E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.086245E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.46 | backward: 1804.57 | backward-backward: 1804.54 | backward-allreduce: 0.00 | optimizer: 56.09 | batch generator: 0.78 samples/sec: 6.589 | iteration 121400/ 320000 | elapsed time per iteration (ms): 2428.1 | learning rate: 2.100E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.084354E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.17 | backward: 1805.09 | backward-backward: 1805.07 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.75 samples/sec: 6.598 | iteration 121500/ 320000 | elapsed time per iteration (ms): 2425.1 | learning rate: 2.099E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.088575E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.02 | backward: 1802.98 | backward-backward: 1802.95 | backward-allreduce: 0.00 | optimizer: 55.73 | batch generator: 0.82 samples/sec: 6.587 | iteration 121600/ 320000 | elapsed time per iteration (ms): 2429.1 | learning rate: 2.098E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.101573E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.90 | backward: 1806.02 | backward-backward: 1806.00 | backward-allreduce: 0.00 | optimizer: 55.79 | batch generator: 0.76 samples/sec: 6.594 | iteration 121700/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 2.096E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.074033E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.63 | backward: 1804.02 | backward-backward: 1804.00 | backward-allreduce: 0.00 | optimizer: 55.38 | batch generator: 0.76 samples/sec: 6.588 | iteration 121800/ 320000 | elapsed time per iteration (ms): 2428.5 | learning rate: 2.095E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.073558E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.43 | backward: 1805.89 | backward-backward: 1805.87 | backward-allreduce: 0.00 | optimizer: 55.83 | batch generator: 0.77 samples/sec: 6.590 | iteration 121900/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 2.094E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.092599E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.30 | backward: 1804.30 | backward-backward: 1804.28 | backward-allreduce: 0.00 | optimizer: 56.03 | batch generator: 0.81 samples/sec: 6.591 | iteration 122000/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 2.092E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.093189E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.18 | backward: 1805.31 | backward-backward: 1805.29 | backward-allreduce: 0.00 | optimizer: 55.72 | batch generator: 0.75 ----------------------------------------------------------------------------------------------------------- validation results at iteration 122000 | lm_loss value: 3.070959E+00 | lm_loss_ppl value: 2.156258E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.439 | iteration 122100/ 320000 | elapsed time per iteration (ms): 2484.8 | learning rate: 2.091E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.086191E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.16 | backward: 1804.86 | backward-backward: 1804.84 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.89 samples/sec: 6.596 | iteration 122200/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 2.090E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.100107E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 565.88 | backward: 1804.11 | backward-backward: 1804.09 | backward-allreduce: 0.00 | optimizer: 55.41 | batch generator: 0.76 samples/sec: 6.585 | iteration 122300/ 320000 | elapsed time per iteration (ms): 2429.7 | learning rate: 2.088E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.082443E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.26 | backward: 1806.32 | backward-backward: 1806.30 | backward-allreduce: 0.00 | optimizer: 55.80 | batch generator: 0.77 samples/sec: 6.594 | iteration 122400/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 2.087E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.084052E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.51 | backward: 1803.94 | backward-backward: 1803.92 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.76 samples/sec: 6.593 | iteration 122500/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 2.086E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.087305E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.22 | backward: 1804.87 | backward-backward: 1804.85 | backward-allreduce: 0.00 | optimizer: 55.37 | batch generator: 0.74 samples/sec: 6.589 | iteration 122600/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 2.084E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.086535E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.01 | backward: 1805.38 | backward-backward: 1805.36 | backward-allreduce: 0.00 | optimizer: 55.46 | batch generator: 0.75 samples/sec: 6.596 | iteration 122700/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 2.083E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.077744E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 565.89 | backward: 1804.17 | backward-backward: 1804.15 | backward-allreduce: 0.00 | optimizer: 55.26 | batch generator: 0.75 samples/sec: 6.586 | iteration 122800/ 320000 | elapsed time per iteration (ms): 2429.4 | learning rate: 2.082E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.091965E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.11 | backward: 1806.30 | backward-backward: 1806.28 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.77 samples/sec: 6.595 | iteration 122900/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 2.080E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.110667E+00 | loss scale: 65536.0 | number of skipped iterations: 2 | number of nan iterations: 0 | time (ms) | forward: 566.80 | backward: 1804.23 | backward-backward: 1804.21 | backward-allreduce: 0.00 | optimizer: 54.70 | batch generator: 0.77 samples/sec: 6.591 | iteration 123000/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 2.079E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.078969E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.34 | backward: 1804.85 | backward-backward: 1804.83 | backward-allreduce: 0.00 | optimizer: 56.15 | batch generator: 0.77 ----------------------------------------------------------------------------------------------------------- validation results at iteration 123000 | lm_loss value: 3.119326E+00 | lm_loss_ppl value: 2.263112E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.436 | iteration 123100/ 320000 | elapsed time per iteration (ms): 2486.0 | learning rate: 2.077E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.071359E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.10 | backward: 1805.64 | backward-backward: 1805.61 | backward-allreduce: 0.00 | optimizer: 56.03 | batch generator: 0.85 samples/sec: 6.597 | iteration 123200/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 2.076E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.081655E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.18 | backward: 1803.89 | backward-backward: 1803.86 | backward-allreduce: 0.00 | optimizer: 55.01 | batch generator: 0.79 samples/sec: 6.588 | iteration 123300/ 320000 | elapsed time per iteration (ms): 2428.7 | learning rate: 2.075E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.084671E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.06 | backward: 1805.68 | backward-backward: 1805.66 | backward-allreduce: 0.00 | optimizer: 55.59 | batch generator: 0.83 samples/sec: 6.597 | iteration 123400/ 320000 | elapsed time per iteration (ms): 2425.4 | learning rate: 2.073E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.097367E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.38 | backward: 1803.01 | backward-backward: 1802.99 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.77 samples/sec: 6.593 | iteration 123500/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 2.072E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.085467E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.49 | backward: 1804.23 | backward-backward: 1804.21 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.78 samples/sec: 6.590 | iteration 123600/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 2.071E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.091798E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.15 | backward: 1804.91 | backward-backward: 1804.89 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.77 samples/sec: 6.597 | iteration 123700/ 320000 | elapsed time per iteration (ms): 2425.4 | learning rate: 2.069E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.095840E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 565.90 | backward: 1803.17 | backward-backward: 1803.15 | backward-allreduce: 0.00 | optimizer: 55.92 | batch generator: 0.80 samples/sec: 6.590 | iteration 123800/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 2.068E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.083546E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.71 | backward: 1805.12 | backward-backward: 1805.09 | backward-allreduce: 0.00 | optimizer: 55.72 | batch generator: 0.75 samples/sec: 6.593 | iteration 123900/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 2.067E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.101005E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.03 | backward: 1804.02 | backward-backward: 1804.00 | backward-allreduce: 0.00 | optimizer: 55.47 | batch generator: 0.81 samples/sec: 6.593 | iteration 124000/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 2.065E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.113238E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.49 | backward: 1804.62 | backward-backward: 1804.59 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.80 ----------------------------------------------------------------------------------------------------------- validation results at iteration 124000 | lm_loss value: 3.071016E+00 | lm_loss_ppl value: 2.156380E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.438 | iteration 124100/ 320000 | elapsed time per iteration (ms): 2485.4 | learning rate: 2.064E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.077357E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.01 | backward: 1804.66 | backward-backward: 1804.64 | backward-allreduce: 0.00 | optimizer: 56.47 | batch generator: 0.86 samples/sec: 6.597 | iteration 124200/ 320000 | elapsed time per iteration (ms): 2425.3 | learning rate: 2.063E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.074804E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 565.95 | backward: 1803.42 | backward-backward: 1803.40 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.78 samples/sec: 6.587 | iteration 124300/ 320000 | elapsed time per iteration (ms): 2429.2 | learning rate: 2.061E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.087747E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.74 | backward: 1806.40 | backward-backward: 1806.38 | backward-allreduce: 0.00 | optimizer: 55.66 | batch generator: 0.80 samples/sec: 6.592 | iteration 124400/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 2.060E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.096384E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.88 | backward: 1804.45 | backward-backward: 1804.43 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.78 samples/sec: 6.595 | iteration 124500/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 2.058E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.097603E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.19 | backward: 1804.10 | backward-backward: 1804.07 | backward-allreduce: 0.00 | optimizer: 55.33 | batch generator: 0.81 samples/sec: 6.586 | iteration 124600/ 320000 | elapsed time per iteration (ms): 2429.4 | learning rate: 2.057E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.077510E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.05 | backward: 1806.38 | backward-backward: 1806.36 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.77 samples/sec: 6.594 | iteration 124700/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 2.056E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.079122E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.48 | backward: 1804.05 | backward-backward: 1804.02 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.76 samples/sec: 6.591 | iteration 124800/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 2.054E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.079206E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.47 | backward: 1805.42 | backward-backward: 1805.40 | backward-allreduce: 0.00 | optimizer: 55.21 | batch generator: 0.78 samples/sec: 6.588 | iteration 124900/ 320000 | elapsed time per iteration (ms): 2428.6 | learning rate: 2.053E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.098342E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.26 | backward: 1805.37 | backward-backward: 1805.35 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.77 samples/sec: 6.595 | iteration 125000/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 2.052E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.081217E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.23 | backward: 1803.81 | backward-backward: 1803.79 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.81 ----------------------------------------------------------------------------------------------------------- validation results at iteration 125000 | lm_loss value: 3.078744E+00 | lm_loss_ppl value: 2.173109E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.432 | iteration 125100/ 320000 | elapsed time per iteration (ms): 2487.4 | learning rate: 2.050E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.083378E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.13 | backward: 1806.60 | backward-backward: 1806.57 | backward-allreduce: 0.00 | optimizer: 56.44 | batch generator: 0.85 samples/sec: 6.598 | iteration 125200/ 320000 | elapsed time per iteration (ms): 2424.9 | learning rate: 2.049E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.077004E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.06 | backward: 1803.06 | backward-backward: 1803.04 | backward-allreduce: 0.00 | optimizer: 55.37 | batch generator: 0.79 samples/sec: 6.591 | iteration 125300/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 2.047E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.064909E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.52 | backward: 1805.13 | backward-backward: 1805.11 | backward-allreduce: 0.00 | optimizer: 55.71 | batch generator: 0.86 samples/sec: 6.591 | iteration 125400/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 2.046E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.084066E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.84 | backward: 1804.87 | backward-backward: 1804.84 | backward-allreduce: 0.00 | optimizer: 55.59 | batch generator: 0.82 samples/sec: 6.597 | iteration 125500/ 320000 | elapsed time per iteration (ms): 2425.4 | learning rate: 2.045E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.083432E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.25 | backward: 1803.56 | backward-backward: 1803.53 | backward-allreduce: 0.00 | optimizer: 55.22 | batch generator: 0.76 samples/sec: 6.591 | iteration 125600/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 2.043E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.069535E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.72 | backward: 1805.57 | backward-backward: 1805.54 | backward-allreduce: 0.00 | optimizer: 54.99 | batch generator: 0.77 samples/sec: 6.590 | iteration 125700/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 2.042E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.075305E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.13 | backward: 1804.48 | backward-backward: 1804.46 | backward-allreduce: 0.00 | optimizer: 55.76 | batch generator: 0.82 samples/sec: 6.600 | iteration 125800/ 320000 | elapsed time per iteration (ms): 2424.3 | learning rate: 2.041E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.056347E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.03 | backward: 1802.39 | backward-backward: 1802.37 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.81 samples/sec: 6.590 | iteration 125900/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 2.039E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.067981E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.03 | backward: 1804.38 | backward-backward: 1804.36 | backward-allreduce: 0.00 | optimizer: 56.01 | batch generator: 0.84 samples/sec: 6.589 | iteration 126000/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 2.038E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.063672E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.87 | backward: 1805.38 | backward-backward: 1805.36 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.80 ----------------------------------------------------------------------------------------------------------- validation results at iteration 126000 | lm_loss value: 3.050312E+00 | lm_loss_ppl value: 2.112193E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.442 | iteration 126100/ 320000 | elapsed time per iteration (ms): 2483.7 | learning rate: 2.037E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.085309E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.38 | backward: 1803.55 | backward-backward: 1803.53 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.84 samples/sec: 6.593 | iteration 126200/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 2.035E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.076188E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.15 | backward: 1803.86 | backward-backward: 1803.83 | backward-allreduce: 0.00 | optimizer: 56.27 | batch generator: 0.77 samples/sec: 6.589 | iteration 126300/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 2.034E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.080347E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.71 | backward: 1805.38 | backward-backward: 1805.36 | backward-allreduce: 0.00 | optimizer: 55.82 | batch generator: 0.78 samples/sec: 6.589 | iteration 126400/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 2.032E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.066930E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.99 | backward: 1805.27 | backward-backward: 1805.25 | backward-allreduce: 0.00 | optimizer: 55.66 | batch generator: 0.76 samples/sec: 6.598 | iteration 126500/ 320000 | elapsed time per iteration (ms): 2425.2 | learning rate: 2.031E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.087142E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.37 | backward: 1803.12 | backward-backward: 1803.10 | backward-allreduce: 0.00 | optimizer: 55.29 | batch generator: 0.79 samples/sec: 6.594 | iteration 126600/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 2.030E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.062640E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.49 | backward: 1803.73 | backward-backward: 1803.70 | backward-allreduce: 0.00 | optimizer: 55.84 | batch generator: 0.79 samples/sec: 6.588 | iteration 126700/ 320000 | elapsed time per iteration (ms): 2428.6 | learning rate: 2.028E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.081048E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.53 | backward: 1806.04 | backward-backward: 1806.02 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.76 samples/sec: 6.589 | iteration 126800/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 2.027E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.066232E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.02 | backward: 1805.32 | backward-backward: 1805.30 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.76 samples/sec: 6.597 | iteration 126900/ 320000 | elapsed time per iteration (ms): 2425.2 | learning rate: 2.025E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.078873E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.16 | backward: 1803.19 | backward-backward: 1803.16 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.76 samples/sec: 6.593 | iteration 127000/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 2.024E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.076885E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.45 | backward: 1804.64 | backward-backward: 1804.61 | backward-allreduce: 0.00 | optimizer: 55.28 | batch generator: 0.80 ----------------------------------------------------------------------------------------------------------- validation results at iteration 127000 | lm_loss value: 3.084415E+00 | lm_loss_ppl value: 2.185467E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.436 | iteration 127100/ 320000 | elapsed time per iteration (ms): 2485.8 | learning rate: 2.023E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.065677E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.04 | backward: 1805.36 | backward-backward: 1805.34 | backward-allreduce: 0.00 | optimizer: 56.03 | batch generator: 0.99 samples/sec: 6.590 | iteration 127200/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 2.021E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.088005E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.00 | backward: 1805.09 | backward-backward: 1805.07 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.75 samples/sec: 6.596 | iteration 127300/ 320000 | elapsed time per iteration (ms): 2425.6 | learning rate: 2.020E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.077164E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 565.77 | backward: 1803.83 | backward-backward: 1803.81 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.77 samples/sec: 6.591 | iteration 127400/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 2.019E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.065232E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.67 | backward: 1805.03 | backward-backward: 1805.01 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.79 samples/sec: 6.589 | iteration 127500/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 2.017E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.065774E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.81 | backward: 1805.27 | backward-backward: 1805.25 | backward-allreduce: 0.00 | optimizer: 55.82 | batch generator: 0.77 samples/sec: 6.593 | iteration 127600/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 2.016E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.052685E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.69 | backward: 1804.22 | backward-backward: 1804.19 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.80 samples/sec: 6.596 | iteration 127700/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 2.014E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.080493E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.13 | backward: 1803.78 | backward-backward: 1803.75 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.78 samples/sec: 6.588 | iteration 127800/ 320000 | elapsed time per iteration (ms): 2428.8 | learning rate: 2.013E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.074638E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.50 | backward: 1805.96 | backward-backward: 1805.93 | backward-allreduce: 0.00 | optimizer: 55.84 | batch generator: 0.77 samples/sec: 6.587 | iteration 127900/ 320000 | elapsed time per iteration (ms): 2428.9 | learning rate: 2.012E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.055861E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.11 | backward: 1805.74 | backward-backward: 1805.72 | backward-allreduce: 0.00 | optimizer: 55.66 | batch generator: 0.79 samples/sec: 6.592 | iteration 128000/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 2.010E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.096235E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.14 | backward: 1803.92 | backward-backward: 1803.90 | backward-allreduce: 0.00 | optimizer: 55.89 | batch generator: 0.91 ----------------------------------------------------------------------------------------------------------- validation results at iteration 128000 | lm_loss value: 3.042003E+00 | lm_loss_ppl value: 2.094717E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.445 | iteration 128100/ 320000 | elapsed time per iteration (ms): 2482.4 | learning rate: 2.009E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.092009E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.31 | backward: 1803.42 | backward-backward: 1803.40 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.85 samples/sec: 6.587 | iteration 128200/ 320000 | elapsed time per iteration (ms): 2428.9 | learning rate: 2.008E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.060804E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.80 | backward: 1806.17 | backward-backward: 1806.15 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.81 samples/sec: 6.590 | iteration 128300/ 320000 | elapsed time per iteration (ms): 2428.1 | learning rate: 2.006E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.074201E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.10 | backward: 1805.04 | backward-backward: 1805.02 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.75 samples/sec: 6.595 | iteration 128400/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 2.005E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.071242E+00 | loss scale: 65536.0 | number of skipped iterations: 2 | number of nan iterations: 0 | time (ms) | forward: 566.38 | backward: 1804.15 | backward-backward: 1804.12 | backward-allreduce: 0.00 | optimizer: 55.28 | batch generator: 0.78 samples/sec: 6.594 | iteration 128500/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 2.003E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.075129E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.27 | backward: 1804.21 | backward-backward: 1804.19 | backward-allreduce: 0.00 | optimizer: 55.70 | batch generator: 0.76 samples/sec: 6.588 | iteration 128600/ 320000 | elapsed time per iteration (ms): 2428.5 | learning rate: 2.002E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.087572E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.57 | backward: 1806.03 | backward-backward: 1806.00 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.77 samples/sec: 6.589 | iteration 128700/ 320000 | elapsed time per iteration (ms): 2428.4 | learning rate: 2.001E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.061379E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.34 | backward: 1804.77 | backward-backward: 1804.74 | backward-allreduce: 0.00 | optimizer: 55.92 | batch generator: 0.88 samples/sec: 6.595 | iteration 128800/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 1.999E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.073888E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.32 | backward: 1803.44 | backward-backward: 1803.42 | backward-allreduce: 0.00 | optimizer: 55.87 | batch generator: 0.76 samples/sec: 6.594 | iteration 128900/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 1.998E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.065279E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.42 | backward: 1803.86 | backward-backward: 1803.84 | backward-allreduce: 0.00 | optimizer: 55.84 | batch generator: 0.82 samples/sec: 6.589 | iteration 129000/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 1.996E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.073433E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.51 | backward: 1805.66 | backward-backward: 1805.64 | backward-allreduce: 0.00 | optimizer: 55.80 | batch generator: 0.75 ----------------------------------------------------------------------------------------------------------- validation results at iteration 129000 | lm_loss value: 3.043628E+00 | lm_loss_ppl value: 2.098121E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.439 | iteration 129100/ 320000 | elapsed time per iteration (ms): 2484.9 | learning rate: 1.995E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.078174E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.93 | backward: 1805.11 | backward-backward: 1805.08 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.84 samples/sec: 6.598 | iteration 129200/ 320000 | elapsed time per iteration (ms): 2425.0 | learning rate: 1.994E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.067695E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.26 | backward: 1803.34 | backward-backward: 1803.31 | backward-allreduce: 0.00 | optimizer: 55.06 | batch generator: 0.82 samples/sec: 6.595 | iteration 129300/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 1.992E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.066571E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.33 | backward: 1803.81 | backward-backward: 1803.79 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.82 samples/sec: 6.590 | iteration 129400/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 1.991E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.084184E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.61 | backward: 1805.33 | backward-backward: 1805.31 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.77 samples/sec: 6.586 | iteration 129500/ 320000 | elapsed time per iteration (ms): 2429.5 | learning rate: 1.990E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.059732E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.27 | backward: 1805.41 | backward-backward: 1805.39 | backward-allreduce: 0.00 | optimizer: 56.40 | batch generator: 0.78 samples/sec: 6.595 | iteration 129600/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 1.988E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.065894E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.95 | backward: 1803.09 | backward-backward: 1803.07 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.77 samples/sec: 6.596 | iteration 129700/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 1.987E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.068958E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.05 | backward: 1803.58 | backward-backward: 1803.55 | backward-allreduce: 0.00 | optimizer: 55.79 | batch generator: 0.76 samples/sec: 6.590 | iteration 129800/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 1.985E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.064581E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.79 | backward: 1805.10 | backward-backward: 1805.08 | backward-allreduce: 0.00 | optimizer: 55.72 | batch generator: 0.78 samples/sec: 6.590 | iteration 129900/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 1.984E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.079006E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.98 | backward: 1804.74 | backward-backward: 1804.72 | backward-allreduce: 0.00 | optimizer: 55.66 | batch generator: 0.78 samples/sec: 6.593 | iteration 130000/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 1.983E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.090594E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.80 | backward: 1803.69 | backward-backward: 1803.67 | backward-allreduce: 0.00 | optimizer: 55.79 | batch generator: 0.78 WARNING: Deleting old checkpoints: checkpoints-hfcm/global_step30000 ----------------------------------------------------------------------------------------------------------- validation results at iteration 130000 | lm_loss value: 3.086703E+00 | lm_loss_ppl value: 2.190474E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.226 | iteration 130100/ 320000 | elapsed time per iteration (ms): 2569.9 | learning rate: 1.981E-04 | approx flops per GPU: 38.7TFLOPS | lm_loss: 3.065611E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.29 | backward: 1802.51 | backward-backward: 1802.49 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.85 samples/sec: 6.592 | iteration 130200/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 1.980E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.046880E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.59 | backward: 1804.65 | backward-backward: 1804.62 | backward-allreduce: 0.00 | optimizer: 55.42 | batch generator: 0.77 samples/sec: 6.590 | iteration 130300/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 1.978E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.078893E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.73 | backward: 1805.20 | backward-backward: 1805.17 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.80 samples/sec: 6.591 | iteration 130400/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 1.977E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.072608E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.04 | backward: 1804.63 | backward-backward: 1804.61 | backward-allreduce: 0.00 | optimizer: 55.47 | batch generator: 0.80 samples/sec: 6.595 | iteration 130500/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 1.976E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.061287E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.34 | backward: 1803.88 | backward-backward: 1803.86 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.76 samples/sec: 6.590 | iteration 130600/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 1.974E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.045481E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.71 | backward: 1804.58 | backward-backward: 1804.56 | backward-allreduce: 0.00 | optimizer: 56.20 | batch generator: 0.77 samples/sec: 6.582 | iteration 130700/ 320000 | elapsed time per iteration (ms): 2430.9 | learning rate: 1.973E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.080473E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.17 | backward: 1807.39 | backward-backward: 1807.37 | backward-allreduce: 0.00 | optimizer: 55.90 | batch generator: 0.79 samples/sec: 6.585 | iteration 130800/ 320000 | elapsed time per iteration (ms): 2429.9 | learning rate: 1.971E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.066716E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.42 | backward: 1806.18 | backward-backward: 1806.16 | backward-allreduce: 0.00 | optimizer: 55.84 | batch generator: 0.80 samples/sec: 6.590 | iteration 130900/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 1.970E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.062238E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.92 | backward: 1804.78 | backward-backward: 1804.76 | backward-allreduce: 0.00 | optimizer: 55.79 | batch generator: 0.81 samples/sec: 6.587 | iteration 131000/ 320000 | elapsed time per iteration (ms): 2429.2 | learning rate: 1.969E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.053495E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.90 | backward: 1805.88 | backward-backward: 1805.85 | backward-allreduce: 0.00 | optimizer: 55.99 | batch generator: 0.79 ----------------------------------------------------------------------------------------------------------- validation results at iteration 131000 | lm_loss value: 3.112295E+00 | lm_loss_ppl value: 2.247256E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.434 | iteration 131100/ 320000 | elapsed time per iteration (ms): 2486.7 | learning rate: 1.967E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.061818E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.15 | backward: 1806.58 | backward-backward: 1806.56 | backward-allreduce: 0.00 | optimizer: 55.72 | batch generator: 0.86 samples/sec: 6.583 | iteration 131200/ 320000 | elapsed time per iteration (ms): 2430.5 | learning rate: 1.966E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.068578E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.77 | backward: 1806.31 | backward-backward: 1806.28 | backward-allreduce: 0.00 | optimizer: 56.00 | batch generator: 0.80 samples/sec: 6.590 | iteration 131300/ 320000 | elapsed time per iteration (ms): 2428.1 | learning rate: 1.964E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.058050E+00 | loss scale: 65536.0 | number of skipped iterations: 2 | number of nan iterations: 0 | time (ms) | forward: 567.08 | backward: 1805.47 | backward-backward: 1805.44 | backward-allreduce: 0.00 | optimizer: 55.13 | batch generator: 0.96 samples/sec: 6.589 | iteration 131400/ 320000 | elapsed time per iteration (ms): 2428.4 | learning rate: 1.963E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.071386E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.81 | backward: 1805.54 | backward-backward: 1805.52 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.80 samples/sec: 6.584 | iteration 131500/ 320000 | elapsed time per iteration (ms): 2430.1 | learning rate: 1.962E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.049335E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.07 | backward: 1806.61 | backward-backward: 1806.58 | backward-allreduce: 0.00 | optimizer: 56.00 | batch generator: 0.83 samples/sec: 6.585 | iteration 131600/ 320000 | elapsed time per iteration (ms): 2429.8 | learning rate: 1.960E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.072898E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.30 | backward: 1806.28 | backward-backward: 1806.25 | backward-allreduce: 0.00 | optimizer: 55.80 | batch generator: 0.80 samples/sec: 6.586 | iteration 131700/ 320000 | elapsed time per iteration (ms): 2429.4 | learning rate: 1.959E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.055979E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.38 | backward: 1805.12 | backward-backward: 1805.09 | backward-allreduce: 0.00 | optimizer: 56.45 | batch generator: 0.81 samples/sec: 6.586 | iteration 131800/ 320000 | elapsed time per iteration (ms): 2429.3 | learning rate: 1.957E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.067788E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.42 | backward: 1805.44 | backward-backward: 1805.41 | backward-allreduce: 0.00 | optimizer: 56.03 | batch generator: 0.82 samples/sec: 6.585 | iteration 131900/ 320000 | elapsed time per iteration (ms): 2429.6 | learning rate: 1.956E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.050641E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.26 | backward: 1806.04 | backward-backward: 1806.01 | backward-allreduce: 0.00 | optimizer: 55.87 | batch generator: 0.80 samples/sec: 6.587 | iteration 132000/ 320000 | elapsed time per iteration (ms): 2429.2 | learning rate: 1.955E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.070415E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.18 | backward: 1805.77 | backward-backward: 1805.74 | backward-allreduce: 0.00 | optimizer: 55.85 | batch generator: 0.79 ----------------------------------------------------------------------------------------------------------- validation results at iteration 132000 | lm_loss value: 3.057759E+00 | lm_loss_ppl value: 2.127981E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.435 | iteration 132100/ 320000 | elapsed time per iteration (ms): 2486.2 | learning rate: 1.953E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.069102E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.21 | backward: 1805.90 | backward-backward: 1805.88 | backward-allreduce: 0.00 | optimizer: 55.82 | batch generator: 0.88 samples/sec: 6.587 | iteration 132200/ 320000 | elapsed time per iteration (ms): 2429.0 | learning rate: 1.952E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.061030E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.34 | backward: 1805.36 | backward-backward: 1805.34 | backward-allreduce: 0.00 | optimizer: 55.87 | batch generator: 0.79 samples/sec: 6.586 | iteration 132300/ 320000 | elapsed time per iteration (ms): 2429.4 | learning rate: 1.950E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.068590E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.24 | backward: 1805.31 | backward-backward: 1805.28 | backward-allreduce: 0.00 | optimizer: 56.38 | batch generator: 0.80 samples/sec: 6.585 | iteration 132400/ 320000 | elapsed time per iteration (ms): 2429.6 | learning rate: 1.949E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.051036E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.19 | backward: 1806.01 | backward-backward: 1805.98 | backward-allreduce: 0.00 | optimizer: 56.00 | batch generator: 0.80 samples/sec: 6.587 | iteration 132500/ 320000 | elapsed time per iteration (ms): 2429.2 | learning rate: 1.948E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.081978E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.09 | backward: 1805.95 | backward-backward: 1805.93 | backward-allreduce: 0.00 | optimizer: 55.72 | batch generator: 0.81 samples/sec: 6.587 | iteration 132600/ 320000 | elapsed time per iteration (ms): 2429.1 | learning rate: 1.946E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.050364E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.23 | backward: 1805.69 | backward-backward: 1805.67 | backward-allreduce: 0.00 | optimizer: 55.73 | batch generator: 0.79 samples/sec: 6.584 | iteration 132700/ 320000 | elapsed time per iteration (ms): 2430.1 | learning rate: 1.945E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.059780E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.35 | backward: 1806.18 | backward-backward: 1806.15 | backward-allreduce: 0.00 | optimizer: 56.20 | batch generator: 0.87 samples/sec: 6.584 | iteration 132800/ 320000 | elapsed time per iteration (ms): 2430.0 | learning rate: 1.943E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.071407E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 567.15 | backward: 1806.37 | backward-backward: 1806.34 | backward-allreduce: 0.00 | optimizer: 56.03 | batch generator: 0.80 samples/sec: 6.585 | iteration 132900/ 320000 | elapsed time per iteration (ms): 2429.9 | learning rate: 1.942E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.058595E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.19 | backward: 1805.97 | backward-backward: 1805.93 | backward-allreduce: 0.00 | optimizer: 56.26 | batch generator: 0.84 samples/sec: 6.586 | iteration 133000/ 320000 | elapsed time per iteration (ms): 2429.3 | learning rate: 1.941E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.073339E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.04 | backward: 1805.88 | backward-backward: 1805.86 | backward-allreduce: 0.00 | optimizer: 55.93 | batch generator: 0.78 ----------------------------------------------------------------------------------------------------------- validation results at iteration 133000 | lm_loss value: 3.026960E+00 | lm_loss_ppl value: 2.063441E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.435 | iteration 133100/ 320000 | elapsed time per iteration (ms): 2486.5 | learning rate: 1.939E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.053462E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.12 | backward: 1806.00 | backward-backward: 1805.97 | backward-allreduce: 0.00 | optimizer: 55.95 | batch generator: 0.92 samples/sec: 6.592 | iteration 133200/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 1.938E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.059422E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.67 | backward: 1804.47 | backward-backward: 1804.45 | backward-allreduce: 0.00 | optimizer: 55.70 | batch generator: 0.78 samples/sec: 6.592 | iteration 133300/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 1.936E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.065145E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.65 | backward: 1804.39 | backward-backward: 1804.36 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.80 samples/sec: 6.589 | iteration 133400/ 320000 | elapsed time per iteration (ms): 2428.4 | learning rate: 1.935E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.062161E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.88 | backward: 1805.14 | backward-backward: 1805.12 | backward-allreduce: 0.00 | optimizer: 55.95 | batch generator: 0.81 samples/sec: 6.584 | iteration 133500/ 320000 | elapsed time per iteration (ms): 2430.1 | learning rate: 1.934E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.059736E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.92 | backward: 1806.66 | backward-backward: 1806.63 | backward-allreduce: 0.00 | optimizer: 56.14 | batch generator: 0.81 samples/sec: 6.585 | iteration 133600/ 320000 | elapsed time per iteration (ms): 2429.8 | learning rate: 1.932E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.046633E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.08 | backward: 1806.22 | backward-backward: 1806.19 | backward-allreduce: 0.00 | optimizer: 56.08 | batch generator: 0.82 samples/sec: 6.584 | iteration 133700/ 320000 | elapsed time per iteration (ms): 2430.2 | learning rate: 1.931E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.073206E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.46 | backward: 1806.44 | backward-backward: 1806.42 | backward-allreduce: 0.00 | optimizer: 55.87 | batch generator: 0.81 samples/sec: 6.587 | iteration 133800/ 320000 | elapsed time per iteration (ms): 2429.1 | learning rate: 1.929E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.043623E+00 | loss scale: 65536.0 | number of skipped iterations: 2 | number of nan iterations: 0 | time (ms) | forward: 567.32 | backward: 1806.71 | backward-backward: 1806.68 | backward-allreduce: 0.00 | optimizer: 54.70 | batch generator: 0.80 samples/sec: 6.586 | iteration 133900/ 320000 | elapsed time per iteration (ms): 2429.4 | learning rate: 1.928E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.053161E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.85 | backward: 1805.25 | backward-backward: 1805.23 | backward-allreduce: 0.00 | optimizer: 56.82 | batch generator: 0.82 samples/sec: 6.580 | iteration 134000/ 320000 | elapsed time per iteration (ms): 2431.4 | learning rate: 1.926E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.041233E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.30 | backward: 1807.60 | backward-backward: 1807.58 | backward-allreduce: 0.00 | optimizer: 56.09 | batch generator: 0.81 ----------------------------------------------------------------------------------------------------------- validation results at iteration 134000 | lm_loss value: 3.074132E+00 | lm_loss_ppl value: 2.163109E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.434 | iteration 134100/ 320000 | elapsed time per iteration (ms): 2486.8 | learning rate: 1.925E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.044921E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.48 | backward: 1806.09 | backward-backward: 1806.07 | backward-allreduce: 0.00 | optimizer: 55.90 | batch generator: 0.88 samples/sec: 6.584 | iteration 134200/ 320000 | elapsed time per iteration (ms): 2430.3 | learning rate: 1.924E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.065352E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.33 | backward: 1806.37 | backward-backward: 1806.35 | backward-allreduce: 0.00 | optimizer: 56.17 | batch generator: 0.77 samples/sec: 6.585 | iteration 134300/ 320000 | elapsed time per iteration (ms): 2429.8 | learning rate: 1.922E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.065573E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.54 | backward: 1805.99 | backward-backward: 1805.97 | backward-allreduce: 0.00 | optimizer: 55.88 | batch generator: 0.79 samples/sec: 6.592 | iteration 134400/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 1.921E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.055439E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.77 | backward: 1804.42 | backward-backward: 1804.40 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.79 samples/sec: 6.592 | iteration 134500/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 1.919E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.051775E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.52 | backward: 1804.41 | backward-backward: 1804.39 | backward-allreduce: 0.00 | optimizer: 55.85 | batch generator: 0.77 samples/sec: 6.593 | iteration 134600/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 1.918E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.059305E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.59 | backward: 1804.24 | backward-backward: 1804.22 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.80 samples/sec: 6.593 | iteration 134700/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 1.917E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.044677E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.42 | backward: 1804.46 | backward-backward: 1804.44 | backward-allreduce: 0.00 | optimizer: 55.47 | batch generator: 0.77 samples/sec: 6.595 | iteration 134800/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 1.915E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.055835E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.34 | backward: 1803.80 | backward-backward: 1803.78 | backward-allreduce: 0.00 | optimizer: 55.41 | batch generator: 0.78 samples/sec: 6.596 | iteration 134900/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 1.914E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.058415E+00 | loss scale: 16384.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.30 | backward: 1804.01 | backward-backward: 1803.99 | backward-allreduce: 0.00 | optimizer: 55.13 | batch generator: 0.77 samples/sec: 6.595 | iteration 135000/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 1.912E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.051456E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.42 | backward: 1803.25 | backward-backward: 1803.23 | backward-allreduce: 0.00 | optimizer: 56.18 | batch generator: 0.78 ----------------------------------------------------------------------------------------------------------- validation results at iteration 135000 | lm_loss value: 3.070428E+00 | lm_loss_ppl value: 2.155112E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.446 | iteration 135100/ 320000 | elapsed time per iteration (ms): 2482.2 | learning rate: 1.911E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.056295E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.44 | backward: 1803.06 | backward-backward: 1803.03 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.86 samples/sec: 6.595 | iteration 135200/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 1.910E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.060187E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.55 | backward: 1803.34 | backward-backward: 1803.31 | backward-allreduce: 0.00 | optimizer: 55.78 | batch generator: 0.88 samples/sec: 6.596 | iteration 135300/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 1.908E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.055689E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.39 | backward: 1803.52 | backward-backward: 1803.49 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.78 samples/sec: 6.596 | iteration 135400/ 320000 | elapsed time per iteration (ms): 2425.6 | learning rate: 1.907E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.047019E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.29 | backward: 1803.52 | backward-backward: 1803.50 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.79 samples/sec: 6.594 | iteration 135500/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 1.905E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.051680E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.75 | backward: 1803.76 | backward-backward: 1803.74 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.87 samples/sec: 6.595 | iteration 135600/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 1.904E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.049390E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.32 | backward: 1804.06 | backward-backward: 1804.04 | backward-allreduce: 0.00 | optimizer: 55.47 | batch generator: 0.77 samples/sec: 6.596 | iteration 135700/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 1.902E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.059099E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.46 | backward: 1803.58 | backward-backward: 1803.56 | backward-allreduce: 0.00 | optimizer: 55.38 | batch generator: 0.78 samples/sec: 6.595 | iteration 135800/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 1.901E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.051161E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.47 | backward: 1803.62 | backward-backward: 1803.59 | backward-allreduce: 0.00 | optimizer: 55.43 | batch generator: 0.79 samples/sec: 6.595 | iteration 135900/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 1.900E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.049094E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.40 | backward: 1803.79 | backward-backward: 1803.77 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.78 samples/sec: 6.591 | iteration 136000/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 1.898E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.067031E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.43 | backward: 1804.48 | backward-backward: 1804.46 | backward-allreduce: 0.00 | optimizer: 56.10 | batch generator: 0.78 ----------------------------------------------------------------------------------------------------------- validation results at iteration 136000 | lm_loss value: 3.050942E+00 | lm_loss_ppl value: 2.113524E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.444 | iteration 136100/ 320000 | elapsed time per iteration (ms): 2482.9 | learning rate: 1.897E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.042839E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.36 | backward: 1803.96 | backward-backward: 1803.94 | backward-allreduce: 0.00 | optimizer: 55.40 | batch generator: 0.82 samples/sec: 6.593 | iteration 136200/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 1.895E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.058321E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.66 | backward: 1803.65 | backward-backward: 1803.63 | backward-allreduce: 0.00 | optimizer: 56.08 | batch generator: 0.85 samples/sec: 6.594 | iteration 136300/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 1.894E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.056953E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.67 | backward: 1804.02 | backward-backward: 1804.00 | backward-allreduce: 0.00 | optimizer: 55.34 | batch generator: 0.80 samples/sec: 6.593 | iteration 136400/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 1.893E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.062739E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.57 | backward: 1804.25 | backward-backward: 1804.23 | backward-allreduce: 0.00 | optimizer: 55.51 | batch generator: 0.79 samples/sec: 6.594 | iteration 136500/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 1.891E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.046387E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.48 | backward: 1803.79 | backward-backward: 1803.77 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.80 samples/sec: 6.593 | iteration 136600/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 1.890E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.058197E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.44 | backward: 1804.34 | backward-backward: 1804.32 | backward-allreduce: 0.00 | optimizer: 55.50 | batch generator: 0.77 samples/sec: 6.594 | iteration 136700/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 1.888E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.049178E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.69 | backward: 1803.91 | backward-backward: 1803.88 | backward-allreduce: 0.00 | optimizer: 55.47 | batch generator: 0.79 samples/sec: 6.594 | iteration 136800/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 1.887E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.076348E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.46 | backward: 1804.15 | backward-backward: 1804.13 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.74 samples/sec: 6.594 | iteration 136900/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 1.885E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.031886E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.46 | backward: 1803.96 | backward-backward: 1803.94 | backward-allreduce: 0.00 | optimizer: 55.77 | batch generator: 0.78 samples/sec: 6.595 | iteration 137000/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 1.884E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.054580E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.53 | backward: 1804.05 | backward-backward: 1804.03 | backward-allreduce: 0.00 | optimizer: 55.30 | batch generator: 0.77 ----------------------------------------------------------------------------------------------------------- validation results at iteration 137000 | lm_loss value: 3.052997E+00 | lm_loss_ppl value: 2.117872E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.436 | iteration 137100/ 320000 | elapsed time per iteration (ms): 2486.1 | learning rate: 1.883E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.048406E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.71 | backward: 1805.06 | backward-backward: 1805.04 | backward-allreduce: 0.00 | optimizer: 57.02 | batch generator: 0.92 samples/sec: 6.591 | iteration 137200/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 1.881E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.056695E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.84 | backward: 1804.48 | backward-backward: 1804.46 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.92 samples/sec: 6.594 | iteration 137300/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 1.880E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.078279E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.40 | backward: 1804.21 | backward-backward: 1804.18 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.78 samples/sec: 6.590 | iteration 137400/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 1.878E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.033837E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.76 | backward: 1804.85 | backward-backward: 1804.83 | backward-allreduce: 0.00 | optimizer: 55.83 | batch generator: 0.88 samples/sec: 6.593 | iteration 137500/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 1.877E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.044055E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.42 | backward: 1804.35 | backward-backward: 1804.33 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.81 samples/sec: 6.591 | iteration 137600/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 1.875E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.061606E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.35 | backward: 1805.11 | backward-backward: 1805.08 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.78 samples/sec: 6.592 | iteration 137700/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 1.874E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.036219E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.87 | backward: 1803.99 | backward-backward: 1803.97 | backward-allreduce: 0.00 | optimizer: 55.80 | batch generator: 0.86 samples/sec: 6.593 | iteration 137800/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 1.873E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.059850E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.45 | backward: 1804.26 | backward-backward: 1804.23 | backward-allreduce: 0.00 | optimizer: 55.72 | batch generator: 0.78 samples/sec: 6.594 | iteration 137900/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 1.871E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.044859E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.39 | backward: 1804.38 | backward-backward: 1804.36 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.83 samples/sec: 6.593 | iteration 138000/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 1.870E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.043589E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.46 | backward: 1804.54 | backward-backward: 1804.52 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.79 ----------------------------------------------------------------------------------------------------------- validation results at iteration 138000 | lm_loss value: 3.091660E+00 | lm_loss_ppl value: 2.201359E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.443 | iteration 138100/ 320000 | elapsed time per iteration (ms): 2483.4 | learning rate: 1.868E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.061575E+00 | loss scale: 32768.0 | number of skipped iterations: 2 | number of nan iterations: 0 | time (ms) | forward: 566.53 | backward: 1804.47 | backward-backward: 1804.44 | backward-allreduce: 0.00 | optimizer: 55.18 | batch generator: 0.97 samples/sec: 6.591 | iteration 138200/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 1.867E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.036143E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.61 | backward: 1804.55 | backward-backward: 1804.53 | backward-allreduce: 0.00 | optimizer: 56.11 | batch generator: 0.79 samples/sec: 6.593 | iteration 138300/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 1.865E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.055357E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.29 | backward: 1804.66 | backward-backward: 1804.64 | backward-allreduce: 0.00 | optimizer: 55.45 | batch generator: 0.75 samples/sec: 6.594 | iteration 138400/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 1.864E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.045198E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.34 | backward: 1804.24 | backward-backward: 1804.22 | backward-allreduce: 0.00 | optimizer: 55.45 | batch generator: 0.79 samples/sec: 6.594 | iteration 138500/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 1.863E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.037054E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.55 | backward: 1803.77 | backward-backward: 1803.74 | backward-allreduce: 0.00 | optimizer: 55.69 | batch generator: 0.78 samples/sec: 6.593 | iteration 138600/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 1.861E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.041249E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.36 | backward: 1804.44 | backward-backward: 1804.42 | backward-allreduce: 0.00 | optimizer: 55.70 | batch generator: 0.75 samples/sec: 6.594 | iteration 138700/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 1.860E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.046075E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.35 | backward: 1804.16 | backward-backward: 1804.14 | backward-allreduce: 0.00 | optimizer: 55.59 | batch generator: 0.77 samples/sec: 6.593 | iteration 138800/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 1.858E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.027180E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.41 | backward: 1804.51 | backward-backward: 1804.48 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.77 samples/sec: 6.593 | iteration 138900/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 1.857E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.043286E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.40 | backward: 1804.36 | backward-backward: 1804.34 | backward-allreduce: 0.00 | optimizer: 55.79 | batch generator: 0.78 samples/sec: 6.591 | iteration 139000/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 1.855E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.032041E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.58 | backward: 1804.47 | backward-backward: 1804.45 | backward-allreduce: 0.00 | optimizer: 56.00 | batch generator: 0.78 ----------------------------------------------------------------------------------------------------------- validation results at iteration 139000 | lm_loss value: 3.007798E+00 | lm_loss_ppl value: 2.024278E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.441 | iteration 139100/ 320000 | elapsed time per iteration (ms): 2484.0 | learning rate: 1.854E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.030677E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.82 | backward: 1804.35 | backward-backward: 1804.33 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.85 samples/sec: 6.591 | iteration 139200/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 1.853E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.038467E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.64 | backward: 1805.14 | backward-backward: 1805.12 | backward-allreduce: 0.00 | optimizer: 55.15 | batch generator: 0.79 samples/sec: 6.589 | iteration 139300/ 320000 | elapsed time per iteration (ms): 2428.1 | learning rate: 1.851E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.033901E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.78 | backward: 1805.16 | backward-backward: 1805.14 | backward-allreduce: 0.00 | optimizer: 55.80 | batch generator: 0.77 samples/sec: 6.593 | iteration 139400/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 1.850E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.052567E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.54 | backward: 1804.16 | backward-backward: 1804.14 | backward-allreduce: 0.00 | optimizer: 55.74 | batch generator: 0.80 samples/sec: 6.593 | iteration 139500/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 1.848E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.055073E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.64 | backward: 1804.14 | backward-backward: 1804.12 | backward-allreduce: 0.00 | optimizer: 55.50 | batch generator: 0.89 samples/sec: 6.594 | iteration 139600/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 1.847E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.039230E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.41 | backward: 1804.07 | backward-backward: 1804.04 | backward-allreduce: 0.00 | optimizer: 55.66 | batch generator: 0.80 samples/sec: 6.593 | iteration 139700/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 1.845E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.042206E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.54 | backward: 1804.31 | backward-backward: 1804.29 | backward-allreduce: 0.00 | optimizer: 55.49 | batch generator: 0.79 samples/sec: 6.595 | iteration 139800/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 1.844E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.053268E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.51 | backward: 1803.89 | backward-backward: 1803.75 | backward-allreduce: 0.00 | optimizer: 55.36 | batch generator: 0.76 samples/sec: 6.591 | iteration 139900/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 1.843E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.056171E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.11 | backward: 1804.18 | backward-backward: 1804.16 | backward-allreduce: 0.00 | optimizer: 55.85 | batch generator: 0.79 samples/sec: 6.593 | iteration 140000/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 1.841E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.047426E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.42 | backward: 1804.29 | backward-backward: 1804.27 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.77 WARNING: Deleting old checkpoints: checkpoints-hfcm/global_step40000 ----------------------------------------------------------------------------------------------------------- validation results at iteration 140000 | lm_loss value: 3.028448E+00 | lm_loss_ppl value: 2.066514E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.222 | iteration 140100/ 320000 | elapsed time per iteration (ms): 2571.3 | learning rate: 1.840E-04 | approx flops per GPU: 38.7TFLOPS | lm_loss: 3.031694E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.77 | backward: 1803.97 | backward-backward: 1803.95 | backward-allreduce: 0.00 | optimizer: 55.85 | batch generator: 0.84 samples/sec: 6.594 | iteration 140200/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 1.838E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.043356E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.61 | backward: 1804.04 | backward-backward: 1804.02 | backward-allreduce: 0.00 | optimizer: 55.46 | batch generator: 0.79 samples/sec: 6.590 | iteration 140300/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 1.837E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.039525E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.33 | backward: 1805.16 | backward-backward: 1805.14 | backward-allreduce: 0.00 | optimizer: 55.93 | batch generator: 0.76 samples/sec: 6.591 | iteration 140400/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 1.835E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.037931E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.40 | backward: 1804.60 | backward-backward: 1804.58 | backward-allreduce: 0.00 | optimizer: 56.30 | batch generator: 0.79 samples/sec: 6.592 | iteration 140500/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 1.834E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.047817E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.76 | backward: 1804.30 | backward-backward: 1804.28 | backward-allreduce: 0.00 | optimizer: 55.84 | batch generator: 0.77 samples/sec: 6.593 | iteration 140600/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 1.833E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.057196E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.60 | backward: 1804.45 | backward-backward: 1804.43 | backward-allreduce: 0.00 | optimizer: 55.51 | batch generator: 0.79 samples/sec: 6.592 | iteration 140700/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 1.831E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.039465E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.47 | backward: 1804.87 | backward-backward: 1804.85 | backward-allreduce: 0.00 | optimizer: 55.47 | batch generator: 0.80 samples/sec: 6.593 | iteration 140800/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 1.830E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.056598E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.59 | backward: 1804.34 | backward-backward: 1804.32 | backward-allreduce: 0.00 | optimizer: 55.36 | batch generator: 0.83 samples/sec: 6.593 | iteration 140900/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 1.828E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.045098E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.49 | backward: 1804.49 | backward-backward: 1804.47 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.81 samples/sec: 6.594 | iteration 141000/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 1.827E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.021065E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.40 | backward: 1804.30 | backward-backward: 1804.28 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.78 ----------------------------------------------------------------------------------------------------------- validation results at iteration 141000 | lm_loss value: 3.051776E+00 | lm_loss_ppl value: 2.115287E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.442 | iteration 141100/ 320000 | elapsed time per iteration (ms): 2483.6 | learning rate: 1.825E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.049136E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.44 | backward: 1804.45 | backward-backward: 1804.43 | backward-allreduce: 0.00 | optimizer: 55.47 | batch generator: 0.84 samples/sec: 6.592 | iteration 141200/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 1.824E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.031889E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.56 | backward: 1804.34 | backward-backward: 1804.32 | backward-allreduce: 0.00 | optimizer: 55.75 | batch generator: 0.79 samples/sec: 6.595 | iteration 141300/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 1.823E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.037527E+00 | loss scale: 65536.0 | number of skipped iterations: 2 | number of nan iterations: 0 | time (ms) | forward: 566.48 | backward: 1804.53 | backward-backward: 1804.51 | backward-allreduce: 0.00 | optimizer: 54.64 | batch generator: 0.77 samples/sec: 6.591 | iteration 141400/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 1.821E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.052532E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.75 | backward: 1805.12 | backward-backward: 1805.10 | backward-allreduce: 0.00 | optimizer: 55.47 | batch generator: 0.75 samples/sec: 6.592 | iteration 141500/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 1.820E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.054896E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.49 | backward: 1804.46 | backward-backward: 1804.44 | backward-allreduce: 0.00 | optimizer: 55.94 | batch generator: 0.80 samples/sec: 6.592 | iteration 141600/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 1.818E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.055653E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.42 | backward: 1804.84 | backward-backward: 1804.81 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.76 samples/sec: 6.596 | iteration 141700/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 1.817E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.052300E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.49 | backward: 1804.00 | backward-backward: 1803.98 | backward-allreduce: 0.00 | optimizer: 54.99 | batch generator: 0.81 samples/sec: 6.594 | iteration 141800/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 1.815E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.047600E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.57 | backward: 1804.00 | backward-backward: 1803.98 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.78 samples/sec: 6.593 | iteration 141900/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 1.814E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.013339E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.39 | backward: 1804.36 | backward-backward: 1804.34 | backward-allreduce: 0.00 | optimizer: 55.69 | batch generator: 0.77 samples/sec: 6.593 | iteration 142000/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 1.812E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.041589E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.45 | backward: 1804.15 | backward-backward: 1804.13 | backward-allreduce: 0.00 | optimizer: 55.71 | batch generator: 0.81 ----------------------------------------------------------------------------------------------------------- validation results at iteration 142000 | lm_loss value: 3.009537E+00 | lm_loss_ppl value: 2.027801E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.444 | iteration 142100/ 320000 | elapsed time per iteration (ms): 2483.1 | learning rate: 1.811E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.033456E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.63 | backward: 1803.68 | backward-backward: 1803.65 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.88 samples/sec: 6.595 | iteration 142200/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 1.810E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.049727E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.53 | backward: 1803.81 | backward-backward: 1803.79 | backward-allreduce: 0.00 | optimizer: 55.45 | batch generator: 0.79 samples/sec: 6.593 | iteration 142300/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 1.808E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.040757E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.73 | backward: 1804.09 | backward-backward: 1804.07 | backward-allreduce: 0.00 | optimizer: 55.66 | batch generator: 0.75 samples/sec: 6.594 | iteration 142400/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 1.807E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.035513E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.64 | backward: 1803.98 | backward-backward: 1803.95 | backward-allreduce: 0.00 | optimizer: 55.37 | batch generator: 0.79 samples/sec: 6.593 | iteration 142500/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 1.805E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.026842E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.37 | backward: 1804.52 | backward-backward: 1804.50 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.76 samples/sec: 6.592 | iteration 142600/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 1.804E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.045529E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.87 | backward: 1803.97 | backward-backward: 1803.94 | backward-allreduce: 0.00 | optimizer: 56.00 | batch generator: 0.80 samples/sec: 6.592 | iteration 142700/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 1.802E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.051479E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.28 | backward: 1804.54 | backward-backward: 1804.52 | backward-allreduce: 0.00 | optimizer: 55.83 | batch generator: 0.76 samples/sec: 6.594 | iteration 142800/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 1.801E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.037427E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.42 | backward: 1804.40 | backward-backward: 1804.38 | backward-allreduce: 0.00 | optimizer: 55.27 | batch generator: 0.80 samples/sec: 6.592 | iteration 142900/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 1.800E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.016642E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.44 | backward: 1804.23 | backward-backward: 1804.20 | backward-allreduce: 0.00 | optimizer: 56.09 | batch generator: 0.81 samples/sec: 6.596 | iteration 143000/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 1.798E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.034414E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.30 | backward: 1804.31 | backward-backward: 1804.29 | backward-allreduce: 0.00 | optimizer: 54.88 | batch generator: 0.73 ----------------------------------------------------------------------------------------------------------- validation results at iteration 143000 | lm_loss value: 3.028264E+00 | lm_loss_ppl value: 2.066133E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.442 | iteration 143100/ 320000 | elapsed time per iteration (ms): 2483.5 | learning rate: 1.797E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.023559E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.36 | backward: 1804.34 | backward-backward: 1804.31 | backward-allreduce: 0.00 | optimizer: 55.66 | batch generator: 0.83 samples/sec: 6.593 | iteration 143200/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 1.795E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.040875E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.26 | backward: 1804.31 | backward-backward: 1804.29 | backward-allreduce: 0.00 | optimizer: 55.83 | batch generator: 0.75 samples/sec: 6.593 | iteration 143300/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 1.794E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.024303E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.34 | backward: 1804.29 | backward-backward: 1804.27 | backward-allreduce: 0.00 | optimizer: 55.94 | batch generator: 0.81 samples/sec: 6.595 | iteration 143400/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 1.792E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.035533E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.47 | backward: 1803.62 | backward-backward: 1803.59 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.77 samples/sec: 6.594 | iteration 143500/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 1.791E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.021259E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.34 | backward: 1803.83 | backward-backward: 1803.80 | backward-allreduce: 0.00 | optimizer: 55.73 | batch generator: 0.77 samples/sec: 6.592 | iteration 143600/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 1.789E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.057126E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.39 | backward: 1804.77 | backward-backward: 1804.75 | backward-allreduce: 0.00 | optimizer: 55.69 | batch generator: 0.78 samples/sec: 6.592 | iteration 143700/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 1.788E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.024351E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.33 | backward: 1804.05 | backward-backward: 1804.03 | backward-allreduce: 0.00 | optimizer: 56.57 | batch generator: 0.83 samples/sec: 6.594 | iteration 143800/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 1.787E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.024432E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.48 | backward: 1804.07 | backward-backward: 1804.05 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.80 samples/sec: 6.595 | iteration 143900/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 1.785E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.026548E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.40 | backward: 1803.75 | backward-backward: 1803.73 | backward-allreduce: 0.00 | optimizer: 55.51 | batch generator: 0.80 samples/sec: 6.593 | iteration 144000/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 1.784E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.046983E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.31 | backward: 1804.52 | backward-backward: 1804.50 | backward-allreduce: 0.00 | optimizer: 55.54 | batch generator: 0.79 ----------------------------------------------------------------------------------------------------------- validation results at iteration 144000 | lm_loss value: 3.106998E+00 | lm_loss_ppl value: 2.235384E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.441 | iteration 144100/ 320000 | elapsed time per iteration (ms): 2483.9 | learning rate: 1.782E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.053500E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.42 | backward: 1804.66 | backward-backward: 1804.64 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.86 samples/sec: 6.597 | iteration 144200/ 320000 | elapsed time per iteration (ms): 2425.4 | learning rate: 1.781E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.033965E+00 | loss scale: 32768.0 | number of skipped iterations: 2 | number of nan iterations: 0 | time (ms) | forward: 566.40 | backward: 1804.15 | backward-backward: 1804.13 | backward-allreduce: 0.00 | optimizer: 54.50 | batch generator: 0.77 samples/sec: 6.594 | iteration 144300/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 1.779E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.048312E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.42 | backward: 1804.03 | backward-backward: 1804.01 | backward-allreduce: 0.00 | optimizer: 55.46 | batch generator: 0.77 samples/sec: 6.595 | iteration 144400/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 1.778E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.021844E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.48 | backward: 1803.66 | backward-backward: 1803.64 | backward-allreduce: 0.00 | optimizer: 55.54 | batch generator: 0.79 samples/sec: 6.594 | iteration 144500/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 1.776E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.039352E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.40 | backward: 1804.08 | backward-backward: 1804.06 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.78 samples/sec: 6.594 | iteration 144600/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 1.775E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.041687E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.31 | backward: 1804.06 | backward-backward: 1804.03 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.79 samples/sec: 6.593 | iteration 144700/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 1.774E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.036742E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.18 | backward: 1804.47 | backward-backward: 1804.45 | backward-allreduce: 0.00 | optimizer: 55.75 | batch generator: 0.78 samples/sec: 6.593 | iteration 144800/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 1.772E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.037513E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.22 | backward: 1804.08 | backward-backward: 1804.06 | backward-allreduce: 0.00 | optimizer: 56.02 | batch generator: 0.78 samples/sec: 6.595 | iteration 144900/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 1.771E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.025012E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.35 | backward: 1803.92 | backward-backward: 1803.89 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.78 samples/sec: 6.594 | iteration 145000/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 1.769E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.013197E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.65 | backward: 1803.87 | backward-backward: 1803.85 | backward-allreduce: 0.00 | optimizer: 55.67 | batch generator: 0.82 ----------------------------------------------------------------------------------------------------------- validation results at iteration 145000 | lm_loss value: 3.024555E+00 | lm_loss_ppl value: 2.058484E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.443 | iteration 145100/ 320000 | elapsed time per iteration (ms): 2483.3 | learning rate: 1.768E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.031324E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.33 | backward: 1803.96 | backward-backward: 1803.94 | backward-allreduce: 0.00 | optimizer: 55.77 | batch generator: 0.88 samples/sec: 6.594 | iteration 145200/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 1.766E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.024501E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.47 | backward: 1803.97 | backward-backward: 1803.95 | backward-allreduce: 0.00 | optimizer: 55.72 | batch generator: 0.85 samples/sec: 6.594 | iteration 145300/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 1.765E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.053079E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.20 | backward: 1804.33 | backward-backward: 1804.30 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.78 samples/sec: 6.595 | iteration 145400/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 1.763E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.018714E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.25 | backward: 1804.22 | backward-backward: 1804.20 | backward-allreduce: 0.00 | optimizer: 55.15 | batch generator: 0.75 samples/sec: 6.593 | iteration 145500/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 1.762E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.029797E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.65 | backward: 1804.48 | backward-backward: 1804.46 | backward-allreduce: 0.00 | optimizer: 55.17 | batch generator: 0.85 samples/sec: 6.593 | iteration 145600/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 1.761E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.031650E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.37 | backward: 1804.37 | backward-backward: 1804.35 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.75 samples/sec: 6.593 | iteration 145700/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 1.759E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.023445E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.57 | backward: 1804.04 | backward-backward: 1804.01 | backward-allreduce: 0.00 | optimizer: 55.82 | batch generator: 0.74 samples/sec: 6.592 | iteration 145800/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 1.758E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.008823E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.23 | backward: 1804.72 | backward-backward: 1804.70 | backward-allreduce: 0.00 | optimizer: 55.98 | batch generator: 0.81 samples/sec: 6.593 | iteration 145900/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 1.756E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.026916E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.59 | backward: 1804.36 | backward-backward: 1804.34 | backward-allreduce: 0.00 | optimizer: 55.67 | batch generator: 0.83 samples/sec: 6.594 | iteration 146000/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 1.755E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.025652E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.35 | backward: 1803.95 | backward-backward: 1803.93 | backward-allreduce: 0.00 | optimizer: 55.79 | batch generator: 0.79 ----------------------------------------------------------------------------------------------------------- validation results at iteration 146000 | lm_loss value: 3.027365E+00 | lm_loss_ppl value: 2.064277E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.445 | iteration 146100/ 320000 | elapsed time per iteration (ms): 2482.7 | learning rate: 1.753E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.013431E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.50 | backward: 1803.56 | backward-backward: 1803.54 | backward-allreduce: 0.00 | optimizer: 55.42 | batch generator: 0.85 samples/sec: 6.594 | iteration 146200/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 1.752E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.023485E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.38 | backward: 1803.98 | backward-backward: 1803.96 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.76 samples/sec: 6.595 | iteration 146300/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 1.750E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.016788E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.35 | backward: 1803.86 | backward-backward: 1803.84 | backward-allreduce: 0.00 | optimizer: 55.43 | batch generator: 0.78 samples/sec: 6.593 | iteration 146400/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 1.749E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.037700E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.51 | backward: 1804.05 | backward-backward: 1804.03 | backward-allreduce: 0.00 | optimizer: 55.76 | batch generator: 0.79 samples/sec: 6.593 | iteration 146500/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 1.747E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.015233E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.26 | backward: 1804.49 | backward-backward: 1804.47 | backward-allreduce: 0.00 | optimizer: 55.75 | batch generator: 0.78 samples/sec: 6.594 | iteration 146600/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 1.746E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.024631E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.31 | backward: 1804.73 | backward-backward: 1804.70 | backward-allreduce: 0.00 | optimizer: 55.03 | batch generator: 0.77 samples/sec: 6.597 | iteration 146700/ 320000 | elapsed time per iteration (ms): 2425.4 | learning rate: 1.745E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.011344E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.26 | backward: 1803.95 | backward-backward: 1803.93 | backward-allreduce: 0.00 | optimizer: 54.79 | batch generator: 0.73 samples/sec: 6.593 | iteration 146800/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 1.743E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.034887E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.39 | backward: 1804.31 | backward-backward: 1804.29 | backward-allreduce: 0.00 | optimizer: 55.59 | batch generator: 0.78 samples/sec: 6.591 | iteration 146900/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 1.742E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.021133E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.37 | backward: 1804.73 | backward-backward: 1804.70 | backward-allreduce: 0.00 | optimizer: 56.14 | batch generator: 0.82 samples/sec: 6.595 | iteration 147000/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 1.740E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.038637E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.22 | backward: 1804.17 | backward-backward: 1804.15 | backward-allreduce: 0.00 | optimizer: 55.29 | batch generator: 0.75 ----------------------------------------------------------------------------------------------------------- validation results at iteration 147000 | lm_loss value: 3.076801E+00 | lm_loss_ppl value: 2.168891E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.443 | iteration 147100/ 320000 | elapsed time per iteration (ms): 2483.4 | learning rate: 1.739E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.036346E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.47 | backward: 1804.17 | backward-backward: 1804.15 | backward-allreduce: 0.00 | optimizer: 55.59 | batch generator: 0.85 samples/sec: 6.594 | iteration 147200/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 1.737E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.016544E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.53 | backward: 1803.91 | backward-backward: 1803.89 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.79 samples/sec: 6.594 | iteration 147300/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 1.736E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.032325E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.35 | backward: 1804.07 | backward-backward: 1804.05 | backward-allreduce: 0.00 | optimizer: 55.72 | batch generator: 0.77 samples/sec: 6.593 | iteration 147400/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 1.734E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.023206E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.49 | backward: 1804.53 | backward-backward: 1804.51 | backward-allreduce: 0.00 | optimizer: 55.45 | batch generator: 0.77 samples/sec: 6.592 | iteration 147500/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 1.733E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.049706E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.68 | backward: 1804.16 | backward-backward: 1804.13 | backward-allreduce: 0.00 | optimizer: 56.15 | batch generator: 0.87 samples/sec: 6.592 | iteration 147600/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 1.732E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.029768E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.80 | backward: 1804.23 | backward-backward: 1804.21 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.77 samples/sec: 6.592 | iteration 147700/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 1.730E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.010648E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.44 | backward: 1804.59 | backward-backward: 1804.57 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.79 samples/sec: 6.593 | iteration 147800/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 1.729E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.018624E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.42 | backward: 1804.62 | backward-backward: 1804.60 | backward-allreduce: 0.00 | optimizer: 55.54 | batch generator: 0.76 samples/sec: 6.594 | iteration 147900/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 1.727E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.031998E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.40 | backward: 1804.83 | backward-backward: 1804.80 | backward-allreduce: 0.00 | optimizer: 54.97 | batch generator: 0.77 samples/sec: 6.587 | iteration 148000/ 320000 | elapsed time per iteration (ms): 2428.9 | learning rate: 1.726E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.013422E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.67 | backward: 1805.39 | backward-backward: 1805.36 | backward-allreduce: 0.00 | optimizer: 56.48 | batch generator: 0.76 ----------------------------------------------------------------------------------------------------------- validation results at iteration 148000 | lm_loss value: 3.059664E+00 | lm_loss_ppl value: 2.132038E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.443 | iteration 148100/ 320000 | elapsed time per iteration (ms): 2483.4 | learning rate: 1.724E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.039422E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.47 | backward: 1804.26 | backward-backward: 1804.24 | backward-allreduce: 0.00 | optimizer: 55.46 | batch generator: 0.89 samples/sec: 6.593 | iteration 148200/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 1.723E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.021249E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.44 | backward: 1804.38 | backward-backward: 1804.35 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.80 samples/sec: 6.591 | iteration 148300/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 1.721E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.026466E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.43 | backward: 1804.65 | backward-backward: 1804.63 | backward-allreduce: 0.00 | optimizer: 56.03 | batch generator: 0.77 samples/sec: 6.592 | iteration 148400/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 1.720E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.016986E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.37 | backward: 1804.91 | backward-backward: 1804.89 | backward-allreduce: 0.00 | optimizer: 55.69 | batch generator: 0.78 samples/sec: 6.593 | iteration 148500/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 1.718E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.029101E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.45 | backward: 1804.61 | backward-backward: 1804.59 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.78 samples/sec: 6.593 | iteration 148600/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 1.717E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.006409E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.47 | backward: 1804.64 | backward-backward: 1804.61 | backward-allreduce: 0.00 | optimizer: 55.51 | batch generator: 0.80 samples/sec: 6.593 | iteration 148700/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 1.716E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.013894E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.46 | backward: 1804.42 | backward-backward: 1804.40 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.75 samples/sec: 6.593 | iteration 148800/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 1.714E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.018489E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.43 | backward: 1804.43 | backward-backward: 1804.41 | backward-allreduce: 0.00 | optimizer: 55.42 | batch generator: 0.77 samples/sec: 6.596 | iteration 148900/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 1.713E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.014476E+00 | loss scale: 65536.0 | number of skipped iterations: 2 | number of nan iterations: 0 | time (ms) | forward: 566.42 | backward: 1804.53 | backward-backward: 1804.51 | backward-allreduce: 0.00 | optimizer: 54.47 | batch generator: 0.76 samples/sec: 6.591 | iteration 149000/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 1.711E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.030229E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.55 | backward: 1804.97 | backward-backward: 1804.95 | backward-allreduce: 0.00 | optimizer: 55.76 | batch generator: 0.81 ----------------------------------------------------------------------------------------------------------- validation results at iteration 149000 | lm_loss value: 3.034751E+00 | lm_loss_ppl value: 2.079579E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.439 | iteration 149100/ 320000 | elapsed time per iteration (ms): 2484.8 | learning rate: 1.710E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.027383E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.61 | backward: 1804.82 | backward-backward: 1804.80 | backward-allreduce: 0.00 | optimizer: 56.19 | batch generator: 0.85 samples/sec: 6.591 | iteration 149200/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 1.708E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.006511E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.65 | backward: 1804.88 | backward-backward: 1804.86 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.79 samples/sec: 6.591 | iteration 149300/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 1.707E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.028918E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.64 | backward: 1804.96 | backward-backward: 1804.94 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.81 samples/sec: 6.592 | iteration 149400/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 1.705E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.015161E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.51 | backward: 1804.68 | backward-backward: 1804.66 | backward-allreduce: 0.00 | optimizer: 55.69 | batch generator: 0.82 samples/sec: 6.592 | iteration 149500/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 1.704E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.001823E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.82 | backward: 1804.37 | backward-backward: 1804.35 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.84 samples/sec: 6.594 | iteration 149600/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 1.702E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.001932E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.45 | backward: 1804.39 | backward-backward: 1804.36 | backward-allreduce: 0.00 | optimizer: 55.41 | batch generator: 0.84 samples/sec: 6.593 | iteration 149700/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 1.701E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.010489E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.50 | backward: 1804.38 | backward-backward: 1804.36 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.75 samples/sec: 6.594 | iteration 149800/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 1.700E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.013773E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.49 | backward: 1803.89 | backward-backward: 1803.87 | backward-allreduce: 0.00 | optimizer: 55.66 | batch generator: 0.87 samples/sec: 6.595 | iteration 149900/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 1.698E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.035664E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.36 | backward: 1803.87 | backward-backward: 1803.84 | backward-allreduce: 0.00 | optimizer: 55.39 | batch generator: 0.81 samples/sec: 6.594 | iteration 150000/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 1.697E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.009889E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.46 | backward: 1804.18 | backward-backward: 1804.15 | backward-allreduce: 0.00 | optimizer: 55.59 | batch generator: 0.81 WARNING: Deleting old checkpoints: checkpoints-hfcm/global_step50000 ----------------------------------------------------------------------------------------------------------- validation results at iteration 150000 | lm_loss value: 2.974780E+00 | lm_loss_ppl value: 1.958532E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.220 | iteration 150100/ 320000 | elapsed time per iteration (ms): 2572.2 | learning rate: 1.695E-04 | approx flops per GPU: 38.6TFLOPS | lm_loss: 3.018773E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.72 | backward: 1805.17 | backward-backward: 1805.15 | backward-allreduce: 0.00 | optimizer: 55.80 | batch generator: 0.88 samples/sec: 6.592 | iteration 150200/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 1.694E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.007109E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.41 | backward: 1804.39 | backward-backward: 1804.36 | backward-allreduce: 0.00 | optimizer: 56.19 | batch generator: 0.80 samples/sec: 6.592 | iteration 150300/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 1.692E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.014760E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.85 | backward: 1804.58 | backward-backward: 1804.56 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.92 samples/sec: 6.593 | iteration 150400/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 1.691E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.006078E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.54 | backward: 1804.47 | backward-backward: 1804.45 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.82 samples/sec: 6.593 | iteration 150500/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 1.689E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.002942E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.29 | backward: 1804.41 | backward-backward: 1804.39 | backward-allreduce: 0.00 | optimizer: 55.80 | batch generator: 0.78 samples/sec: 6.593 | iteration 150600/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 1.688E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.018105E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.36 | backward: 1804.51 | backward-backward: 1804.48 | backward-allreduce: 0.00 | optimizer: 55.43 | batch generator: 0.79 samples/sec: 6.593 | iteration 150700/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 1.686E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.009311E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.45 | backward: 1804.62 | backward-backward: 1804.59 | backward-allreduce: 0.00 | optimizer: 55.39 | batch generator: 0.79 samples/sec: 6.592 | iteration 150800/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 1.685E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.011689E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.25 | backward: 1804.92 | backward-backward: 1804.89 | backward-allreduce: 0.00 | optimizer: 55.67 | batch generator: 0.78 samples/sec: 6.594 | iteration 150900/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 1.683E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.995972E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.28 | backward: 1804.37 | backward-backward: 1804.34 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.80 samples/sec: 6.592 | iteration 151000/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 1.682E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.016360E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.77 | backward: 1804.51 | backward-backward: 1804.48 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.79 ----------------------------------------------------------------------------------------------------------- validation results at iteration 151000 | lm_loss value: 3.022918E+00 | lm_loss_ppl value: 2.055116E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.443 | iteration 151100/ 320000 | elapsed time per iteration (ms): 2483.2 | learning rate: 1.681E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.023494E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.44 | backward: 1804.25 | backward-backward: 1804.23 | backward-allreduce: 0.00 | optimizer: 55.35 | batch generator: 0.84 samples/sec: 6.590 | iteration 151200/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 1.679E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.019942E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.53 | backward: 1805.25 | backward-backward: 1805.23 | backward-allreduce: 0.00 | optimizer: 55.78 | batch generator: 0.81 samples/sec: 6.590 | iteration 151300/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 1.678E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.990773E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.45 | backward: 1804.61 | backward-backward: 1804.59 | backward-allreduce: 0.00 | optimizer: 56.37 | batch generator: 0.79 samples/sec: 6.593 | iteration 151400/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 1.676E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.015884E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.48 | backward: 1804.46 | backward-backward: 1804.44 | backward-allreduce: 0.00 | optimizer: 55.38 | batch generator: 0.79 samples/sec: 6.592 | iteration 151500/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 1.675E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.004505E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.74 | backward: 1804.21 | backward-backward: 1804.19 | backward-allreduce: 0.00 | optimizer: 55.76 | batch generator: 0.83 samples/sec: 6.593 | iteration 151600/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 1.673E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.001127E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.51 | backward: 1804.05 | backward-backward: 1804.02 | backward-allreduce: 0.00 | optimizer: 55.77 | batch generator: 0.78 samples/sec: 6.594 | iteration 151700/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 1.672E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.020629E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.41 | backward: 1803.96 | backward-backward: 1803.93 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.78 samples/sec: 6.594 | iteration 151800/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 1.670E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.008045E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.40 | backward: 1804.02 | backward-backward: 1803.99 | backward-allreduce: 0.00 | optimizer: 55.74 | batch generator: 0.77 samples/sec: 6.593 | iteration 151900/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 1.669E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.018357E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.29 | backward: 1804.52 | backward-backward: 1804.50 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.77 samples/sec: 6.593 | iteration 152000/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 1.667E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.012874E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.37 | backward: 1804.42 | backward-backward: 1804.40 | backward-allreduce: 0.00 | optimizer: 55.77 | batch generator: 0.80 ----------------------------------------------------------------------------------------------------------- validation results at iteration 152000 | lm_loss value: 2.954201E+00 | lm_loss_ppl value: 1.918640E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.442 | iteration 152100/ 320000 | elapsed time per iteration (ms): 2483.8 | learning rate: 1.666E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.015667E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.41 | backward: 1804.47 | backward-backward: 1804.45 | backward-allreduce: 0.00 | optimizer: 55.71 | batch generator: 0.85 samples/sec: 6.593 | iteration 152200/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 1.664E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.019675E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.34 | backward: 1804.47 | backward-backward: 1804.45 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.76 samples/sec: 6.593 | iteration 152300/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 1.663E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.003974E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.43 | backward: 1804.53 | backward-backward: 1804.51 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.79 samples/sec: 6.591 | iteration 152400/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 1.662E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.985063E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.45 | backward: 1804.39 | backward-backward: 1804.37 | backward-allreduce: 0.00 | optimizer: 56.16 | batch generator: 0.79 samples/sec: 6.593 | iteration 152500/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 1.660E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.006631E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.46 | backward: 1804.46 | backward-backward: 1804.44 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.79 samples/sec: 6.591 | iteration 152600/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 1.659E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.008100E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.47 | backward: 1804.49 | backward-backward: 1804.46 | backward-allreduce: 0.00 | optimizer: 56.10 | batch generator: 0.81 samples/sec: 6.593 | iteration 152700/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 1.657E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.013616E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.76 | backward: 1804.22 | backward-backward: 1804.20 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.82 samples/sec: 6.592 | iteration 152800/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 1.656E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.000176E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.39 | backward: 1804.73 | backward-backward: 1804.70 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.75 samples/sec: 6.592 | iteration 152900/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 1.654E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.011522E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.40 | backward: 1805.09 | backward-backward: 1805.07 | backward-allreduce: 0.00 | optimizer: 55.39 | batch generator: 0.71 samples/sec: 6.593 | iteration 153000/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 1.653E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.992163E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.82 | backward: 1804.14 | backward-backward: 1804.11 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.78 ----------------------------------------------------------------------------------------------------------- validation results at iteration 153000 | lm_loss value: 3.031534E+00 | lm_loss_ppl value: 2.072901E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.442 | iteration 153100/ 320000 | elapsed time per iteration (ms): 2483.8 | learning rate: 1.651E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.009099E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.57 | backward: 1804.44 | backward-backward: 1804.42 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.81 samples/sec: 6.593 | iteration 153200/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 1.650E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.003361E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.59 | backward: 1804.12 | backward-backward: 1804.10 | backward-allreduce: 0.00 | optimizer: 55.74 | batch generator: 0.83 samples/sec: 6.594 | iteration 153300/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 1.648E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.022211E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.49 | backward: 1804.24 | backward-backward: 1804.22 | backward-allreduce: 0.00 | optimizer: 55.50 | batch generator: 0.78 samples/sec: 6.592 | iteration 153400/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 1.647E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.993040E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.48 | backward: 1805.01 | backward-backward: 1804.99 | backward-allreduce: 0.00 | optimizer: 55.18 | batch generator: 0.80 samples/sec: 6.592 | iteration 153500/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 1.645E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.017674E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.42 | backward: 1804.39 | backward-backward: 1804.37 | backward-allreduce: 0.00 | optimizer: 55.98 | batch generator: 0.75 samples/sec: 6.593 | iteration 153600/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 1.644E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.997401E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.44 | backward: 1804.75 | backward-backward: 1804.72 | backward-allreduce: 0.00 | optimizer: 55.37 | batch generator: 0.74 samples/sec: 6.595 | iteration 153700/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 1.643E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.998173E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.73 | backward: 1804.00 | backward-backward: 1803.97 | backward-allreduce: 0.00 | optimizer: 54.85 | batch generator: 0.76 samples/sec: 6.595 | iteration 153800/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 1.641E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.002377E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.59 | backward: 1803.58 | backward-backward: 1803.56 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.80 samples/sec: 6.594 | iteration 153900/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 1.640E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.026907E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.47 | backward: 1803.88 | backward-backward: 1803.86 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.78 samples/sec: 6.593 | iteration 154000/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 1.638E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.007756E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.72 | backward: 1804.31 | backward-backward: 1804.29 | backward-allreduce: 0.00 | optimizer: 55.41 | batch generator: 0.78 ----------------------------------------------------------------------------------------------------------- validation results at iteration 154000 | lm_loss value: 3.051081E+00 | lm_loss_ppl value: 2.113819E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.443 | iteration 154100/ 320000 | elapsed time per iteration (ms): 2483.4 | learning rate: 1.637E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.014007E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.47 | backward: 1804.04 | backward-backward: 1804.01 | backward-allreduce: 0.00 | optimizer: 55.71 | batch generator: 0.88 samples/sec: 6.593 | iteration 154200/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 1.635E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.007681E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.53 | backward: 1804.16 | backward-backward: 1804.14 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.88 samples/sec: 6.593 | iteration 154300/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 1.634E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.997709E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.51 | backward: 1804.19 | backward-backward: 1804.16 | backward-allreduce: 0.00 | optimizer: 55.74 | batch generator: 0.80 samples/sec: 6.594 | iteration 154400/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 1.632E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.004828E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.71 | backward: 1803.89 | backward-backward: 1803.86 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.78 samples/sec: 6.591 | iteration 154500/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 1.631E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.998044E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.37 | backward: 1804.62 | backward-backward: 1804.60 | backward-allreduce: 0.00 | optimizer: 56.07 | batch generator: 0.84 samples/sec: 6.591 | iteration 154600/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 1.629E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.000352E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.81 | backward: 1804.22 | backward-backward: 1804.19 | backward-allreduce: 0.00 | optimizer: 56.12 | batch generator: 0.76 samples/sec: 6.592 | iteration 154700/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 1.628E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.999859E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.47 | backward: 1804.50 | backward-backward: 1804.47 | backward-allreduce: 0.00 | optimizer: 55.67 | batch generator: 0.77 samples/sec: 6.594 | iteration 154800/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 1.626E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.000351E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.65 | backward: 1804.00 | backward-backward: 1803.97 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.84 samples/sec: 6.592 | iteration 154900/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 1.625E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.012040E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.55 | backward: 1804.49 | backward-backward: 1804.46 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.75 samples/sec: 6.595 | iteration 155000/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 1.623E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.991129E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.41 | backward: 1804.20 | backward-backward: 1804.18 | backward-allreduce: 0.00 | optimizer: 55.15 | batch generator: 0.77 ----------------------------------------------------------------------------------------------------------- validation results at iteration 155000 | lm_loss value: 2.982948E+00 | lm_loss_ppl value: 1.974594E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.444 | iteration 155100/ 320000 | elapsed time per iteration (ms): 2483.0 | learning rate: 1.622E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.997955E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.53 | backward: 1804.30 | backward-backward: 1804.27 | backward-allreduce: 0.00 | optimizer: 54.96 | batch generator: 0.81 samples/sec: 6.595 | iteration 155200/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 1.621E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.998732E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.43 | backward: 1803.85 | backward-backward: 1803.83 | backward-allreduce: 0.00 | optimizer: 55.37 | batch generator: 0.77 samples/sec: 6.594 | iteration 155300/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 1.619E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.002869E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.70 | backward: 1804.03 | backward-backward: 1804.01 | backward-allreduce: 0.00 | optimizer: 55.51 | batch generator: 0.78 samples/sec: 6.592 | iteration 155400/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 1.618E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.987270E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.11 | backward: 1804.07 | backward-backward: 1804.05 | backward-allreduce: 0.00 | optimizer: 55.45 | batch generator: 0.81 samples/sec: 6.593 | iteration 155500/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 1.616E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.002252E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.75 | backward: 1804.02 | backward-backward: 1804.00 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.78 samples/sec: 6.591 | iteration 155600/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 1.615E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.988682E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.61 | backward: 1804.41 | backward-backward: 1804.39 | backward-allreduce: 0.00 | optimizer: 56.07 | batch generator: 0.80 samples/sec: 6.593 | iteration 155700/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 1.613E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.997135E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.57 | backward: 1803.90 | backward-backward: 1803.87 | backward-allreduce: 0.00 | optimizer: 55.95 | batch generator: 0.82 samples/sec: 6.594 | iteration 155800/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 1.612E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.988332E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.57 | backward: 1804.29 | backward-backward: 1804.27 | backward-allreduce: 0.00 | optimizer: 55.35 | batch generator: 0.78 samples/sec: 6.594 | iteration 155900/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 1.610E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.995996E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.46 | backward: 1803.98 | backward-backward: 1803.96 | backward-allreduce: 0.00 | optimizer: 55.49 | batch generator: 0.78 samples/sec: 6.593 | iteration 156000/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 1.609E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.995316E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.67 | backward: 1804.01 | backward-backward: 1803.99 | backward-allreduce: 0.00 | optimizer: 55.72 | batch generator: 0.78 ----------------------------------------------------------------------------------------------------------- validation results at iteration 156000 | lm_loss value: 2.959697E+00 | lm_loss_ppl value: 1.929212E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.442 | iteration 156100/ 320000 | elapsed time per iteration (ms): 2483.7 | learning rate: 1.607E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 3.014117E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.63 | backward: 1804.28 | backward-backward: 1804.26 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.86 samples/sec: 6.592 | iteration 156200/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 1.606E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.997733E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.58 | backward: 1804.73 | backward-backward: 1804.71 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.78 samples/sec: 6.591 | iteration 156300/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 1.604E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.999171E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.72 | backward: 1804.55 | backward-backward: 1804.53 | backward-allreduce: 0.00 | optimizer: 55.98 | batch generator: 0.83 samples/sec: 6.593 | iteration 156400/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 1.603E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.998018E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.53 | backward: 1804.61 | backward-backward: 1804.59 | backward-allreduce: 0.00 | optimizer: 55.37 | batch generator: 0.78 samples/sec: 6.592 | iteration 156500/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 1.601E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.994243E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.41 | backward: 1804.61 | backward-backward: 1804.59 | backward-allreduce: 0.00 | optimizer: 55.83 | batch generator: 0.79 samples/sec: 6.593 | iteration 156600/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 1.600E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.005865E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.53 | backward: 1804.38 | backward-backward: 1804.36 | backward-allreduce: 0.00 | optimizer: 55.69 | batch generator: 0.81 samples/sec: 6.587 | iteration 156700/ 320000 | elapsed time per iteration (ms): 2429.2 | learning rate: 1.599E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.004431E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.82 | backward: 1805.23 | backward-backward: 1805.21 | backward-allreduce: 0.00 | optimizer: 56.74 | batch generator: 0.77 samples/sec: 6.594 | iteration 156800/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 1.597E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.004960E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.43 | backward: 1804.16 | backward-backward: 1804.14 | backward-allreduce: 0.00 | optimizer: 55.45 | batch generator: 0.78 samples/sec: 6.593 | iteration 156900/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 1.596E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.994632E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.80 | backward: 1804.24 | backward-backward: 1804.21 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.78 samples/sec: 6.594 | iteration 157000/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 1.594E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.008017E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.53 | backward: 1804.17 | backward-backward: 1804.14 | backward-allreduce: 0.00 | optimizer: 55.41 | batch generator: 0.79 ----------------------------------------------------------------------------------------------------------- validation results at iteration 157000 | lm_loss value: 2.940694E+00 | lm_loss_ppl value: 1.892897E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.442 | iteration 157100/ 320000 | elapsed time per iteration (ms): 2483.7 | learning rate: 1.593E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.996066E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.56 | backward: 1804.21 | backward-backward: 1804.18 | backward-allreduce: 0.00 | optimizer: 55.77 | batch generator: 0.85 samples/sec: 6.595 | iteration 157200/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 1.591E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.981118E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.49 | backward: 1803.92 | backward-backward: 1803.90 | backward-allreduce: 0.00 | optimizer: 55.42 | batch generator: 0.78 samples/sec: 6.592 | iteration 157300/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 1.590E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.005805E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.51 | backward: 1804.36 | backward-backward: 1804.34 | backward-allreduce: 0.00 | optimizer: 55.77 | batch generator: 0.78 samples/sec: 6.593 | iteration 157400/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 1.588E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.995114E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.51 | backward: 1804.26 | backward-backward: 1804.23 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.80 samples/sec: 6.593 | iteration 157500/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 1.587E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.987516E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.86 | backward: 1803.81 | backward-backward: 1803.79 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.77 samples/sec: 6.593 | iteration 157600/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 1.585E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.001489E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.73 | backward: 1804.02 | backward-backward: 1804.00 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.78 samples/sec: 6.593 | iteration 157700/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 1.584E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.007001E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.60 | backward: 1804.12 | backward-backward: 1804.09 | backward-allreduce: 0.00 | optimizer: 55.89 | batch generator: 0.80 samples/sec: 6.593 | iteration 157800/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 1.582E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.004645E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.64 | backward: 1803.77 | backward-backward: 1803.75 | backward-allreduce: 0.00 | optimizer: 56.00 | batch generator: 0.79 samples/sec: 6.595 | iteration 157900/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 1.581E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.983952E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.52 | backward: 1803.66 | backward-backward: 1803.63 | backward-allreduce: 0.00 | optimizer: 55.70 | batch generator: 0.79 samples/sec: 6.593 | iteration 158000/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 1.579E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.983028E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.34 | backward: 1804.37 | backward-backward: 1804.34 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.78 ----------------------------------------------------------------------------------------------------------- validation results at iteration 158000 | lm_loss value: 2.997899E+00 | lm_loss_ppl value: 2.004338E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.443 | iteration 158100/ 320000 | elapsed time per iteration (ms): 2483.4 | learning rate: 1.578E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.999596E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.60 | backward: 1804.22 | backward-backward: 1804.20 | backward-allreduce: 0.00 | optimizer: 55.43 | batch generator: 0.84 samples/sec: 6.592 | iteration 158200/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 1.577E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.007680E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.51 | backward: 1804.86 | backward-backward: 1804.83 | backward-allreduce: 0.00 | optimizer: 55.28 | batch generator: 0.76 samples/sec: 6.592 | iteration 158300/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 1.575E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.992825E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.54 | backward: 1804.56 | backward-backward: 1804.53 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.80 samples/sec: 6.592 | iteration 158400/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 1.574E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.997531E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.42 | backward: 1804.58 | backward-backward: 1804.55 | backward-allreduce: 0.00 | optimizer: 55.80 | batch generator: 0.79 samples/sec: 6.592 | iteration 158500/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 1.572E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.987425E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.61 | backward: 1804.65 | backward-backward: 1804.63 | backward-allreduce: 0.00 | optimizer: 55.43 | batch generator: 0.80 samples/sec: 6.594 | iteration 158600/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 1.571E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.982928E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.54 | backward: 1804.45 | backward-backward: 1804.42 | backward-allreduce: 0.00 | optimizer: 55.19 | batch generator: 0.76 samples/sec: 6.593 | iteration 158700/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 1.569E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.973705E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.40 | backward: 1804.25 | backward-backward: 1804.22 | backward-allreduce: 0.00 | optimizer: 55.83 | batch generator: 0.77 samples/sec: 6.593 | iteration 158800/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 1.568E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.989608E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.41 | backward: 1804.23 | backward-backward: 1804.21 | backward-allreduce: 0.00 | optimizer: 55.78 | batch generator: 0.77 samples/sec: 6.587 | iteration 158900/ 320000 | elapsed time per iteration (ms): 2429.0 | learning rate: 1.566E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.984009E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.74 | backward: 1805.11 | backward-backward: 1805.09 | backward-allreduce: 0.00 | optimizer: 56.79 | batch generator: 0.78 samples/sec: 6.597 | iteration 159000/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 1.565E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.990480E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.08 | backward: 1803.61 | backward-backward: 1803.58 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.74 ----------------------------------------------------------------------------------------------------------- validation results at iteration 159000 | lm_loss value: 3.024084E+00 | lm_loss_ppl value: 2.057515E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.443 | iteration 159100/ 320000 | elapsed time per iteration (ms): 2483.2 | learning rate: 1.563E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.980807E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.55 | backward: 1804.09 | backward-backward: 1804.06 | backward-allreduce: 0.00 | optimizer: 55.39 | batch generator: 0.84 samples/sec: 6.588 | iteration 159200/ 320000 | elapsed time per iteration (ms): 2428.6 | learning rate: 1.562E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.982689E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.87 | backward: 1805.29 | backward-backward: 1805.27 | backward-allreduce: 0.00 | optimizer: 56.07 | batch generator: 0.81 samples/sec: 6.597 | iteration 159300/ 320000 | elapsed time per iteration (ms): 2425.2 | learning rate: 1.560E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.006416E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.03 | backward: 1803.10 | backward-backward: 1803.08 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.77 samples/sec: 6.593 | iteration 159400/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 1.559E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.981994E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.80 | backward: 1804.14 | backward-backward: 1804.11 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.82 samples/sec: 6.591 | iteration 159500/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 1.557E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 3.001057E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.70 | backward: 1804.98 | backward-backward: 1804.96 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.78 samples/sec: 6.597 | iteration 159600/ 320000 | elapsed time per iteration (ms): 2425.3 | learning rate: 1.556E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.974427E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.19 | backward: 1803.20 | backward-backward: 1803.18 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.76 samples/sec: 6.590 | iteration 159700/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 1.554E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.984948E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.56 | backward: 1805.24 | backward-backward: 1805.21 | backward-allreduce: 0.00 | optimizer: 55.51 | batch generator: 0.76 samples/sec: 6.587 | iteration 159800/ 320000 | elapsed time per iteration (ms): 2428.9 | learning rate: 1.553E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.984574E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.05 | backward: 1805.64 | backward-backward: 1805.61 | backward-allreduce: 0.00 | optimizer: 55.86 | batch generator: 0.79 samples/sec: 6.597 | iteration 159900/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 1.552E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.998404E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.17 | backward: 1803.81 | backward-backward: 1803.79 | backward-allreduce: 0.00 | optimizer: 55.10 | batch generator: 0.82 samples/sec: 6.585 | iteration 160000/ 320000 | elapsed time per iteration (ms): 2429.9 | learning rate: 1.550E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.982385E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.03 | backward: 1806.34 | backward-backward: 1806.32 | backward-allreduce: 0.00 | optimizer: 56.16 | batch generator: 0.78 WARNING: Deleting old checkpoints: checkpoints-hfcm/global_step60000 ----------------------------------------------------------------------------------------------------------- validation results at iteration 160000 | lm_loss value: 2.976429E+00 | lm_loss_ppl value: 1.961763E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.227 | iteration 160100/ 320000 | elapsed time per iteration (ms): 2569.3 | learning rate: 1.549E-04 | approx flops per GPU: 38.7TFLOPS | lm_loss: 2.998669E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.35 | backward: 1803.11 | backward-backward: 1803.09 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.82 samples/sec: 6.594 | iteration 160200/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 1.547E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 3.006082E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.38 | backward: 1803.88 | backward-backward: 1803.86 | backward-allreduce: 0.00 | optimizer: 55.84 | batch generator: 0.79 samples/sec: 6.589 | iteration 160300/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 1.546E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.993439E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.74 | backward: 1805.90 | backward-backward: 1805.88 | backward-allreduce: 0.00 | optimizer: 55.18 | batch generator: 0.79 samples/sec: 6.599 | iteration 160400/ 320000 | elapsed time per iteration (ms): 2424.6 | learning rate: 1.544E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.990240E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.10 | backward: 1802.77 | backward-backward: 1802.75 | backward-allreduce: 0.00 | optimizer: 55.39 | batch generator: 0.88 samples/sec: 6.592 | iteration 160500/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 1.543E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.961081E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.59 | backward: 1804.88 | backward-backward: 1804.86 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.77 samples/sec: 6.589 | iteration 160600/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 1.541E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.991292E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.93 | backward: 1805.30 | backward-backward: 1805.28 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.78 samples/sec: 6.600 | iteration 160700/ 320000 | elapsed time per iteration (ms): 2424.4 | learning rate: 1.540E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.986893E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.19 | backward: 1802.63 | backward-backward: 1802.60 | backward-allreduce: 0.00 | optimizer: 55.17 | batch generator: 0.78 samples/sec: 6.593 | iteration 160800/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 1.538E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.981256E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.51 | backward: 1804.28 | backward-backward: 1804.26 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.80 samples/sec: 6.590 | iteration 160900/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 1.537E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.994413E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.91 | backward: 1805.19 | backward-backward: 1805.17 | backward-allreduce: 0.00 | optimizer: 55.51 | batch generator: 0.80 samples/sec: 6.593 | iteration 161000/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 1.535E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.981331E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.71 | backward: 1803.81 | backward-backward: 1803.79 | backward-allreduce: 0.00 | optimizer: 56.09 | batch generator: 0.82 ----------------------------------------------------------------------------------------------------------- validation results at iteration 161000 | lm_loss value: 2.923371E+00 | lm_loss_ppl value: 1.860390E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.444 | iteration 161100/ 320000 | elapsed time per iteration (ms): 2483.1 | learning rate: 1.534E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.979581E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.30 | backward: 1803.61 | backward-backward: 1803.58 | backward-allreduce: 0.00 | optimizer: 56.09 | batch generator: 0.85 samples/sec: 6.588 | iteration 161200/ 320000 | elapsed time per iteration (ms): 2428.8 | learning rate: 1.532E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.995244E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.87 | backward: 1805.90 | backward-backward: 1805.87 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.79 samples/sec: 6.597 | iteration 161300/ 320000 | elapsed time per iteration (ms): 2425.3 | learning rate: 1.531E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.982285E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.41 | backward: 1803.15 | backward-backward: 1803.13 | backward-allreduce: 0.00 | optimizer: 55.38 | batch generator: 0.80 samples/sec: 6.593 | iteration 161400/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 1.529E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.977187E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.66 | backward: 1804.11 | backward-backward: 1804.09 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.77 samples/sec: 6.585 | iteration 161500/ 320000 | elapsed time per iteration (ms): 2429.7 | learning rate: 1.528E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.989604E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.03 | backward: 1806.05 | backward-backward: 1806.03 | backward-allreduce: 0.00 | optimizer: 56.22 | batch generator: 0.81 samples/sec: 6.595 | iteration 161600/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 1.527E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.985354E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.48 | backward: 1803.35 | backward-backward: 1803.33 | backward-allreduce: 0.00 | optimizer: 55.74 | batch generator: 0.79 samples/sec: 6.595 | iteration 161700/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 1.525E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.975111E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.28 | backward: 1804.29 | backward-backward: 1804.26 | backward-allreduce: 0.00 | optimizer: 55.25 | batch generator: 0.76 samples/sec: 6.588 | iteration 161800/ 320000 | elapsed time per iteration (ms): 2428.5 | learning rate: 1.524E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.987317E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.77 | backward: 1806.09 | backward-backward: 1806.06 | backward-allreduce: 0.00 | optimizer: 55.19 | batch generator: 0.78 samples/sec: 6.593 | iteration 161900/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 1.522E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.981028E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.55 | backward: 1804.26 | backward-backward: 1804.24 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.75 samples/sec: 6.595 | iteration 162000/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 1.521E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.995286E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.12 | backward: 1803.95 | backward-backward: 1803.93 | backward-allreduce: 0.00 | optimizer: 55.75 | batch generator: 0.76 ----------------------------------------------------------------------------------------------------------- validation results at iteration 162000 | lm_loss value: 2.948351E+00 | lm_loss_ppl value: 1.907448E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.434 | iteration 162100/ 320000 | elapsed time per iteration (ms): 2486.8 | learning rate: 1.519E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.968936E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.74 | backward: 1806.89 | backward-backward: 1806.87 | backward-allreduce: 0.00 | optimizer: 55.98 | batch generator: 0.83 samples/sec: 6.592 | iteration 162200/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 1.518E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.999277E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.99 | backward: 1804.24 | backward-backward: 1804.22 | backward-allreduce: 0.00 | optimizer: 55.76 | batch generator: 0.76 samples/sec: 6.597 | iteration 162300/ 320000 | elapsed time per iteration (ms): 2425.3 | learning rate: 1.516E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.991430E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.29 | backward: 1803.06 | backward-backward: 1803.04 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.84 samples/sec: 6.589 | iteration 162400/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 1.515E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.974163E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.86 | backward: 1805.62 | backward-backward: 1805.60 | backward-allreduce: 0.00 | optimizer: 55.47 | batch generator: 0.79 samples/sec: 6.595 | iteration 162500/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 1.513E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.958438E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.58 | backward: 1803.90 | backward-backward: 1803.88 | backward-allreduce: 0.00 | optimizer: 55.39 | batch generator: 0.75 samples/sec: 6.597 | iteration 162600/ 320000 | elapsed time per iteration (ms): 2425.3 | learning rate: 1.512E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.997252E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 565.98 | backward: 1803.64 | backward-backward: 1803.62 | backward-allreduce: 0.00 | optimizer: 55.34 | batch generator: 0.72 samples/sec: 6.588 | iteration 162700/ 320000 | elapsed time per iteration (ms): 2428.7 | learning rate: 1.510E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.992465E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.84 | backward: 1805.59 | backward-backward: 1805.57 | backward-allreduce: 0.00 | optimizer: 55.92 | batch generator: 0.81 samples/sec: 6.592 | iteration 162800/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 1.509E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.979658E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.77 | backward: 1804.01 | backward-backward: 1803.98 | backward-allreduce: 0.00 | optimizer: 55.83 | batch generator: 0.81 samples/sec: 6.598 | iteration 162900/ 320000 | elapsed time per iteration (ms): 2425.1 | learning rate: 1.507E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.960946E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.09 | backward: 1803.30 | backward-backward: 1803.28 | backward-allreduce: 0.00 | optimizer: 55.24 | batch generator: 0.75 samples/sec: 6.588 | iteration 163000/ 320000 | elapsed time per iteration (ms): 2428.8 | learning rate: 1.506E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.981319E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.01 | backward: 1805.75 | backward-backward: 1805.72 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.78 ----------------------------------------------------------------------------------------------------------- validation results at iteration 163000 | lm_loss value: 2.906969E+00 | lm_loss_ppl value: 1.830124E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.443 | iteration 163100/ 320000 | elapsed time per iteration (ms): 2483.3 | learning rate: 1.504E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.992463E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.76 | backward: 1803.75 | backward-backward: 1803.73 | backward-allreduce: 0.00 | optimizer: 55.59 | batch generator: 0.82 samples/sec: 6.596 | iteration 163200/ 320000 | elapsed time per iteration (ms): 2425.6 | learning rate: 1.503E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.997758E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.25 | backward: 1803.56 | backward-backward: 1803.53 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.79 samples/sec: 6.588 | iteration 163300/ 320000 | elapsed time per iteration (ms): 2428.8 | learning rate: 1.502E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.980252E+00 | loss scale: 32768.0 | number of skipped iterations: 2 | number of nan iterations: 0 | time (ms) | forward: 566.66 | backward: 1806.37 | backward-backward: 1806.35 | backward-allreduce: 0.00 | optimizer: 55.37 | batch generator: 0.77 samples/sec: 6.590 | iteration 163400/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 1.500E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.966051E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.06 | backward: 1804.63 | backward-backward: 1804.61 | backward-allreduce: 0.00 | optimizer: 55.74 | batch generator: 0.77 samples/sec: 6.597 | iteration 163500/ 320000 | elapsed time per iteration (ms): 2425.4 | learning rate: 1.499E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.988117E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.12 | backward: 1803.31 | backward-backward: 1803.29 | backward-allreduce: 0.00 | optimizer: 55.59 | batch generator: 0.76 samples/sec: 6.590 | iteration 163600/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 1.497E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.980496E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.79 | backward: 1805.09 | backward-backward: 1805.06 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.78 samples/sec: 6.590 | iteration 163700/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 1.496E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.975281E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.14 | backward: 1804.77 | backward-backward: 1804.75 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.77 samples/sec: 6.600 | iteration 163800/ 320000 | elapsed time per iteration (ms): 2424.4 | learning rate: 1.494E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.968513E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 565.99 | backward: 1802.66 | backward-backward: 1802.63 | backward-allreduce: 0.00 | optimizer: 55.38 | batch generator: 0.80 samples/sec: 6.593 | iteration 163900/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 1.493E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.981061E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.45 | backward: 1804.23 | backward-backward: 1804.21 | backward-allreduce: 0.00 | optimizer: 55.73 | batch generator: 0.77 samples/sec: 6.590 | iteration 164000/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 1.491E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.984370E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.93 | backward: 1805.20 | backward-backward: 1805.18 | backward-allreduce: 0.00 | optimizer: 55.47 | batch generator: 0.76 ----------------------------------------------------------------------------------------------------------- validation results at iteration 164000 | lm_loss value: 3.008506E+00 | lm_loss_ppl value: 2.025710E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.447 | iteration 164100/ 320000 | elapsed time per iteration (ms): 2481.8 | learning rate: 1.490E-04 | approx flops per GPU: 40.1TFLOPS | lm_loss: 2.953906E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.32 | backward: 1802.61 | backward-backward: 1802.59 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.82 samples/sec: 6.595 | iteration 164200/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 1.488E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.975520E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.44 | backward: 1803.85 | backward-backward: 1803.83 | backward-allreduce: 0.00 | optimizer: 55.41 | batch generator: 0.80 samples/sec: 6.588 | iteration 164300/ 320000 | elapsed time per iteration (ms): 2428.5 | learning rate: 1.487E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.971483E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.91 | backward: 1805.43 | backward-backward: 1805.41 | backward-allreduce: 0.00 | optimizer: 55.76 | batch generator: 0.79 samples/sec: 6.594 | iteration 164400/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 1.485E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.971605E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.60 | backward: 1803.92 | backward-backward: 1803.90 | backward-allreduce: 0.00 | optimizer: 55.49 | batch generator: 0.78 samples/sec: 6.597 | iteration 164500/ 320000 | elapsed time per iteration (ms): 2425.2 | learning rate: 1.484E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.969607E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.18 | backward: 1803.57 | backward-backward: 1803.55 | backward-allreduce: 0.00 | optimizer: 55.07 | batch generator: 0.79 samples/sec: 6.589 | iteration 164600/ 320000 | elapsed time per iteration (ms): 2428.4 | learning rate: 1.482E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.993456E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.24 | backward: 1805.12 | backward-backward: 1805.10 | backward-allreduce: 0.00 | optimizer: 55.67 | batch generator: 0.78 samples/sec: 6.595 | iteration 164700/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 1.481E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.983800E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.85 | backward: 1803.36 | backward-backward: 1803.34 | backward-allreduce: 0.00 | optimizer: 55.37 | batch generator: 0.79 samples/sec: 6.599 | iteration 164800/ 320000 | elapsed time per iteration (ms): 2424.6 | learning rate: 1.479E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.980310E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 565.98 | backward: 1802.88 | backward-backward: 1802.86 | backward-allreduce: 0.00 | optimizer: 55.39 | batch generator: 0.76 samples/sec: 6.589 | iteration 164900/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 1.478E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.989977E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.63 | backward: 1805.70 | backward-backward: 1805.68 | backward-allreduce: 0.00 | optimizer: 55.45 | batch generator: 0.81 samples/sec: 6.590 | iteration 165000/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 1.477E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.970767E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.13 | backward: 1804.60 | backward-backward: 1804.58 | backward-allreduce: 0.00 | optimizer: 55.66 | batch generator: 0.79 ----------------------------------------------------------------------------------------------------------- validation results at iteration 165000 | lm_loss value: 2.982138E+00 | lm_loss_ppl value: 1.972996E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.448 | iteration 165100/ 320000 | elapsed time per iteration (ms): 2481.3 | learning rate: 1.475E-04 | approx flops per GPU: 40.1TFLOPS | lm_loss: 2.963963E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 565.93 | backward: 1802.84 | backward-backward: 1802.81 | backward-allreduce: 0.00 | optimizer: 55.33 | batch generator: 0.85 samples/sec: 6.591 | iteration 165200/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 1.474E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.985519E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.77 | backward: 1804.86 | backward-backward: 1804.84 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.81 samples/sec: 6.589 | iteration 165300/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 1.472E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.981418E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.93 | backward: 1805.27 | backward-backward: 1805.25 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.78 samples/sec: 6.596 | iteration 165400/ 320000 | elapsed time per iteration (ms): 2425.6 | learning rate: 1.471E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.972857E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.13 | backward: 1803.49 | backward-backward: 1803.46 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.81 samples/sec: 6.591 | iteration 165500/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 1.469E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.979313E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.60 | backward: 1804.70 | backward-backward: 1804.67 | backward-allreduce: 0.00 | optimizer: 56.00 | batch generator: 0.78 samples/sec: 6.587 | iteration 165600/ 320000 | elapsed time per iteration (ms): 2428.9 | learning rate: 1.468E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.968164E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.78 | backward: 1806.02 | backward-backward: 1806.00 | backward-allreduce: 0.00 | optimizer: 55.70 | batch generator: 0.79 samples/sec: 6.596 | iteration 165700/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 1.466E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.976829E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.22 | backward: 1803.43 | backward-backward: 1803.41 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.78 samples/sec: 6.592 | iteration 165800/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 1.465E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.946321E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.39 | backward: 1804.31 | backward-backward: 1804.29 | backward-allreduce: 0.00 | optimizer: 55.93 | batch generator: 0.77 samples/sec: 6.588 | iteration 165900/ 320000 | elapsed time per iteration (ms): 2428.8 | learning rate: 1.463E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.970059E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.83 | backward: 1805.78 | backward-backward: 1805.76 | backward-allreduce: 0.00 | optimizer: 55.84 | batch generator: 0.79 samples/sec: 6.590 | iteration 166000/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 1.462E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.979019E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.29 | backward: 1804.65 | backward-backward: 1804.62 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.78 ----------------------------------------------------------------------------------------------------------- validation results at iteration 166000 | lm_loss value: 2.983183E+00 | lm_loss_ppl value: 1.975058E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.444 | iteration 166100/ 320000 | elapsed time per iteration (ms): 2482.7 | learning rate: 1.460E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.980103E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.14 | backward: 1803.91 | backward-backward: 1803.89 | backward-allreduce: 0.00 | optimizer: 55.54 | batch generator: 1.00 samples/sec: 6.588 | iteration 166200/ 320000 | elapsed time per iteration (ms): 2428.5 | learning rate: 1.459E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.981524E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.86 | backward: 1805.66 | backward-backward: 1805.64 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.78 samples/sec: 6.588 | iteration 166300/ 320000 | elapsed time per iteration (ms): 2428.6 | learning rate: 1.457E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.973037E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.15 | backward: 1805.34 | backward-backward: 1805.32 | backward-allreduce: 0.00 | optimizer: 55.70 | batch generator: 0.80 samples/sec: 6.590 | iteration 166400/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 1.456E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.978907E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.31 | backward: 1804.50 | backward-backward: 1804.48 | backward-allreduce: 0.00 | optimizer: 55.83 | batch generator: 0.84 samples/sec: 6.597 | iteration 166500/ 320000 | elapsed time per iteration (ms): 2425.2 | learning rate: 1.454E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.973003E+00 | loss scale: 65536.0 | number of skipped iterations: 2 | number of nan iterations: 0 | time (ms) | forward: 565.93 | backward: 1803.85 | backward-backward: 1803.82 | backward-allreduce: 0.00 | optimizer: 55.04 | batch generator: 0.82 samples/sec: 6.591 | iteration 166600/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 1.453E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.968264E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.81 | backward: 1804.74 | backward-backward: 1804.71 | backward-allreduce: 0.00 | optimizer: 55.59 | batch generator: 0.78 samples/sec: 6.589 | iteration 166700/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 1.452E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.968095E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.81 | backward: 1805.29 | backward-backward: 1805.26 | backward-allreduce: 0.00 | optimizer: 55.72 | batch generator: 0.84 samples/sec: 6.592 | iteration 166800/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 1.450E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.983925E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.80 | backward: 1804.82 | backward-backward: 1804.80 | backward-allreduce: 0.00 | optimizer: 55.07 | batch generator: 0.79 samples/sec: 6.593 | iteration 166900/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 1.449E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.969024E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.79 | backward: 1804.18 | backward-backward: 1804.15 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.75 samples/sec: 6.593 | iteration 167000/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 1.447E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.988711E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.73 | backward: 1804.11 | backward-backward: 1804.09 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.78 ----------------------------------------------------------------------------------------------------------- validation results at iteration 167000 | lm_loss value: 2.951571E+00 | lm_loss_ppl value: 1.913598E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.441 | iteration 167100/ 320000 | elapsed time per iteration (ms): 2483.9 | learning rate: 1.446E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.953082E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.85 | backward: 1803.81 | backward-backward: 1803.78 | backward-allreduce: 0.00 | optimizer: 56.09 | batch generator: 1.05 samples/sec: 6.594 | iteration 167200/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 1.444E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.967480E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.40 | backward: 1804.05 | backward-backward: 1804.02 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.77 samples/sec: 6.595 | iteration 167300/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 1.443E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.947682E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.60 | backward: 1803.52 | backward-backward: 1803.50 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.86 samples/sec: 6.595 | iteration 167400/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 1.441E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.972765E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.39 | backward: 1803.85 | backward-backward: 1803.83 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.81 samples/sec: 6.597 | iteration 167500/ 320000 | elapsed time per iteration (ms): 2425.3 | learning rate: 1.440E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.949778E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.41 | backward: 1803.26 | backward-backward: 1803.23 | backward-allreduce: 0.00 | optimizer: 55.29 | batch generator: 0.80 samples/sec: 6.590 | iteration 167600/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 1.438E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.961230E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.29 | backward: 1804.99 | backward-backward: 1804.97 | backward-allreduce: 0.00 | optimizer: 56.37 | batch generator: 0.77 samples/sec: 6.595 | iteration 167700/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 1.437E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.976617E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.24 | backward: 1803.72 | backward-backward: 1803.70 | backward-allreduce: 0.00 | optimizer: 55.73 | batch generator: 0.77 samples/sec: 6.594 | iteration 167800/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 1.435E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.964568E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.32 | backward: 1804.06 | backward-backward: 1804.03 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.79 samples/sec: 6.594 | iteration 167900/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 1.434E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.971714E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.51 | backward: 1804.07 | backward-backward: 1804.04 | backward-allreduce: 0.00 | optimizer: 55.38 | batch generator: 0.79 samples/sec: 6.594 | iteration 168000/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 1.432E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.976982E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.48 | backward: 1803.94 | backward-backward: 1803.92 | backward-allreduce: 0.00 | optimizer: 55.72 | batch generator: 0.85 ----------------------------------------------------------------------------------------------------------- validation results at iteration 168000 | lm_loss value: 2.930950E+00 | lm_loss_ppl value: 1.874543E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.444 | iteration 168100/ 320000 | elapsed time per iteration (ms): 2483.0 | learning rate: 1.431E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.976493E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.53 | backward: 1803.58 | backward-backward: 1803.55 | backward-allreduce: 0.00 | optimizer: 55.67 | batch generator: 0.88 samples/sec: 6.593 | iteration 168200/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 1.429E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.965325E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.23 | backward: 1804.13 | backward-backward: 1804.11 | backward-allreduce: 0.00 | optimizer: 56.01 | batch generator: 0.77 samples/sec: 6.593 | iteration 168300/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 1.428E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.978864E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.60 | backward: 1803.99 | backward-backward: 1803.96 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.78 samples/sec: 6.594 | iteration 168400/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 1.427E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.946640E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.51 | backward: 1804.02 | backward-backward: 1804.00 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.77 samples/sec: 6.595 | iteration 168500/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 1.425E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.972582E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.26 | backward: 1804.50 | backward-backward: 1804.48 | backward-allreduce: 0.00 | optimizer: 54.96 | batch generator: 0.77 samples/sec: 6.594 | iteration 168600/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 1.424E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.975205E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.43 | backward: 1804.13 | backward-backward: 1804.11 | backward-allreduce: 0.00 | optimizer: 55.41 | batch generator: 0.77 samples/sec: 6.590 | iteration 168700/ 320000 | elapsed time per iteration (ms): 2428.1 | learning rate: 1.422E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.954539E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.65 | backward: 1805.02 | backward-backward: 1805.00 | backward-allreduce: 0.00 | optimizer: 56.01 | batch generator: 0.78 samples/sec: 6.594 | iteration 168800/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 1.421E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.970365E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.43 | backward: 1804.27 | backward-backward: 1804.25 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.79 samples/sec: 6.593 | iteration 168900/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 1.419E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.954410E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.69 | backward: 1804.16 | backward-backward: 1804.13 | backward-allreduce: 0.00 | optimizer: 55.42 | batch generator: 0.76 samples/sec: 6.594 | iteration 169000/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 1.418E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.972017E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.34 | backward: 1804.25 | backward-backward: 1804.23 | backward-allreduce: 0.00 | optimizer: 55.39 | batch generator: 0.77 ----------------------------------------------------------------------------------------------------------- validation results at iteration 169000 | lm_loss value: 2.936843E+00 | lm_loss_ppl value: 1.885623E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.443 | iteration 169100/ 320000 | elapsed time per iteration (ms): 2483.2 | learning rate: 1.416E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.955041E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.44 | backward: 1804.24 | backward-backward: 1804.22 | backward-allreduce: 0.00 | optimizer: 55.39 | batch generator: 0.87 samples/sec: 6.594 | iteration 169200/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 1.415E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.947872E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.23 | backward: 1804.33 | backward-backward: 1804.31 | backward-allreduce: 0.00 | optimizer: 55.51 | batch generator: 0.78 samples/sec: 6.593 | iteration 169300/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 1.413E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.934462E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.24 | backward: 1804.27 | backward-backward: 1804.25 | backward-allreduce: 0.00 | optimizer: 55.79 | batch generator: 0.76 samples/sec: 6.593 | iteration 169400/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 1.412E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.978593E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.41 | backward: 1804.90 | backward-backward: 1804.88 | backward-allreduce: 0.00 | optimizer: 55.03 | batch generator: 0.78 samples/sec: 6.593 | iteration 169500/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 1.410E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.973011E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.80 | backward: 1803.70 | backward-backward: 1803.67 | backward-allreduce: 0.00 | optimizer: 55.81 | batch generator: 0.76 samples/sec: 6.594 | iteration 169600/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 1.409E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.962646E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.42 | backward: 1804.04 | backward-backward: 1804.02 | backward-allreduce: 0.00 | optimizer: 55.77 | batch generator: 0.77 samples/sec: 6.595 | iteration 169700/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 1.407E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.953558E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.32 | backward: 1804.14 | backward-backward: 1804.11 | backward-allreduce: 0.00 | optimizer: 55.39 | batch generator: 0.78 samples/sec: 6.592 | iteration 169800/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 1.406E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.960134E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.28 | backward: 1804.50 | backward-backward: 1804.47 | backward-allreduce: 0.00 | optimizer: 56.09 | batch generator: 0.78 samples/sec: 6.594 | iteration 169900/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 1.405E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.953017E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.31 | backward: 1804.09 | backward-backward: 1804.07 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.78 samples/sec: 6.594 | iteration 170000/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 1.403E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.959350E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.48 | backward: 1804.08 | backward-backward: 1804.05 | backward-allreduce: 0.00 | optimizer: 55.40 | batch generator: 0.78 WARNING: Deleting old checkpoints: checkpoints-hfcm/global_step70000 ----------------------------------------------------------------------------------------------------------- validation results at iteration 170000 | lm_loss value: 2.951192E+00 | lm_loss_ppl value: 1.912875E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.222 | iteration 170100/ 320000 | elapsed time per iteration (ms): 2571.6 | learning rate: 1.402E-04 | approx flops per GPU: 38.7TFLOPS | lm_loss: 2.953705E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.06 | backward: 1804.47 | backward-backward: 1804.45 | backward-allreduce: 0.00 | optimizer: 55.54 | batch generator: 0.84 samples/sec: 6.595 | iteration 170200/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 1.400E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.941761E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.45 | backward: 1803.76 | backward-backward: 1803.73 | backward-allreduce: 0.00 | optimizer: 55.43 | batch generator: 0.79 samples/sec: 6.596 | iteration 170300/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 1.399E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.934691E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.41 | backward: 1803.44 | backward-backward: 1803.42 | backward-allreduce: 0.00 | optimizer: 55.50 | batch generator: 0.79 samples/sec: 6.596 | iteration 170400/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 1.397E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.956355E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.32 | backward: 1803.63 | backward-backward: 1803.61 | backward-allreduce: 0.00 | optimizer: 55.47 | batch generator: 0.75 samples/sec: 6.593 | iteration 170500/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 1.396E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.963199E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.39 | backward: 1804.20 | backward-backward: 1804.18 | backward-allreduce: 0.00 | optimizer: 55.81 | batch generator: 0.77 samples/sec: 6.594 | iteration 170600/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 1.394E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.956244E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.43 | backward: 1803.94 | backward-backward: 1803.92 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.79 samples/sec: 6.593 | iteration 170700/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 1.393E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.960375E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.45 | backward: 1804.40 | backward-backward: 1804.38 | backward-allreduce: 0.00 | optimizer: 55.72 | batch generator: 0.79 samples/sec: 6.591 | iteration 170800/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 1.391E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.956396E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.38 | backward: 1805.08 | backward-backward: 1805.05 | backward-allreduce: 0.00 | optimizer: 55.73 | batch generator: 0.78 samples/sec: 6.592 | iteration 170900/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 1.390E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.960852E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.38 | backward: 1804.33 | backward-backward: 1804.31 | backward-allreduce: 0.00 | optimizer: 56.07 | batch generator: 0.75 samples/sec: 6.593 | iteration 171000/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 1.388E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.956722E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.46 | backward: 1804.46 | backward-backward: 1804.44 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.78 ----------------------------------------------------------------------------------------------------------- validation results at iteration 171000 | lm_loss value: 2.994478E+00 | lm_loss_ppl value: 1.997492E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.442 | iteration 171100/ 320000 | elapsed time per iteration (ms): 2483.5 | learning rate: 1.387E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.946405E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.53 | backward: 1804.06 | backward-backward: 1804.04 | backward-allreduce: 0.00 | optimizer: 55.75 | batch generator: 0.88 samples/sec: 6.593 | iteration 171200/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 1.385E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.959759E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.65 | backward: 1804.28 | backward-backward: 1804.26 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.83 samples/sec: 6.593 | iteration 171300/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 1.384E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.971077E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.46 | backward: 1804.37 | backward-backward: 1804.34 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.79 samples/sec: 6.595 | iteration 171400/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 1.383E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.955389E+00 | loss scale: 65536.0 | number of skipped iterations: 2 | number of nan iterations: 0 | time (ms) | forward: 566.47 | backward: 1804.58 | backward-backward: 1804.56 | backward-allreduce: 0.00 | optimizer: 54.67 | batch generator: 0.75 samples/sec: 6.592 | iteration 171500/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 1.381E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.948012E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.52 | backward: 1804.30 | backward-backward: 1804.28 | backward-allreduce: 0.00 | optimizer: 55.93 | batch generator: 0.78 samples/sec: 6.594 | iteration 171600/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 1.380E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.946745E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.49 | backward: 1804.20 | backward-backward: 1804.18 | backward-allreduce: 0.00 | optimizer: 55.50 | batch generator: 0.78 samples/sec: 6.593 | iteration 171700/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 1.378E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.954058E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.73 | backward: 1804.16 | backward-backward: 1804.14 | backward-allreduce: 0.00 | optimizer: 55.43 | batch generator: 0.91 samples/sec: 6.592 | iteration 171800/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 1.377E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.961812E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.85 | backward: 1804.28 | backward-backward: 1804.25 | backward-allreduce: 0.00 | optimizer: 55.54 | batch generator: 0.80 samples/sec: 6.590 | iteration 171900/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 1.375E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.976431E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.66 | backward: 1805.20 | backward-backward: 1805.18 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.80 samples/sec: 6.593 | iteration 172000/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 1.374E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.949615E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.44 | backward: 1804.44 | backward-backward: 1804.41 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.82 ----------------------------------------------------------------------------------------------------------- validation results at iteration 172000 | lm_loss value: 2.889027E+00 | lm_loss_ppl value: 1.797582E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.442 | iteration 172100/ 320000 | elapsed time per iteration (ms): 2483.5 | learning rate: 1.372E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.922720E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.25 | backward: 1804.60 | backward-backward: 1804.58 | backward-allreduce: 0.00 | optimizer: 55.42 | batch generator: 0.82 samples/sec: 6.594 | iteration 172200/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 1.371E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.948749E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.37 | backward: 1804.13 | backward-backward: 1804.10 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.81 samples/sec: 6.593 | iteration 172300/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 1.369E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.957791E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.53 | backward: 1803.94 | backward-backward: 1803.92 | backward-allreduce: 0.00 | optimizer: 55.78 | batch generator: 0.80 samples/sec: 6.593 | iteration 172400/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 1.368E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.947236E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.38 | backward: 1803.89 | backward-backward: 1803.86 | backward-allreduce: 0.00 | optimizer: 56.02 | batch generator: 0.76 samples/sec: 6.592 | iteration 172500/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 1.366E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.943646E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.61 | backward: 1804.20 | backward-backward: 1804.18 | backward-allreduce: 0.00 | optimizer: 55.93 | batch generator: 0.77 samples/sec: 6.594 | iteration 172600/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 1.365E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.964934E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.31 | backward: 1804.23 | backward-backward: 1804.21 | backward-allreduce: 0.00 | optimizer: 55.66 | batch generator: 0.74 samples/sec: 6.593 | iteration 172700/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 1.363E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.963261E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.45 | backward: 1804.26 | backward-backward: 1804.24 | backward-allreduce: 0.00 | optimizer: 55.88 | batch generator: 0.78 samples/sec: 6.593 | iteration 172800/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 1.362E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.945473E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.49 | backward: 1804.17 | backward-backward: 1804.15 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.80 samples/sec: 6.592 | iteration 172900/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 1.361E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.945754E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.64 | backward: 1804.32 | backward-backward: 1804.29 | backward-allreduce: 0.00 | optimizer: 55.67 | batch generator: 0.80 samples/sec: 6.592 | iteration 173000/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 1.359E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.936829E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.40 | backward: 1805.33 | backward-backward: 1805.30 | backward-allreduce: 0.00 | optimizer: 55.21 | batch generator: 0.79 ----------------------------------------------------------------------------------------------------------- validation results at iteration 173000 | lm_loss value: 3.009814E+00 | lm_loss_ppl value: 2.028363E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.443 | iteration 173100/ 320000 | elapsed time per iteration (ms): 2483.2 | learning rate: 1.358E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.942831E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.44 | backward: 1804.04 | backward-backward: 1804.02 | backward-allreduce: 0.00 | optimizer: 55.54 | batch generator: 0.86 samples/sec: 6.594 | iteration 173200/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 1.356E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.952524E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.51 | backward: 1804.08 | backward-backward: 1804.05 | backward-allreduce: 0.00 | optimizer: 55.41 | batch generator: 0.77 samples/sec: 6.593 | iteration 173300/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 1.355E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.940877E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.50 | backward: 1804.13 | backward-backward: 1804.10 | backward-allreduce: 0.00 | optimizer: 55.78 | batch generator: 0.79 samples/sec: 6.592 | iteration 173400/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 1.353E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.951345E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.48 | backward: 1804.14 | backward-backward: 1804.12 | backward-allreduce: 0.00 | optimizer: 56.12 | batch generator: 0.79 samples/sec: 6.593 | iteration 173500/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 1.352E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.939355E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.45 | backward: 1804.22 | backward-backward: 1804.20 | backward-allreduce: 0.00 | optimizer: 55.83 | batch generator: 0.79 samples/sec: 6.594 | iteration 173600/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 1.350E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.941638E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.31 | backward: 1804.08 | backward-backward: 1804.06 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.77 samples/sec: 6.594 | iteration 173700/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 1.349E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.939625E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.44 | backward: 1803.93 | backward-backward: 1803.90 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.78 samples/sec: 6.594 | iteration 173800/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 1.347E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.942234E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.30 | backward: 1804.40 | backward-backward: 1804.38 | backward-allreduce: 0.00 | optimizer: 55.39 | batch generator: 0.78 samples/sec: 6.593 | iteration 173900/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 1.346E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.954233E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.34 | backward: 1804.58 | backward-backward: 1804.56 | backward-allreduce: 0.00 | optimizer: 55.54 | batch generator: 0.78 samples/sec: 6.593 | iteration 174000/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 1.344E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.936789E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.31 | backward: 1804.53 | backward-backward: 1804.51 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.77 ----------------------------------------------------------------------------------------------------------- validation results at iteration 174000 | lm_loss value: 2.899767E+00 | lm_loss_ppl value: 1.816991E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.444 | iteration 174100/ 320000 | elapsed time per iteration (ms): 2483.1 | learning rate: 1.343E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.960142E+00 | loss scale: 32768.0 | number of skipped iterations: 2 | number of nan iterations: 0 | time (ms) | forward: 566.26 | backward: 1804.94 | backward-backward: 1804.92 | backward-allreduce: 0.00 | optimizer: 54.68 | batch generator: 0.82 samples/sec: 6.592 | iteration 174200/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 1.342E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.945296E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.54 | backward: 1804.39 | backward-backward: 1804.37 | backward-allreduce: 0.00 | optimizer: 56.02 | batch generator: 0.84 samples/sec: 6.593 | iteration 174300/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 1.340E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.940969E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.37 | backward: 1804.34 | backward-backward: 1804.31 | backward-allreduce: 0.00 | optimizer: 55.67 | batch generator: 0.85 samples/sec: 6.594 | iteration 174400/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 1.339E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.938221E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.39 | backward: 1804.21 | backward-backward: 1804.19 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.83 samples/sec: 6.593 | iteration 174500/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 1.337E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.951516E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.48 | backward: 1804.11 | backward-backward: 1804.09 | backward-allreduce: 0.00 | optimizer: 55.81 | batch generator: 0.81 samples/sec: 6.592 | iteration 174600/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 1.336E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.944487E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.56 | backward: 1804.36 | backward-backward: 1804.34 | backward-allreduce: 0.00 | optimizer: 55.81 | batch generator: 0.81 samples/sec: 6.592 | iteration 174700/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 1.334E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.950092E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.31 | backward: 1804.56 | backward-backward: 1804.54 | backward-allreduce: 0.00 | optimizer: 55.86 | batch generator: 0.77 samples/sec: 6.593 | iteration 174800/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 1.333E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.951087E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.45 | backward: 1804.72 | backward-backward: 1804.69 | backward-allreduce: 0.00 | optimizer: 55.33 | batch generator: 0.76 samples/sec: 6.594 | iteration 174900/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 1.331E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.959577E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.47 | backward: 1804.27 | backward-backward: 1804.25 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.79 samples/sec: 6.595 | iteration 175000/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 1.330E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.938952E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.33 | backward: 1804.10 | backward-backward: 1804.07 | backward-allreduce: 0.00 | optimizer: 55.42 | batch generator: 0.76 ----------------------------------------------------------------------------------------------------------- validation results at iteration 175000 | lm_loss value: 2.887935E+00 | lm_loss_ppl value: 1.795619E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.441 | iteration 175100/ 320000 | elapsed time per iteration (ms): 2483.9 | learning rate: 1.328E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.939624E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.32 | backward: 1804.79 | backward-backward: 1804.77 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.81 samples/sec: 6.590 | iteration 175200/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 1.327E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.943563E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.41 | backward: 1805.43 | backward-backward: 1805.40 | backward-allreduce: 0.00 | optimizer: 55.73 | batch generator: 0.78 samples/sec: 6.590 | iteration 175300/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 1.325E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.943899E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.29 | backward: 1805.12 | backward-backward: 1805.09 | backward-allreduce: 0.00 | optimizer: 56.18 | batch generator: 0.77 samples/sec: 6.592 | iteration 175400/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 1.324E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.934443E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.51 | backward: 1804.75 | backward-backward: 1804.72 | backward-allreduce: 0.00 | optimizer: 55.51 | batch generator: 0.83 samples/sec: 6.592 | iteration 175500/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 1.323E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.961180E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.36 | backward: 1804.93 | backward-backward: 1804.91 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.76 samples/sec: 6.592 | iteration 175600/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 1.321E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.943044E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.23 | backward: 1805.00 | backward-backward: 1804.98 | backward-allreduce: 0.00 | optimizer: 55.59 | batch generator: 0.75 samples/sec: 6.592 | iteration 175700/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 1.320E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.941252E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.34 | backward: 1805.03 | backward-backward: 1805.01 | backward-allreduce: 0.00 | optimizer: 55.41 | batch generator: 0.76 samples/sec: 6.592 | iteration 175800/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 1.318E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.922613E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.37 | backward: 1804.86 | backward-backward: 1804.83 | backward-allreduce: 0.00 | optimizer: 55.59 | batch generator: 0.78 samples/sec: 6.591 | iteration 175900/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 1.317E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.939370E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.79 | backward: 1804.80 | backward-backward: 1804.78 | backward-allreduce: 0.00 | optimizer: 55.66 | batch generator: 0.81 samples/sec: 6.594 | iteration 176000/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 1.315E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.947991E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.47 | backward: 1804.16 | backward-backward: 1804.14 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.78 ----------------------------------------------------------------------------------------------------------- validation results at iteration 176000 | lm_loss value: 2.949436E+00 | lm_loss_ppl value: 1.909518E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.448 | iteration 176100/ 320000 | elapsed time per iteration (ms): 2481.6 | learning rate: 1.314E-04 | approx flops per GPU: 40.1TFLOPS | lm_loss: 2.905543E+00 | loss scale: 32768.0 | number of skipped iterations: 3 | number of nan iterations: 0 | time (ms) | forward: 566.39 | backward: 1803.97 | backward-backward: 1803.95 | backward-allreduce: 0.00 | optimizer: 54.02 | batch generator: 0.85 samples/sec: 6.594 | iteration 176200/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 1.312E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.966075E+00 | loss scale: 16384.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.43 | backward: 1804.35 | backward-backward: 1804.33 | backward-allreduce: 0.00 | optimizer: 55.41 | batch generator: 0.83 samples/sec: 6.595 | iteration 176300/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 1.311E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.928431E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.41 | backward: 1804.02 | backward-backward: 1804.00 | backward-allreduce: 0.00 | optimizer: 55.31 | batch generator: 0.77 samples/sec: 6.594 | iteration 176400/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 1.309E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.945763E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.36 | backward: 1803.53 | backward-backward: 1803.51 | backward-allreduce: 0.00 | optimizer: 56.24 | batch generator: 0.76 samples/sec: 6.595 | iteration 176500/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 1.308E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.937508E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.53 | backward: 1803.37 | backward-backward: 1803.35 | backward-allreduce: 0.00 | optimizer: 55.84 | batch generator: 0.77 samples/sec: 6.596 | iteration 176600/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 1.306E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.951646E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.46 | backward: 1803.30 | backward-backward: 1803.28 | backward-allreduce: 0.00 | optimizer: 55.67 | batch generator: 0.78 samples/sec: 6.596 | iteration 176700/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 1.305E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.917621E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.49 | backward: 1803.44 | backward-backward: 1803.42 | backward-allreduce: 0.00 | optimizer: 55.38 | batch generator: 0.74 samples/sec: 6.597 | iteration 176800/ 320000 | elapsed time per iteration (ms): 2425.4 | learning rate: 1.304E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.953846E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.35 | backward: 1803.26 | backward-backward: 1803.24 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.78 samples/sec: 6.599 | iteration 176900/ 320000 | elapsed time per iteration (ms): 2424.5 | learning rate: 1.302E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.937358E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.20 | backward: 1802.66 | backward-backward: 1802.63 | backward-allreduce: 0.00 | optimizer: 55.29 | batch generator: 0.76 samples/sec: 6.597 | iteration 177000/ 320000 | elapsed time per iteration (ms): 2425.3 | learning rate: 1.301E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.942372E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.40 | backward: 1802.94 | backward-backward: 1802.92 | backward-allreduce: 0.00 | optimizer: 55.59 | batch generator: 0.76 ----------------------------------------------------------------------------------------------------------- validation results at iteration 177000 | lm_loss value: 2.923351E+00 | lm_loss_ppl value: 1.860352E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.443 | iteration 177100/ 320000 | elapsed time per iteration (ms): 2483.5 | learning rate: 1.299E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.933376E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.38 | backward: 1804.49 | backward-backward: 1804.46 | backward-allreduce: 0.00 | optimizer: 55.45 | batch generator: 0.83 samples/sec: 6.600 | iteration 177200/ 320000 | elapsed time per iteration (ms): 2424.1 | learning rate: 1.298E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.928284E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.31 | backward: 1802.20 | backward-backward: 1802.18 | backward-allreduce: 0.00 | optimizer: 55.20 | batch generator: 0.78 samples/sec: 6.592 | iteration 177300/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 1.296E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.951679E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.60 | backward: 1804.38 | backward-backward: 1804.36 | backward-allreduce: 0.00 | optimizer: 55.74 | batch generator: 0.80 samples/sec: 6.591 | iteration 177400/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 1.295E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.947990E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.46 | backward: 1804.79 | backward-backward: 1804.77 | backward-allreduce: 0.00 | optimizer: 55.94 | batch generator: 0.78 samples/sec: 6.598 | iteration 177500/ 320000 | elapsed time per iteration (ms): 2424.9 | learning rate: 1.293E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.936619E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.25 | backward: 1802.88 | backward-backward: 1802.85 | backward-allreduce: 0.00 | optimizer: 55.35 | batch generator: 0.77 samples/sec: 6.591 | iteration 177600/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 1.292E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.935933E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.61 | backward: 1805.03 | backward-backward: 1805.01 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.84 samples/sec: 6.599 | iteration 177700/ 320000 | elapsed time per iteration (ms): 2424.5 | learning rate: 1.290E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.926637E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 565.98 | backward: 1802.88 | backward-backward: 1802.85 | backward-allreduce: 0.00 | optimizer: 55.29 | batch generator: 0.78 samples/sec: 6.591 | iteration 177800/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 1.289E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.928698E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.60 | backward: 1804.60 | backward-backward: 1804.58 | backward-allreduce: 0.00 | optimizer: 55.93 | batch generator: 0.78 samples/sec: 6.596 | iteration 177900/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 1.288E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.944129E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.21 | backward: 1803.97 | backward-backward: 1803.95 | backward-allreduce: 0.00 | optimizer: 55.26 | batch generator: 0.78 samples/sec: 6.595 | iteration 178000/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 1.286E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.933685E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.70 | backward: 1803.38 | backward-backward: 1803.36 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.80 ----------------------------------------------------------------------------------------------------------- validation results at iteration 178000 | lm_loss value: 2.942332E+00 | lm_loss_ppl value: 1.896001E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.440 | iteration 178100/ 320000 | elapsed time per iteration (ms): 2484.4 | learning rate: 1.285E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.939944E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.51 | backward: 1804.96 | backward-backward: 1804.94 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.86 samples/sec: 6.600 | iteration 178200/ 320000 | elapsed time per iteration (ms): 2424.3 | learning rate: 1.283E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.922949E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.00 | backward: 1803.14 | backward-backward: 1803.11 | backward-allreduce: 0.00 | optimizer: 54.83 | batch generator: 0.78 samples/sec: 6.590 | iteration 178300/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 1.282E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.916752E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.42 | backward: 1805.51 | backward-backward: 1805.49 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.77 samples/sec: 6.594 | iteration 178400/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 1.280E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.937656E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.08 | backward: 1804.63 | backward-backward: 1804.61 | backward-allreduce: 0.00 | optimizer: 55.30 | batch generator: 0.75 samples/sec: 6.593 | iteration 178500/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 1.279E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.943488E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.34 | backward: 1803.83 | backward-backward: 1803.81 | backward-allreduce: 0.00 | optimizer: 56.15 | batch generator: 0.81 samples/sec: 6.591 | iteration 178600/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 1.277E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.925703E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.61 | backward: 1805.26 | backward-backward: 1805.24 | backward-allreduce: 0.00 | optimizer: 55.26 | batch generator: 0.81 samples/sec: 6.593 | iteration 178700/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 1.276E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.939691E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.73 | backward: 1804.38 | backward-backward: 1804.36 | backward-allreduce: 0.00 | optimizer: 55.42 | batch generator: 0.79 samples/sec: 6.592 | iteration 178800/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 1.274E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.938304E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.85 | backward: 1804.39 | backward-backward: 1804.37 | backward-allreduce: 0.00 | optimizer: 55.41 | batch generator: 0.75 samples/sec: 6.598 | iteration 178900/ 320000 | elapsed time per iteration (ms): 2425.0 | learning rate: 1.273E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.936214E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 565.91 | backward: 1803.17 | backward-backward: 1803.15 | backward-allreduce: 0.00 | optimizer: 55.54 | batch generator: 0.75 samples/sec: 6.593 | iteration 179000/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 1.272E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.944816E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.56 | backward: 1804.32 | backward-backward: 1804.29 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.77 ----------------------------------------------------------------------------------------------------------- validation results at iteration 179000 | lm_loss value: 2.976671E+00 | lm_loss_ppl value: 1.962239E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.443 | iteration 179100/ 320000 | elapsed time per iteration (ms): 2483.3 | learning rate: 1.270E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.924724E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.50 | backward: 1804.24 | backward-backward: 1804.21 | backward-allreduce: 0.00 | optimizer: 55.37 | batch generator: 0.80 samples/sec: 6.596 | iteration 179200/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 1.269E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.911094E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.36 | backward: 1803.51 | backward-backward: 1803.49 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.82 samples/sec: 6.589 | iteration 179300/ 320000 | elapsed time per iteration (ms): 2428.1 | learning rate: 1.267E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.935549E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.44 | backward: 1804.92 | backward-backward: 1804.90 | backward-allreduce: 0.00 | optimizer: 55.39 | batch generator: 0.81 samples/sec: 6.596 | iteration 179400/ 320000 | elapsed time per iteration (ms): 2425.6 | learning rate: 1.266E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.939466E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 565.98 | backward: 1803.70 | backward-backward: 1803.68 | backward-allreduce: 0.00 | optimizer: 55.50 | batch generator: 0.76 samples/sec: 6.590 | iteration 179500/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 1.264E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.948564E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.87 | backward: 1804.90 | backward-backward: 1804.87 | backward-allreduce: 0.00 | optimizer: 55.77 | batch generator: 0.80 samples/sec: 6.594 | iteration 179600/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 1.263E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.912855E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.26 | backward: 1804.03 | backward-backward: 1804.01 | backward-allreduce: 0.00 | optimizer: 55.88 | batch generator: 0.80 samples/sec: 6.596 | iteration 179700/ 320000 | elapsed time per iteration (ms): 2425.6 | learning rate: 1.261E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.930904E+00 | loss scale: 32768.0 | number of skipped iterations: 2 | number of nan iterations: 0 | time (ms) | forward: 566.41 | backward: 1804.07 | backward-backward: 1804.05 | backward-allreduce: 0.00 | optimizer: 54.76 | batch generator: 0.84 samples/sec: 6.592 | iteration 179800/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 1.260E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.932272E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.68 | backward: 1804.70 | backward-backward: 1804.68 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.79 samples/sec: 6.598 | iteration 179900/ 320000 | elapsed time per iteration (ms): 2425.1 | learning rate: 1.258E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.918658E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.23 | backward: 1803.05 | backward-backward: 1803.02 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.77 samples/sec: 6.590 | iteration 180000/ 320000 | elapsed time per iteration (ms): 2428.1 | learning rate: 1.257E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.936505E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.84 | backward: 1805.09 | backward-backward: 1805.07 | backward-allreduce: 0.00 | optimizer: 55.78 | batch generator: 0.78 WARNING: Deleting old checkpoints: checkpoints-hfcm/global_step80000 ----------------------------------------------------------------------------------------------------------- validation results at iteration 180000 | lm_loss value: 2.926335E+00 | lm_loss_ppl value: 1.865912E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.225 | iteration 180100/ 320000 | elapsed time per iteration (ms): 2570.3 | learning rate: 1.256E-04 | approx flops per GPU: 38.7TFLOPS | lm_loss: 2.930938E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.22 | backward: 1802.47 | backward-backward: 1802.44 | backward-allreduce: 0.00 | optimizer: 55.40 | batch generator: 0.83 samples/sec: 6.592 | iteration 180200/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 1.254E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.913394E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.62 | backward: 1804.69 | backward-backward: 1804.67 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.79 samples/sec: 6.596 | iteration 180300/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 1.253E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.930243E+00 | loss scale: 16384.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.49 | backward: 1803.95 | backward-backward: 1803.93 | backward-allreduce: 0.00 | optimizer: 54.89 | batch generator: 0.78 samples/sec: 6.598 | iteration 180400/ 320000 | elapsed time per iteration (ms): 2425.1 | learning rate: 1.251E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.933478E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.43 | backward: 1802.92 | backward-backward: 1802.90 | backward-allreduce: 0.00 | optimizer: 55.34 | batch generator: 0.76 samples/sec: 6.592 | iteration 180500/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 1.250E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.916119E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.58 | backward: 1804.46 | backward-backward: 1804.44 | backward-allreduce: 0.00 | optimizer: 55.87 | batch generator: 0.74 samples/sec: 6.597 | iteration 180600/ 320000 | elapsed time per iteration (ms): 2425.3 | learning rate: 1.248E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.921178E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.02 | backward: 1803.37 | backward-backward: 1803.35 | backward-allreduce: 0.00 | optimizer: 55.54 | batch generator: 0.81 samples/sec: 6.590 | iteration 180700/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 1.247E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.918084E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.02 | backward: 1804.68 | backward-backward: 1804.66 | backward-allreduce: 0.00 | optimizer: 55.97 | batch generator: 0.77 samples/sec: 6.597 | iteration 180800/ 320000 | elapsed time per iteration (ms): 2425.2 | learning rate: 1.245E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.922067E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.23 | backward: 1802.90 | backward-backward: 1802.88 | backward-allreduce: 0.00 | optimizer: 55.74 | batch generator: 0.79 samples/sec: 6.597 | iteration 180900/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 1.244E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.920130E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.43 | backward: 1803.21 | backward-backward: 1803.19 | backward-allreduce: 0.00 | optimizer: 55.50 | batch generator: 0.75 samples/sec: 6.592 | iteration 181000/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 1.243E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.902901E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.94 | backward: 1804.25 | backward-backward: 1804.23 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.92 ----------------------------------------------------------------------------------------------------------- validation results at iteration 181000 | lm_loss value: 2.890402E+00 | lm_loss_ppl value: 1.800054E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.445 | iteration 181100/ 320000 | elapsed time per iteration (ms): 2482.5 | learning rate: 1.241E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.939847E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.25 | backward: 1803.58 | backward-backward: 1803.56 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.84 samples/sec: 6.588 | iteration 181200/ 320000 | elapsed time per iteration (ms): 2428.6 | learning rate: 1.240E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.924091E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.81 | backward: 1805.52 | backward-backward: 1805.49 | backward-allreduce: 0.00 | optimizer: 55.86 | batch generator: 0.81 samples/sec: 6.598 | iteration 181300/ 320000 | elapsed time per iteration (ms): 2424.8 | learning rate: 1.238E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.922973E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.18 | backward: 1802.91 | backward-backward: 1802.88 | backward-allreduce: 0.00 | optimizer: 55.35 | batch generator: 0.79 samples/sec: 6.593 | iteration 181400/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 1.237E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.909583E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.37 | backward: 1804.72 | backward-backward: 1804.70 | backward-allreduce: 0.00 | optimizer: 55.38 | batch generator: 0.71 samples/sec: 6.593 | iteration 181500/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 1.235E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.950050E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.57 | backward: 1804.08 | backward-backward: 1804.05 | backward-allreduce: 0.00 | optimizer: 55.66 | batch generator: 0.79 samples/sec: 6.597 | iteration 181600/ 320000 | elapsed time per iteration (ms): 2425.2 | learning rate: 1.234E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.910693E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.06 | backward: 1803.30 | backward-backward: 1803.27 | backward-allreduce: 0.00 | optimizer: 55.46 | batch generator: 0.77 samples/sec: 6.587 | iteration 181700/ 320000 | elapsed time per iteration (ms): 2429.0 | learning rate: 1.232E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.919912E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.85 | backward: 1806.28 | backward-backward: 1806.26 | backward-allreduce: 0.00 | optimizer: 55.49 | batch generator: 0.80 samples/sec: 6.596 | iteration 181800/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 1.231E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.923745E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.16 | backward: 1803.40 | backward-backward: 1803.38 | backward-allreduce: 0.00 | optimizer: 55.95 | batch generator: 0.82 samples/sec: 6.587 | iteration 181900/ 320000 | elapsed time per iteration (ms): 2428.9 | learning rate: 1.229E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.924237E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.37 | backward: 1805.27 | backward-backward: 1805.25 | backward-allreduce: 0.00 | optimizer: 55.84 | batch generator: 0.79 samples/sec: 6.598 | iteration 182000/ 320000 | elapsed time per iteration (ms): 2425.0 | learning rate: 1.228E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.928519E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.33 | backward: 1802.72 | backward-backward: 1802.69 | backward-allreduce: 0.00 | optimizer: 55.54 | batch generator: 0.86 ----------------------------------------------------------------------------------------------------------- validation results at iteration 182000 | lm_loss value: 2.878111E+00 | lm_loss_ppl value: 1.778065E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.441 | iteration 182100/ 320000 | elapsed time per iteration (ms): 2484.1 | learning rate: 1.227E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.917550E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.52 | backward: 1804.68 | backward-backward: 1804.66 | backward-allreduce: 0.00 | optimizer: 55.73 | batch generator: 0.86 samples/sec: 6.591 | iteration 182200/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 1.225E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.924937E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.21 | backward: 1804.27 | backward-backward: 1804.25 | backward-allreduce: 0.00 | optimizer: 55.70 | batch generator: 0.77 samples/sec: 6.598 | iteration 182300/ 320000 | elapsed time per iteration (ms): 2425.1 | learning rate: 1.224E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.932738E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 565.86 | backward: 1803.55 | backward-backward: 1803.52 | backward-allreduce: 0.00 | optimizer: 55.31 | batch generator: 0.76 samples/sec: 6.589 | iteration 182400/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 1.222E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.906497E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.99 | backward: 1805.02 | backward-backward: 1804.99 | backward-allreduce: 0.00 | optimizer: 55.78 | batch generator: 0.81 samples/sec: 6.597 | iteration 182500/ 320000 | elapsed time per iteration (ms): 2425.2 | learning rate: 1.221E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.909006E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.17 | backward: 1803.33 | backward-backward: 1803.31 | backward-allreduce: 0.00 | optimizer: 55.38 | batch generator: 0.76 samples/sec: 6.596 | iteration 182600/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 1.219E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.923148E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.28 | backward: 1803.76 | backward-backward: 1803.74 | backward-allreduce: 0.00 | optimizer: 55.46 | batch generator: 0.78 samples/sec: 6.592 | iteration 182700/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 1.218E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.911183E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.28 | backward: 1804.95 | backward-backward: 1804.93 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.77 samples/sec: 6.590 | iteration 182800/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 1.216E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.927574E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.86 | backward: 1805.61 | backward-backward: 1805.59 | backward-allreduce: 0.00 | optimizer: 54.95 | batch generator: 0.77 samples/sec: 6.591 | iteration 182900/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 1.215E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.930432E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.05 | backward: 1804.31 | backward-backward: 1804.28 | backward-allreduce: 0.00 | optimizer: 56.64 | batch generator: 0.78 samples/sec: 6.585 | iteration 183000/ 320000 | elapsed time per iteration (ms): 2429.8 | learning rate: 1.214E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.911062E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.67 | backward: 1805.83 | backward-backward: 1805.80 | backward-allreduce: 0.00 | optimizer: 55.94 | batch generator: 0.78 ----------------------------------------------------------------------------------------------------------- validation results at iteration 183000 | lm_loss value: 2.907209E+00 | lm_loss_ppl value: 1.830563E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.444 | iteration 183100/ 320000 | elapsed time per iteration (ms): 2482.8 | learning rate: 1.212E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.914056E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.50 | backward: 1803.51 | backward-backward: 1803.48 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.87 samples/sec: 6.587 | iteration 183200/ 320000 | elapsed time per iteration (ms): 2429.1 | learning rate: 1.211E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.924135E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.76 | backward: 1805.90 | backward-backward: 1805.88 | backward-allreduce: 0.00 | optimizer: 56.01 | batch generator: 0.78 samples/sec: 6.593 | iteration 183300/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 1.209E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.905582E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.43 | backward: 1804.34 | backward-backward: 1804.32 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.78 samples/sec: 6.589 | iteration 183400/ 320000 | elapsed time per iteration (ms): 2428.4 | learning rate: 1.208E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.903845E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.78 | backward: 1805.70 | backward-backward: 1805.67 | backward-allreduce: 0.00 | optimizer: 55.51 | batch generator: 0.79 samples/sec: 6.593 | iteration 183500/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 1.206E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.913902E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.29 | backward: 1803.93 | backward-backward: 1803.90 | backward-allreduce: 0.00 | optimizer: 55.36 | batch generator: 0.81 samples/sec: 6.596 | iteration 183600/ 320000 | elapsed time per iteration (ms): 2425.6 | learning rate: 1.205E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.926850E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 565.86 | backward: 1803.86 | backward-backward: 1803.84 | backward-allreduce: 0.00 | optimizer: 55.54 | batch generator: 0.74 samples/sec: 6.589 | iteration 183700/ 320000 | elapsed time per iteration (ms): 2428.4 | learning rate: 1.203E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.906180E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.84 | backward: 1805.72 | backward-backward: 1805.70 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.78 samples/sec: 6.597 | iteration 183800/ 320000 | elapsed time per iteration (ms): 2425.2 | learning rate: 1.202E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.911381E+00 | loss scale: 65536.0 | number of skipped iterations: 2 | number of nan iterations: 0 | time (ms) | forward: 566.55 | backward: 1803.78 | backward-backward: 1803.76 | backward-allreduce: 0.00 | optimizer: 54.46 | batch generator: 0.81 samples/sec: 6.591 | iteration 183900/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 1.201E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.919857E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.35 | backward: 1805.38 | backward-backward: 1805.36 | backward-allreduce: 0.00 | optimizer: 55.33 | batch generator: 0.73 samples/sec: 6.589 | iteration 184000/ 320000 | elapsed time per iteration (ms): 2428.5 | learning rate: 1.199E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.912133E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.00 | backward: 1804.85 | backward-backward: 1804.83 | backward-allreduce: 0.00 | optimizer: 56.24 | batch generator: 0.77 ----------------------------------------------------------------------------------------------------------- validation results at iteration 184000 | lm_loss value: 2.955103E+00 | lm_loss_ppl value: 1.920369E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.444 | iteration 184100/ 320000 | elapsed time per iteration (ms): 2483.0 | learning rate: 1.198E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.913323E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.23 | backward: 1804.09 | backward-backward: 1804.07 | backward-allreduce: 0.00 | optimizer: 55.41 | batch generator: 0.86 samples/sec: 6.590 | iteration 184200/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 1.196E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.909435E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.62 | backward: 1805.31 | backward-backward: 1805.29 | backward-allreduce: 0.00 | optimizer: 55.75 | batch generator: 0.77 samples/sec: 6.600 | iteration 184300/ 320000 | elapsed time per iteration (ms): 2424.4 | learning rate: 1.195E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.908343E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 565.99 | backward: 1803.23 | backward-backward: 1803.20 | backward-allreduce: 0.00 | optimizer: 54.81 | batch generator: 0.77 samples/sec: 6.589 | iteration 184400/ 320000 | elapsed time per iteration (ms): 2428.4 | learning rate: 1.193E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.919521E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.83 | backward: 1805.32 | backward-backward: 1805.30 | backward-allreduce: 0.00 | optimizer: 55.85 | batch generator: 0.80 samples/sec: 6.598 | iteration 184500/ 320000 | elapsed time per iteration (ms): 2425.1 | learning rate: 1.192E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.898281E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.30 | backward: 1803.19 | backward-backward: 1803.16 | backward-allreduce: 0.00 | optimizer: 55.28 | batch generator: 0.80 samples/sec: 6.588 | iteration 184600/ 320000 | elapsed time per iteration (ms): 2428.7 | learning rate: 1.191E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.929295E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.91 | backward: 1805.59 | backward-backward: 1805.56 | backward-allreduce: 0.00 | optimizer: 55.83 | batch generator: 0.74 samples/sec: 6.594 | iteration 184700/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 1.189E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.910114E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.68 | backward: 1803.83 | backward-backward: 1803.81 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.77 samples/sec: 6.595 | iteration 184800/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 1.188E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.911338E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.10 | backward: 1804.12 | backward-backward: 1804.10 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.75 samples/sec: 6.587 | iteration 184900/ 320000 | elapsed time per iteration (ms): 2428.8 | learning rate: 1.186E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.901369E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.05 | backward: 1805.54 | backward-backward: 1805.52 | backward-allreduce: 0.00 | optimizer: 55.87 | batch generator: 0.80 samples/sec: 6.597 | iteration 185000/ 320000 | elapsed time per iteration (ms): 2425.2 | learning rate: 1.185E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.905968E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 565.98 | backward: 1803.41 | backward-backward: 1803.39 | backward-allreduce: 0.00 | optimizer: 55.46 | batch generator: 0.78 ----------------------------------------------------------------------------------------------------------- validation results at iteration 185000 | lm_loss value: 2.949658E+00 | lm_loss_ppl value: 1.909942E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.439 | iteration 185100/ 320000 | elapsed time per iteration (ms): 2484.7 | learning rate: 1.183E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.893925E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.73 | backward: 1804.73 | backward-backward: 1804.71 | backward-allreduce: 0.00 | optimizer: 56.12 | batch generator: 0.85 samples/sec: 6.592 | iteration 185200/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 1.182E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.906694E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.64 | backward: 1804.61 | backward-backward: 1804.59 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.77 samples/sec: 6.594 | iteration 185300/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 1.180E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.919806E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.26 | backward: 1804.45 | backward-backward: 1804.43 | backward-allreduce: 0.00 | optimizer: 55.33 | batch generator: 0.79 samples/sec: 6.586 | iteration 185400/ 320000 | elapsed time per iteration (ms): 2429.4 | learning rate: 1.179E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.913897E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.38 | backward: 1805.86 | backward-backward: 1805.84 | backward-allreduce: 0.00 | optimizer: 55.80 | batch generator: 0.84 samples/sec: 6.597 | iteration 185500/ 320000 | elapsed time per iteration (ms): 2425.2 | learning rate: 1.178E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.898366E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.01 | backward: 1803.23 | backward-backward: 1803.21 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.78 samples/sec: 6.589 | iteration 185600/ 320000 | elapsed time per iteration (ms): 2428.4 | learning rate: 1.176E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.902570E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.67 | backward: 1805.75 | backward-backward: 1805.73 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.79 samples/sec: 6.593 | iteration 185700/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 1.175E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.912851E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.37 | backward: 1804.32 | backward-backward: 1804.29 | backward-allreduce: 0.00 | optimizer: 55.76 | batch generator: 0.76 samples/sec: 6.593 | iteration 185800/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 1.173E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.919192E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.42 | backward: 1804.89 | backward-backward: 1804.87 | backward-allreduce: 0.00 | optimizer: 55.29 | batch generator: 0.73 samples/sec: 6.593 | iteration 185900/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 1.172E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.912756E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.71 | backward: 1804.84 | backward-backward: 1804.82 | backward-allreduce: 0.00 | optimizer: 55.07 | batch generator: 0.86 samples/sec: 6.593 | iteration 186000/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 1.170E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.905645E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.48 | backward: 1804.27 | backward-backward: 1804.25 | backward-allreduce: 0.00 | optimizer: 55.66 | batch generator: 0.84 ----------------------------------------------------------------------------------------------------------- validation results at iteration 186000 | lm_loss value: 2.905843E+00 | lm_loss_ppl value: 1.828066E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.437 | iteration 186100/ 320000 | elapsed time per iteration (ms): 2485.6 | learning rate: 1.169E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.916734E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 567.00 | backward: 1806.18 | backward-backward: 1806.16 | backward-allreduce: 0.00 | optimizer: 55.23 | batch generator: 0.82 samples/sec: 6.595 | iteration 186200/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 1.168E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.906291E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.28 | backward: 1803.30 | backward-backward: 1803.28 | backward-allreduce: 0.00 | optimizer: 56.10 | batch generator: 0.79 samples/sec: 6.589 | iteration 186300/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 1.166E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.919513E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.08 | backward: 1805.32 | backward-backward: 1805.30 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.76 samples/sec: 6.596 | iteration 186400/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 1.165E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.900855E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.64 | backward: 1803.18 | backward-backward: 1803.16 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.77 samples/sec: 6.590 | iteration 186500/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 1.163E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.902611E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.81 | backward: 1805.12 | backward-backward: 1805.10 | backward-allreduce: 0.00 | optimizer: 55.69 | batch generator: 0.78 samples/sec: 6.594 | iteration 186600/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 1.162E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.909851E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.55 | backward: 1804.06 | backward-backward: 1804.04 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.79 samples/sec: 6.590 | iteration 186700/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 1.160E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.910154E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.60 | backward: 1804.94 | backward-backward: 1804.91 | backward-allreduce: 0.00 | optimizer: 56.00 | batch generator: 0.77 samples/sec: 6.592 | iteration 186800/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 1.159E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.914744E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.99 | backward: 1804.10 | backward-backward: 1804.08 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.74 samples/sec: 6.596 | iteration 186900/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 1.157E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.888773E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.02 | backward: 1803.79 | backward-backward: 1803.77 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.79 samples/sec: 6.589 | iteration 187000/ 320000 | elapsed time per iteration (ms): 2428.5 | learning rate: 1.156E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.900011E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.17 | backward: 1805.12 | backward-backward: 1805.10 | backward-allreduce: 0.00 | optimizer: 55.79 | batch generator: 0.76 ----------------------------------------------------------------------------------------------------------- validation results at iteration 187000 | lm_loss value: 2.900805E+00 | lm_loss_ppl value: 1.818879E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.443 | iteration 187100/ 320000 | elapsed time per iteration (ms): 2483.3 | learning rate: 1.155E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.871877E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.47 | backward: 1803.75 | backward-backward: 1803.73 | backward-allreduce: 0.00 | optimizer: 55.85 | batch generator: 0.88 samples/sec: 6.585 | iteration 187200/ 320000 | elapsed time per iteration (ms): 2429.9 | learning rate: 1.153E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.912438E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.28 | backward: 1806.65 | backward-backward: 1806.63 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.81 samples/sec: 6.594 | iteration 187300/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 1.152E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.899158E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.71 | backward: 1803.50 | backward-backward: 1803.48 | backward-allreduce: 0.00 | optimizer: 55.97 | batch generator: 0.77 samples/sec: 6.592 | iteration 187400/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 1.150E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.911316E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.37 | backward: 1804.81 | backward-backward: 1804.79 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.85 samples/sec: 6.587 | iteration 187500/ 320000 | elapsed time per iteration (ms): 2429.0 | learning rate: 1.149E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.890419E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.80 | backward: 1805.33 | backward-backward: 1805.31 | backward-allreduce: 0.00 | optimizer: 55.47 | batch generator: 0.80 samples/sec: 6.595 | iteration 187600/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 1.147E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.915982E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.19 | backward: 1804.51 | backward-backward: 1804.48 | backward-allreduce: 0.00 | optimizer: 55.11 | batch generator: 0.78 samples/sec: 6.587 | iteration 187700/ 320000 | elapsed time per iteration (ms): 2428.9 | learning rate: 1.146E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.896455E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 567.35 | backward: 1805.72 | backward-backward: 1805.70 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.78 samples/sec: 6.599 | iteration 187800/ 320000 | elapsed time per iteration (ms): 2424.7 | learning rate: 1.145E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.887451E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.20 | backward: 1802.81 | backward-backward: 1802.79 | backward-allreduce: 0.00 | optimizer: 55.34 | batch generator: 0.78 samples/sec: 6.591 | iteration 187900/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 1.143E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.892942E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.66 | backward: 1804.76 | backward-backward: 1804.74 | backward-allreduce: 0.00 | optimizer: 55.90 | batch generator: 0.75 samples/sec: 6.594 | iteration 188000/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 1.142E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.885229E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.83 | backward: 1803.54 | backward-backward: 1803.52 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.78 ----------------------------------------------------------------------------------------------------------- validation results at iteration 188000 | lm_loss value: 2.931022E+00 | lm_loss_ppl value: 1.874677E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.441 | iteration 188100/ 320000 | elapsed time per iteration (ms): 2484.1 | learning rate: 1.140E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.897960E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.54 | backward: 1804.53 | backward-backward: 1804.51 | backward-allreduce: 0.00 | optimizer: 55.90 | batch generator: 0.85 samples/sec: 6.593 | iteration 188200/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 1.139E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.897905E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.11 | backward: 1803.95 | backward-backward: 1803.93 | backward-allreduce: 0.00 | optimizer: 55.41 | batch generator: 0.76 samples/sec: 6.594 | iteration 188300/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 1.137E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.904960E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.01 | backward: 1804.03 | backward-backward: 1804.01 | backward-allreduce: 0.00 | optimizer: 56.09 | batch generator: 0.81 samples/sec: 6.588 | iteration 188400/ 320000 | elapsed time per iteration (ms): 2428.6 | learning rate: 1.136E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.899377E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.40 | backward: 1805.02 | backward-backward: 1805.00 | backward-allreduce: 0.00 | optimizer: 55.81 | batch generator: 0.84 samples/sec: 6.597 | iteration 188500/ 320000 | elapsed time per iteration (ms): 2425.4 | learning rate: 1.135E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.891153E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.50 | backward: 1803.23 | backward-backward: 1803.21 | backward-allreduce: 0.00 | optimizer: 55.30 | batch generator: 0.78 samples/sec: 6.589 | iteration 188600/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 1.133E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.904573E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.91 | backward: 1805.43 | backward-backward: 1805.41 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.81 samples/sec: 6.597 | iteration 188700/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 1.132E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.906653E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.85 | backward: 1803.31 | backward-backward: 1803.29 | backward-allreduce: 0.00 | optimizer: 54.98 | batch generator: 0.80 samples/sec: 6.588 | iteration 188800/ 320000 | elapsed time per iteration (ms): 2428.5 | learning rate: 1.130E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.896935E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.75 | backward: 1805.36 | backward-backward: 1805.34 | backward-allreduce: 0.00 | optimizer: 56.01 | batch generator: 0.85 samples/sec: 6.592 | iteration 188900/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 1.129E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.921957E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.04 | backward: 1804.35 | backward-backward: 1804.33 | backward-allreduce: 0.00 | optimizer: 55.43 | batch generator: 0.74 samples/sec: 6.598 | iteration 189000/ 320000 | elapsed time per iteration (ms): 2424.8 | learning rate: 1.128E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.873430E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 565.89 | backward: 1803.26 | backward-backward: 1803.24 | backward-allreduce: 0.00 | optimizer: 55.27 | batch generator: 0.76 ----------------------------------------------------------------------------------------------------------- validation results at iteration 189000 | lm_loss value: 2.806434E+00 | lm_loss_ppl value: 1.655080E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.434 | iteration 189100/ 320000 | elapsed time per iteration (ms): 2486.9 | learning rate: 1.126E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.902368E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.67 | backward: 1806.47 | backward-backward: 1806.45 | backward-allreduce: 0.00 | optimizer: 55.59 | batch generator: 0.84 samples/sec: 6.595 | iteration 189200/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 1.125E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.904568E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.49 | backward: 1803.54 | backward-backward: 1803.52 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.81 samples/sec: 6.588 | iteration 189300/ 320000 | elapsed time per iteration (ms): 2428.6 | learning rate: 1.123E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.887297E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.96 | backward: 1806.16 | backward-backward: 1806.14 | backward-allreduce: 0.00 | optimizer: 55.16 | batch generator: 0.76 samples/sec: 6.594 | iteration 189400/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 1.122E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.893681E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.66 | backward: 1803.41 | backward-backward: 1803.38 | backward-allreduce: 0.00 | optimizer: 56.04 | batch generator: 0.98 samples/sec: 6.591 | iteration 189500/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 1.120E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.882438E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.70 | backward: 1804.84 | backward-backward: 1804.81 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.80 samples/sec: 6.593 | iteration 189600/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 1.119E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.902131E+00 | loss scale: 16384.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.90 | backward: 1804.52 | backward-backward: 1804.50 | backward-allreduce: 0.00 | optimizer: 54.97 | batch generator: 0.75 samples/sec: 6.593 | iteration 189700/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 1.118E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.883688E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.27 | backward: 1804.40 | backward-backward: 1804.38 | backward-allreduce: 0.00 | optimizer: 55.81 | batch generator: 0.75 samples/sec: 6.592 | iteration 189800/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 1.116E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.899958E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.21 | backward: 1803.59 | backward-backward: 1803.57 | backward-allreduce: 0.00 | optimizer: 56.16 | batch generator: 0.80 samples/sec: 6.595 | iteration 189900/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 1.115E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.894361E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.50 | backward: 1803.67 | backward-backward: 1803.65 | backward-allreduce: 0.00 | optimizer: 55.37 | batch generator: 0.78 samples/sec: 6.592 | iteration 190000/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 1.113E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.896594E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.38 | backward: 1803.90 | backward-backward: 1803.88 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.77 WARNING: Deleting old checkpoints: checkpoints-hfcm/global_step90000 ----------------------------------------------------------------------------------------------------------- validation results at iteration 190000 | lm_loss value: 2.942744E+00 | lm_loss_ppl value: 1.896782E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.228 | iteration 190100/ 320000 | elapsed time per iteration (ms): 2569.0 | learning rate: 1.112E-04 | approx flops per GPU: 38.7TFLOPS | lm_loss: 2.886036E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.09 | backward: 1802.23 | backward-backward: 1802.20 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.85 samples/sec: 6.589 | iteration 190200/ 320000 | elapsed time per iteration (ms): 2428.4 | learning rate: 1.110E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.893201E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.26 | backward: 1804.91 | backward-backward: 1804.88 | backward-allreduce: 0.00 | optimizer: 55.83 | batch generator: 0.83 samples/sec: 6.596 | iteration 190300/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 1.109E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.883813E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.71 | backward: 1803.23 | backward-backward: 1803.21 | backward-allreduce: 0.00 | optimizer: 55.37 | batch generator: 0.77 samples/sec: 6.591 | iteration 190400/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 1.108E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.881096E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.84 | backward: 1804.71 | backward-backward: 1804.68 | backward-allreduce: 0.00 | optimizer: 55.72 | batch generator: 0.91 samples/sec: 6.594 | iteration 190500/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 1.106E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.879266E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.73 | backward: 1803.15 | backward-backward: 1803.13 | backward-allreduce: 0.00 | optimizer: 56.09 | batch generator: 0.79 samples/sec: 6.591 | iteration 190600/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 1.105E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.895333E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.79 | backward: 1804.78 | backward-backward: 1804.76 | backward-allreduce: 0.00 | optimizer: 55.73 | batch generator: 0.88 samples/sec: 6.591 | iteration 190700/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 1.103E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.896576E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.07 | backward: 1804.42 | backward-backward: 1804.40 | backward-allreduce: 0.00 | optimizer: 55.59 | batch generator: 0.77 samples/sec: 6.594 | iteration 190800/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 1.102E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.884185E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.38 | backward: 1803.64 | backward-backward: 1803.61 | backward-allreduce: 0.00 | optimizer: 55.95 | batch generator: 0.77 samples/sec: 6.590 | iteration 190900/ 320000 | elapsed time per iteration (ms): 2428.1 | learning rate: 1.100E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.886545E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.18 | backward: 1804.88 | backward-backward: 1804.86 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.78 samples/sec: 6.595 | iteration 191000/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 1.099E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.891495E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.59 | backward: 1803.43 | backward-backward: 1803.40 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.80 ----------------------------------------------------------------------------------------------------------- validation results at iteration 191000 | lm_loss value: 2.931430E+00 | lm_loss_ppl value: 1.875442E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.438 | iteration 191100/ 320000 | elapsed time per iteration (ms): 2485.4 | learning rate: 1.098E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.893353E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.49 | backward: 1805.11 | backward-backward: 1805.09 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.93 samples/sec: 6.593 | iteration 191200/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 1.096E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.876401E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.96 | backward: 1803.84 | backward-backward: 1803.82 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.83 samples/sec: 6.590 | iteration 191300/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 1.095E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.896714E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.11 | backward: 1804.77 | backward-backward: 1804.74 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.79 samples/sec: 6.592 | iteration 191400/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 1.093E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.876837E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.44 | backward: 1803.61 | backward-backward: 1803.58 | backward-allreduce: 0.00 | optimizer: 55.89 | batch generator: 0.81 samples/sec: 6.591 | iteration 191500/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 1.092E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.901294E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.52 | backward: 1805.36 | backward-backward: 1805.33 | backward-allreduce: 0.00 | optimizer: 55.34 | batch generator: 0.74 samples/sec: 6.584 | iteration 191600/ 320000 | elapsed time per iteration (ms): 2430.3 | learning rate: 1.091E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.892244E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.73 | backward: 1805.51 | backward-backward: 1805.49 | backward-allreduce: 0.00 | optimizer: 56.70 | batch generator: 0.85 samples/sec: 6.595 | iteration 191700/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 1.089E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.881456E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.62 | backward: 1803.40 | backward-backward: 1803.37 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.81 samples/sec: 6.590 | iteration 191800/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 1.088E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.881934E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.92 | backward: 1805.31 | backward-backward: 1805.29 | backward-allreduce: 0.00 | optimizer: 55.42 | batch generator: 0.79 samples/sec: 6.591 | iteration 191900/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 1.086E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.885275E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.36 | backward: 1804.25 | backward-backward: 1804.23 | backward-allreduce: 0.00 | optimizer: 55.67 | batch generator: 0.78 samples/sec: 6.590 | iteration 192000/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 1.085E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.886986E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.29 | backward: 1805.27 | backward-backward: 1805.25 | backward-allreduce: 0.00 | optimizer: 55.91 | batch generator: 0.78 ----------------------------------------------------------------------------------------------------------- validation results at iteration 192000 | lm_loss value: 2.833315E+00 | lm_loss_ppl value: 1.700173E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.437 | iteration 192100/ 320000 | elapsed time per iteration (ms): 2485.7 | learning rate: 1.083E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.900186E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.33 | backward: 1804.85 | backward-backward: 1804.83 | backward-allreduce: 0.00 | optimizer: 56.29 | batch generator: 0.86 samples/sec: 6.592 | iteration 192200/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 1.082E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.877361E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.42 | backward: 1804.79 | backward-backward: 1804.76 | backward-allreduce: 0.00 | optimizer: 55.46 | batch generator: 0.78 samples/sec: 6.588 | iteration 192300/ 320000 | elapsed time per iteration (ms): 2428.7 | learning rate: 1.081E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.883498E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.21 | backward: 1805.02 | backward-backward: 1805.00 | backward-allreduce: 0.00 | optimizer: 56.09 | batch generator: 0.78 samples/sec: 6.595 | iteration 192400/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 1.079E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.890304E+00 | loss scale: 32768.0 | number of skipped iterations: 2 | number of nan iterations: 0 | time (ms) | forward: 566.45 | backward: 1804.45 | backward-backward: 1804.42 | backward-allreduce: 0.00 | optimizer: 54.63 | batch generator: 0.79 samples/sec: 6.589 | iteration 192500/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 1.078E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.889785E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.36 | backward: 1805.02 | backward-backward: 1805.00 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.76 samples/sec: 6.594 | iteration 192600/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 1.076E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.888711E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.55 | backward: 1804.18 | backward-backward: 1804.15 | backward-allreduce: 0.00 | optimizer: 55.49 | batch generator: 0.79 samples/sec: 6.587 | iteration 192700/ 320000 | elapsed time per iteration (ms): 2428.9 | learning rate: 1.075E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.879196E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.87 | backward: 1805.46 | backward-backward: 1805.43 | backward-allreduce: 0.00 | optimizer: 56.22 | batch generator: 0.76 samples/sec: 6.595 | iteration 192800/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 1.074E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.887023E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.83 | backward: 1803.65 | backward-backward: 1803.63 | backward-allreduce: 0.00 | optimizer: 55.36 | batch generator: 0.77 samples/sec: 6.592 | iteration 192900/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 1.072E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.866826E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.53 | backward: 1804.54 | backward-backward: 1804.52 | backward-allreduce: 0.00 | optimizer: 55.70 | batch generator: 0.80 samples/sec: 6.590 | iteration 193000/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 1.071E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.881120E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.48 | backward: 1804.71 | backward-backward: 1804.69 | backward-allreduce: 0.00 | optimizer: 55.33 | batch generator: 0.76 ----------------------------------------------------------------------------------------------------------- validation results at iteration 193000 | lm_loss value: 2.853888E+00 | lm_loss_ppl value: 1.735513E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.446 | iteration 193100/ 320000 | elapsed time per iteration (ms): 2482.3 | learning rate: 1.069E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.881653E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.00 | backward: 1803.68 | backward-backward: 1803.65 | backward-allreduce: 0.00 | optimizer: 55.54 | batch generator: 0.80 samples/sec: 6.589 | iteration 193200/ 320000 | elapsed time per iteration (ms): 2428.4 | learning rate: 1.068E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.889540E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.13 | backward: 1805.14 | backward-backward: 1805.11 | backward-allreduce: 0.00 | optimizer: 55.77 | batch generator: 0.83 samples/sec: 6.597 | iteration 193300/ 320000 | elapsed time per iteration (ms): 2425.4 | learning rate: 1.067E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.877888E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.46 | backward: 1803.08 | backward-backward: 1803.05 | backward-allreduce: 0.00 | optimizer: 55.45 | batch generator: 0.78 samples/sec: 6.587 | iteration 193400/ 320000 | elapsed time per iteration (ms): 2429.1 | learning rate: 1.065E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.867823E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.36 | backward: 1805.70 | backward-backward: 1805.68 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.90 samples/sec: 6.595 | iteration 193500/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 1.064E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.876234E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.89 | backward: 1803.51 | backward-backward: 1803.48 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.77 samples/sec: 6.592 | iteration 193600/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 1.062E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.882170E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.55 | backward: 1805.02 | backward-backward: 1805.00 | backward-allreduce: 0.00 | optimizer: 55.25 | batch generator: 0.78 samples/sec: 6.589 | iteration 193700/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 1.061E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.883452E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.00 | backward: 1805.58 | backward-backward: 1805.55 | backward-allreduce: 0.00 | optimizer: 55.36 | batch generator: 0.77 samples/sec: 6.594 | iteration 193800/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 1.060E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.865060E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.23 | backward: 1803.57 | backward-backward: 1803.55 | backward-allreduce: 0.00 | optimizer: 56.16 | batch generator: 0.77 samples/sec: 6.586 | iteration 193900/ 320000 | elapsed time per iteration (ms): 2429.3 | learning rate: 1.058E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.880760E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.17 | backward: 1806.09 | backward-backward: 1806.07 | backward-allreduce: 0.00 | optimizer: 55.69 | batch generator: 0.76 samples/sec: 6.596 | iteration 194000/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 1.057E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.870717E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.59 | backward: 1803.12 | backward-backward: 1803.10 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.84 ----------------------------------------------------------------------------------------------------------- validation results at iteration 194000 | lm_loss value: 2.894890E+00 | lm_loss_ppl value: 1.808151E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.440 | iteration 194100/ 320000 | elapsed time per iteration (ms): 2484.5 | learning rate: 1.055E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.860239E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.44 | backward: 1805.30 | backward-backward: 1805.27 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.83 samples/sec: 6.592 | iteration 194200/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 1.054E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.883917E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.85 | backward: 1804.26 | backward-backward: 1804.24 | backward-allreduce: 0.00 | optimizer: 55.54 | batch generator: 0.77 samples/sec: 6.595 | iteration 194300/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 1.053E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.874817E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.32 | backward: 1803.70 | backward-backward: 1803.68 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.78 samples/sec: 6.586 | iteration 194400/ 320000 | elapsed time per iteration (ms): 2429.3 | learning rate: 1.051E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.876257E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.17 | backward: 1805.75 | backward-backward: 1805.72 | backward-allreduce: 0.00 | optimizer: 56.01 | batch generator: 0.77 samples/sec: 6.595 | iteration 194500/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 1.050E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.872150E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.74 | backward: 1803.29 | backward-backward: 1803.26 | backward-allreduce: 0.00 | optimizer: 55.54 | batch generator: 0.84 samples/sec: 6.591 | iteration 194600/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 1.048E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.879685E+00 | loss scale: 65536.0 | number of skipped iterations: 2 | number of nan iterations: 0 | time (ms) | forward: 566.82 | backward: 1805.49 | backward-backward: 1805.47 | backward-allreduce: 0.00 | optimizer: 55.01 | batch generator: 0.80 samples/sec: 6.593 | iteration 194700/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 1.047E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.874678E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.85 | backward: 1804.11 | backward-backward: 1804.09 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.76 samples/sec: 6.591 | iteration 194800/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 1.046E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.876344E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.70 | backward: 1804.78 | backward-backward: 1804.76 | backward-allreduce: 0.00 | optimizer: 55.54 | batch generator: 0.98 samples/sec: 6.585 | iteration 194900/ 320000 | elapsed time per iteration (ms): 2429.8 | learning rate: 1.044E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.875509E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.26 | backward: 1805.55 | backward-backward: 1805.53 | backward-allreduce: 0.00 | optimizer: 56.63 | batch generator: 0.84 samples/sec: 6.595 | iteration 195000/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 1.043E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.874517E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.32 | backward: 1803.58 | backward-backward: 1803.56 | backward-allreduce: 0.00 | optimizer: 55.81 | batch generator: 0.79 ----------------------------------------------------------------------------------------------------------- validation results at iteration 195000 | lm_loss value: 2.800706E+00 | lm_loss_ppl value: 1.645626E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.435 | iteration 195100/ 320000 | elapsed time per iteration (ms): 2486.4 | learning rate: 1.041E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.875151E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.02 | backward: 1806.57 | backward-backward: 1806.55 | backward-allreduce: 0.00 | optimizer: 55.66 | batch generator: 0.86 samples/sec: 6.596 | iteration 195200/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 1.040E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.866458E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.60 | backward: 1803.56 | backward-backward: 1803.54 | backward-allreduce: 0.00 | optimizer: 55.35 | batch generator: 0.80 samples/sec: 6.589 | iteration 195300/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 1.039E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.878257E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.72 | backward: 1805.63 | backward-backward: 1805.61 | backward-allreduce: 0.00 | optimizer: 55.59 | batch generator: 0.81 samples/sec: 6.591 | iteration 195400/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 1.037E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.882995E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.80 | backward: 1804.63 | backward-backward: 1804.61 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.75 samples/sec: 6.594 | iteration 195500/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 1.036E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.877170E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.74 | backward: 1804.40 | backward-backward: 1804.38 | backward-allreduce: 0.00 | optimizer: 54.99 | batch generator: 0.75 samples/sec: 6.594 | iteration 195600/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 1.034E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.869288E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.91 | backward: 1803.82 | backward-backward: 1803.80 | backward-allreduce: 0.00 | optimizer: 55.35 | batch generator: 0.77 samples/sec: 6.594 | iteration 195700/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 1.033E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.883563E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.56 | backward: 1803.70 | backward-backward: 1803.67 | backward-allreduce: 0.00 | optimizer: 55.75 | batch generator: 0.80 samples/sec: 6.590 | iteration 195800/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 1.032E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.879724E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.10 | backward: 1804.83 | backward-backward: 1804.81 | backward-allreduce: 0.00 | optimizer: 55.50 | batch generator: 0.77 samples/sec: 6.594 | iteration 195900/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 1.030E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.868319E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.41 | backward: 1803.98 | backward-backward: 1803.95 | backward-allreduce: 0.00 | optimizer: 55.73 | batch generator: 0.75 samples/sec: 6.589 | iteration 196000/ 320000 | elapsed time per iteration (ms): 2428.4 | learning rate: 1.029E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.870657E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.81 | backward: 1805.14 | backward-backward: 1805.11 | backward-allreduce: 0.00 | optimizer: 56.06 | batch generator: 0.77 ----------------------------------------------------------------------------------------------------------- validation results at iteration 196000 | lm_loss value: 2.813854E+00 | lm_loss_ppl value: 1.667405E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.447 | iteration 196100/ 320000 | elapsed time per iteration (ms): 2482.0 | learning rate: 1.027E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.872852E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.43 | backward: 1802.73 | backward-backward: 1802.71 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.84 samples/sec: 6.594 | iteration 196200/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 1.026E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.874950E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.45 | backward: 1804.31 | backward-backward: 1804.29 | backward-allreduce: 0.00 | optimizer: 55.40 | batch generator: 0.78 samples/sec: 6.592 | iteration 196300/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 1.025E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.850206E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.95 | backward: 1804.06 | backward-backward: 1804.04 | backward-allreduce: 0.00 | optimizer: 55.92 | batch generator: 0.84 samples/sec: 6.599 | iteration 196400/ 320000 | elapsed time per iteration (ms): 2424.7 | learning rate: 1.023E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.858491E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.11 | backward: 1802.84 | backward-backward: 1802.82 | backward-allreduce: 0.00 | optimizer: 55.37 | batch generator: 0.75 samples/sec: 6.589 | iteration 196500/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 1.022E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.864555E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.15 | backward: 1805.14 | backward-backward: 1805.12 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.79 samples/sec: 6.596 | iteration 196600/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 1.020E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.864643E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.54 | backward: 1803.60 | backward-backward: 1803.58 | backward-allreduce: 0.00 | optimizer: 55.25 | batch generator: 0.80 samples/sec: 6.593 | iteration 196700/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 1.019E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.846960E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.44 | backward: 1804.55 | backward-backward: 1804.53 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.78 samples/sec: 6.589 | iteration 196800/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 1.018E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.871263E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.80 | backward: 1804.88 | backward-backward: 1804.86 | backward-allreduce: 0.00 | optimizer: 56.11 | batch generator: 0.81 samples/sec: 6.594 | iteration 196900/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 1.016E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.873144E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.08 | backward: 1804.30 | backward-backward: 1804.28 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.74 samples/sec: 6.587 | iteration 197000/ 320000 | elapsed time per iteration (ms): 2429.1 | learning rate: 1.015E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.895490E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.14 | backward: 1805.56 | backward-backward: 1805.53 | backward-allreduce: 0.00 | optimizer: 56.06 | batch generator: 0.86 ----------------------------------------------------------------------------------------------------------- validation results at iteration 197000 | lm_loss value: 2.846826E+00 | lm_loss_ppl value: 1.723300E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.443 | iteration 197100/ 320000 | elapsed time per iteration (ms): 2483.2 | learning rate: 1.013E-04 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.874953E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.21 | backward: 1803.37 | backward-backward: 1803.35 | backward-allreduce: 0.00 | optimizer: 56.39 | batch generator: 0.85 samples/sec: 6.590 | iteration 197200/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 1.012E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.874832E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.74 | backward: 1805.22 | backward-backward: 1805.19 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.77 samples/sec: 6.595 | iteration 197300/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 1.011E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.870186E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.57 | backward: 1803.85 | backward-backward: 1803.82 | backward-allreduce: 0.00 | optimizer: 55.33 | batch generator: 0.73 samples/sec: 6.595 | iteration 197400/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 1.009E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.862314E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.53 | backward: 1803.44 | backward-backward: 1803.42 | backward-allreduce: 0.00 | optimizer: 55.89 | batch generator: 0.81 samples/sec: 6.589 | iteration 197500/ 320000 | elapsed time per iteration (ms): 2428.5 | learning rate: 1.008E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.869876E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.01 | backward: 1805.22 | backward-backward: 1805.20 | backward-allreduce: 0.00 | optimizer: 55.84 | batch generator: 0.83 samples/sec: 6.597 | iteration 197600/ 320000 | elapsed time per iteration (ms): 2425.2 | learning rate: 1.006E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.872791E+00 | loss scale: 65536.0 | number of skipped iterations: 2 | number of nan iterations: 0 | time (ms) | forward: 566.31 | backward: 1804.09 | backward-backward: 1804.07 | backward-allreduce: 0.00 | optimizer: 54.41 | batch generator: 0.78 samples/sec: 6.588 | iteration 197700/ 320000 | elapsed time per iteration (ms): 2428.6 | learning rate: 1.005E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.853130E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.30 | backward: 1805.23 | backward-backward: 1805.21 | backward-allreduce: 0.00 | optimizer: 55.66 | batch generator: 0.75 samples/sec: 6.596 | iteration 197800/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 1.004E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.882266E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.34 | backward: 1803.61 | backward-backward: 1803.59 | backward-allreduce: 0.00 | optimizer: 55.33 | batch generator: 0.80 samples/sec: 6.590 | iteration 197900/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 1.002E-04 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.855889E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.79 | backward: 1805.43 | backward-backward: 1805.40 | backward-allreduce: 0.00 | optimizer: 55.42 | batch generator: 0.78 samples/sec: 6.596 | iteration 198000/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 1.001E-04 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.865151E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.76 | backward: 1803.31 | backward-backward: 1803.28 | backward-allreduce: 0.00 | optimizer: 55.43 | batch generator: 0.78 ----------------------------------------------------------------------------------------------------------- validation results at iteration 198000 | lm_loss value: 2.858880E+00 | lm_loss_ppl value: 1.744198E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.438 | iteration 198100/ 320000 | elapsed time per iteration (ms): 2485.4 | learning rate: 9.995E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.862487E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.35 | backward: 1805.33 | backward-backward: 1805.31 | backward-allreduce: 0.00 | optimizer: 56.44 | batch generator: 0.83 samples/sec: 6.591 | iteration 198200/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 9.981E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.857746E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.87 | backward: 1804.67 | backward-backward: 1804.65 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.78 samples/sec: 6.597 | iteration 198300/ 320000 | elapsed time per iteration (ms): 2425.2 | learning rate: 9.967E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.865623E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 565.87 | backward: 1803.58 | backward-backward: 1803.55 | backward-allreduce: 0.00 | optimizer: 55.34 | batch generator: 0.76 samples/sec: 6.590 | iteration 198400/ 320000 | elapsed time per iteration (ms): 2428.1 | learning rate: 9.953E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.869846E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.84 | backward: 1805.38 | backward-backward: 1805.36 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.79 samples/sec: 6.596 | iteration 198500/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 9.940E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.840215E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.47 | backward: 1803.27 | backward-backward: 1803.25 | backward-allreduce: 0.00 | optimizer: 55.59 | batch generator: 0.79 samples/sec: 6.595 | iteration 198600/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 9.926E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.877109E+00 | loss scale: 131072.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.31 | backward: 1804.48 | backward-backward: 1804.45 | backward-allreduce: 0.00 | optimizer: 54.95 | batch generator: 0.73 samples/sec: 6.591 | iteration 198700/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 9.912E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.849883E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.86 | backward: 1805.28 | backward-backward: 1805.25 | backward-allreduce: 0.00 | optimizer: 55.02 | batch generator: 0.77 samples/sec: 6.594 | iteration 198800/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 9.898E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.874938E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.68 | backward: 1803.65 | backward-backward: 1803.62 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.77 samples/sec: 6.588 | iteration 198900/ 320000 | elapsed time per iteration (ms): 2428.7 | learning rate: 9.884E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.848520E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.93 | backward: 1805.35 | backward-backward: 1805.33 | backward-allreduce: 0.00 | optimizer: 56.01 | batch generator: 0.82 samples/sec: 6.596 | iteration 199000/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 9.871E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.843685E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.12 | backward: 1803.68 | backward-backward: 1803.66 | backward-allreduce: 0.00 | optimizer: 55.38 | batch generator: 0.72 ----------------------------------------------------------------------------------------------------------- validation results at iteration 199000 | lm_loss value: 2.901250E+00 | lm_loss_ppl value: 1.819688E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.439 | iteration 199100/ 320000 | elapsed time per iteration (ms): 2484.7 | learning rate: 9.857E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.854572E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.85 | backward: 1804.77 | backward-backward: 1804.75 | backward-allreduce: 0.00 | optimizer: 55.93 | batch generator: 1.09 samples/sec: 6.589 | iteration 199200/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 9.843E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.858744E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.74 | backward: 1805.01 | backward-backward: 1804.99 | backward-allreduce: 0.00 | optimizer: 56.11 | batch generator: 0.79 samples/sec: 6.596 | iteration 199300/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 9.829E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.859268E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.39 | backward: 1803.53 | backward-backward: 1803.50 | backward-allreduce: 0.00 | optimizer: 55.43 | batch generator: 0.79 samples/sec: 6.588 | iteration 199400/ 320000 | elapsed time per iteration (ms): 2428.7 | learning rate: 9.815E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.848360E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.37 | backward: 1805.39 | backward-backward: 1805.37 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.81 samples/sec: 6.598 | iteration 199500/ 320000 | elapsed time per iteration (ms): 2425.1 | learning rate: 9.801E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.847390E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.14 | backward: 1803.39 | backward-backward: 1803.37 | backward-allreduce: 0.00 | optimizer: 55.19 | batch generator: 0.75 samples/sec: 6.590 | iteration 199600/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 9.788E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.850941E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.69 | backward: 1805.18 | backward-backward: 1805.16 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.78 samples/sec: 6.596 | iteration 199700/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 9.774E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.846189E+00 | loss scale: 65536.0 | number of skipped iterations: 2 | number of nan iterations: 0 | time (ms) | forward: 566.59 | backward: 1803.71 | backward-backward: 1803.69 | backward-allreduce: 0.00 | optimizer: 54.86 | batch generator: 0.81 samples/sec: 6.595 | iteration 199800/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 9.760E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.862650E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.16 | backward: 1803.96 | backward-backward: 1803.94 | backward-allreduce: 0.00 | optimizer: 55.59 | batch generator: 0.75 samples/sec: 6.587 | iteration 199900/ 320000 | elapsed time per iteration (ms): 2428.9 | learning rate: 9.747E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.839540E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.55 | backward: 1806.08 | backward-backward: 1806.06 | backward-allreduce: 0.00 | optimizer: 55.95 | batch generator: 0.75 samples/sec: 6.597 | iteration 200000/ 320000 | elapsed time per iteration (ms): 2425.3 | learning rate: 9.733E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.846803E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.06 | backward: 1803.47 | backward-backward: 1803.45 | backward-allreduce: 0.00 | optimizer: 55.33 | batch generator: 0.79 WARNING: Deleting old checkpoints: checkpoints-hfcm/global_step100000 ----------------------------------------------------------------------------------------------------------- validation results at iteration 200000 | lm_loss value: 2.858487E+00 | lm_loss_ppl value: 1.743513E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.206 | iteration 200100/ 320000 | elapsed time per iteration (ms): 2577.9 | learning rate: 9.719E-05 | approx flops per GPU: 38.6TFLOPS | lm_loss: 2.851064E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 568.45 | backward: 1809.18 | backward-backward: 1809.16 | backward-allreduce: 0.00 | optimizer: 55.45 | batch generator: 0.83 samples/sec: 6.597 | iteration 200200/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 9.705E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.864427E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.22 | backward: 1803.02 | backward-backward: 1803.00 | backward-allreduce: 0.00 | optimizer: 55.82 | batch generator: 0.79 samples/sec: 6.590 | iteration 200300/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 9.692E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.856422E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.37 | backward: 1805.00 | backward-backward: 1804.97 | backward-allreduce: 0.00 | optimizer: 56.11 | batch generator: 0.76 samples/sec: 6.595 | iteration 200400/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 9.678E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.866338E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.43 | backward: 1803.82 | backward-backward: 1803.80 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.79 samples/sec: 6.596 | iteration 200500/ 320000 | elapsed time per iteration (ms): 2425.6 | learning rate: 9.664E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.852659E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.17 | backward: 1803.40 | backward-backward: 1803.38 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.78 samples/sec: 6.589 | iteration 200600/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 9.650E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.849985E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.60 | backward: 1805.35 | backward-backward: 1805.33 | backward-allreduce: 0.00 | optimizer: 55.79 | batch generator: 0.81 samples/sec: 6.598 | iteration 200700/ 320000 | elapsed time per iteration (ms): 2425.0 | learning rate: 9.636E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.847309E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.21 | backward: 1802.76 | backward-backward: 1802.73 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.87 samples/sec: 6.591 | iteration 200800/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 9.623E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.854982E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.80 | backward: 1804.55 | backward-backward: 1804.53 | backward-allreduce: 0.00 | optimizer: 56.00 | batch generator: 0.79 samples/sec: 6.596 | iteration 200900/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 9.609E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.851122E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.46 | backward: 1803.21 | backward-backward: 1803.19 | backward-allreduce: 0.00 | optimizer: 55.49 | batch generator: 0.85 samples/sec: 6.594 | iteration 201000/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 9.595E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.833287E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.54 | backward: 1803.62 | backward-backward: 1803.60 | backward-allreduce: 0.00 | optimizer: 55.85 | batch generator: 0.80 ----------------------------------------------------------------------------------------------------------- validation results at iteration 201000 | lm_loss value: 2.796851E+00 | lm_loss_ppl value: 1.639294E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.439 | iteration 201100/ 320000 | elapsed time per iteration (ms): 2484.7 | learning rate: 9.582E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.841221E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.48 | backward: 1805.14 | backward-backward: 1805.12 | backward-allreduce: 0.00 | optimizer: 55.89 | batch generator: 0.86 samples/sec: 6.598 | iteration 201200/ 320000 | elapsed time per iteration (ms): 2425.1 | learning rate: 9.568E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.844435E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 565.99 | backward: 1803.72 | backward-backward: 1803.70 | backward-allreduce: 0.00 | optimizer: 54.97 | batch generator: 0.74 samples/sec: 6.585 | iteration 201300/ 320000 | elapsed time per iteration (ms): 2429.6 | learning rate: 9.554E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.853267E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.61 | backward: 1806.65 | backward-backward: 1806.63 | backward-allreduce: 0.00 | optimizer: 55.97 | batch generator: 0.79 samples/sec: 6.593 | iteration 201400/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 9.540E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.855822E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.16 | backward: 1803.84 | backward-backward: 1803.81 | backward-allreduce: 0.00 | optimizer: 56.35 | batch generator: 0.77 samples/sec: 6.590 | iteration 201500/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 9.527E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.843596E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.56 | backward: 1804.88 | backward-backward: 1804.86 | backward-allreduce: 0.00 | optimizer: 56.05 | batch generator: 0.77 samples/sec: 6.593 | iteration 201600/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 9.513E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.832040E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.65 | backward: 1804.01 | backward-backward: 1803.99 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.80 samples/sec: 6.598 | iteration 201700/ 320000 | elapsed time per iteration (ms): 2425.0 | learning rate: 9.499E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.860461E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.20 | backward: 1803.40 | backward-backward: 1803.37 | backward-allreduce: 0.00 | optimizer: 55.04 | batch generator: 0.77 samples/sec: 6.590 | iteration 201800/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 9.486E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.849138E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.64 | backward: 1805.05 | backward-backward: 1805.02 | backward-allreduce: 0.00 | optimizer: 55.84 | batch generator: 0.79 samples/sec: 6.598 | iteration 201900/ 320000 | elapsed time per iteration (ms): 2424.9 | learning rate: 9.472E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.847576E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.16 | backward: 1802.81 | backward-backward: 1802.79 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.86 samples/sec: 6.589 | iteration 202000/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 9.458E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.825201E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.02 | backward: 1804.74 | backward-backward: 1804.72 | backward-allreduce: 0.00 | optimizer: 55.99 | batch generator: 0.84 ----------------------------------------------------------------------------------------------------------- validation results at iteration 202000 | lm_loss value: 2.839590E+00 | lm_loss_ppl value: 1.710875E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.444 | iteration 202100/ 320000 | elapsed time per iteration (ms): 2482.8 | learning rate: 9.445E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.849583E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.39 | backward: 1803.35 | backward-backward: 1803.32 | backward-allreduce: 0.00 | optimizer: 55.89 | batch generator: 0.90 samples/sec: 6.596 | iteration 202200/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 9.431E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.856394E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.50 | backward: 1803.14 | backward-backward: 1803.12 | backward-allreduce: 0.00 | optimizer: 55.83 | batch generator: 0.79 samples/sec: 6.592 | iteration 202300/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 9.417E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.842900E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.52 | backward: 1804.41 | backward-backward: 1804.39 | backward-allreduce: 0.00 | optimizer: 55.87 | batch generator: 0.79 samples/sec: 6.595 | iteration 202400/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 9.404E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.830025E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.42 | backward: 1803.64 | backward-backward: 1803.61 | backward-allreduce: 0.00 | optimizer: 55.73 | batch generator: 0.93 samples/sec: 6.589 | iteration 202500/ 320000 | elapsed time per iteration (ms): 2428.1 | learning rate: 9.390E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.836286E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.74 | backward: 1804.77 | backward-backward: 1804.74 | backward-allreduce: 0.00 | optimizer: 56.25 | batch generator: 0.75 samples/sec: 6.599 | iteration 202600/ 320000 | elapsed time per iteration (ms): 2424.8 | learning rate: 9.376E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.855296E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.14 | backward: 1802.64 | backward-backward: 1802.62 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.77 samples/sec: 6.591 | iteration 202700/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 9.363E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.852473E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.72 | backward: 1804.65 | backward-backward: 1804.62 | backward-allreduce: 0.00 | optimizer: 55.86 | batch generator: 0.83 samples/sec: 6.592 | iteration 202800/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 9.349E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.853751E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.43 | backward: 1804.49 | backward-backward: 1804.46 | backward-allreduce: 0.00 | optimizer: 55.84 | batch generator: 0.76 samples/sec: 6.595 | iteration 202900/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 9.335E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.854634E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.15 | backward: 1803.72 | backward-backward: 1803.70 | backward-allreduce: 0.00 | optimizer: 55.82 | batch generator: 0.75 samples/sec: 6.589 | iteration 203000/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 9.322E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.833404E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.69 | backward: 1805.51 | backward-backward: 1805.49 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.78 ----------------------------------------------------------------------------------------------------------- validation results at iteration 203000 | lm_loss value: 2.793117E+00 | lm_loss_ppl value: 1.633185E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.444 | iteration 203100/ 320000 | elapsed time per iteration (ms): 2482.8 | learning rate: 9.308E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.850141E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.19 | backward: 1803.77 | backward-backward: 1803.75 | backward-allreduce: 0.00 | optimizer: 55.67 | batch generator: 0.86 samples/sec: 6.589 | iteration 203200/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 9.295E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.842355E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.65 | backward: 1805.35 | backward-backward: 1805.33 | backward-allreduce: 0.00 | optimizer: 55.85 | batch generator: 0.86 samples/sec: 6.594 | iteration 203300/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 9.281E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.848521E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.47 | backward: 1803.65 | backward-backward: 1803.62 | backward-allreduce: 0.00 | optimizer: 55.88 | batch generator: 0.79 samples/sec: 6.596 | iteration 203400/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 9.267E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.836229E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.21 | backward: 1803.61 | backward-backward: 1803.58 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.77 samples/sec: 6.588 | iteration 203500/ 320000 | elapsed time per iteration (ms): 2428.8 | learning rate: 9.254E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.848741E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.63 | backward: 1805.65 | backward-backward: 1805.63 | backward-allreduce: 0.00 | optimizer: 55.99 | batch generator: 0.77 samples/sec: 6.595 | iteration 203600/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 9.240E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.844409E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.20 | backward: 1803.35 | backward-backward: 1803.33 | backward-allreduce: 0.00 | optimizer: 56.33 | batch generator: 0.80 samples/sec: 6.593 | iteration 203700/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 9.227E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.840764E+00 | loss scale: 65536.0 | number of skipped iterations: 2 | number of nan iterations: 0 | time (ms) | forward: 566.57 | backward: 1805.29 | backward-backward: 1805.27 | backward-allreduce: 0.00 | optimizer: 54.58 | batch generator: 0.81 samples/sec: 6.597 | iteration 203800/ 320000 | elapsed time per iteration (ms): 2425.3 | learning rate: 9.213E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.851332E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.40 | backward: 1803.18 | backward-backward: 1803.15 | backward-allreduce: 0.00 | optimizer: 55.35 | batch generator: 0.88 samples/sec: 6.595 | iteration 203900/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 9.200E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.843549E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.50 | backward: 1803.70 | backward-backward: 1803.68 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.76 samples/sec: 6.592 | iteration 204000/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 9.186E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.846371E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.44 | backward: 1804.37 | backward-backward: 1804.34 | backward-allreduce: 0.00 | optimizer: 55.84 | batch generator: 0.75 ----------------------------------------------------------------------------------------------------------- validation results at iteration 204000 | lm_loss value: 2.887904E+00 | lm_loss_ppl value: 1.795563E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.448 | iteration 204100/ 320000 | elapsed time per iteration (ms): 2481.5 | learning rate: 9.173E-05 | approx flops per GPU: 40.1TFLOPS | lm_loss: 2.839841E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.05 | backward: 1802.87 | backward-backward: 1802.84 | backward-allreduce: 0.00 | optimizer: 55.41 | batch generator: 0.84 samples/sec: 6.589 | iteration 204200/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 9.159E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.853775E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.21 | backward: 1804.98 | backward-backward: 1804.96 | backward-allreduce: 0.00 | optimizer: 55.71 | batch generator: 0.93 samples/sec: 6.601 | iteration 204300/ 320000 | elapsed time per iteration (ms): 2424.1 | learning rate: 9.146E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.836523E+00 | loss scale: 16384.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.12 | backward: 1802.42 | backward-backward: 1802.40 | backward-allreduce: 0.00 | optimizer: 55.13 | batch generator: 0.81 samples/sec: 6.593 | iteration 204400/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 9.132E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.843987E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.60 | backward: 1804.16 | backward-backward: 1804.13 | backward-allreduce: 0.00 | optimizer: 55.81 | batch generator: 0.78 samples/sec: 6.597 | iteration 204500/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 9.119E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.832281E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.40 | backward: 1803.13 | backward-backward: 1803.10 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.80 samples/sec: 6.597 | iteration 204600/ 320000 | elapsed time per iteration (ms): 2425.3 | learning rate: 9.105E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.854664E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.23 | backward: 1803.22 | backward-backward: 1803.20 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.75 samples/sec: 6.592 | iteration 204700/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 9.091E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.839626E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.61 | backward: 1804.08 | backward-backward: 1804.05 | backward-allreduce: 0.00 | optimizer: 56.22 | batch generator: 0.80 samples/sec: 6.597 | iteration 204800/ 320000 | elapsed time per iteration (ms): 2425.3 | learning rate: 9.078E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.840267E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.09 | backward: 1803.01 | backward-backward: 1802.98 | backward-allreduce: 0.00 | optimizer: 55.80 | batch generator: 0.79 samples/sec: 6.592 | iteration 204900/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 9.064E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.836918E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.58 | backward: 1804.20 | backward-backward: 1804.18 | backward-allreduce: 0.00 | optimizer: 55.86 | batch generator: 0.81 samples/sec: 6.597 | iteration 205000/ 320000 | elapsed time per iteration (ms): 2425.3 | learning rate: 9.051E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.848750E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.36 | backward: 1802.98 | backward-backward: 1802.96 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.76 ----------------------------------------------------------------------------------------------------------- validation results at iteration 205000 | lm_loss value: 2.917443E+00 | lm_loss_ppl value: 1.849394E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.443 | iteration 205100/ 320000 | elapsed time per iteration (ms): 2483.4 | learning rate: 9.037E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.833185E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.77 | backward: 1803.40 | backward-backward: 1803.38 | backward-allreduce: 0.00 | optimizer: 56.06 | batch generator: 0.85 samples/sec: 6.593 | iteration 205200/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 9.024E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.839823E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.59 | backward: 1803.96 | backward-backward: 1803.94 | backward-allreduce: 0.00 | optimizer: 55.74 | batch generator: 0.81 samples/sec: 6.597 | iteration 205300/ 320000 | elapsed time per iteration (ms): 2425.3 | learning rate: 9.010E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.847297E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.18 | backward: 1802.89 | backward-backward: 1802.87 | backward-allreduce: 0.00 | optimizer: 55.86 | batch generator: 0.79 samples/sec: 6.590 | iteration 205400/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 8.997E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.844380E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.95 | backward: 1804.74 | backward-backward: 1804.72 | backward-allreduce: 0.00 | optimizer: 55.79 | batch generator: 0.95 samples/sec: 6.598 | iteration 205500/ 320000 | elapsed time per iteration (ms): 2424.9 | learning rate: 8.983E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.837637E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.32 | backward: 1802.78 | backward-backward: 1802.76 | backward-allreduce: 0.00 | optimizer: 55.45 | batch generator: 0.75 samples/sec: 6.594 | iteration 205600/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 8.970E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.851175E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.33 | backward: 1804.08 | backward-backward: 1804.06 | backward-allreduce: 0.00 | optimizer: 55.80 | batch generator: 0.79 samples/sec: 6.593 | iteration 205700/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 8.956E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.849739E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.36 | backward: 1804.41 | backward-backward: 1804.39 | backward-allreduce: 0.00 | optimizer: 55.74 | batch generator: 0.78 samples/sec: 6.594 | iteration 205800/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 8.943E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.835285E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.41 | backward: 1803.54 | backward-backward: 1803.52 | backward-allreduce: 0.00 | optimizer: 56.18 | batch generator: 0.74 samples/sec: 6.589 | iteration 205900/ 320000 | elapsed time per iteration (ms): 2428.1 | learning rate: 8.929E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.841388E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.59 | backward: 1805.24 | backward-backward: 1805.22 | backward-allreduce: 0.00 | optimizer: 55.92 | batch generator: 0.85 samples/sec: 6.599 | iteration 206000/ 320000 | elapsed time per iteration (ms): 2424.7 | learning rate: 8.916E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.838356E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.02 | backward: 1802.92 | backward-backward: 1802.89 | backward-allreduce: 0.00 | optimizer: 55.39 | batch generator: 0.78 ----------------------------------------------------------------------------------------------------------- validation results at iteration 206000 | lm_loss value: 2.854188E+00 | lm_loss_ppl value: 1.736033E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.442 | iteration 206100/ 320000 | elapsed time per iteration (ms): 2483.6 | learning rate: 8.903E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.827210E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.52 | backward: 1804.47 | backward-backward: 1804.44 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.86 samples/sec: 6.593 | iteration 206200/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 8.889E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.833841E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.44 | backward: 1804.00 | backward-backward: 1803.98 | backward-allreduce: 0.00 | optimizer: 56.14 | batch generator: 0.80 samples/sec: 6.597 | iteration 206300/ 320000 | elapsed time per iteration (ms): 2425.2 | learning rate: 8.876E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.829783E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.11 | backward: 1803.70 | backward-backward: 1803.68 | backward-allreduce: 0.00 | optimizer: 55.00 | batch generator: 0.75 samples/sec: 6.589 | iteration 206400/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 8.862E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.843696E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.78 | backward: 1805.33 | backward-backward: 1805.31 | backward-allreduce: 0.00 | optimizer: 55.83 | batch generator: 0.79 samples/sec: 6.596 | iteration 206500/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 8.849E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.821791E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.33 | backward: 1803.41 | backward-backward: 1803.39 | backward-allreduce: 0.00 | optimizer: 55.70 | batch generator: 0.85 samples/sec: 6.593 | iteration 206600/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 8.835E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.818038E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.80 | backward: 1803.96 | backward-backward: 1803.94 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 1.08 samples/sec: 6.589 | iteration 206700/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 8.822E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.823350E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.68 | backward: 1805.47 | backward-backward: 1805.45 | backward-allreduce: 0.00 | optimizer: 55.69 | batch generator: 0.89 samples/sec: 6.595 | iteration 206800/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 8.809E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.827261E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.37 | backward: 1803.57 | backward-backward: 1803.55 | backward-allreduce: 0.00 | optimizer: 55.83 | batch generator: 0.88 samples/sec: 6.595 | iteration 206900/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 8.795E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.845583E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.37 | backward: 1803.45 | backward-backward: 1803.43 | backward-allreduce: 0.00 | optimizer: 55.82 | batch generator: 0.78 samples/sec: 6.592 | iteration 207000/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 8.782E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.850736E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.36 | backward: 1804.59 | backward-backward: 1804.57 | backward-allreduce: 0.00 | optimizer: 55.76 | batch generator: 0.75 ----------------------------------------------------------------------------------------------------------- validation results at iteration 207000 | lm_loss value: 2.748863E+00 | lm_loss_ppl value: 1.562486E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.447 | iteration 207100/ 320000 | elapsed time per iteration (ms): 2481.9 | learning rate: 8.769E-05 | approx flops per GPU: 40.1TFLOPS | lm_loss: 2.830620E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.14 | backward: 1802.77 | backward-backward: 1802.75 | backward-allreduce: 0.00 | optimizer: 55.81 | batch generator: 0.82 samples/sec: 6.593 | iteration 207200/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 8.755E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.826467E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.42 | backward: 1804.50 | backward-backward: 1804.48 | backward-allreduce: 0.00 | optimizer: 55.50 | batch generator: 0.80 samples/sec: 6.596 | iteration 207300/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 8.742E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.842319E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.43 | backward: 1803.47 | backward-backward: 1803.45 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.81 samples/sec: 6.597 | iteration 207400/ 320000 | elapsed time per iteration (ms): 2425.3 | learning rate: 8.728E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.835752E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.57 | backward: 1802.79 | backward-backward: 1802.76 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 1.02 samples/sec: 6.591 | iteration 207500/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 8.715E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.826314E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.59 | backward: 1804.62 | backward-backward: 1804.60 | backward-allreduce: 0.00 | optimizer: 55.97 | batch generator: 0.76 samples/sec: 6.598 | iteration 207600/ 320000 | elapsed time per iteration (ms): 2424.8 | learning rate: 8.702E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.835082E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.09 | backward: 1802.73 | backward-backward: 1802.71 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.78 samples/sec: 6.596 | iteration 207700/ 320000 | elapsed time per iteration (ms): 2425.6 | learning rate: 8.688E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.817727E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.35 | backward: 1803.38 | backward-backward: 1803.36 | backward-allreduce: 0.00 | optimizer: 55.47 | batch generator: 0.75 samples/sec: 6.592 | iteration 207800/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 8.675E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.829916E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.41 | backward: 1804.73 | backward-backward: 1804.70 | backward-allreduce: 0.00 | optimizer: 55.69 | batch generator: 0.78 samples/sec: 6.595 | iteration 207900/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 8.662E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.836194E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 565.92 | backward: 1803.55 | backward-backward: 1803.53 | backward-allreduce: 0.00 | optimizer: 56.39 | batch generator: 0.77 samples/sec: 6.593 | iteration 208000/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 8.648E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.805206E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.84 | backward: 1804.17 | backward-backward: 1804.15 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.76 ----------------------------------------------------------------------------------------------------------- validation results at iteration 208000 | lm_loss value: 2.831977E+00 | lm_loss_ppl value: 1.697899E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.443 | iteration 208100/ 320000 | elapsed time per iteration (ms): 2483.5 | learning rate: 8.635E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.833283E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.22 | backward: 1804.45 | backward-backward: 1804.43 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.83 samples/sec: 6.600 | iteration 208200/ 320000 | elapsed time per iteration (ms): 2424.3 | learning rate: 8.622E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.831178E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 565.94 | backward: 1803.17 | backward-backward: 1803.15 | backward-allreduce: 0.00 | optimizer: 54.87 | batch generator: 0.76 samples/sec: 6.589 | iteration 208300/ 320000 | elapsed time per iteration (ms): 2428.4 | learning rate: 8.608E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.833559E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.48 | backward: 1805.49 | backward-backward: 1805.47 | backward-allreduce: 0.00 | optimizer: 56.09 | batch generator: 0.84 samples/sec: 6.594 | iteration 208400/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 8.595E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.820368E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.28 | backward: 1803.86 | backward-backward: 1803.84 | backward-allreduce: 0.00 | optimizer: 55.96 | batch generator: 0.80 samples/sec: 6.596 | iteration 208500/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 8.582E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.839627E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.31 | backward: 1803.59 | backward-backward: 1803.57 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.79 samples/sec: 6.589 | iteration 208600/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 8.568E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.819772E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.66 | backward: 1805.29 | backward-backward: 1805.26 | backward-allreduce: 0.00 | optimizer: 55.99 | batch generator: 0.76 samples/sec: 6.597 | iteration 208700/ 320000 | elapsed time per iteration (ms): 2425.2 | learning rate: 8.555E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.821215E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.15 | backward: 1803.18 | backward-backward: 1803.16 | backward-allreduce: 0.00 | optimizer: 55.41 | batch generator: 0.77 samples/sec: 6.596 | iteration 208800/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 8.542E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.818560E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.12 | backward: 1803.65 | backward-backward: 1803.63 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.74 samples/sec: 6.590 | iteration 208900/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 8.529E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.840672E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.72 | backward: 1805.00 | backward-backward: 1804.98 | backward-allreduce: 0.00 | optimizer: 55.73 | batch generator: 0.77 samples/sec: 6.594 | iteration 209000/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 8.515E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.812456E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.12 | backward: 1804.05 | backward-backward: 1804.03 | backward-allreduce: 0.00 | optimizer: 56.03 | batch generator: 0.75 ----------------------------------------------------------------------------------------------------------- validation results at iteration 209000 | lm_loss value: 2.780476E+00 | lm_loss_ppl value: 1.612670E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.447 | iteration 209100/ 320000 | elapsed time per iteration (ms): 2481.7 | learning rate: 8.502E-05 | approx flops per GPU: 40.1TFLOPS | lm_loss: 2.819534E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.45 | backward: 1802.95 | backward-backward: 1802.93 | backward-allreduce: 0.00 | optimizer: 55.19 | batch generator: 0.94 samples/sec: 6.591 | iteration 209200/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 8.489E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.829411E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.34 | backward: 1805.17 | backward-backward: 1805.15 | backward-allreduce: 0.00 | optimizer: 55.73 | batch generator: 0.77 samples/sec: 6.598 | iteration 209300/ 320000 | elapsed time per iteration (ms): 2425.1 | learning rate: 8.476E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.819524E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.25 | backward: 1802.97 | backward-backward: 1802.95 | backward-allreduce: 0.00 | optimizer: 55.49 | batch generator: 0.77 samples/sec: 6.598 | iteration 209400/ 320000 | elapsed time per iteration (ms): 2425.1 | learning rate: 8.462E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.826232E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.29 | backward: 1802.74 | backward-backward: 1802.72 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.79 samples/sec: 6.590 | iteration 209500/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 8.449E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.823405E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.44 | backward: 1804.97 | backward-backward: 1804.95 | backward-allreduce: 0.00 | optimizer: 56.00 | batch generator: 0.84 samples/sec: 6.598 | iteration 209600/ 320000 | elapsed time per iteration (ms): 2425.0 | learning rate: 8.436E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.824936E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.30 | backward: 1802.66 | backward-backward: 1802.64 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.83 samples/sec: 6.596 | iteration 209700/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 8.423E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.819929E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.39 | backward: 1803.22 | backward-backward: 1803.20 | backward-allreduce: 0.00 | optimizer: 55.72 | batch generator: 0.79 samples/sec: 6.591 | iteration 209800/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 8.409E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.826456E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.38 | backward: 1805.03 | backward-backward: 1805.00 | backward-allreduce: 0.00 | optimizer: 55.82 | batch generator: 0.76 samples/sec: 6.598 | iteration 209900/ 320000 | elapsed time per iteration (ms): 2425.1 | learning rate: 8.396E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.812462E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.38 | backward: 1802.77 | backward-backward: 1802.74 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.77 samples/sec: 6.598 | iteration 210000/ 320000 | elapsed time per iteration (ms): 2425.0 | learning rate: 8.383E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.826378E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.10 | backward: 1802.96 | backward-backward: 1802.94 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.78 WARNING: Deleting old checkpoints: checkpoints-hfcm/global_step110000 ----------------------------------------------------------------------------------------------------------- validation results at iteration 210000 | lm_loss value: 2.786085E+00 | lm_loss_ppl value: 1.621740E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.206 | iteration 210100/ 320000 | elapsed time per iteration (ms): 2578.1 | learning rate: 8.370E-05 | approx flops per GPU: 38.6TFLOPS | lm_loss: 2.798781E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.97 | backward: 1807.46 | backward-backward: 1807.44 | backward-allreduce: 0.00 | optimizer: 56.32 | batch generator: 0.85 samples/sec: 6.593 | iteration 210200/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 8.357E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.813144E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.79 | backward: 1803.96 | backward-backward: 1803.94 | backward-allreduce: 0.00 | optimizer: 55.72 | batch generator: 0.80 samples/sec: 6.598 | iteration 210300/ 320000 | elapsed time per iteration (ms): 2424.8 | learning rate: 8.343E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.818721E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.02 | backward: 1802.94 | backward-backward: 1802.92 | backward-allreduce: 0.00 | optimizer: 55.51 | batch generator: 0.79 samples/sec: 6.591 | iteration 210400/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 8.330E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.822578E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.65 | backward: 1805.12 | backward-backward: 1805.10 | backward-allreduce: 0.00 | optimizer: 55.59 | batch generator: 0.78 samples/sec: 6.595 | iteration 210500/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 8.317E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.805124E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.53 | backward: 1803.90 | backward-backward: 1803.88 | backward-allreduce: 0.00 | optimizer: 55.27 | batch generator: 0.82 samples/sec: 6.598 | iteration 210600/ 320000 | elapsed time per iteration (ms): 2425.0 | learning rate: 8.304E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.829164E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.02 | backward: 1803.18 | backward-backward: 1803.15 | backward-allreduce: 0.00 | optimizer: 55.37 | batch generator: 0.79 samples/sec: 6.592 | iteration 210700/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 8.291E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.817079E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.74 | backward: 1804.90 | backward-backward: 1804.88 | backward-allreduce: 0.00 | optimizer: 55.05 | batch generator: 0.81 samples/sec: 6.596 | iteration 210800/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 8.278E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.821971E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.42 | backward: 1803.32 | backward-backward: 1803.29 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.76 samples/sec: 6.599 | iteration 210900/ 320000 | elapsed time per iteration (ms): 2424.7 | learning rate: 8.265E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.814009E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.08 | backward: 1802.75 | backward-backward: 1802.73 | backward-allreduce: 0.00 | optimizer: 55.50 | batch generator: 0.77 samples/sec: 6.589 | iteration 211000/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 8.251E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.831830E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.74 | backward: 1805.17 | backward-backward: 1805.14 | backward-allreduce: 0.00 | optimizer: 56.04 | batch generator: 0.79 ----------------------------------------------------------------------------------------------------------- validation results at iteration 211000 | lm_loss value: 2.828266E+00 | lm_loss_ppl value: 1.691611E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.447 | iteration 211100/ 320000 | elapsed time per iteration (ms): 2481.7 | learning rate: 8.238E-05 | approx flops per GPU: 40.1TFLOPS | lm_loss: 2.814595E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.27 | backward: 1802.89 | backward-backward: 1802.86 | backward-allreduce: 0.00 | optimizer: 55.35 | batch generator: 0.85 samples/sec: 6.592 | iteration 211200/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 8.225E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.812917E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.66 | backward: 1803.97 | backward-backward: 1803.95 | backward-allreduce: 0.00 | optimizer: 56.10 | batch generator: 0.78 samples/sec: 6.592 | iteration 211300/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 8.212E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.813546E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.71 | backward: 1804.27 | backward-backward: 1804.25 | backward-allreduce: 0.00 | optimizer: 55.73 | batch generator: 0.79 samples/sec: 6.600 | iteration 211400/ 320000 | elapsed time per iteration (ms): 2424.4 | learning rate: 8.199E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.810233E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.07 | backward: 1802.36 | backward-backward: 1802.34 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.80 samples/sec: 6.591 | iteration 211500/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 8.186E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.827729E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.71 | backward: 1804.68 | backward-backward: 1804.66 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.77 samples/sec: 6.596 | iteration 211600/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 8.173E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.815220E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.41 | backward: 1803.54 | backward-backward: 1803.52 | backward-allreduce: 0.00 | optimizer: 55.42 | batch generator: 0.72 samples/sec: 6.597 | iteration 211700/ 320000 | elapsed time per iteration (ms): 2425.2 | learning rate: 8.160E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.811539E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.60 | backward: 1803.09 | backward-backward: 1803.06 | backward-allreduce: 0.00 | optimizer: 55.15 | batch generator: 0.83 samples/sec: 6.588 | iteration 211800/ 320000 | elapsed time per iteration (ms): 2428.7 | learning rate: 8.146E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.787294E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.03 | backward: 1805.53 | backward-backward: 1805.50 | backward-allreduce: 0.00 | optimizer: 55.76 | batch generator: 0.97 samples/sec: 6.597 | iteration 211900/ 320000 | elapsed time per iteration (ms): 2425.2 | learning rate: 8.133E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.820916E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.18 | backward: 1803.03 | backward-backward: 1803.01 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.77 samples/sec: 6.594 | iteration 212000/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 8.120E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.809944E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.46 | backward: 1804.16 | backward-backward: 1804.13 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.79 ----------------------------------------------------------------------------------------------------------- validation results at iteration 212000 | lm_loss value: 2.771506E+00 | lm_loss_ppl value: 1.598268E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.442 | iteration 212100/ 320000 | elapsed time per iteration (ms): 2483.8 | learning rate: 8.107E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.807163E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.72 | backward: 1804.32 | backward-backward: 1804.29 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.87 samples/sec: 6.594 | iteration 212200/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 8.094E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.806263E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.14 | backward: 1804.25 | backward-backward: 1804.23 | backward-allreduce: 0.00 | optimizer: 55.77 | batch generator: 0.76 samples/sec: 6.586 | iteration 212300/ 320000 | elapsed time per iteration (ms): 2429.3 | learning rate: 8.081E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.814668E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.73 | backward: 1805.81 | backward-backward: 1805.78 | backward-allreduce: 0.00 | optimizer: 56.33 | batch generator: 0.77 samples/sec: 6.596 | iteration 212400/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 8.068E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.810582E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.18 | backward: 1803.23 | backward-backward: 1803.21 | backward-allreduce: 0.00 | optimizer: 56.01 | batch generator: 0.78 samples/sec: 6.589 | iteration 212500/ 320000 | elapsed time per iteration (ms): 2428.4 | learning rate: 8.055E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.818044E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.82 | backward: 1805.58 | backward-backward: 1805.56 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.80 samples/sec: 6.594 | iteration 212600/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 8.042E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.803832E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.61 | backward: 1803.59 | backward-backward: 1803.57 | backward-allreduce: 0.00 | optimizer: 55.70 | batch generator: 0.82 samples/sec: 6.594 | iteration 212700/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 8.029E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.803479E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.37 | backward: 1803.92 | backward-backward: 1803.89 | backward-allreduce: 0.00 | optimizer: 55.80 | batch generator: 0.79 samples/sec: 6.591 | iteration 212800/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 8.016E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.804441E+00 | loss scale: 65536.0 | number of skipped iterations: 2 | number of nan iterations: 0 | time (ms) | forward: 566.61 | backward: 1805.79 | backward-backward: 1805.77 | backward-allreduce: 0.00 | optimizer: 54.63 | batch generator: 0.80 samples/sec: 6.596 | iteration 212900/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 8.003E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.818412E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.19 | backward: 1803.52 | backward-backward: 1803.50 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.80 samples/sec: 6.589 | iteration 213000/ 320000 | elapsed time per iteration (ms): 2428.4 | learning rate: 7.990E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.803716E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 567.02 | backward: 1805.62 | backward-backward: 1805.60 | backward-allreduce: 0.00 | optimizer: 55.39 | batch generator: 0.82 ----------------------------------------------------------------------------------------------------------- validation results at iteration 213000 | lm_loss value: 2.803329E+00 | lm_loss_ppl value: 1.649949E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.444 | iteration 213100/ 320000 | elapsed time per iteration (ms): 2482.8 | learning rate: 7.977E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.820198E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.44 | backward: 1803.52 | backward-backward: 1803.49 | backward-allreduce: 0.00 | optimizer: 55.69 | batch generator: 0.89 samples/sec: 6.594 | iteration 213200/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 7.964E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.811161E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.49 | backward: 1803.96 | backward-backward: 1803.94 | backward-allreduce: 0.00 | optimizer: 55.51 | batch generator: 0.78 samples/sec: 6.589 | iteration 213300/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 7.951E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.791296E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.80 | backward: 1805.48 | backward-backward: 1805.46 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.83 samples/sec: 6.595 | iteration 213400/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 7.938E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.804435E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.43 | backward: 1803.25 | backward-backward: 1803.23 | backward-allreduce: 0.00 | optimizer: 56.11 | batch generator: 0.77 samples/sec: 6.594 | iteration 213500/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 7.925E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.808982E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.55 | backward: 1804.13 | backward-backward: 1804.11 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.76 samples/sec: 6.598 | iteration 213600/ 320000 | elapsed time per iteration (ms): 2424.9 | learning rate: 7.912E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.814500E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 565.96 | backward: 1802.98 | backward-backward: 1802.96 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.79 samples/sec: 6.592 | iteration 213700/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 7.899E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.809180E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.49 | backward: 1804.62 | backward-backward: 1804.59 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.78 samples/sec: 6.595 | iteration 213800/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 7.886E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.789474E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.57 | backward: 1803.50 | backward-backward: 1803.48 | backward-allreduce: 0.00 | optimizer: 55.54 | batch generator: 0.81 samples/sec: 6.596 | iteration 213900/ 320000 | elapsed time per iteration (ms): 2425.6 | learning rate: 7.873E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.800332E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.29 | backward: 1803.55 | backward-backward: 1803.53 | backward-allreduce: 0.00 | optimizer: 55.42 | batch generator: 0.79 samples/sec: 6.590 | iteration 214000/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 7.860E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.808278E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.85 | backward: 1805.06 | backward-backward: 1805.04 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.78 ----------------------------------------------------------------------------------------------------------- validation results at iteration 214000 | lm_loss value: 2.784805E+00 | lm_loss_ppl value: 1.619666E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.446 | iteration 214100/ 320000 | elapsed time per iteration (ms): 2482.1 | learning rate: 7.847E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.795650E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.20 | backward: 1803.09 | backward-backward: 1803.06 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.87 samples/sec: 6.592 | iteration 214200/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 7.835E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.791644E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.54 | backward: 1805.10 | backward-backward: 1805.08 | backward-allreduce: 0.00 | optimizer: 55.18 | batch generator: 0.78 samples/sec: 6.590 | iteration 214300/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 7.822E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.802989E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.69 | backward: 1804.86 | backward-backward: 1804.83 | backward-allreduce: 0.00 | optimizer: 55.81 | batch generator: 0.80 samples/sec: 6.593 | iteration 214400/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 7.809E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.805388E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.20 | backward: 1804.48 | backward-backward: 1804.45 | backward-allreduce: 0.00 | optimizer: 55.77 | batch generator: 0.84 samples/sec: 6.587 | iteration 214500/ 320000 | elapsed time per iteration (ms): 2429.1 | learning rate: 7.796E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.810659E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.74 | backward: 1805.88 | backward-backward: 1805.86 | backward-allreduce: 0.00 | optimizer: 56.12 | batch generator: 0.77 samples/sec: 6.596 | iteration 214600/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 7.783E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.799274E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.23 | backward: 1803.77 | backward-backward: 1803.75 | backward-allreduce: 0.00 | optimizer: 55.38 | batch generator: 0.75 samples/sec: 6.589 | iteration 214700/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 7.770E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.784055E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.69 | backward: 1805.79 | backward-backward: 1805.77 | backward-allreduce: 0.00 | optimizer: 55.34 | batch generator: 0.79 samples/sec: 6.596 | iteration 214800/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 7.757E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.800480E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.60 | backward: 1802.94 | backward-backward: 1802.92 | backward-allreduce: 0.00 | optimizer: 55.85 | batch generator: 0.92 samples/sec: 6.594 | iteration 214900/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 7.744E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.819302E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.66 | backward: 1803.74 | backward-backward: 1803.72 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.78 samples/sec: 6.590 | iteration 215000/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 7.731E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.796681E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.68 | backward: 1805.24 | backward-backward: 1805.22 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.77 ----------------------------------------------------------------------------------------------------------- validation results at iteration 215000 | lm_loss value: 2.818605E+00 | lm_loss_ppl value: 1.675346E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.444 | iteration 215100/ 320000 | elapsed time per iteration (ms): 2482.9 | learning rate: 7.719E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.810033E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.68 | backward: 1803.20 | backward-backward: 1803.18 | backward-allreduce: 0.00 | optimizer: 55.83 | batch generator: 0.86 samples/sec: 6.591 | iteration 215200/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 7.706E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.809696E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.85 | backward: 1804.77 | backward-backward: 1804.75 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.75 samples/sec: 6.597 | iteration 215300/ 320000 | elapsed time per iteration (ms): 2425.4 | learning rate: 7.693E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.802079E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.19 | backward: 1803.00 | backward-backward: 1802.98 | backward-allreduce: 0.00 | optimizer: 55.84 | batch generator: 0.77 samples/sec: 6.589 | iteration 215400/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 7.680E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.799725E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.85 | backward: 1805.32 | backward-backward: 1805.30 | backward-allreduce: 0.00 | optimizer: 55.78 | batch generator: 0.83 samples/sec: 6.594 | iteration 215500/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 7.667E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.789125E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.59 | backward: 1803.71 | backward-backward: 1803.68 | backward-allreduce: 0.00 | optimizer: 55.82 | batch generator: 0.88 samples/sec: 6.591 | iteration 215600/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 7.654E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.781138E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.55 | backward: 1804.64 | backward-backward: 1804.62 | backward-allreduce: 0.00 | optimizer: 56.05 | batch generator: 0.78 samples/sec: 6.599 | iteration 215700/ 320000 | elapsed time per iteration (ms): 2424.7 | learning rate: 7.641E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.796307E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.28 | backward: 1802.68 | backward-backward: 1802.65 | backward-allreduce: 0.00 | optimizer: 55.33 | batch generator: 0.77 samples/sec: 6.592 | iteration 215800/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 7.629E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.792079E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.69 | backward: 1804.79 | backward-backward: 1804.77 | backward-allreduce: 0.00 | optimizer: 55.49 | batch generator: 0.79 samples/sec: 6.592 | iteration 215900/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 7.616E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.800379E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.80 | backward: 1804.47 | backward-backward: 1804.45 | backward-allreduce: 0.00 | optimizer: 55.67 | batch generator: 0.77 samples/sec: 6.593 | iteration 216000/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 7.603E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.807486E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.03 | backward: 1803.52 | backward-backward: 1803.49 | backward-allreduce: 0.00 | optimizer: 55.71 | batch generator: 0.79 ----------------------------------------------------------------------------------------------------------- validation results at iteration 216000 | lm_loss value: 2.805528E+00 | lm_loss_ppl value: 1.653581E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.440 | iteration 216100/ 320000 | elapsed time per iteration (ms): 2484.6 | learning rate: 7.590E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.813310E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.53 | backward: 1805.02 | backward-backward: 1805.00 | backward-allreduce: 0.00 | optimizer: 55.89 | batch generator: 0.84 samples/sec: 6.589 | iteration 216200/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 7.577E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.787771E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.84 | backward: 1805.42 | backward-backward: 1805.39 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.76 samples/sec: 6.598 | iteration 216300/ 320000 | elapsed time per iteration (ms): 2424.9 | learning rate: 7.565E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.796005E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.40 | backward: 1802.72 | backward-backward: 1802.70 | backward-allreduce: 0.00 | optimizer: 55.41 | batch generator: 0.80 samples/sec: 6.591 | iteration 216400/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 7.552E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.800721E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.56 | backward: 1804.81 | backward-backward: 1804.78 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.78 samples/sec: 6.593 | iteration 216500/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 7.539E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.791657E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.53 | backward: 1804.22 | backward-backward: 1804.19 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.78 samples/sec: 6.590 | iteration 216600/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 7.526E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.784952E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.70 | backward: 1805.32 | backward-backward: 1805.30 | backward-allreduce: 0.00 | optimizer: 55.51 | batch generator: 0.73 samples/sec: 6.587 | iteration 216700/ 320000 | elapsed time per iteration (ms): 2429.1 | learning rate: 7.513E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.781029E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.01 | backward: 1805.30 | backward-backward: 1805.28 | backward-allreduce: 0.00 | optimizer: 56.39 | batch generator: 0.82 samples/sec: 6.600 | iteration 216800/ 320000 | elapsed time per iteration (ms): 2424.1 | learning rate: 7.501E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.794182E+00 | loss scale: 65536.0 | number of skipped iterations: 2 | number of nan iterations: 0 | time (ms) | forward: 566.07 | backward: 1803.18 | backward-backward: 1803.16 | backward-allreduce: 0.00 | optimizer: 54.48 | batch generator: 0.79 samples/sec: 6.590 | iteration 216900/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 7.488E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.785225E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.55 | backward: 1804.98 | backward-backward: 1804.96 | backward-allreduce: 0.00 | optimizer: 55.94 | batch generator: 0.83 samples/sec: 6.594 | iteration 217000/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 7.475E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.799529E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.54 | backward: 1804.11 | backward-backward: 1804.09 | backward-allreduce: 0.00 | optimizer: 55.49 | batch generator: 0.79 ----------------------------------------------------------------------------------------------------------- validation results at iteration 217000 | lm_loss value: 2.743475E+00 | lm_loss_ppl value: 1.554089E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.441 | iteration 217100/ 320000 | elapsed time per iteration (ms): 2484.1 | learning rate: 7.463E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.800377E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.34 | backward: 1804.56 | backward-backward: 1804.54 | backward-allreduce: 0.00 | optimizer: 55.99 | batch generator: 0.83 samples/sec: 6.589 | iteration 217200/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 7.450E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.796234E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.87 | backward: 1804.78 | backward-backward: 1804.76 | backward-allreduce: 0.00 | optimizer: 56.24 | batch generator: 0.80 samples/sec: 6.594 | iteration 217300/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 7.437E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.795112E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.37 | backward: 1804.17 | backward-backward: 1804.15 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.77 samples/sec: 6.591 | iteration 217400/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 7.425E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.792578E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.80 | backward: 1804.69 | backward-backward: 1804.66 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.83 samples/sec: 6.596 | iteration 217500/ 320000 | elapsed time per iteration (ms): 2425.6 | learning rate: 7.412E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.789649E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.31 | backward: 1803.18 | backward-backward: 1803.16 | backward-allreduce: 0.00 | optimizer: 55.69 | batch generator: 0.80 samples/sec: 6.589 | iteration 217600/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 7.399E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.789028E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.82 | backward: 1805.36 | backward-backward: 1805.33 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.76 samples/sec: 6.593 | iteration 217700/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 7.386E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.804594E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.53 | backward: 1804.54 | backward-backward: 1804.52 | backward-allreduce: 0.00 | optimizer: 55.46 | batch generator: 0.78 samples/sec: 6.591 | iteration 217800/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 7.374E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.792996E+00 | loss scale: 65536.0 | number of skipped iterations: 2 | number of nan iterations: 0 | time (ms) | forward: 566.83 | backward: 1805.13 | backward-backward: 1805.11 | backward-allreduce: 0.00 | optimizer: 55.34 | batch generator: 0.88 samples/sec: 6.596 | iteration 217900/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 7.361E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.794062E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.26 | backward: 1803.72 | backward-backward: 1803.70 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.82 samples/sec: 6.590 | iteration 218000/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 7.349E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.792565E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.52 | backward: 1805.37 | backward-backward: 1805.34 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.77 ----------------------------------------------------------------------------------------------------------- validation results at iteration 218000 | lm_loss value: 2.779558E+00 | lm_loss_ppl value: 1.611190E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.439 | iteration 218100/ 320000 | elapsed time per iteration (ms): 2484.8 | learning rate: 7.336E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.782136E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.78 | backward: 1804.67 | backward-backward: 1804.65 | backward-allreduce: 0.00 | optimizer: 56.13 | batch generator: 0.91 samples/sec: 6.593 | iteration 218200/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 7.323E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.785983E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.57 | backward: 1803.94 | backward-backward: 1803.92 | backward-allreduce: 0.00 | optimizer: 55.81 | batch generator: 0.78 samples/sec: 6.586 | iteration 218300/ 320000 | elapsed time per iteration (ms): 2429.4 | learning rate: 7.311E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.771649E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.08 | backward: 1805.81 | backward-backward: 1805.78 | backward-allreduce: 0.00 | optimizer: 56.10 | batch generator: 0.79 samples/sec: 6.594 | iteration 218400/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 7.298E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.788412E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.33 | backward: 1803.69 | backward-backward: 1803.66 | backward-allreduce: 0.00 | optimizer: 56.02 | batch generator: 0.84 samples/sec: 6.585 | iteration 218500/ 320000 | elapsed time per iteration (ms): 2429.7 | learning rate: 7.285E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.778736E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.94 | backward: 1806.76 | backward-backward: 1806.74 | backward-allreduce: 0.00 | optimizer: 55.67 | batch generator: 0.77 samples/sec: 6.593 | iteration 218600/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 7.273E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.788150E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.68 | backward: 1803.84 | backward-backward: 1803.82 | backward-allreduce: 0.00 | optimizer: 55.75 | batch generator: 0.77 samples/sec: 6.590 | iteration 218700/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 7.260E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.789305E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.56 | backward: 1805.13 | backward-backward: 1805.11 | backward-allreduce: 0.00 | optimizer: 55.77 | batch generator: 0.82 samples/sec: 6.577 | iteration 218800/ 320000 | elapsed time per iteration (ms): 2432.7 | learning rate: 7.248E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.798412E+00 | loss scale: 65536.0 | number of skipped iterations: 2 | number of nan iterations: 0 | time (ms) | forward: 567.59 | backward: 1808.33 | backward-backward: 1808.29 | backward-allreduce: 0.00 | optimizer: 56.29 | batch generator: 0.79 samples/sec: 6.587 | iteration 218900/ 320000 | elapsed time per iteration (ms): 2428.9 | learning rate: 7.235E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.783631E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.83 | backward: 1805.61 | backward-backward: 1805.58 | backward-allreduce: 0.00 | optimizer: 55.76 | batch generator: 0.78 samples/sec: 6.594 | iteration 219000/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 7.223E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.794135E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.43 | backward: 1804.38 | backward-backward: 1804.35 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.77 ----------------------------------------------------------------------------------------------------------- validation results at iteration 219000 | lm_loss value: 2.757401E+00 | lm_loss_ppl value: 1.575884E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.441 | iteration 219100/ 320000 | elapsed time per iteration (ms): 2483.9 | learning rate: 7.210E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.792747E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.79 | backward: 1804.37 | backward-backward: 1804.34 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.84 samples/sec: 6.597 | iteration 219200/ 320000 | elapsed time per iteration (ms): 2425.3 | learning rate: 7.198E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.752775E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.15 | backward: 1803.44 | backward-backward: 1803.41 | backward-allreduce: 0.00 | optimizer: 55.25 | batch generator: 0.78 samples/sec: 6.589 | iteration 219300/ 320000 | elapsed time per iteration (ms): 2428.1 | learning rate: 7.185E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.782396E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.85 | backward: 1805.24 | backward-backward: 1805.22 | backward-allreduce: 0.00 | optimizer: 55.67 | batch generator: 0.80 samples/sec: 6.598 | iteration 219400/ 320000 | elapsed time per iteration (ms): 2425.0 | learning rate: 7.172E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.777851E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.32 | backward: 1802.97 | backward-backward: 1802.94 | backward-allreduce: 0.00 | optimizer: 55.38 | batch generator: 0.78 samples/sec: 6.591 | iteration 219500/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 7.160E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.788046E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.87 | backward: 1804.64 | backward-backward: 1804.61 | backward-allreduce: 0.00 | optimizer: 55.84 | batch generator: 0.88 samples/sec: 6.594 | iteration 219600/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 7.147E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.771344E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.53 | backward: 1803.66 | backward-backward: 1803.63 | backward-allreduce: 0.00 | optimizer: 55.77 | batch generator: 0.81 samples/sec: 6.590 | iteration 219700/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 7.135E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.777267E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.92 | backward: 1804.91 | backward-backward: 1804.89 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.79 samples/sec: 6.584 | iteration 219800/ 320000 | elapsed time per iteration (ms): 2430.2 | learning rate: 7.122E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.785469E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.56 | backward: 1806.17 | backward-backward: 1806.14 | backward-allreduce: 0.00 | optimizer: 55.96 | batch generator: 0.80 samples/sec: 6.579 | iteration 219900/ 320000 | elapsed time per iteration (ms): 2431.9 | learning rate: 7.110E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.792609E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.36 | backward: 1807.18 | backward-backward: 1807.15 | backward-allreduce: 0.00 | optimizer: 56.88 | batch generator: 0.83 samples/sec: 6.591 | iteration 220000/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 7.097E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.783473E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.95 | backward: 1804.42 | backward-backward: 1804.40 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.76 WARNING: Deleting old checkpoints: checkpoints-hfcm/global_step120000 ----------------------------------------------------------------------------------------------------------- validation results at iteration 220000 | lm_loss value: 2.832596E+00 | lm_loss_ppl value: 1.698951E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.197 | iteration 220100/ 320000 | elapsed time per iteration (ms): 2581.8 | learning rate: 7.085E-05 | approx flops per GPU: 38.5TFLOPS | lm_loss: 2.787599E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.19 | backward: 1804.23 | backward-backward: 1804.21 | backward-allreduce: 0.00 | optimizer: 56.03 | batch generator: 0.86 samples/sec: 6.585 | iteration 220200/ 320000 | elapsed time per iteration (ms): 2429.8 | learning rate: 7.072E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.791324E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.78 | backward: 1805.75 | backward-backward: 1805.73 | backward-allreduce: 0.00 | optimizer: 55.89 | batch generator: 0.77 samples/sec: 6.597 | iteration 220300/ 320000 | elapsed time per iteration (ms): 2425.4 | learning rate: 7.060E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.795425E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.54 | backward: 1803.02 | backward-backward: 1803.00 | backward-allreduce: 0.00 | optimizer: 55.45 | batch generator: 0.81 samples/sec: 6.589 | iteration 220400/ 320000 | elapsed time per iteration (ms): 2428.1 | learning rate: 7.047E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.793063E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.68 | backward: 1805.27 | backward-backward: 1805.25 | backward-allreduce: 0.00 | optimizer: 55.81 | batch generator: 0.79 samples/sec: 6.596 | iteration 220500/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 7.035E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.800036E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.71 | backward: 1803.26 | backward-backward: 1803.23 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.79 samples/sec: 6.595 | iteration 220600/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 7.022E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.776077E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.31 | backward: 1803.92 | backward-backward: 1803.90 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.74 samples/sec: 6.593 | iteration 220700/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 7.010E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.790384E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.87 | backward: 1803.96 | backward-backward: 1803.94 | backward-allreduce: 0.00 | optimizer: 55.43 | batch generator: 0.78 samples/sec: 6.594 | iteration 220800/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 6.997E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.773863E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.72 | backward: 1803.70 | backward-backward: 1803.68 | backward-allreduce: 0.00 | optimizer: 55.49 | batch generator: 0.88 samples/sec: 6.593 | iteration 220900/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 6.985E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.788637E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.40 | backward: 1804.81 | backward-backward: 1804.78 | backward-allreduce: 0.00 | optimizer: 55.42 | batch generator: 0.74 samples/sec: 6.593 | iteration 221000/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 6.972E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.762545E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.51 | backward: 1803.79 | backward-backward: 1803.77 | backward-allreduce: 0.00 | optimizer: 56.02 | batch generator: 0.78 ----------------------------------------------------------------------------------------------------------- validation results at iteration 221000 | lm_loss value: 2.750136E+00 | lm_loss_ppl value: 1.564475E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.439 | iteration 221100/ 320000 | elapsed time per iteration (ms): 2484.8 | learning rate: 6.960E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.771130E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.82 | backward: 1805.11 | backward-backward: 1805.09 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.92 samples/sec: 6.594 | iteration 221200/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 6.948E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.782462E+00 | loss scale: 65536.0 | number of skipped iterations: 2 | number of nan iterations: 0 | time (ms) | forward: 566.86 | backward: 1804.07 | backward-backward: 1804.04 | backward-allreduce: 0.00 | optimizer: 55.11 | batch generator: 0.82 samples/sec: 6.580 | iteration 221300/ 320000 | elapsed time per iteration (ms): 2431.7 | learning rate: 6.935E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.773837E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.69 | backward: 1807.46 | backward-backward: 1807.43 | backward-allreduce: 0.00 | optimizer: 56.12 | batch generator: 0.79 samples/sec: 6.592 | iteration 221400/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 6.923E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.783598E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.32 | backward: 1803.95 | backward-backward: 1803.93 | backward-allreduce: 0.00 | optimizer: 55.45 | batch generator: 0.79 samples/sec: 6.595 | iteration 221500/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 6.911E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.783026E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.54 | backward: 1803.73 | backward-backward: 1803.70 | backward-allreduce: 0.00 | optimizer: 55.45 | batch generator: 0.77 samples/sec: 6.593 | iteration 221600/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 6.898E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.786018E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.36 | backward: 1804.05 | backward-backward: 1804.02 | backward-allreduce: 0.00 | optimizer: 55.82 | batch generator: 0.74 samples/sec: 6.595 | iteration 221700/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 6.886E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.759960E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.38 | backward: 1804.23 | backward-backward: 1804.21 | backward-allreduce: 0.00 | optimizer: 55.29 | batch generator: 0.77 samples/sec: 6.588 | iteration 221800/ 320000 | elapsed time per iteration (ms): 2428.8 | learning rate: 6.873E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.771013E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.21 | backward: 1805.43 | backward-backward: 1805.41 | backward-allreduce: 0.00 | optimizer: 55.80 | batch generator: 0.80 samples/sec: 6.596 | iteration 221900/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 6.861E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.772403E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.52 | backward: 1803.45 | backward-backward: 1803.43 | backward-allreduce: 0.00 | optimizer: 55.46 | batch generator: 0.78 samples/sec: 6.587 | iteration 222000/ 320000 | elapsed time per iteration (ms): 2429.2 | learning rate: 6.849E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.757396E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.04 | backward: 1806.08 | backward-backward: 1806.06 | backward-allreduce: 0.00 | optimizer: 55.67 | batch generator: 0.78 ----------------------------------------------------------------------------------------------------------- validation results at iteration 222000 | lm_loss value: 2.851840E+00 | lm_loss_ppl value: 1.731961E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.442 | iteration 222100/ 320000 | elapsed time per iteration (ms): 2483.9 | learning rate: 6.836E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.795343E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.04 | backward: 1803.20 | backward-backward: 1803.18 | backward-allreduce: 0.00 | optimizer: 56.40 | batch generator: 0.86 samples/sec: 6.597 | iteration 222200/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 6.824E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.780659E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.20 | backward: 1803.50 | backward-backward: 1803.47 | backward-allreduce: 0.00 | optimizer: 55.46 | batch generator: 0.76 samples/sec: 6.595 | iteration 222300/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 6.812E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.803277E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.43 | backward: 1803.89 | backward-backward: 1803.87 | backward-allreduce: 0.00 | optimizer: 55.46 | batch generator: 0.80 samples/sec: 6.590 | iteration 222400/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 6.799E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.779285E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.26 | backward: 1804.65 | backward-backward: 1804.62 | backward-allreduce: 0.00 | optimizer: 55.66 | batch generator: 0.76 samples/sec: 6.596 | iteration 222500/ 320000 | elapsed time per iteration (ms): 2425.6 | learning rate: 6.787E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.787581E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.41 | backward: 1803.25 | backward-backward: 1803.23 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.85 samples/sec: 6.588 | iteration 222600/ 320000 | elapsed time per iteration (ms): 2428.6 | learning rate: 6.775E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.776649E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.25 | backward: 1805.49 | backward-backward: 1805.47 | backward-allreduce: 0.00 | optimizer: 55.49 | batch generator: 0.77 samples/sec: 6.595 | iteration 222700/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 6.762E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.779587E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.86 | backward: 1803.14 | backward-backward: 1803.12 | backward-allreduce: 0.00 | optimizer: 55.73 | batch generator: 0.78 samples/sec: 6.589 | iteration 222800/ 320000 | elapsed time per iteration (ms): 2428.1 | learning rate: 6.750E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.749223E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.99 | backward: 1805.06 | backward-backward: 1805.04 | backward-allreduce: 0.00 | optimizer: 55.69 | batch generator: 0.77 samples/sec: 6.592 | iteration 222900/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 6.738E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.756665E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.22 | backward: 1803.85 | backward-backward: 1803.83 | backward-allreduce: 0.00 | optimizer: 55.71 | batch generator: 0.82 samples/sec: 6.594 | iteration 223000/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 6.726E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.776718E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.29 | backward: 1804.05 | backward-backward: 1804.03 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.78 ----------------------------------------------------------------------------------------------------------- validation results at iteration 223000 | lm_loss value: 2.744097E+00 | lm_loss_ppl value: 1.555056E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.432 | iteration 223100/ 320000 | elapsed time per iteration (ms): 2487.6 | learning rate: 6.713E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.760411E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.99 | backward: 1806.37 | backward-backward: 1806.35 | backward-allreduce: 0.00 | optimizer: 56.02 | batch generator: 1.00 samples/sec: 6.597 | iteration 223200/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 6.701E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.784316E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.28 | backward: 1803.06 | backward-backward: 1803.04 | backward-allreduce: 0.00 | optimizer: 55.81 | batch generator: 0.77 samples/sec: 6.592 | iteration 223300/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 6.689E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.759524E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.51 | backward: 1804.66 | backward-backward: 1804.64 | backward-allreduce: 0.00 | optimizer: 55.75 | batch generator: 0.80 samples/sec: 6.589 | iteration 223400/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 6.676E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.782536E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.78 | backward: 1805.18 | backward-backward: 1805.15 | backward-allreduce: 0.00 | optimizer: 55.91 | batch generator: 0.80 samples/sec: 6.594 | iteration 223500/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 6.664E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.759742E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.52 | backward: 1803.97 | backward-backward: 1803.94 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.80 samples/sec: 6.593 | iteration 223600/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 6.652E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.770838E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.40 | backward: 1804.12 | backward-backward: 1804.10 | backward-allreduce: 0.00 | optimizer: 55.81 | batch generator: 0.84 samples/sec: 6.594 | iteration 223700/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 6.640E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.768008E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.31 | backward: 1804.16 | backward-backward: 1804.13 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.80 samples/sec: 6.589 | iteration 223800/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 6.628E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.761618E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.78 | backward: 1805.43 | backward-backward: 1805.41 | backward-allreduce: 0.00 | optimizer: 55.59 | batch generator: 0.79 samples/sec: 6.590 | iteration 223900/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 6.615E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.773209E+00 | loss scale: 131072.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 567.40 | backward: 1805.18 | backward-backward: 1805.16 | backward-allreduce: 0.00 | optimizer: 55.09 | batch generator: 0.77 samples/sec: 6.594 | iteration 224000/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 6.603E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.768821E+00 | loss scale: 32768.0 | number of skipped iterations: 2 | number of nan iterations: 0 | time (ms) | forward: 566.40 | backward: 1804.88 | backward-backward: 1804.86 | backward-allreduce: 0.00 | optimizer: 54.76 | batch generator: 0.77 ----------------------------------------------------------------------------------------------------------- validation results at iteration 224000 | lm_loss value: 2.759386E+00 | lm_loss_ppl value: 1.579014E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.443 | iteration 224100/ 320000 | elapsed time per iteration (ms): 2483.2 | learning rate: 6.591E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.761041E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.90 | backward: 1803.64 | backward-backward: 1803.62 | backward-allreduce: 0.00 | optimizer: 55.42 | batch generator: 0.84 samples/sec: 6.597 | iteration 224200/ 320000 | elapsed time per iteration (ms): 2425.4 | learning rate: 6.579E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.748634E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 565.93 | backward: 1803.69 | backward-backward: 1803.66 | backward-allreduce: 0.00 | optimizer: 55.47 | batch generator: 0.77 samples/sec: 6.588 | iteration 224300/ 320000 | elapsed time per iteration (ms): 2428.8 | learning rate: 6.567E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.770290E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.13 | backward: 1805.11 | backward-backward: 1805.08 | backward-allreduce: 0.00 | optimizer: 56.20 | batch generator: 0.77 samples/sec: 6.596 | iteration 224400/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 6.555E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.748346E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.65 | backward: 1803.12 | backward-backward: 1803.09 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.77 samples/sec: 6.597 | iteration 224500/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 6.543E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.745575E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.33 | backward: 1803.34 | backward-backward: 1803.32 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.76 samples/sec: 6.595 | iteration 224600/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 6.530E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.762674E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.65 | backward: 1803.15 | backward-backward: 1803.12 | backward-allreduce: 0.00 | optimizer: 55.79 | batch generator: 0.77 samples/sec: 6.593 | iteration 224700/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 6.518E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.761523E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.65 | backward: 1804.03 | backward-backward: 1804.00 | backward-allreduce: 0.00 | optimizer: 55.71 | batch generator: 0.76 samples/sec: 6.599 | iteration 224800/ 320000 | elapsed time per iteration (ms): 2424.4 | learning rate: 6.506E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.767048E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.17 | backward: 1802.72 | backward-backward: 1802.70 | backward-allreduce: 0.00 | optimizer: 55.19 | batch generator: 0.75 samples/sec: 6.594 | iteration 224900/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 6.494E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.752663E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.75 | backward: 1803.98 | backward-backward: 1803.96 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.79 samples/sec: 6.595 | iteration 225000/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 6.482E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.765651E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.53 | backward: 1803.68 | backward-backward: 1803.66 | backward-allreduce: 0.00 | optimizer: 55.67 | batch generator: 0.76 ----------------------------------------------------------------------------------------------------------- validation results at iteration 225000 | lm_loss value: 2.757615E+00 | lm_loss_ppl value: 1.576221E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.442 | iteration 225100/ 320000 | elapsed time per iteration (ms): 2483.6 | learning rate: 6.470E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.751848E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.60 | backward: 1804.14 | backward-backward: 1804.11 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.93 samples/sec: 6.595 | iteration 225200/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 6.458E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.765128E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.49 | backward: 1803.45 | backward-backward: 1803.43 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.79 samples/sec: 6.594 | iteration 225300/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 6.446E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.768570E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.29 | backward: 1804.29 | backward-backward: 1804.27 | backward-allreduce: 0.00 | optimizer: 55.54 | batch generator: 0.77 samples/sec: 6.590 | iteration 225400/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 6.433E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.772401E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.55 | backward: 1804.41 | backward-backward: 1804.39 | backward-allreduce: 0.00 | optimizer: 56.66 | batch generator: 0.79 samples/sec: 6.596 | iteration 225500/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 6.421E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.775476E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.12 | backward: 1803.57 | backward-backward: 1803.55 | backward-allreduce: 0.00 | optimizer: 55.79 | batch generator: 0.76 samples/sec: 6.589 | iteration 225600/ 320000 | elapsed time per iteration (ms): 2428.1 | learning rate: 6.409E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.751468E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.80 | backward: 1805.41 | backward-backward: 1805.38 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.77 samples/sec: 6.599 | iteration 225700/ 320000 | elapsed time per iteration (ms): 2424.5 | learning rate: 6.397E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.781364E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.25 | backward: 1802.48 | backward-backward: 1802.46 | backward-allreduce: 0.00 | optimizer: 55.42 | batch generator: 0.78 samples/sec: 6.591 | iteration 225800/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 6.385E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.758481E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.81 | backward: 1804.83 | backward-backward: 1804.81 | backward-allreduce: 0.00 | optimizer: 55.50 | batch generator: 0.79 samples/sec: 6.593 | iteration 225900/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 6.373E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.771733E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.75 | backward: 1804.07 | backward-backward: 1804.04 | backward-allreduce: 0.00 | optimizer: 55.46 | batch generator: 0.76 samples/sec: 6.595 | iteration 226000/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 6.361E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.765064E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.33 | backward: 1803.84 | backward-backward: 1803.82 | backward-allreduce: 0.00 | optimizer: 55.67 | batch generator: 0.75 ----------------------------------------------------------------------------------------------------------- validation results at iteration 226000 | lm_loss value: 2.693229E+00 | lm_loss_ppl value: 1.477932E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.441 | iteration 226100/ 320000 | elapsed time per iteration (ms): 2484.1 | learning rate: 6.349E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.768588E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.69 | backward: 1804.87 | backward-backward: 1804.85 | backward-allreduce: 0.00 | optimizer: 55.39 | batch generator: 0.83 samples/sec: 6.597 | iteration 226200/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 6.337E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.759586E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.27 | backward: 1803.63 | backward-backward: 1803.61 | backward-allreduce: 0.00 | optimizer: 55.24 | batch generator: 0.78 samples/sec: 6.590 | iteration 226300/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 6.325E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.752615E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.17 | backward: 1804.94 | backward-backward: 1804.92 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.80 samples/sec: 6.593 | iteration 226400/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 6.313E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.742524E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.52 | backward: 1804.17 | backward-backward: 1804.14 | backward-allreduce: 0.00 | optimizer: 55.67 | batch generator: 0.78 samples/sec: 6.591 | iteration 226500/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 6.301E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.762515E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.62 | backward: 1804.83 | backward-backward: 1804.81 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.76 samples/sec: 6.598 | iteration 226600/ 320000 | elapsed time per iteration (ms): 2425.0 | learning rate: 6.289E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.753635E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.43 | backward: 1802.80 | backward-backward: 1802.78 | backward-allreduce: 0.00 | optimizer: 55.40 | batch generator: 0.77 samples/sec: 6.594 | iteration 226700/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 6.277E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.771872E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.78 | backward: 1803.93 | backward-backward: 1803.91 | backward-allreduce: 0.00 | optimizer: 55.49 | batch generator: 0.78 samples/sec: 6.593 | iteration 226800/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 6.265E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.751002E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.93 | backward: 1803.81 | backward-backward: 1803.78 | backward-allreduce: 0.00 | optimizer: 55.72 | batch generator: 0.76 samples/sec: 6.595 | iteration 226900/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 6.253E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.764084E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.41 | backward: 1803.66 | backward-backward: 1803.64 | backward-allreduce: 0.00 | optimizer: 55.49 | batch generator: 0.83 samples/sec: 6.597 | iteration 227000/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 6.241E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.750660E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.28 | backward: 1803.51 | backward-backward: 1803.49 | backward-allreduce: 0.00 | optimizer: 55.35 | batch generator: 0.76 ----------------------------------------------------------------------------------------------------------- validation results at iteration 227000 | lm_loss value: 2.725383E+00 | lm_loss_ppl value: 1.526226E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.446 | iteration 227100/ 320000 | elapsed time per iteration (ms): 2482.3 | learning rate: 6.229E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.747637E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.36 | backward: 1803.34 | backward-backward: 1803.32 | backward-allreduce: 0.00 | optimizer: 55.40 | batch generator: 0.86 samples/sec: 6.594 | iteration 227200/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 6.217E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.753936E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.48 | backward: 1804.19 | backward-backward: 1804.17 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.77 samples/sec: 6.596 | iteration 227300/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 6.206E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.747083E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.19 | backward: 1803.56 | backward-backward: 1803.54 | backward-allreduce: 0.00 | optimizer: 55.75 | batch generator: 0.80 samples/sec: 6.591 | iteration 227400/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 6.194E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.755164E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.88 | backward: 1804.83 | backward-backward: 1804.81 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.79 samples/sec: 6.593 | iteration 227500/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 6.182E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.745743E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.52 | backward: 1804.27 | backward-backward: 1804.24 | backward-allreduce: 0.00 | optimizer: 55.67 | batch generator: 0.77 samples/sec: 6.591 | iteration 227600/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 6.170E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.753399E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.63 | backward: 1804.49 | backward-backward: 1804.47 | backward-allreduce: 0.00 | optimizer: 56.18 | batch generator: 0.87 samples/sec: 6.591 | iteration 227700/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 6.158E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.753108E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.04 | backward: 1804.58 | backward-backward: 1804.55 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.78 samples/sec: 6.592 | iteration 227800/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 6.146E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.749351E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.46 | backward: 1804.65 | backward-backward: 1804.62 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.80 samples/sec: 6.589 | iteration 227900/ 320000 | elapsed time per iteration (ms): 2428.4 | learning rate: 6.134E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.732365E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.69 | backward: 1804.69 | backward-backward: 1804.66 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.78 samples/sec: 6.597 | iteration 228000/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 6.122E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.763722E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.44 | backward: 1802.99 | backward-backward: 1802.97 | backward-allreduce: 0.00 | optimizer: 55.72 | batch generator: 0.80 ----------------------------------------------------------------------------------------------------------- validation results at iteration 228000 | lm_loss value: 2.788276E+00 | lm_loss_ppl value: 1.625297E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.439 | iteration 228100/ 320000 | elapsed time per iteration (ms): 2484.8 | learning rate: 6.110E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.758531E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.73 | backward: 1804.92 | backward-backward: 1804.90 | backward-allreduce: 0.00 | optimizer: 55.96 | batch generator: 0.87 samples/sec: 6.591 | iteration 228200/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 6.098E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.758223E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.02 | backward: 1804.60 | backward-backward: 1804.58 | backward-allreduce: 0.00 | optimizer: 55.54 | batch generator: 0.77 samples/sec: 6.594 | iteration 228300/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 6.087E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.754001E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.55 | backward: 1803.90 | backward-backward: 1803.88 | backward-allreduce: 0.00 | optimizer: 55.66 | batch generator: 0.80 samples/sec: 6.586 | iteration 228400/ 320000 | elapsed time per iteration (ms): 2429.5 | learning rate: 6.075E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.750123E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.69 | backward: 1805.53 | backward-backward: 1805.51 | backward-allreduce: 0.00 | optimizer: 55.90 | batch generator: 0.81 samples/sec: 6.596 | iteration 228500/ 320000 | elapsed time per iteration (ms): 2425.6 | learning rate: 6.063E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.759639E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.33 | backward: 1803.36 | backward-backward: 1803.34 | backward-allreduce: 0.00 | optimizer: 55.49 | batch generator: 0.78 samples/sec: 6.588 | iteration 228600/ 320000 | elapsed time per iteration (ms): 2428.8 | learning rate: 6.051E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.737249E+00 | loss scale: 65536.0 | number of skipped iterations: 2 | number of nan iterations: 0 | time (ms) | forward: 566.86 | backward: 1806.07 | backward-backward: 1806.05 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.78 samples/sec: 6.593 | iteration 228700/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 6.040E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.756644E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.01 | backward: 1803.59 | backward-backward: 1803.56 | backward-allreduce: 0.00 | optimizer: 55.77 | batch generator: 0.79 samples/sec: 6.593 | iteration 228800/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 6.028E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.742794E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.50 | backward: 1804.42 | backward-backward: 1804.39 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.83 samples/sec: 6.593 | iteration 228900/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 6.016E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.751244E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.81 | backward: 1803.92 | backward-backward: 1803.90 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.78 samples/sec: 6.594 | iteration 229000/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 6.004E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.728914E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.40 | backward: 1804.03 | backward-backward: 1804.01 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.87 ----------------------------------------------------------------------------------------------------------- validation results at iteration 229000 | lm_loss value: 2.716910E+00 | lm_loss_ppl value: 1.513349E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.439 | iteration 229100/ 320000 | elapsed time per iteration (ms): 2484.8 | learning rate: 5.992E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.758645E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.03 | backward: 1804.99 | backward-backward: 1804.97 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.84 samples/sec: 6.593 | iteration 229200/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 5.981E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.756267E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.03 | backward: 1803.86 | backward-backward: 1803.84 | backward-allreduce: 0.00 | optimizer: 55.43 | batch generator: 0.80 samples/sec: 6.594 | iteration 229300/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 5.969E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.757164E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.33 | backward: 1804.00 | backward-backward: 1803.98 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.78 samples/sec: 6.594 | iteration 229400/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 5.957E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.756346E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.61 | backward: 1803.85 | backward-backward: 1803.82 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.81 samples/sec: 6.589 | iteration 229500/ 320000 | elapsed time per iteration (ms): 2428.1 | learning rate: 5.945E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.746256E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.18 | backward: 1804.83 | backward-backward: 1804.81 | backward-allreduce: 0.00 | optimizer: 55.75 | batch generator: 0.75 samples/sec: 6.595 | iteration 229600/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 5.934E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.734819E+00 | loss scale: 131072.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.88 | backward: 1803.71 | backward-backward: 1803.69 | backward-allreduce: 0.00 | optimizer: 55.13 | batch generator: 0.79 samples/sec: 6.593 | iteration 229700/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 5.922E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.744673E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.32 | backward: 1804.45 | backward-backward: 1804.43 | backward-allreduce: 0.00 | optimizer: 55.78 | batch generator: 0.81 samples/sec: 6.589 | iteration 229800/ 320000 | elapsed time per iteration (ms): 2428.4 | learning rate: 5.910E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.742413E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.92 | backward: 1805.24 | backward-backward: 1805.22 | backward-allreduce: 0.00 | optimizer: 55.84 | batch generator: 0.78 samples/sec: 6.595 | iteration 229900/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 5.899E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.747764E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.57 | backward: 1803.54 | backward-backward: 1803.52 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.76 samples/sec: 6.594 | iteration 230000/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 5.887E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.759525E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.34 | backward: 1803.91 | backward-backward: 1803.89 | backward-allreduce: 0.00 | optimizer: 55.84 | batch generator: 0.77 WARNING: Deleting old checkpoints: checkpoints-hfcm/global_step130000 ----------------------------------------------------------------------------------------------------------- validation results at iteration 230000 | lm_loss value: 2.823469E+00 | lm_loss_ppl value: 1.683515E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.225 | iteration 230100/ 320000 | elapsed time per iteration (ms): 2570.1 | learning rate: 5.875E-05 | approx flops per GPU: 38.7TFLOPS | lm_loss: 2.745953E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.45 | backward: 1804.62 | backward-backward: 1804.60 | backward-allreduce: 0.00 | optimizer: 55.74 | batch generator: 0.87 samples/sec: 6.595 | iteration 230200/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 5.864E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.748477E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.51 | backward: 1803.80 | backward-backward: 1803.78 | backward-allreduce: 0.00 | optimizer: 55.51 | batch generator: 0.77 samples/sec: 6.589 | iteration 230300/ 320000 | elapsed time per iteration (ms): 2428.4 | learning rate: 5.852E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.748264E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.88 | backward: 1805.47 | backward-backward: 1805.45 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.79 samples/sec: 6.594 | iteration 230400/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 5.840E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.754018E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.91 | backward: 1803.56 | backward-backward: 1803.54 | backward-allreduce: 0.00 | optimizer: 55.54 | batch generator: 0.77 samples/sec: 6.592 | iteration 230500/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 5.829E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.752930E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.28 | backward: 1804.57 | backward-backward: 1804.54 | backward-allreduce: 0.00 | optimizer: 55.99 | batch generator: 0.80 samples/sec: 6.589 | iteration 230600/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 5.817E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.737130E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.19 | backward: 1805.13 | backward-backward: 1805.10 | backward-allreduce: 0.00 | optimizer: 55.49 | batch generator: 0.80 samples/sec: 6.591 | iteration 230700/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 5.805E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.751166E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.68 | backward: 1804.71 | backward-backward: 1804.69 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.83 samples/sec: 6.587 | iteration 230800/ 320000 | elapsed time per iteration (ms): 2429.2 | learning rate: 5.794E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.746848E+00 | loss scale: 65536.0 | number of skipped iterations: 2 | number of nan iterations: 0 | time (ms) | forward: 567.37 | backward: 1805.74 | backward-backward: 1805.72 | backward-allreduce: 0.00 | optimizer: 55.69 | batch generator: 1.00 samples/sec: 6.592 | iteration 230900/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 5.782E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.719106E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.96 | backward: 1804.19 | backward-backward: 1804.17 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.78 samples/sec: 6.587 | iteration 231000/ 320000 | elapsed time per iteration (ms): 2429.0 | learning rate: 5.771E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.762615E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.21 | backward: 1805.61 | backward-backward: 1805.59 | backward-allreduce: 0.00 | optimizer: 55.85 | batch generator: 0.76 ----------------------------------------------------------------------------------------------------------- validation results at iteration 231000 | lm_loss value: 2.770120E+00 | lm_loss_ppl value: 1.596056E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.447 | iteration 231100/ 320000 | elapsed time per iteration (ms): 2481.8 | learning rate: 5.759E-05 | approx flops per GPU: 40.1TFLOPS | lm_loss: 2.733612E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.41 | backward: 1802.89 | backward-backward: 1802.86 | backward-allreduce: 0.00 | optimizer: 55.29 | batch generator: 0.85 samples/sec: 6.592 | iteration 231200/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 5.748E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.723493E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.53 | backward: 1804.71 | backward-backward: 1804.68 | backward-allreduce: 0.00 | optimizer: 55.47 | batch generator: 0.77 samples/sec: 6.591 | iteration 231300/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 5.736E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.759964E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.85 | backward: 1804.71 | backward-backward: 1804.69 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.79 samples/sec: 6.595 | iteration 231400/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 5.724E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.746694E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.14 | backward: 1804.10 | backward-backward: 1804.08 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.75 samples/sec: 6.588 | iteration 231500/ 320000 | elapsed time per iteration (ms): 2428.5 | learning rate: 5.713E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.724331E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 567.35 | backward: 1805.53 | backward-backward: 1805.51 | backward-allreduce: 0.00 | optimizer: 55.24 | batch generator: 0.77 samples/sec: 6.596 | iteration 231600/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 5.701E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.726225E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.47 | backward: 1802.93 | backward-backward: 1802.91 | backward-allreduce: 0.00 | optimizer: 55.76 | batch generator: 0.84 samples/sec: 6.590 | iteration 231700/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 5.690E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.742117E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.82 | backward: 1804.92 | backward-backward: 1804.90 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.78 samples/sec: 6.593 | iteration 231800/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 5.678E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.745255E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.62 | backward: 1804.38 | backward-backward: 1804.36 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.79 samples/sec: 6.590 | iteration 231900/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 5.667E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.746128E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.52 | backward: 1804.26 | backward-backward: 1804.24 | backward-allreduce: 0.00 | optimizer: 56.63 | batch generator: 0.80 samples/sec: 6.590 | iteration 232000/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 5.655E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.731274E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.35 | backward: 1804.47 | backward-backward: 1804.45 | backward-allreduce: 0.00 | optimizer: 55.59 | batch generator: 0.84 ----------------------------------------------------------------------------------------------------------- validation results at iteration 232000 | lm_loss value: 2.713244E+00 | lm_loss_ppl value: 1.507811E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.447 | iteration 232100/ 320000 | elapsed time per iteration (ms): 2481.7 | learning rate: 5.644E-05 | approx flops per GPU: 40.1TFLOPS | lm_loss: 2.745145E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.48 | backward: 1802.72 | backward-backward: 1802.70 | backward-allreduce: 0.00 | optimizer: 55.35 | batch generator: 0.88 samples/sec: 6.591 | iteration 232200/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 5.632E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.747982E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.97 | backward: 1804.87 | backward-backward: 1804.85 | backward-allreduce: 0.00 | optimizer: 55.51 | batch generator: 0.78 samples/sec: 6.597 | iteration 232300/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 5.621E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.730949E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.65 | backward: 1802.98 | backward-backward: 1802.96 | backward-allreduce: 0.00 | optimizer: 55.51 | batch generator: 0.78 samples/sec: 6.590 | iteration 232400/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 5.609E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.741195E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.80 | backward: 1804.90 | backward-backward: 1804.87 | backward-allreduce: 0.00 | optimizer: 55.87 | batch generator: 0.81 samples/sec: 6.592 | iteration 232500/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 5.598E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.724316E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.24 | backward: 1804.10 | backward-backward: 1804.08 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.78 samples/sec: 6.594 | iteration 232600/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 5.586E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.733464E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.10 | backward: 1804.38 | backward-backward: 1804.35 | backward-allreduce: 0.00 | optimizer: 55.75 | batch generator: 0.76 samples/sec: 6.589 | iteration 232700/ 320000 | elapsed time per iteration (ms): 2428.4 | learning rate: 5.575E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.731762E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.14 | backward: 1805.12 | backward-backward: 1805.10 | backward-allreduce: 0.00 | optimizer: 55.75 | batch generator: 0.90 samples/sec: 6.594 | iteration 232800/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 5.563E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.743212E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.18 | backward: 1804.55 | backward-backward: 1804.53 | backward-allreduce: 0.00 | optimizer: 55.49 | batch generator: 0.75 samples/sec: 6.586 | iteration 232900/ 320000 | elapsed time per iteration (ms): 2429.4 | learning rate: 5.552E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.728574E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.00 | backward: 1806.21 | backward-backward: 1806.19 | backward-allreduce: 0.00 | optimizer: 55.82 | batch generator: 0.79 samples/sec: 6.593 | iteration 233000/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 5.540E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.735064E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.66 | backward: 1803.72 | backward-backward: 1803.70 | backward-allreduce: 0.00 | optimizer: 56.02 | batch generator: 0.78 ----------------------------------------------------------------------------------------------------------- validation results at iteration 233000 | lm_loss value: 2.788218E+00 | lm_loss_ppl value: 1.625203E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.438 | iteration 233100/ 320000 | elapsed time per iteration (ms): 2485.4 | learning rate: 5.529E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.758500E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.14 | backward: 1805.35 | backward-backward: 1805.33 | backward-allreduce: 0.00 | optimizer: 55.71 | batch generator: 0.85 samples/sec: 6.592 | iteration 233200/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 5.518E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.724757E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.02 | backward: 1804.03 | backward-backward: 1804.01 | backward-allreduce: 0.00 | optimizer: 55.74 | batch generator: 0.85 samples/sec: 6.594 | iteration 233300/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 5.506E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.740111E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.22 | backward: 1804.21 | backward-backward: 1804.18 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.82 samples/sec: 6.588 | iteration 233400/ 320000 | elapsed time per iteration (ms): 2428.5 | learning rate: 5.495E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.742692E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.99 | backward: 1805.55 | backward-backward: 1805.52 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.79 samples/sec: 6.595 | iteration 233500/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 5.483E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.741235E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.66 | backward: 1803.43 | backward-backward: 1803.40 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.78 samples/sec: 6.588 | iteration 233600/ 320000 | elapsed time per iteration (ms): 2428.7 | learning rate: 5.472E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.723048E+00 | loss scale: 65536.0 | number of skipped iterations: 2 | number of nan iterations: 0 | time (ms) | forward: 567.14 | backward: 1806.08 | backward-backward: 1806.05 | backward-allreduce: 0.00 | optimizer: 55.08 | batch generator: 0.82 samples/sec: 6.594 | iteration 233700/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 5.461E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.735476E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.03 | backward: 1803.58 | backward-backward: 1803.55 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.79 samples/sec: 6.591 | iteration 233800/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 5.450E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.741182E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.42 | backward: 1804.90 | backward-backward: 1804.87 | backward-allreduce: 0.00 | optimizer: 55.69 | batch generator: 0.75 samples/sec: 6.591 | iteration 233900/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 5.438E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.749383E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.94 | backward: 1804.40 | backward-backward: 1804.38 | backward-allreduce: 0.00 | optimizer: 55.74 | batch generator: 0.78 samples/sec: 6.594 | iteration 234000/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 5.427E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.731332E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.17 | backward: 1804.16 | backward-backward: 1804.14 | backward-allreduce: 0.00 | optimizer: 55.83 | batch generator: 0.79 ----------------------------------------------------------------------------------------------------------- validation results at iteration 234000 | lm_loss value: 2.713469E+00 | lm_loss_ppl value: 1.508150E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.435 | iteration 234100/ 320000 | elapsed time per iteration (ms): 2486.6 | learning rate: 5.416E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.739590E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.41 | backward: 1805.52 | backward-backward: 1805.50 | backward-allreduce: 0.00 | optimizer: 56.48 | batch generator: 0.95 samples/sec: 6.598 | iteration 234200/ 320000 | elapsed time per iteration (ms): 2425.1 | learning rate: 5.404E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.715305E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.25 | backward: 1803.07 | backward-backward: 1803.05 | backward-allreduce: 0.00 | optimizer: 55.45 | batch generator: 0.78 samples/sec: 6.591 | iteration 234300/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 5.393E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.722883E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.61 | backward: 1804.94 | backward-backward: 1804.92 | backward-allreduce: 0.00 | optimizer: 55.70 | batch generator: 0.76 samples/sec: 6.592 | iteration 234400/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 5.382E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.735547E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.56 | backward: 1804.19 | backward-backward: 1804.17 | backward-allreduce: 0.00 | optimizer: 55.94 | batch generator: 0.85 samples/sec: 6.595 | iteration 234500/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 5.370E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.728419E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.30 | backward: 1803.87 | backward-backward: 1803.85 | backward-allreduce: 0.00 | optimizer: 55.40 | batch generator: 0.77 samples/sec: 6.592 | iteration 234600/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 5.359E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.703845E+00 | loss scale: 65536.0 | number of skipped iterations: 2 | number of nan iterations: 0 | time (ms) | forward: 566.86 | backward: 1805.16 | backward-backward: 1805.14 | backward-allreduce: 0.00 | optimizer: 54.77 | batch generator: 0.78 samples/sec: 6.597 | iteration 234700/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 5.348E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.722599E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 565.94 | backward: 1803.36 | backward-backward: 1803.34 | backward-allreduce: 0.00 | optimizer: 55.80 | batch generator: 0.76 samples/sec: 6.589 | iteration 234800/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 5.337E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.745035E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.25 | backward: 1805.06 | backward-backward: 1805.04 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.79 samples/sec: 6.597 | iteration 234900/ 320000 | elapsed time per iteration (ms): 2425.3 | learning rate: 5.325E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.733756E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.34 | backward: 1803.36 | backward-backward: 1803.34 | backward-allreduce: 0.00 | optimizer: 55.24 | batch generator: 0.77 samples/sec: 6.592 | iteration 235000/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 5.314E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.734606E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.71 | backward: 1804.31 | backward-backward: 1804.28 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.77 ----------------------------------------------------------------------------------------------------------- validation results at iteration 235000 | lm_loss value: 2.691060E+00 | lm_loss_ppl value: 1.474729E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.438 | iteration 235100/ 320000 | elapsed time per iteration (ms): 2485.1 | learning rate: 5.303E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.716174E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.70 | backward: 1805.57 | backward-backward: 1805.55 | backward-allreduce: 0.00 | optimizer: 55.59 | batch generator: 0.83 samples/sec: 6.592 | iteration 235200/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 5.292E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.734035E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.64 | backward: 1803.74 | backward-backward: 1803.71 | backward-allreduce: 0.00 | optimizer: 56.30 | batch generator: 0.89 samples/sec: 6.587 | iteration 235300/ 320000 | elapsed time per iteration (ms): 2429.1 | learning rate: 5.280E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.716158E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.15 | backward: 1805.44 | backward-backward: 1805.42 | backward-allreduce: 0.00 | optimizer: 56.09 | batch generator: 0.88 samples/sec: 6.598 | iteration 235400/ 320000 | elapsed time per iteration (ms): 2425.1 | learning rate: 5.269E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.719626E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.18 | backward: 1803.22 | backward-backward: 1803.20 | backward-allreduce: 0.00 | optimizer: 55.33 | batch generator: 0.76 samples/sec: 6.590 | iteration 235500/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 5.258E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.722123E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.65 | backward: 1805.01 | backward-backward: 1804.98 | backward-allreduce: 0.00 | optimizer: 56.00 | batch generator: 0.81 samples/sec: 6.594 | iteration 235600/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 5.247E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.730175E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.68 | backward: 1804.09 | backward-backward: 1804.07 | backward-allreduce: 0.00 | optimizer: 55.40 | batch generator: 0.78 samples/sec: 6.597 | iteration 235700/ 320000 | elapsed time per iteration (ms): 2425.3 | learning rate: 5.236E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.712450E+00 | loss scale: 131072.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.11 | backward: 1803.73 | backward-backward: 1803.71 | backward-allreduce: 0.00 | optimizer: 55.10 | batch generator: 0.79 samples/sec: 6.589 | iteration 235800/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 5.225E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.723305E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.87 | backward: 1805.87 | backward-backward: 1805.84 | backward-allreduce: 0.00 | optimizer: 55.20 | batch generator: 0.80 samples/sec: 6.596 | iteration 235900/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 5.214E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.736081E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.17 | backward: 1803.48 | backward-backward: 1803.46 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.79 samples/sec: 6.592 | iteration 236000/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 5.202E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.702437E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.86 | backward: 1804.46 | backward-backward: 1804.44 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.79 ----------------------------------------------------------------------------------------------------------- validation results at iteration 236000 | lm_loss value: 2.761008E+00 | lm_loss_ppl value: 1.581578E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.442 | iteration 236100/ 320000 | elapsed time per iteration (ms): 2483.7 | learning rate: 5.191E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.731689E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.61 | backward: 1804.50 | backward-backward: 1804.47 | backward-allreduce: 0.00 | optimizer: 55.39 | batch generator: 0.84 samples/sec: 6.594 | iteration 236200/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 5.180E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.739073E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.04 | backward: 1804.37 | backward-backward: 1804.35 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.77 samples/sec: 6.588 | iteration 236300/ 320000 | elapsed time per iteration (ms): 2428.6 | learning rate: 5.169E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.730917E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.67 | backward: 1805.34 | backward-backward: 1805.32 | backward-allreduce: 0.00 | optimizer: 56.19 | batch generator: 0.77 samples/sec: 6.598 | iteration 236400/ 320000 | elapsed time per iteration (ms): 2425.1 | learning rate: 5.158E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.717496E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.04 | backward: 1802.93 | backward-backward: 1802.91 | backward-allreduce: 0.00 | optimizer: 55.73 | batch generator: 0.74 samples/sec: 6.592 | iteration 236500/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 5.147E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.707153E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.64 | backward: 1804.73 | backward-backward: 1804.70 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.77 samples/sec: 6.595 | iteration 236600/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 5.136E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.716986E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.32 | backward: 1803.96 | backward-backward: 1803.94 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.76 samples/sec: 6.595 | iteration 236700/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 5.125E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.699745E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.31 | backward: 1803.82 | backward-backward: 1803.80 | backward-allreduce: 0.00 | optimizer: 55.73 | batch generator: 0.79 samples/sec: 6.589 | iteration 236800/ 320000 | elapsed time per iteration (ms): 2428.4 | learning rate: 5.114E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.721167E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.78 | backward: 1805.41 | backward-backward: 1805.39 | backward-allreduce: 0.00 | optimizer: 55.82 | batch generator: 0.81 samples/sec: 6.600 | iteration 236900/ 320000 | elapsed time per iteration (ms): 2424.3 | learning rate: 5.103E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.714954E+00 | loss scale: 65536.0 | number of skipped iterations: 2 | number of nan iterations: 0 | time (ms) | forward: 566.13 | backward: 1803.33 | backward-backward: 1803.31 | backward-allreduce: 0.00 | optimizer: 54.46 | batch generator: 0.76 samples/sec: 6.590 | iteration 237000/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 5.092E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.726127E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.81 | backward: 1804.78 | backward-backward: 1804.76 | backward-allreduce: 0.00 | optimizer: 55.90 | batch generator: 0.90 ----------------------------------------------------------------------------------------------------------- validation results at iteration 237000 | lm_loss value: 2.704316E+00 | lm_loss_ppl value: 1.494409E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.443 | iteration 237100/ 320000 | elapsed time per iteration (ms): 2483.3 | learning rate: 5.081E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.733974E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.69 | backward: 1803.75 | backward-backward: 1803.72 | backward-allreduce: 0.00 | optimizer: 55.72 | batch generator: 0.90 samples/sec: 6.596 | iteration 237200/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 5.070E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.732924E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.55 | backward: 1803.51 | backward-backward: 1803.49 | backward-allreduce: 0.00 | optimizer: 55.31 | batch generator: 0.79 samples/sec: 6.588 | iteration 237300/ 320000 | elapsed time per iteration (ms): 2428.7 | learning rate: 5.058E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.701049E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.51 | backward: 1805.11 | backward-backward: 1805.09 | backward-allreduce: 0.00 | optimizer: 55.67 | batch generator: 0.88 samples/sec: 6.597 | iteration 237400/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 5.047E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.714175E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.00 | backward: 1803.11 | backward-backward: 1803.09 | backward-allreduce: 0.00 | optimizer: 55.98 | batch generator: 0.79 samples/sec: 6.590 | iteration 237500/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 5.036E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.730139E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.88 | backward: 1804.72 | backward-backward: 1804.70 | backward-allreduce: 0.00 | optimizer: 55.78 | batch generator: 0.77 samples/sec: 6.596 | iteration 237600/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 5.025E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.697535E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.47 | backward: 1803.09 | backward-backward: 1803.07 | backward-allreduce: 0.00 | optimizer: 55.78 | batch generator: 0.78 samples/sec: 6.595 | iteration 237700/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 5.014E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.713621E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.60 | backward: 1803.59 | backward-backward: 1803.57 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.79 samples/sec: 6.591 | iteration 237800/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 5.003E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.736107E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.80 | backward: 1804.44 | backward-backward: 1804.42 | backward-allreduce: 0.00 | optimizer: 55.94 | batch generator: 0.79 samples/sec: 6.599 | iteration 237900/ 320000 | elapsed time per iteration (ms): 2424.6 | learning rate: 4.993E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.713349E+00 | loss scale: 131072.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.09 | backward: 1803.24 | backward-backward: 1803.22 | backward-allreduce: 0.00 | optimizer: 54.88 | batch generator: 0.79 samples/sec: 6.591 | iteration 238000/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 4.982E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.703772E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.97 | backward: 1804.99 | backward-backward: 1804.96 | backward-allreduce: 0.00 | optimizer: 55.31 | batch generator: 0.82 ----------------------------------------------------------------------------------------------------------- validation results at iteration 238000 | lm_loss value: 2.753046E+00 | lm_loss_ppl value: 1.569035E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.446 | iteration 238100/ 320000 | elapsed time per iteration (ms): 2482.1 | learning rate: 4.971E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.725006E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.48 | backward: 1803.00 | backward-backward: 1802.98 | backward-allreduce: 0.00 | optimizer: 55.50 | batch generator: 0.95 samples/sec: 6.593 | iteration 238200/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 4.960E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.698601E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.85 | backward: 1803.87 | backward-backward: 1803.85 | backward-allreduce: 0.00 | optimizer: 55.72 | batch generator: 0.85 samples/sec: 6.593 | iteration 238300/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 4.949E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.731270E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.66 | backward: 1804.22 | backward-backward: 1804.20 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.79 samples/sec: 6.595 | iteration 238400/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 4.938E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.694816E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 565.95 | backward: 1803.59 | backward-backward: 1803.57 | backward-allreduce: 0.00 | optimizer: 56.12 | batch generator: 0.79 samples/sec: 6.589 | iteration 238500/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 4.927E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.727235E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.89 | backward: 1805.06 | backward-backward: 1805.04 | backward-allreduce: 0.00 | optimizer: 55.83 | batch generator: 0.80 samples/sec: 6.598 | iteration 238600/ 320000 | elapsed time per iteration (ms): 2425.0 | learning rate: 4.916E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.697990E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.14 | backward: 1803.31 | backward-backward: 1803.29 | backward-allreduce: 0.00 | optimizer: 55.21 | batch generator: 0.76 samples/sec: 6.593 | iteration 238700/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 4.905E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.704695E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.43 | backward: 1804.08 | backward-backward: 1804.06 | backward-allreduce: 0.00 | optimizer: 55.91 | batch generator: 0.76 samples/sec: 6.592 | iteration 238800/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 4.894E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.700107E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.73 | backward: 1804.48 | backward-backward: 1804.45 | backward-allreduce: 0.00 | optimizer: 55.50 | batch generator: 0.79 samples/sec: 6.597 | iteration 238900/ 320000 | elapsed time per iteration (ms): 2425.2 | learning rate: 4.883E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.716730E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.17 | backward: 1803.13 | backward-backward: 1803.10 | backward-allreduce: 0.00 | optimizer: 55.54 | batch generator: 0.78 samples/sec: 6.589 | iteration 239000/ 320000 | elapsed time per iteration (ms): 2428.4 | learning rate: 4.873E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.716874E+00 | loss scale: 131072.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 567.09 | backward: 1805.89 | backward-backward: 1805.87 | backward-allreduce: 0.00 | optimizer: 55.07 | batch generator: 0.74 ----------------------------------------------------------------------------------------------------------- validation results at iteration 239000 | lm_loss value: 2.705721E+00 | lm_loss_ppl value: 1.496511E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.446 | iteration 239100/ 320000 | elapsed time per iteration (ms): 2482.0 | learning rate: 4.862E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.699721E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.39 | backward: 1803.37 | backward-backward: 1803.34 | backward-allreduce: 0.00 | optimizer: 55.07 | batch generator: 0.87 samples/sec: 6.595 | iteration 239200/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 4.851E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.703474E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.40 | backward: 1804.04 | backward-backward: 1804.02 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.78 samples/sec: 6.589 | iteration 239300/ 320000 | elapsed time per iteration (ms): 2428.4 | learning rate: 4.840E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.715621E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.80 | backward: 1805.43 | backward-backward: 1805.40 | backward-allreduce: 0.00 | optimizer: 55.84 | batch generator: 0.86 samples/sec: 6.600 | iteration 239400/ 320000 | elapsed time per iteration (ms): 2424.2 | learning rate: 4.829E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.718670E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 565.94 | backward: 1802.57 | backward-backward: 1802.55 | backward-allreduce: 0.00 | optimizer: 55.28 | batch generator: 0.71 samples/sec: 6.590 | iteration 239500/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 4.819E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.702203E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.63 | backward: 1804.30 | backward-backward: 1804.28 | backward-allreduce: 0.00 | optimizer: 56.65 | batch generator: 0.81 samples/sec: 6.592 | iteration 239600/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 4.808E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.700984E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.75 | backward: 1804.69 | backward-backward: 1804.67 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.79 samples/sec: 6.598 | iteration 239700/ 320000 | elapsed time per iteration (ms): 2424.8 | learning rate: 4.797E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.714182E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.06 | backward: 1802.80 | backward-backward: 1802.78 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.94 samples/sec: 6.594 | iteration 239800/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 4.786E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.701785E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.47 | backward: 1804.12 | backward-backward: 1804.10 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.76 samples/sec: 6.591 | iteration 239900/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 4.775E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.724051E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.96 | backward: 1804.34 | backward-backward: 1804.32 | backward-allreduce: 0.00 | optimizer: 55.80 | batch generator: 0.78 samples/sec: 6.597 | iteration 240000/ 320000 | elapsed time per iteration (ms): 2425.2 | learning rate: 4.765E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.725995E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.35 | backward: 1803.19 | backward-backward: 1803.16 | backward-allreduce: 0.00 | optimizer: 55.32 | batch generator: 0.77 WARNING: Deleting old checkpoints: checkpoints-hfcm/global_step140000 ----------------------------------------------------------------------------------------------------------- validation results at iteration 240000 | lm_loss value: 2.677252E+00 | lm_loss_ppl value: 1.454507E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.214 | iteration 240100/ 320000 | elapsed time per iteration (ms): 2574.7 | learning rate: 4.754E-05 | approx flops per GPU: 38.6TFLOPS | lm_loss: 2.705043E+00 | loss scale: 131072.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 568.37 | backward: 1809.31 | backward-backward: 1809.29 | backward-allreduce: 0.00 | optimizer: 55.05 | batch generator: 0.83 samples/sec: 6.591 | iteration 240200/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 4.743E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.714879E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.46 | backward: 1805.37 | backward-backward: 1805.35 | backward-allreduce: 0.00 | optimizer: 55.19 | batch generator: 0.77 samples/sec: 6.594 | iteration 240300/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 4.733E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.693105E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.30 | backward: 1804.22 | backward-backward: 1804.20 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.77 samples/sec: 6.599 | iteration 240400/ 320000 | elapsed time per iteration (ms): 2424.5 | learning rate: 4.722E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.717834E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 565.93 | backward: 1802.79 | backward-backward: 1802.77 | backward-allreduce: 0.00 | optimizer: 55.43 | batch generator: 0.77 samples/sec: 6.587 | iteration 240500/ 320000 | elapsed time per iteration (ms): 2429.1 | learning rate: 4.711E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.704707E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.60 | backward: 1806.11 | backward-backward: 1806.08 | backward-allreduce: 0.00 | optimizer: 55.97 | batch generator: 0.78 samples/sec: 6.592 | iteration 240600/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 4.700E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.690056E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.44 | backward: 1804.04 | backward-backward: 1804.01 | backward-allreduce: 0.00 | optimizer: 56.19 | batch generator: 0.79 samples/sec: 6.598 | iteration 240700/ 320000 | elapsed time per iteration (ms): 2424.8 | learning rate: 4.690E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.707750E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 565.87 | backward: 1802.87 | backward-backward: 1802.85 | backward-allreduce: 0.00 | optimizer: 55.70 | batch generator: 0.83 samples/sec: 6.593 | iteration 240800/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 4.679E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.723352E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.51 | backward: 1804.59 | backward-backward: 1804.57 | backward-allreduce: 0.00 | optimizer: 55.38 | batch generator: 0.84 samples/sec: 6.592 | iteration 240900/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 4.669E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.702019E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.49 | backward: 1804.34 | backward-backward: 1804.32 | backward-allreduce: 0.00 | optimizer: 55.87 | batch generator: 0.77 samples/sec: 6.600 | iteration 241000/ 320000 | elapsed time per iteration (ms): 2424.2 | learning rate: 4.658E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.708173E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 565.90 | backward: 1802.71 | backward-backward: 1802.69 | backward-allreduce: 0.00 | optimizer: 55.19 | batch generator: 0.73 ----------------------------------------------------------------------------------------------------------- validation results at iteration 241000 | lm_loss value: 2.705850E+00 | lm_loss_ppl value: 1.496703E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.444 | iteration 241100/ 320000 | elapsed time per iteration (ms): 2483.1 | learning rate: 4.647E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.706801E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.28 | backward: 1804.06 | backward-backward: 1804.03 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.87 samples/sec: 6.592 | iteration 241200/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 4.637E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.684836E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.46 | backward: 1804.31 | backward-backward: 1804.28 | backward-allreduce: 0.00 | optimizer: 55.88 | batch generator: 0.81 samples/sec: 6.598 | iteration 241300/ 320000 | elapsed time per iteration (ms): 2424.9 | learning rate: 4.626E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.709907E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.06 | backward: 1803.01 | backward-backward: 1802.99 | backward-allreduce: 0.00 | optimizer: 55.50 | batch generator: 0.77 samples/sec: 6.597 | iteration 241400/ 320000 | elapsed time per iteration (ms): 2425.3 | learning rate: 4.615E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.702860E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.20 | backward: 1803.12 | backward-backward: 1803.10 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.77 samples/sec: 6.591 | iteration 241500/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 4.605E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.712340E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.18 | backward: 1805.32 | backward-backward: 1805.30 | backward-allreduce: 0.00 | optimizer: 55.76 | batch generator: 0.77 samples/sec: 6.596 | iteration 241600/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 4.594E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.695821E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.22 | backward: 1803.72 | backward-backward: 1803.70 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.83 samples/sec: 6.596 | iteration 241700/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 4.583E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.700454E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.03 | backward: 1803.05 | backward-backward: 1803.03 | backward-allreduce: 0.00 | optimizer: 56.38 | batch generator: 0.76 samples/sec: 6.591 | iteration 241800/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 4.573E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.697565E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.44 | backward: 1805.20 | backward-backward: 1805.17 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.82 samples/sec: 6.594 | iteration 241900/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 4.562E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.710031E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.32 | backward: 1804.03 | backward-backward: 1804.00 | backward-allreduce: 0.00 | optimizer: 55.74 | batch generator: 0.82 samples/sec: 6.599 | iteration 242000/ 320000 | elapsed time per iteration (ms): 2424.5 | learning rate: 4.552E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.693907E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.02 | backward: 1802.86 | backward-backward: 1802.84 | backward-allreduce: 0.00 | optimizer: 55.28 | batch generator: 0.78 ----------------------------------------------------------------------------------------------------------- validation results at iteration 242000 | lm_loss value: 2.736271E+00 | lm_loss_ppl value: 1.542934E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.438 | iteration 242100/ 320000 | elapsed time per iteration (ms): 2485.1 | learning rate: 4.541E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.691336E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.49 | backward: 1805.64 | backward-backward: 1805.62 | backward-allreduce: 0.00 | optimizer: 55.73 | batch generator: 0.82 samples/sec: 6.596 | iteration 242200/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 4.531E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.689686E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.30 | backward: 1803.52 | backward-backward: 1803.49 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.85 samples/sec: 6.596 | iteration 242300/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 4.520E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.708373E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.04 | backward: 1803.55 | backward-backward: 1803.53 | backward-allreduce: 0.00 | optimizer: 55.87 | batch generator: 0.74 samples/sec: 6.590 | iteration 242400/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 4.510E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.699362E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.56 | backward: 1805.24 | backward-backward: 1805.22 | backward-allreduce: 0.00 | optimizer: 55.87 | batch generator: 0.75 samples/sec: 6.593 | iteration 242500/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 4.499E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.687162E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.33 | backward: 1804.49 | backward-backward: 1804.47 | backward-allreduce: 0.00 | optimizer: 55.66 | batch generator: 0.78 samples/sec: 6.598 | iteration 242600/ 320000 | elapsed time per iteration (ms): 2424.8 | learning rate: 4.488E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.698579E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 565.91 | backward: 1803.16 | backward-backward: 1803.14 | backward-allreduce: 0.00 | optimizer: 55.39 | batch generator: 0.80 samples/sec: 6.591 | iteration 242700/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 4.478E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.694874E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.60 | backward: 1804.62 | backward-backward: 1804.60 | backward-allreduce: 0.00 | optimizer: 55.96 | batch generator: 0.90 samples/sec: 6.591 | iteration 242800/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 4.468E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.702237E+00 | loss scale: 131072.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.31 | backward: 1805.20 | backward-backward: 1805.18 | backward-allreduce: 0.00 | optimizer: 55.59 | batch generator: 0.78 samples/sec: 6.597 | iteration 242900/ 320000 | elapsed time per iteration (ms): 2425.4 | learning rate: 4.457E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.685580E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.18 | backward: 1803.82 | backward-backward: 1803.80 | backward-allreduce: 0.00 | optimizer: 55.01 | batch generator: 0.79 samples/sec: 6.597 | iteration 243000/ 320000 | elapsed time per iteration (ms): 2425.2 | learning rate: 4.447E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.697531E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 565.96 | backward: 1803.50 | backward-backward: 1803.48 | backward-allreduce: 0.00 | optimizer: 55.35 | batch generator: 0.78 ----------------------------------------------------------------------------------------------------------- validation results at iteration 243000 | lm_loss value: 2.678370E+00 | lm_loss_ppl value: 1.456134E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.439 | iteration 243100/ 320000 | elapsed time per iteration (ms): 2484.8 | learning rate: 4.436E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.695235E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.52 | backward: 1805.49 | backward-backward: 1805.46 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.86 samples/sec: 6.594 | iteration 243200/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 4.426E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.690933E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.52 | backward: 1803.57 | backward-backward: 1803.55 | backward-allreduce: 0.00 | optimizer: 55.85 | batch generator: 0.81 samples/sec: 6.597 | iteration 243300/ 320000 | elapsed time per iteration (ms): 2425.3 | learning rate: 4.415E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.704198E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.17 | backward: 1802.98 | backward-backward: 1802.96 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.91 samples/sec: 6.592 | iteration 243400/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 4.405E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.700610E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.63 | backward: 1804.91 | backward-backward: 1804.88 | backward-allreduce: 0.00 | optimizer: 55.37 | batch generator: 0.79 samples/sec: 6.595 | iteration 243500/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 4.395E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.688445E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.62 | backward: 1803.69 | backward-backward: 1803.67 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.84 samples/sec: 6.599 | iteration 243600/ 320000 | elapsed time per iteration (ms): 2424.6 | learning rate: 4.384E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.684412E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 565.94 | backward: 1802.91 | backward-backward: 1802.89 | backward-allreduce: 0.00 | optimizer: 55.39 | batch generator: 0.76 samples/sec: 6.598 | iteration 243700/ 320000 | elapsed time per iteration (ms): 2425.1 | learning rate: 4.374E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.701894E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.13 | backward: 1803.32 | backward-backward: 1803.30 | backward-allreduce: 0.00 | optimizer: 55.31 | batch generator: 0.76 samples/sec: 6.591 | iteration 243800/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 4.363E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.697761E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.48 | backward: 1804.90 | backward-backward: 1804.88 | backward-allreduce: 0.00 | optimizer: 55.92 | batch generator: 0.76 samples/sec: 6.593 | iteration 243900/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 4.353E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.682357E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.54 | backward: 1803.84 | backward-backward: 1803.81 | backward-allreduce: 0.00 | optimizer: 56.04 | batch generator: 0.77 samples/sec: 6.597 | iteration 244000/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 4.343E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.698982E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.08 | backward: 1803.44 | backward-backward: 1803.41 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.80 ----------------------------------------------------------------------------------------------------------- validation results at iteration 244000 | lm_loss value: 2.682792E+00 | lm_loss_ppl value: 1.462587E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.446 | iteration 244100/ 320000 | elapsed time per iteration (ms): 2482.1 | learning rate: 4.332E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.702554E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.22 | backward: 1802.77 | backward-backward: 1802.75 | backward-allreduce: 0.00 | optimizer: 55.96 | batch generator: 0.85 samples/sec: 6.593 | iteration 244200/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 4.322E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.671779E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.34 | backward: 1804.40 | backward-backward: 1804.38 | backward-allreduce: 0.00 | optimizer: 55.66 | batch generator: 0.76 samples/sec: 6.595 | iteration 244300/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 4.312E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.695796E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.28 | backward: 1804.10 | backward-backward: 1804.08 | backward-allreduce: 0.00 | optimizer: 55.47 | batch generator: 0.76 samples/sec: 6.594 | iteration 244400/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 4.301E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.684875E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.43 | backward: 1803.72 | backward-backward: 1803.70 | backward-allreduce: 0.00 | optimizer: 55.89 | batch generator: 0.77 samples/sec: 6.600 | iteration 244500/ 320000 | elapsed time per iteration (ms): 2424.3 | learning rate: 4.291E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.692135E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 565.90 | backward: 1802.79 | backward-backward: 1802.77 | backward-allreduce: 0.00 | optimizer: 55.26 | batch generator: 0.79 samples/sec: 6.592 | iteration 244600/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 4.281E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.680473E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.51 | backward: 1804.87 | backward-backward: 1804.85 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.76 samples/sec: 6.593 | iteration 244700/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 4.270E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.689572E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.36 | backward: 1804.42 | backward-backward: 1804.40 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.75 samples/sec: 6.594 | iteration 244800/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 4.260E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.689074E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.51 | backward: 1804.20 | backward-backward: 1804.18 | backward-allreduce: 0.00 | optimizer: 55.54 | batch generator: 0.79 samples/sec: 6.597 | iteration 244900/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 4.250E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.674841E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 565.91 | backward: 1803.43 | backward-backward: 1803.41 | backward-allreduce: 0.00 | optimizer: 55.73 | batch generator: 0.86 samples/sec: 6.592 | iteration 245000/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 4.240E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.676530E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.40 | backward: 1804.15 | backward-backward: 1804.13 | backward-allreduce: 0.00 | optimizer: 56.22 | batch generator: 0.79 ----------------------------------------------------------------------------------------------------------- validation results at iteration 245000 | lm_loss value: 2.693595E+00 | lm_loss_ppl value: 1.478473E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.440 | iteration 245100/ 320000 | elapsed time per iteration (ms): 2484.5 | learning rate: 4.229E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.687701E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.45 | backward: 1804.99 | backward-backward: 1804.96 | backward-allreduce: 0.00 | optimizer: 55.89 | batch generator: 0.84 samples/sec: 6.593 | iteration 245200/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 4.219E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.669124E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.54 | backward: 1804.37 | backward-backward: 1804.35 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.76 samples/sec: 6.600 | iteration 245300/ 320000 | elapsed time per iteration (ms): 2424.2 | learning rate: 4.209E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.688631E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 565.84 | backward: 1802.59 | backward-backward: 1802.57 | backward-allreduce: 0.00 | optimizer: 55.37 | batch generator: 0.78 samples/sec: 6.595 | iteration 245400/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 4.199E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.703270E+00 | loss scale: 65536.0 | number of skipped iterations: 2 | number of nan iterations: 0 | time (ms) | forward: 566.44 | backward: 1804.60 | backward-backward: 1804.58 | backward-allreduce: 0.00 | optimizer: 54.67 | batch generator: 0.79 samples/sec: 6.592 | iteration 245500/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 4.189E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.681367E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.98 | backward: 1803.99 | backward-backward: 1803.97 | backward-allreduce: 0.00 | optimizer: 55.91 | batch generator: 1.01 samples/sec: 6.593 | iteration 245600/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 4.178E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.678141E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.48 | backward: 1804.21 | backward-backward: 1804.19 | backward-allreduce: 0.00 | optimizer: 55.71 | batch generator: 0.80 samples/sec: 6.599 | iteration 245700/ 320000 | elapsed time per iteration (ms): 2424.7 | learning rate: 4.168E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.685599E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 565.98 | backward: 1802.71 | backward-backward: 1802.69 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.81 samples/sec: 6.595 | iteration 245800/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 4.158E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.673221E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.56 | backward: 1803.83 | backward-backward: 1803.81 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.82 samples/sec: 6.592 | iteration 245900/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 4.148E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.674567E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.46 | backward: 1804.66 | backward-backward: 1804.64 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.83 samples/sec: 6.590 | iteration 246000/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 4.138E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.668605E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.52 | backward: 1805.02 | backward-backward: 1805.00 | backward-allreduce: 0.00 | optimizer: 55.97 | batch generator: 0.86 ----------------------------------------------------------------------------------------------------------- validation results at iteration 246000 | lm_loss value: 2.724094E+00 | lm_loss_ppl value: 1.524260E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.446 | iteration 246100/ 320000 | elapsed time per iteration (ms): 2482.2 | learning rate: 4.128E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.689904E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.03 | backward: 1802.64 | backward-backward: 1802.62 | backward-allreduce: 0.00 | optimizer: 56.41 | batch generator: 0.90 samples/sec: 6.592 | iteration 246200/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 4.117E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.676909E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.87 | backward: 1804.47 | backward-backward: 1804.45 | backward-allreduce: 0.00 | optimizer: 55.49 | batch generator: 0.80 samples/sec: 6.593 | iteration 246300/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 4.107E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.679104E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.58 | backward: 1804.19 | backward-backward: 1804.17 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.75 samples/sec: 6.598 | iteration 246400/ 320000 | elapsed time per iteration (ms): 2424.9 | learning rate: 4.097E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.676583E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.25 | backward: 1803.31 | backward-backward: 1803.29 | backward-allreduce: 0.00 | optimizer: 54.93 | batch generator: 0.76 samples/sec: 6.598 | iteration 246500/ 320000 | elapsed time per iteration (ms): 2425.0 | learning rate: 4.087E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.692239E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.52 | backward: 1802.68 | backward-backward: 1802.66 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.77 samples/sec: 6.592 | iteration 246600/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 4.077E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.683626E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.72 | backward: 1804.37 | backward-backward: 1804.34 | backward-allreduce: 0.00 | optimizer: 55.82 | batch generator: 0.79 samples/sec: 6.595 | iteration 246700/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 4.067E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.678777E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.52 | backward: 1803.25 | backward-backward: 1803.23 | backward-allreduce: 0.00 | optimizer: 55.99 | batch generator: 0.90 samples/sec: 6.599 | iteration 246800/ 320000 | elapsed time per iteration (ms): 2424.7 | learning rate: 4.057E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.680826E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 565.81 | backward: 1803.10 | backward-backward: 1803.08 | backward-allreduce: 0.00 | optimizer: 55.40 | batch generator: 0.75 samples/sec: 6.590 | iteration 246900/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 4.047E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.692816E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.00 | backward: 1804.64 | backward-backward: 1804.62 | backward-allreduce: 0.00 | optimizer: 55.75 | batch generator: 0.90 samples/sec: 6.594 | iteration 247000/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 4.037E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.691890E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.63 | backward: 1803.66 | backward-backward: 1803.63 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.76 ----------------------------------------------------------------------------------------------------------- validation results at iteration 247000 | lm_loss value: 2.685040E+00 | lm_loss_ppl value: 1.465879E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.448 | iteration 247100/ 320000 | elapsed time per iteration (ms): 2481.4 | learning rate: 4.027E-05 | approx flops per GPU: 40.1TFLOPS | lm_loss: 2.665516E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 565.99 | backward: 1802.94 | backward-backward: 1802.92 | backward-allreduce: 0.00 | optimizer: 55.33 | batch generator: 0.88 samples/sec: 6.590 | iteration 247200/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 4.017E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.689487E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.77 | backward: 1804.55 | backward-backward: 1804.53 | backward-allreduce: 0.00 | optimizer: 56.07 | batch generator: 0.78 samples/sec: 6.594 | iteration 247300/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 4.007E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.695116E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.75 | backward: 1803.99 | backward-backward: 1803.97 | backward-allreduce: 0.00 | optimizer: 55.45 | batch generator: 0.75 samples/sec: 6.599 | iteration 247400/ 320000 | elapsed time per iteration (ms): 2424.8 | learning rate: 3.997E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.684220E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.07 | backward: 1802.83 | backward-backward: 1802.81 | backward-allreduce: 0.00 | optimizer: 55.49 | batch generator: 0.78 samples/sec: 6.592 | iteration 247500/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 3.987E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.674095E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.80 | backward: 1804.66 | backward-backward: 1804.64 | backward-allreduce: 0.00 | optimizer: 55.50 | batch generator: 0.79 samples/sec: 6.592 | iteration 247600/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 3.977E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.669522E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.73 | backward: 1804.79 | backward-backward: 1804.76 | backward-allreduce: 0.00 | optimizer: 55.46 | batch generator: 0.77 samples/sec: 6.598 | iteration 247700/ 320000 | elapsed time per iteration (ms): 2424.9 | learning rate: 3.967E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.679477E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.14 | backward: 1802.73 | backward-backward: 1802.70 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.80 samples/sec: 6.593 | iteration 247800/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 3.957E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.685939E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.68 | backward: 1804.77 | backward-backward: 1804.75 | backward-allreduce: 0.00 | optimizer: 55.11 | batch generator: 0.79 samples/sec: 6.592 | iteration 247900/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 3.947E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.660878E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.96 | backward: 1804.77 | backward-backward: 1804.75 | backward-allreduce: 0.00 | optimizer: 55.12 | batch generator: 0.80 samples/sec: 6.599 | iteration 248000/ 320000 | elapsed time per iteration (ms): 2424.5 | learning rate: 3.937E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.657171E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.00 | backward: 1802.71 | backward-backward: 1802.68 | backward-allreduce: 0.00 | optimizer: 55.45 | batch generator: 0.78 ----------------------------------------------------------------------------------------------------------- validation results at iteration 248000 | lm_loss value: 2.661885E+00 | lm_loss_ppl value: 1.432326E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.442 | iteration 248100/ 320000 | elapsed time per iteration (ms): 2483.7 | learning rate: 3.927E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.658633E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.86 | backward: 1804.08 | backward-backward: 1804.06 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.86 samples/sec: 6.591 | iteration 248200/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 3.917E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.678560E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.70 | backward: 1804.24 | backward-backward: 1804.22 | backward-allreduce: 0.00 | optimizer: 56.05 | batch generator: 0.77 samples/sec: 6.597 | iteration 248300/ 320000 | elapsed time per iteration (ms): 2425.4 | learning rate: 3.907E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.663152E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.51 | backward: 1803.24 | backward-backward: 1803.22 | backward-allreduce: 0.00 | optimizer: 55.31 | batch generator: 0.79 samples/sec: 6.589 | iteration 248400/ 320000 | elapsed time per iteration (ms): 2428.4 | learning rate: 3.897E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.677003E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.97 | backward: 1805.13 | backward-backward: 1805.10 | backward-allreduce: 0.00 | optimizer: 55.88 | batch generator: 0.76 samples/sec: 6.597 | iteration 248500/ 320000 | elapsed time per iteration (ms): 2425.2 | learning rate: 3.887E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.681410E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.32 | backward: 1802.87 | backward-backward: 1802.85 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.78 samples/sec: 6.596 | iteration 248600/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 3.878E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.655598E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.48 | backward: 1803.33 | backward-backward: 1803.31 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.92 samples/sec: 6.590 | iteration 248700/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 3.868E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.669682E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.95 | backward: 1804.65 | backward-backward: 1804.63 | backward-allreduce: 0.00 | optimizer: 55.82 | batch generator: 0.80 samples/sec: 6.600 | iteration 248800/ 320000 | elapsed time per iteration (ms): 2424.4 | learning rate: 3.858E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.669091E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 565.93 | backward: 1802.85 | backward-backward: 1802.82 | backward-allreduce: 0.00 | optimizer: 55.21 | batch generator: 0.75 samples/sec: 6.590 | iteration 248900/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 3.848E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.678873E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 567.17 | backward: 1804.91 | backward-backward: 1804.88 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.76 samples/sec: 6.593 | iteration 249000/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 3.838E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.644965E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.55 | backward: 1803.87 | backward-backward: 1803.85 | backward-allreduce: 0.00 | optimizer: 56.02 | batch generator: 0.79 ----------------------------------------------------------------------------------------------------------- validation results at iteration 249000 | lm_loss value: 2.673376E+00 | lm_loss_ppl value: 1.448880E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.445 | iteration 249100/ 320000 | elapsed time per iteration (ms): 2482.6 | learning rate: 3.828E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.674373E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.20 | backward: 1803.84 | backward-backward: 1803.81 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.85 samples/sec: 6.590 | iteration 249200/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 3.819E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.677266E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.82 | backward: 1804.89 | backward-backward: 1804.86 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.79 samples/sec: 6.595 | iteration 249300/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 3.809E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.674853E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 565.92 | backward: 1803.92 | backward-backward: 1803.90 | backward-allreduce: 0.00 | optimizer: 55.92 | batch generator: 0.77 samples/sec: 6.589 | iteration 249400/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 3.799E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.660229E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.87 | backward: 1805.22 | backward-backward: 1805.20 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.81 samples/sec: 6.594 | iteration 249500/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 3.789E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.657848E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.71 | backward: 1803.76 | backward-backward: 1803.73 | backward-allreduce: 0.00 | optimizer: 55.79 | batch generator: 0.78 samples/sec: 6.601 | iteration 249600/ 320000 | elapsed time per iteration (ms): 2424.0 | learning rate: 3.779E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.678182E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 565.98 | backward: 1802.78 | backward-backward: 1802.76 | backward-allreduce: 0.00 | optimizer: 54.82 | batch generator: 0.78 samples/sec: 6.589 | iteration 249700/ 320000 | elapsed time per iteration (ms): 2428.4 | learning rate: 3.770E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.686300E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.87 | backward: 1805.06 | backward-backward: 1805.04 | backward-allreduce: 0.00 | optimizer: 56.06 | batch generator: 0.81 samples/sec: 6.599 | iteration 249800/ 320000 | elapsed time per iteration (ms): 2424.7 | learning rate: 3.760E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.653446E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.12 | backward: 1802.82 | backward-backward: 1802.80 | backward-allreduce: 0.00 | optimizer: 55.41 | batch generator: 0.77 samples/sec: 6.593 | iteration 249900/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 3.750E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.666006E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.72 | backward: 1804.15 | backward-backward: 1804.13 | backward-allreduce: 0.00 | optimizer: 55.74 | batch generator: 0.75 samples/sec: 6.593 | iteration 250000/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 3.740E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.658907E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.66 | backward: 1803.98 | backward-backward: 1803.96 | backward-allreduce: 0.00 | optimizer: 55.91 | batch generator: 0.82 WARNING: Deleting old checkpoints: checkpoints-hfcm/global_step150000 ----------------------------------------------------------------------------------------------------------- validation results at iteration 250000 | lm_loss value: 2.713522E+00 | lm_loss_ppl value: 1.508230E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.230 | iteration 250100/ 320000 | elapsed time per iteration (ms): 2568.2 | learning rate: 3.731E-05 | approx flops per GPU: 38.7TFLOPS | lm_loss: 2.663464E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.21 | backward: 1803.97 | backward-backward: 1803.94 | backward-allreduce: 0.00 | optimizer: 55.50 | batch generator: 0.89 samples/sec: 6.591 | iteration 250200/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 3.721E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.654070E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.67 | backward: 1805.06 | backward-backward: 1805.04 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.78 samples/sec: 6.597 | iteration 250300/ 320000 | elapsed time per iteration (ms): 2425.2 | learning rate: 3.711E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.682838E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.14 | backward: 1803.26 | backward-backward: 1803.24 | backward-allreduce: 0.00 | optimizer: 55.46 | batch generator: 0.75 samples/sec: 6.589 | iteration 250400/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 3.702E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.655833E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.85 | backward: 1804.53 | backward-backward: 1804.51 | backward-allreduce: 0.00 | optimizer: 56.46 | batch generator: 0.74 samples/sec: 6.597 | iteration 250500/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 3.692E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.652030E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.55 | backward: 1802.96 | backward-backward: 1802.93 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.79 samples/sec: 6.594 | iteration 250600/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 3.682E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.671863E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.51 | backward: 1803.75 | backward-backward: 1803.73 | backward-allreduce: 0.00 | optimizer: 55.83 | batch generator: 0.82 samples/sec: 6.589 | iteration 250700/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 3.673E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.667605E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.94 | backward: 1804.29 | backward-backward: 1804.26 | backward-allreduce: 0.00 | optimizer: 55.73 | batch generator: 0.77 samples/sec: 6.597 | iteration 250800/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 3.663E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.674926E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.13 | backward: 1803.56 | backward-backward: 1803.54 | backward-allreduce: 0.00 | optimizer: 55.43 | batch generator: 0.74 samples/sec: 6.587 | iteration 250900/ 320000 | elapsed time per iteration (ms): 2428.8 | learning rate: 3.653E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.678223E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.24 | backward: 1805.53 | backward-backward: 1805.51 | backward-allreduce: 0.00 | optimizer: 55.69 | batch generator: 0.82 samples/sec: 6.597 | iteration 251000/ 320000 | elapsed time per iteration (ms): 2425.2 | learning rate: 3.644E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.665537E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.11 | backward: 1803.36 | backward-backward: 1803.33 | backward-allreduce: 0.00 | optimizer: 55.39 | batch generator: 0.77 ----------------------------------------------------------------------------------------------------------- validation results at iteration 251000 | lm_loss value: 2.724601E+00 | lm_loss_ppl value: 1.525033E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.439 | iteration 251100/ 320000 | elapsed time per iteration (ms): 2484.8 | learning rate: 3.634E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.659660E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.76 | backward: 1805.35 | backward-backward: 1805.33 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.84 samples/sec: 6.594 | iteration 251200/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 3.624E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.664871E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.70 | backward: 1803.76 | backward-backward: 1803.74 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.83 samples/sec: 6.595 | iteration 251300/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 3.615E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.659312E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.23 | backward: 1803.71 | backward-backward: 1803.69 | backward-allreduce: 0.00 | optimizer: 55.59 | batch generator: 0.77 samples/sec: 6.586 | iteration 251400/ 320000 | elapsed time per iteration (ms): 2429.4 | learning rate: 3.605E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.651647E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.11 | backward: 1806.41 | backward-backward: 1806.39 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.79 samples/sec: 6.596 | iteration 251500/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 3.596E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.673328E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.08 | backward: 1803.90 | backward-backward: 1803.87 | backward-allreduce: 0.00 | optimizer: 55.40 | batch generator: 0.76 samples/sec: 6.587 | iteration 251600/ 320000 | elapsed time per iteration (ms): 2428.8 | learning rate: 3.586E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.655240E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.88 | backward: 1806.04 | backward-backward: 1806.01 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.76 samples/sec: 6.597 | iteration 251700/ 320000 | elapsed time per iteration (ms): 2425.3 | learning rate: 3.577E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.661420E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.35 | backward: 1803.15 | backward-backward: 1803.13 | backward-allreduce: 0.00 | optimizer: 55.38 | batch generator: 0.79 samples/sec: 6.591 | iteration 251800/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 3.567E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.665675E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.84 | backward: 1804.71 | backward-backward: 1804.69 | backward-allreduce: 0.00 | optimizer: 55.46 | batch generator: 0.75 samples/sec: 6.591 | iteration 251900/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 3.558E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.660114E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.02 | backward: 1804.85 | backward-backward: 1804.83 | backward-allreduce: 0.00 | optimizer: 55.47 | batch generator: 0.76 samples/sec: 6.596 | iteration 252000/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 3.548E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.669045E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.18 | backward: 1803.83 | backward-backward: 1803.80 | backward-allreduce: 0.00 | optimizer: 55.28 | batch generator: 0.77 ----------------------------------------------------------------------------------------------------------- validation results at iteration 252000 | lm_loss value: 2.617984E+00 | lm_loss_ppl value: 1.370806E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.436 | iteration 252100/ 320000 | elapsed time per iteration (ms): 2486.0 | learning rate: 3.539E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.665584E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.18 | backward: 1805.41 | backward-backward: 1805.39 | backward-allreduce: 0.00 | optimizer: 56.14 | batch generator: 0.89 samples/sec: 6.597 | iteration 252200/ 320000 | elapsed time per iteration (ms): 2425.4 | learning rate: 3.529E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.652804E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.27 | backward: 1803.09 | backward-backward: 1803.07 | backward-allreduce: 0.00 | optimizer: 55.59 | batch generator: 0.77 samples/sec: 6.590 | iteration 252300/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 3.520E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.642893E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 567.05 | backward: 1805.25 | backward-backward: 1805.23 | backward-allreduce: 0.00 | optimizer: 55.09 | batch generator: 0.80 samples/sec: 6.598 | iteration 252400/ 320000 | elapsed time per iteration (ms): 2425.0 | learning rate: 3.510E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.669879E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.55 | backward: 1802.61 | backward-backward: 1802.59 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.80 samples/sec: 6.590 | iteration 252500/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 3.501E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.656080E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.70 | backward: 1805.02 | backward-backward: 1804.99 | backward-allreduce: 0.00 | optimizer: 55.69 | batch generator: 0.83 samples/sec: 6.592 | iteration 252600/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 3.491E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.666407E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.01 | backward: 1803.52 | backward-backward: 1803.50 | backward-allreduce: 0.00 | optimizer: 56.33 | batch generator: 0.77 samples/sec: 6.597 | iteration 252700/ 320000 | elapsed time per iteration (ms): 2425.4 | learning rate: 3.482E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.661565E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.06 | backward: 1803.18 | backward-backward: 1803.15 | backward-allreduce: 0.00 | optimizer: 55.75 | batch generator: 0.76 samples/sec: 6.590 | iteration 252800/ 320000 | elapsed time per iteration (ms): 2428.1 | learning rate: 3.472E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.662334E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.04 | backward: 1805.08 | backward-backward: 1805.06 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.79 samples/sec: 6.598 | iteration 252900/ 320000 | elapsed time per iteration (ms): 2425.1 | learning rate: 3.463E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.642861E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.45 | backward: 1802.96 | backward-backward: 1802.93 | backward-allreduce: 0.00 | optimizer: 55.28 | batch generator: 0.76 samples/sec: 6.590 | iteration 253000/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 3.454E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.652412E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.82 | backward: 1804.98 | backward-backward: 1804.96 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.84 ----------------------------------------------------------------------------------------------------------- validation results at iteration 253000 | lm_loss value: 2.659381E+00 | lm_loss_ppl value: 1.428744E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.445 | iteration 253100/ 320000 | elapsed time per iteration (ms): 2482.5 | learning rate: 3.444E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.663165E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.73 | backward: 1803.22 | backward-backward: 1803.20 | backward-allreduce: 0.00 | optimizer: 55.41 | batch generator: 0.85 samples/sec: 6.596 | iteration 253200/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 3.435E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.663109E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.42 | backward: 1803.67 | backward-backward: 1803.65 | backward-allreduce: 0.00 | optimizer: 55.38 | batch generator: 0.72 samples/sec: 6.591 | iteration 253300/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 3.426E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.657439E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.98 | backward: 1804.69 | backward-backward: 1804.67 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.78 samples/sec: 6.593 | iteration 253400/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 3.416E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.668945E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.48 | backward: 1804.21 | backward-backward: 1804.18 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.78 samples/sec: 6.586 | iteration 253500/ 320000 | elapsed time per iteration (ms): 2429.6 | learning rate: 3.407E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.644699E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.42 | backward: 1806.05 | backward-backward: 1806.03 | backward-allreduce: 0.00 | optimizer: 55.71 | batch generator: 0.81 samples/sec: 6.596 | iteration 253600/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 3.397E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.658753E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.29 | backward: 1803.51 | backward-backward: 1803.49 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.80 samples/sec: 6.587 | iteration 253700/ 320000 | elapsed time per iteration (ms): 2429.2 | learning rate: 3.388E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.649125E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.57 | backward: 1805.53 | backward-backward: 1805.51 | backward-allreduce: 0.00 | optimizer: 56.71 | batch generator: 0.78 samples/sec: 6.592 | iteration 253800/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 3.379E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.645621E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.02 | backward: 1804.25 | backward-backward: 1804.23 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.80 samples/sec: 6.590 | iteration 253900/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 3.370E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.658275E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.82 | backward: 1804.90 | backward-backward: 1804.88 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.79 samples/sec: 6.592 | iteration 254000/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 3.360E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.649459E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.26 | backward: 1804.34 | backward-backward: 1804.32 | backward-allreduce: 0.00 | optimizer: 55.37 | batch generator: 0.80 ----------------------------------------------------------------------------------------------------------- validation results at iteration 254000 | lm_loss value: 2.673647E+00 | lm_loss_ppl value: 1.449272E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.445 | iteration 254100/ 320000 | elapsed time per iteration (ms): 2482.4 | learning rate: 3.351E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.664099E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.12 | backward: 1803.64 | backward-backward: 1803.62 | backward-allreduce: 0.00 | optimizer: 55.51 | batch generator: 0.86 samples/sec: 6.591 | iteration 254200/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 3.342E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.647928E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.77 | backward: 1805.10 | backward-backward: 1805.08 | backward-allreduce: 0.00 | optimizer: 55.47 | batch generator: 0.77 samples/sec: 6.596 | iteration 254300/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 3.333E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.667393E+00 | loss scale: 65536.0 | number of skipped iterations: 2 | number of nan iterations: 0 | time (ms) | forward: 566.76 | backward: 1804.04 | backward-backward: 1804.02 | backward-allreduce: 0.00 | optimizer: 54.53 | batch generator: 0.79 samples/sec: 6.594 | iteration 254400/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 3.323E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.651205E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.29 | backward: 1804.38 | backward-backward: 1804.36 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.79 samples/sec: 6.586 | iteration 254500/ 320000 | elapsed time per iteration (ms): 2429.3 | learning rate: 3.314E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.652517E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.56 | backward: 1805.30 | backward-backward: 1805.27 | backward-allreduce: 0.00 | optimizer: 55.88 | batch generator: 0.88 samples/sec: 6.595 | iteration 254600/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 3.305E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.662771E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.83 | backward: 1803.41 | backward-backward: 1803.38 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.79 samples/sec: 6.584 | iteration 254700/ 320000 | elapsed time per iteration (ms): 2430.3 | learning rate: 3.296E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.645928E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.35 | backward: 1806.14 | backward-backward: 1806.11 | backward-allreduce: 0.00 | optimizer: 56.36 | batch generator: 0.80 samples/sec: 6.591 | iteration 254800/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 3.286E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.659246E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.37 | backward: 1803.62 | backward-backward: 1803.59 | backward-allreduce: 0.00 | optimizer: 56.21 | batch generator: 0.79 samples/sec: 6.594 | iteration 254900/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 3.277E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.669189E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.24 | backward: 1804.40 | backward-backward: 1804.38 | backward-allreduce: 0.00 | optimizer: 55.35 | batch generator: 0.75 samples/sec: 6.591 | iteration 255000/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 3.268E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.643936E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 567.40 | backward: 1804.79 | backward-backward: 1804.77 | backward-allreduce: 0.00 | optimizer: 55.14 | batch generator: 0.78 ----------------------------------------------------------------------------------------------------------- validation results at iteration 255000 | lm_loss value: 2.601423E+00 | lm_loss_ppl value: 1.348290E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.445 | iteration 255100/ 320000 | elapsed time per iteration (ms): 2482.5 | learning rate: 3.259E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.672719E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.71 | backward: 1803.05 | backward-backward: 1803.03 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.84 samples/sec: 6.590 | iteration 255200/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 3.250E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.649057E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.99 | backward: 1804.87 | backward-backward: 1804.85 | backward-allreduce: 0.00 | optimizer: 55.81 | batch generator: 0.74 samples/sec: 6.596 | iteration 255300/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 3.241E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.641667E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.67 | backward: 1802.82 | backward-backward: 1802.80 | backward-allreduce: 0.00 | optimizer: 55.84 | batch generator: 0.77 samples/sec: 6.596 | iteration 255400/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 3.232E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.638946E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.44 | backward: 1803.35 | backward-backward: 1803.32 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.76 samples/sec: 6.592 | iteration 255500/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 3.222E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.654334E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.00 | backward: 1804.20 | backward-backward: 1804.18 | backward-allreduce: 0.00 | optimizer: 55.72 | batch generator: 0.77 samples/sec: 6.595 | iteration 255600/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 3.213E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.654885E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.70 | backward: 1803.41 | backward-backward: 1803.39 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.89 samples/sec: 6.588 | iteration 255700/ 320000 | elapsed time per iteration (ms): 2428.6 | learning rate: 3.204E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.650035E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.27 | backward: 1805.21 | backward-backward: 1805.19 | backward-allreduce: 0.00 | optimizer: 55.74 | batch generator: 0.79 samples/sec: 6.596 | iteration 255800/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 3.195E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.637616E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.30 | backward: 1803.54 | backward-backward: 1803.52 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.76 samples/sec: 6.590 | iteration 255900/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 3.186E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.647789E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.75 | backward: 1804.38 | backward-backward: 1804.36 | backward-allreduce: 0.00 | optimizer: 56.41 | batch generator: 0.79 samples/sec: 6.591 | iteration 256000/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 3.177E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.638560E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.25 | backward: 1804.31 | backward-backward: 1804.28 | backward-allreduce: 0.00 | optimizer: 55.46 | batch generator: 0.80 ----------------------------------------------------------------------------------------------------------- validation results at iteration 256000 | lm_loss value: 2.613021E+00 | lm_loss_ppl value: 1.364019E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.443 | iteration 256100/ 320000 | elapsed time per iteration (ms): 2483.3 | learning rate: 3.168E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.638071E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.31 | backward: 1804.30 | backward-backward: 1804.28 | backward-allreduce: 0.00 | optimizer: 55.49 | batch generator: 0.83 samples/sec: 6.587 | iteration 256200/ 320000 | elapsed time per iteration (ms): 2429.1 | learning rate: 3.159E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.653433E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.91 | backward: 1805.19 | backward-backward: 1805.16 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.93 samples/sec: 6.597 | iteration 256300/ 320000 | elapsed time per iteration (ms): 2425.3 | learning rate: 3.150E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.645851E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.29 | backward: 1803.05 | backward-backward: 1803.03 | backward-allreduce: 0.00 | optimizer: 55.59 | batch generator: 0.76 samples/sec: 6.589 | iteration 256400/ 320000 | elapsed time per iteration (ms): 2428.4 | learning rate: 3.141E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.636873E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.99 | backward: 1805.13 | backward-backward: 1805.11 | backward-allreduce: 0.00 | optimizer: 55.90 | batch generator: 0.93 samples/sec: 6.594 | iteration 256500/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 3.132E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.645460E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.93 | backward: 1803.70 | backward-backward: 1803.67 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.78 samples/sec: 6.592 | iteration 256600/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 3.123E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.639103E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.42 | backward: 1804.95 | backward-backward: 1804.93 | backward-allreduce: 0.00 | optimizer: 55.39 | batch generator: 0.79 samples/sec: 6.589 | iteration 256700/ 320000 | elapsed time per iteration (ms): 2428.1 | learning rate: 3.114E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.656709E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.26 | backward: 1804.99 | backward-backward: 1804.96 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.79 samples/sec: 6.597 | iteration 256800/ 320000 | elapsed time per iteration (ms): 2425.3 | learning rate: 3.105E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.644236E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.25 | backward: 1803.37 | backward-backward: 1803.35 | backward-allreduce: 0.00 | optimizer: 55.31 | batch generator: 0.79 samples/sec: 6.588 | iteration 256900/ 320000 | elapsed time per iteration (ms): 2428.6 | learning rate: 3.096E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.652751E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.64 | backward: 1806.07 | backward-backward: 1806.04 | backward-allreduce: 0.00 | optimizer: 55.54 | batch generator: 0.76 samples/sec: 6.595 | iteration 257000/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 3.087E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.634290E+00 | loss scale: 131072.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.45 | backward: 1803.96 | backward-backward: 1803.93 | backward-allreduce: 0.00 | optimizer: 55.47 | batch generator: 0.75 ----------------------------------------------------------------------------------------------------------- validation results at iteration 257000 | lm_loss value: 2.646342E+00 | lm_loss_ppl value: 1.410236E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.439 | iteration 257100/ 320000 | elapsed time per iteration (ms): 2484.7 | learning rate: 3.078E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.634675E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.86 | backward: 1805.20 | backward-backward: 1805.18 | backward-allreduce: 0.00 | optimizer: 55.00 | batch generator: 0.84 samples/sec: 6.595 | iteration 257200/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 3.069E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.644108E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.03 | backward: 1803.28 | backward-backward: 1803.25 | backward-allreduce: 0.00 | optimizer: 55.49 | batch generator: 0.88 samples/sec: 6.594 | iteration 257300/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 3.060E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.651939E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.74 | backward: 1803.95 | backward-backward: 1803.93 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.79 samples/sec: 6.589 | iteration 257400/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 3.051E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.636288E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.27 | backward: 1805.14 | backward-backward: 1805.12 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.77 samples/sec: 6.594 | iteration 257500/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 3.043E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.644746E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.65 | backward: 1803.89 | backward-backward: 1803.87 | backward-allreduce: 0.00 | optimizer: 55.45 | batch generator: 0.76 samples/sec: 6.589 | iteration 257600/ 320000 | elapsed time per iteration (ms): 2428.4 | learning rate: 3.034E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.635151E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 567.61 | backward: 1805.34 | backward-backward: 1805.32 | backward-allreduce: 0.00 | optimizer: 55.05 | batch generator: 0.80 samples/sec: 6.595 | iteration 257700/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 3.025E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.633356E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.81 | backward: 1803.16 | backward-backward: 1803.14 | backward-allreduce: 0.00 | optimizer: 55.79 | batch generator: 0.89 samples/sec: 6.594 | iteration 257800/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 3.016E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.647102E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.45 | backward: 1803.53 | backward-backward: 1803.50 | backward-allreduce: 0.00 | optimizer: 56.00 | batch generator: 0.79 samples/sec: 6.594 | iteration 257900/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 3.007E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.636674E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.44 | backward: 1803.73 | backward-backward: 1803.71 | backward-allreduce: 0.00 | optimizer: 55.95 | batch generator: 0.81 samples/sec: 6.592 | iteration 258000/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.646078E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.68 | backward: 1804.36 | backward-backward: 1804.34 | backward-allreduce: 0.00 | optimizer: 55.81 | batch generator: 0.79 ----------------------------------------------------------------------------------------------------------- validation results at iteration 258000 | lm_loss value: 2.675063E+00 | lm_loss_ppl value: 1.451327E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.442 | iteration 258100/ 320000 | elapsed time per iteration (ms): 2483.6 | learning rate: 3.000E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.638025E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.75 | backward: 1803.47 | backward-backward: 1803.45 | backward-allreduce: 0.00 | optimizer: 56.18 | batch generator: 0.85 samples/sec: 6.595 | iteration 258200/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.645801E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.32 | backward: 1803.60 | backward-backward: 1803.58 | backward-allreduce: 0.00 | optimizer: 55.79 | batch generator: 0.79 samples/sec: 6.591 | iteration 258300/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.640803E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.09 | backward: 1804.67 | backward-backward: 1804.64 | backward-allreduce: 0.00 | optimizer: 55.45 | batch generator: 0.81 samples/sec: 6.597 | iteration 258400/ 320000 | elapsed time per iteration (ms): 2425.2 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.631449E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.25 | backward: 1802.95 | backward-backward: 1802.92 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.75 samples/sec: 6.592 | iteration 258500/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.629810E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.78 | backward: 1804.46 | backward-backward: 1804.44 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.77 samples/sec: 6.592 | iteration 258600/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.650591E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.45 | backward: 1803.66 | backward-backward: 1803.64 | backward-allreduce: 0.00 | optimizer: 55.71 | batch generator: 0.84 samples/sec: 6.590 | iteration 258700/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.637146E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.48 | backward: 1805.29 | backward-backward: 1805.27 | backward-allreduce: 0.00 | optimizer: 55.66 | batch generator: 0.78 samples/sec: 6.588 | iteration 258800/ 320000 | elapsed time per iteration (ms): 2428.8 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.633659E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.48 | backward: 1805.03 | backward-backward: 1805.00 | backward-allreduce: 0.00 | optimizer: 55.92 | batch generator: 0.78 samples/sec: 6.595 | iteration 258900/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.633613E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.26 | backward: 1804.24 | backward-backward: 1804.21 | backward-allreduce: 0.00 | optimizer: 55.39 | batch generator: 0.77 samples/sec: 6.586 | iteration 259000/ 320000 | elapsed time per iteration (ms): 2429.2 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.627650E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.43 | backward: 1805.69 | backward-backward: 1805.67 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.79 ----------------------------------------------------------------------------------------------------------- validation results at iteration 259000 | lm_loss value: 2.664446E+00 | lm_loss_ppl value: 1.435999E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.443 | iteration 259100/ 320000 | elapsed time per iteration (ms): 2483.4 | learning rate: 3.000E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.627427E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.64 | backward: 1803.64 | backward-backward: 1803.61 | backward-allreduce: 0.00 | optimizer: 55.91 | batch generator: 0.85 samples/sec: 6.588 | iteration 259200/ 320000 | elapsed time per iteration (ms): 2428.5 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.626682E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.02 | backward: 1805.57 | backward-backward: 1805.55 | backward-allreduce: 0.00 | optimizer: 55.51 | batch generator: 0.78 samples/sec: 6.595 | iteration 259300/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.647105E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.72 | backward: 1803.62 | backward-backward: 1803.59 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.79 samples/sec: 6.595 | iteration 259400/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.621334E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.48 | backward: 1803.88 | backward-backward: 1803.85 | backward-allreduce: 0.00 | optimizer: 55.29 | batch generator: 0.78 samples/sec: 6.586 | iteration 259500/ 320000 | elapsed time per iteration (ms): 2429.4 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.626471E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.20 | backward: 1806.32 | backward-backward: 1806.30 | backward-allreduce: 0.00 | optimizer: 55.54 | batch generator: 0.78 samples/sec: 6.597 | iteration 259600/ 320000 | elapsed time per iteration (ms): 2425.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.634537E+00 | loss scale: 131072.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.38 | backward: 1803.68 | backward-backward: 1803.65 | backward-allreduce: 0.00 | optimizer: 54.90 | batch generator: 0.78 samples/sec: 6.589 | iteration 259700/ 320000 | elapsed time per iteration (ms): 2428.4 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.626572E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.70 | backward: 1806.23 | backward-backward: 1806.21 | backward-allreduce: 0.00 | optimizer: 55.14 | batch generator: 0.78 samples/sec: 6.595 | iteration 259800/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.640982E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.63 | backward: 1803.74 | backward-backward: 1803.72 | backward-allreduce: 0.00 | optimizer: 55.50 | batch generator: 0.78 samples/sec: 6.593 | iteration 259900/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.639997E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.42 | backward: 1804.57 | backward-backward: 1804.54 | backward-allreduce: 0.00 | optimizer: 55.47 | batch generator: 0.79 samples/sec: 6.591 | iteration 260000/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.637998E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.13 | backward: 1804.10 | backward-backward: 1804.08 | backward-allreduce: 0.00 | optimizer: 56.02 | batch generator: 0.96 WARNING: Deleting old checkpoints: checkpoints-hfcm/global_step160000 ----------------------------------------------------------------------------------------------------------- validation results at iteration 260000 | lm_loss value: 2.582207E+00 | lm_loss_ppl value: 1.322629E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.226 | iteration 260100/ 320000 | elapsed time per iteration (ms): 2569.7 | learning rate: 3.000E-05 | approx flops per GPU: 38.7TFLOPS | lm_loss: 2.632487E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.14 | backward: 1804.97 | backward-backward: 1804.95 | backward-allreduce: 0.00 | optimizer: 55.80 | batch generator: 0.84 samples/sec: 6.588 | iteration 260200/ 320000 | elapsed time per iteration (ms): 2428.6 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.636014E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.00 | backward: 1805.09 | backward-backward: 1805.06 | backward-allreduce: 0.00 | optimizer: 56.13 | batch generator: 0.77 samples/sec: 6.594 | iteration 260300/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.617112E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.20 | backward: 1803.73 | backward-backward: 1803.71 | backward-allreduce: 0.00 | optimizer: 56.23 | batch generator: 0.77 samples/sec: 6.586 | iteration 260400/ 320000 | elapsed time per iteration (ms): 2429.3 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.620883E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.62 | backward: 1805.40 | backward-backward: 1805.38 | backward-allreduce: 0.00 | optimizer: 55.92 | batch generator: 0.87 samples/sec: 6.596 | iteration 260500/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.648598E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.82 | backward: 1803.07 | backward-backward: 1803.04 | backward-allreduce: 0.00 | optimizer: 55.43 | batch generator: 0.85 samples/sec: 6.590 | iteration 260600/ 320000 | elapsed time per iteration (ms): 2428.1 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.647334E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.46 | backward: 1805.51 | backward-backward: 1805.49 | backward-allreduce: 0.00 | optimizer: 55.72 | batch generator: 0.81 samples/sec: 6.590 | iteration 260700/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.637316E+00 | loss scale: 131072.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 567.37 | backward: 1805.13 | backward-backward: 1805.11 | backward-allreduce: 0.00 | optimizer: 54.95 | batch generator: 0.76 samples/sec: 6.595 | iteration 260800/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.643345E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.65 | backward: 1803.89 | backward-backward: 1803.86 | backward-allreduce: 0.00 | optimizer: 55.06 | batch generator: 0.81 samples/sec: 6.592 | iteration 260900/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.630883E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.68 | backward: 1804.31 | backward-backward: 1804.28 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.77 samples/sec: 6.595 | iteration 261000/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.635566E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.30 | backward: 1803.91 | backward-backward: 1803.89 | backward-allreduce: 0.00 | optimizer: 55.67 | batch generator: 0.76 ----------------------------------------------------------------------------------------------------------- validation results at iteration 261000 | lm_loss value: 2.639818E+00 | lm_loss_ppl value: 1.401065E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.441 | iteration 261100/ 320000 | elapsed time per iteration (ms): 2484.2 | learning rate: 3.000E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.629482E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.60 | backward: 1804.52 | backward-backward: 1804.49 | backward-allreduce: 0.00 | optimizer: 55.95 | batch generator: 0.93 samples/sec: 6.589 | iteration 261200/ 320000 | elapsed time per iteration (ms): 2428.4 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.637674E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.39 | backward: 1805.19 | backward-backward: 1805.17 | backward-allreduce: 0.00 | optimizer: 56.31 | batch generator: 0.76 samples/sec: 6.589 | iteration 261300/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.636870E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.14 | backward: 1804.28 | backward-backward: 1804.26 | backward-allreduce: 0.00 | optimizer: 56.47 | batch generator: 0.78 samples/sec: 6.595 | iteration 261400/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.654547E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.26 | backward: 1803.97 | backward-backward: 1803.95 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.79 samples/sec: 6.588 | iteration 261500/ 320000 | elapsed time per iteration (ms): 2428.7 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.632112E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.17 | backward: 1805.06 | backward-backward: 1805.04 | backward-allreduce: 0.00 | optimizer: 56.09 | batch generator: 0.75 samples/sec: 6.596 | iteration 261600/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.626945E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.55 | backward: 1803.29 | backward-backward: 1803.27 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.78 samples/sec: 6.586 | iteration 261700/ 320000 | elapsed time per iteration (ms): 2429.4 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.625858E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.08 | backward: 1805.96 | backward-backward: 1805.94 | backward-allreduce: 0.00 | optimizer: 56.02 | batch generator: 0.84 samples/sec: 6.597 | iteration 261800/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.635333E+00 | loss scale: 65536.0 | number of skipped iterations: 2 | number of nan iterations: 0 | time (ms) | forward: 566.97 | backward: 1803.40 | backward-backward: 1803.38 | backward-allreduce: 0.00 | optimizer: 54.72 | batch generator: 0.81 samples/sec: 6.592 | iteration 261900/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.638753E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.68 | backward: 1804.41 | backward-backward: 1804.39 | backward-allreduce: 0.00 | optimizer: 55.84 | batch generator: 0.78 samples/sec: 6.592 | iteration 262000/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.628753E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.61 | backward: 1804.39 | backward-backward: 1804.36 | backward-allreduce: 0.00 | optimizer: 55.75 | batch generator: 0.77 ----------------------------------------------------------------------------------------------------------- validation results at iteration 262000 | lm_loss value: 2.602781E+00 | lm_loss_ppl value: 1.350123E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.442 | iteration 262100/ 320000 | elapsed time per iteration (ms): 2483.6 | learning rate: 3.000E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.626034E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.68 | backward: 1804.19 | backward-backward: 1804.17 | backward-allreduce: 0.00 | optimizer: 55.54 | batch generator: 0.89 samples/sec: 6.594 | iteration 262200/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.636656E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.73 | backward: 1803.75 | backward-backward: 1803.73 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.78 samples/sec: 6.591 | iteration 262300/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.652104E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.11 | backward: 1804.55 | backward-backward: 1804.53 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.77 samples/sec: 6.587 | iteration 262400/ 320000 | elapsed time per iteration (ms): 2429.0 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.643214E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.11 | backward: 1805.16 | backward-backward: 1805.14 | backward-allreduce: 0.00 | optimizer: 56.37 | batch generator: 0.81 samples/sec: 6.595 | iteration 262500/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.639419E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.39 | backward: 1803.49 | backward-backward: 1803.47 | backward-allreduce: 0.00 | optimizer: 55.84 | batch generator: 0.79 samples/sec: 6.587 | iteration 262600/ 320000 | elapsed time per iteration (ms): 2429.2 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.620926E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.25 | backward: 1805.77 | backward-backward: 1805.75 | backward-allreduce: 0.00 | optimizer: 55.80 | batch generator: 0.78 samples/sec: 6.595 | iteration 262700/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.637571E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.72 | backward: 1803.63 | backward-backward: 1803.60 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.77 samples/sec: 6.586 | iteration 262800/ 320000 | elapsed time per iteration (ms): 2429.2 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.631400E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.60 | backward: 1806.37 | backward-backward: 1806.35 | backward-allreduce: 0.00 | optimizer: 55.89 | batch generator: 0.76 samples/sec: 6.592 | iteration 262900/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.647985E+00 | loss scale: 65536.0 | number of skipped iterations: 2 | number of nan iterations: 0 | time (ms) | forward: 567.24 | backward: 1804.96 | backward-backward: 1804.94 | backward-allreduce: 0.00 | optimizer: 54.59 | batch generator: 0.76 samples/sec: 6.594 | iteration 263000/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.630998E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.23 | backward: 1804.22 | backward-backward: 1804.20 | backward-allreduce: 0.00 | optimizer: 55.45 | batch generator: 0.76 ----------------------------------------------------------------------------------------------------------- validation results at iteration 263000 | lm_loss value: 2.554792E+00 | lm_loss_ppl value: 1.286863E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.442 | iteration 263100/ 320000 | elapsed time per iteration (ms): 2483.6 | learning rate: 3.000E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.636675E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.41 | backward: 1804.34 | backward-backward: 1804.32 | backward-allreduce: 0.00 | optimizer: 55.66 | batch generator: 0.83 samples/sec: 6.587 | iteration 263200/ 320000 | elapsed time per iteration (ms): 2428.9 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.634670E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.37 | backward: 1805.14 | backward-backward: 1805.12 | backward-allreduce: 0.00 | optimizer: 56.00 | batch generator: 0.86 samples/sec: 6.597 | iteration 263300/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.634238E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.66 | backward: 1803.20 | backward-backward: 1803.18 | backward-allreduce: 0.00 | optimizer: 55.25 | batch generator: 0.79 samples/sec: 6.587 | iteration 263400/ 320000 | elapsed time per iteration (ms): 2429.2 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.613133E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.14 | backward: 1806.13 | backward-backward: 1806.11 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.77 samples/sec: 6.593 | iteration 263500/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.658441E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.95 | backward: 1803.46 | backward-backward: 1803.44 | backward-allreduce: 0.00 | optimizer: 56.08 | batch generator: 0.79 samples/sec: 6.590 | iteration 263600/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.623465E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.81 | backward: 1805.41 | backward-backward: 1805.38 | backward-allreduce: 0.00 | optimizer: 55.28 | batch generator: 0.80 samples/sec: 6.592 | iteration 263700/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.624368E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.29 | backward: 1804.15 | backward-backward: 1804.13 | backward-allreduce: 0.00 | optimizer: 55.50 | batch generator: 0.82 samples/sec: 6.597 | iteration 263800/ 320000 | elapsed time per iteration (ms): 2425.4 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.631992E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.12 | backward: 1803.56 | backward-backward: 1803.54 | backward-allreduce: 0.00 | optimizer: 55.35 | batch generator: 0.77 samples/sec: 6.587 | iteration 263900/ 320000 | elapsed time per iteration (ms): 2429.1 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.623446E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.49 | backward: 1805.46 | backward-backward: 1805.44 | backward-allreduce: 0.00 | optimizer: 55.67 | batch generator: 0.83 samples/sec: 6.595 | iteration 264000/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.625187E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.02 | backward: 1803.36 | backward-backward: 1803.33 | backward-allreduce: 0.00 | optimizer: 55.33 | batch generator: 0.80 ----------------------------------------------------------------------------------------------------------- validation results at iteration 264000 | lm_loss value: 2.600442E+00 | lm_loss_ppl value: 1.346969E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.442 | iteration 264100/ 320000 | elapsed time per iteration (ms): 2483.8 | learning rate: 3.000E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.635901E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.40 | backward: 1804.53 | backward-backward: 1804.50 | backward-allreduce: 0.00 | optimizer: 55.73 | batch generator: 0.83 samples/sec: 6.594 | iteration 264200/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.648513E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.69 | backward: 1803.86 | backward-backward: 1803.84 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.82 samples/sec: 6.595 | iteration 264300/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.628999E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.30 | backward: 1804.02 | backward-backward: 1804.00 | backward-allreduce: 0.00 | optimizer: 55.38 | batch generator: 0.74 samples/sec: 6.592 | iteration 264400/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.637428E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.12 | backward: 1804.38 | backward-backward: 1804.35 | backward-allreduce: 0.00 | optimizer: 55.50 | batch generator: 0.76 samples/sec: 6.595 | iteration 264500/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.620783E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.59 | backward: 1803.79 | backward-backward: 1803.77 | backward-allreduce: 0.00 | optimizer: 55.30 | batch generator: 0.78 samples/sec: 6.588 | iteration 264600/ 320000 | elapsed time per iteration (ms): 2428.7 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.630475E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.49 | backward: 1804.54 | backward-backward: 1804.52 | backward-allreduce: 0.00 | optimizer: 56.34 | batch generator: 0.78 samples/sec: 6.594 | iteration 264700/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.646866E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.35 | backward: 1803.90 | backward-backward: 1803.88 | backward-allreduce: 0.00 | optimizer: 55.71 | batch generator: 0.86 samples/sec: 6.588 | iteration 264800/ 320000 | elapsed time per iteration (ms): 2428.5 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.659629E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.07 | backward: 1805.65 | backward-backward: 1805.63 | backward-allreduce: 0.00 | optimizer: 55.46 | batch generator: 0.77 samples/sec: 6.594 | iteration 264900/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.617983E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.75 | backward: 1803.59 | backward-backward: 1803.56 | backward-allreduce: 0.00 | optimizer: 55.69 | batch generator: 0.80 samples/sec: 6.593 | iteration 265000/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.640137E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.59 | backward: 1803.96 | backward-backward: 1803.94 | backward-allreduce: 0.00 | optimizer: 55.90 | batch generator: 0.79 ----------------------------------------------------------------------------------------------------------- validation results at iteration 265000 | lm_loss value: 2.637312E+00 | lm_loss_ppl value: 1.397559E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.443 | iteration 265100/ 320000 | elapsed time per iteration (ms): 2483.3 | learning rate: 3.000E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.628506E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.30 | backward: 1803.99 | backward-backward: 1803.97 | backward-allreduce: 0.00 | optimizer: 55.80 | batch generator: 0.81 samples/sec: 6.589 | iteration 265200/ 320000 | elapsed time per iteration (ms): 2428.4 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.637288E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.30 | backward: 1805.04 | backward-backward: 1805.01 | backward-allreduce: 0.00 | optimizer: 55.71 | batch generator: 0.77 samples/sec: 6.596 | iteration 265300/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.631703E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.92 | backward: 1803.12 | backward-backward: 1803.10 | backward-allreduce: 0.00 | optimizer: 55.38 | batch generator: 0.76 samples/sec: 6.591 | iteration 265400/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.609397E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.76 | backward: 1805.05 | backward-backward: 1805.03 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.76 samples/sec: 6.591 | iteration 265500/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.620068E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.95 | backward: 1804.43 | backward-backward: 1804.41 | backward-allreduce: 0.00 | optimizer: 55.97 | batch generator: 0.75 samples/sec: 6.593 | iteration 265600/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.623968E+00 | loss scale: 65536.0 | number of skipped iterations: 2 | number of nan iterations: 0 | time (ms) | forward: 566.74 | backward: 1805.11 | backward-backward: 1805.09 | backward-allreduce: 0.00 | optimizer: 54.55 | batch generator: 0.77 samples/sec: 6.586 | iteration 265700/ 320000 | elapsed time per iteration (ms): 2429.5 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.623334E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.52 | backward: 1805.18 | backward-backward: 1805.16 | backward-allreduce: 0.00 | optimizer: 56.34 | batch generator: 0.86 samples/sec: 6.595 | iteration 265800/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.638253E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.37 | backward: 1803.82 | backward-backward: 1803.79 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.77 samples/sec: 6.588 | iteration 265900/ 320000 | elapsed time per iteration (ms): 2428.6 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.625927E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.14 | backward: 1805.57 | backward-backward: 1805.55 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.72 samples/sec: 6.596 | iteration 266000/ 320000 | elapsed time per iteration (ms): 2425.6 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.632286E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.54 | backward: 1803.32 | backward-backward: 1803.30 | backward-allreduce: 0.00 | optimizer: 55.39 | batch generator: 0.76 ----------------------------------------------------------------------------------------------------------- validation results at iteration 266000 | lm_loss value: 2.643552E+00 | lm_loss_ppl value: 1.406306E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.439 | iteration 266100/ 320000 | elapsed time per iteration (ms): 2484.7 | learning rate: 3.000E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.618720E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.70 | backward: 1805.32 | backward-backward: 1805.30 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.90 samples/sec: 6.593 | iteration 266200/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.633432E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.20 | backward: 1803.55 | backward-backward: 1803.53 | backward-allreduce: 0.00 | optimizer: 55.59 | batch generator: 0.79 samples/sec: 6.594 | iteration 266300/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.643376E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.55 | backward: 1804.10 | backward-backward: 1804.07 | backward-allreduce: 0.00 | optimizer: 55.42 | batch generator: 0.80 samples/sec: 6.588 | iteration 266400/ 320000 | elapsed time per iteration (ms): 2428.7 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.634568E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.46 | backward: 1804.70 | backward-backward: 1804.68 | backward-allreduce: 0.00 | optimizer: 56.18 | batch generator: 0.78 samples/sec: 6.595 | iteration 266500/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.635289E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.30 | backward: 1803.84 | backward-backward: 1803.82 | backward-allreduce: 0.00 | optimizer: 55.67 | batch generator: 0.77 samples/sec: 6.587 | iteration 266600/ 320000 | elapsed time per iteration (ms): 2428.9 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.621262E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.32 | backward: 1805.69 | backward-backward: 1805.67 | backward-allreduce: 0.00 | optimizer: 55.51 | batch generator: 0.78 samples/sec: 6.595 | iteration 266700/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.625753E+00 | loss scale: 65536.0 | number of skipped iterations: 2 | number of nan iterations: 0 | time (ms) | forward: 566.70 | backward: 1804.22 | backward-backward: 1804.20 | backward-allreduce: 0.00 | optimizer: 54.68 | batch generator: 0.78 samples/sec: 6.590 | iteration 266800/ 320000 | elapsed time per iteration (ms): 2428.1 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.634841E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.49 | backward: 1804.94 | backward-backward: 1804.92 | backward-allreduce: 0.00 | optimizer: 56.28 | batch generator: 0.80 samples/sec: 6.590 | iteration 266900/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.622892E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.28 | backward: 1804.58 | backward-backward: 1804.56 | backward-allreduce: 0.00 | optimizer: 55.50 | batch generator: 0.78 samples/sec: 6.596 | iteration 267000/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.624777E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.32 | backward: 1803.76 | backward-backward: 1803.74 | backward-allreduce: 0.00 | optimizer: 55.29 | batch generator: 0.77 ----------------------------------------------------------------------------------------------------------- validation results at iteration 267000 | lm_loss value: 2.605260E+00 | lm_loss_ppl value: 1.353475E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.437 | iteration 267100/ 320000 | elapsed time per iteration (ms): 2485.7 | learning rate: 3.000E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.636609E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.32 | backward: 1805.67 | backward-backward: 1805.65 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.83 samples/sec: 6.600 | iteration 267200/ 320000 | elapsed time per iteration (ms): 2424.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.631650E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.46 | backward: 1802.41 | backward-backward: 1802.38 | backward-allreduce: 0.00 | optimizer: 55.08 | batch generator: 0.77 samples/sec: 6.593 | iteration 267300/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.625676E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.72 | backward: 1804.10 | backward-backward: 1804.08 | backward-allreduce: 0.00 | optimizer: 55.75 | batch generator: 0.79 samples/sec: 6.596 | iteration 267400/ 320000 | elapsed time per iteration (ms): 2425.6 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.610638E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.47 | backward: 1803.07 | backward-backward: 1803.05 | backward-allreduce: 0.00 | optimizer: 55.73 | batch generator: 0.75 samples/sec: 6.594 | iteration 267500/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.622932E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.60 | backward: 1803.77 | backward-backward: 1803.75 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.79 samples/sec: 6.597 | iteration 267600/ 320000 | elapsed time per iteration (ms): 2425.2 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.646952E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.55 | backward: 1802.98 | backward-backward: 1802.96 | backward-allreduce: 0.00 | optimizer: 55.33 | batch generator: 0.79 samples/sec: 6.597 | iteration 267700/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.625236E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.34 | backward: 1803.23 | backward-backward: 1803.21 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.80 samples/sec: 6.589 | iteration 267800/ 320000 | elapsed time per iteration (ms): 2428.1 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.629580E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.77 | backward: 1805.50 | backward-backward: 1805.48 | backward-allreduce: 0.00 | optimizer: 55.50 | batch generator: 0.85 samples/sec: 6.598 | iteration 267900/ 320000 | elapsed time per iteration (ms): 2425.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.639323E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.06 | backward: 1802.87 | backward-backward: 1802.85 | backward-allreduce: 0.00 | optimizer: 55.81 | batch generator: 0.75 samples/sec: 6.592 | iteration 268000/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.612642E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.67 | backward: 1804.74 | backward-backward: 1804.72 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.76 ----------------------------------------------------------------------------------------------------------- validation results at iteration 268000 | lm_loss value: 2.633399E+00 | lm_loss_ppl value: 1.392100E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.445 | iteration 268100/ 320000 | elapsed time per iteration (ms): 2482.5 | learning rate: 3.000E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.627336E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.44 | backward: 1803.32 | backward-backward: 1803.29 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.84 samples/sec: 6.593 | iteration 268200/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.637863E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.56 | backward: 1804.34 | backward-backward: 1804.32 | backward-allreduce: 0.00 | optimizer: 55.39 | batch generator: 0.80 samples/sec: 6.591 | iteration 268300/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.614168E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.83 | backward: 1804.78 | backward-backward: 1804.76 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.83 samples/sec: 6.596 | iteration 268400/ 320000 | elapsed time per iteration (ms): 2425.6 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.618159E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.09 | backward: 1803.75 | backward-backward: 1803.73 | backward-allreduce: 0.00 | optimizer: 55.38 | batch generator: 0.75 samples/sec: 6.590 | iteration 268500/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.623811E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.18 | backward: 1804.94 | backward-backward: 1804.91 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.77 samples/sec: 6.596 | iteration 268600/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.625735E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.37 | backward: 1803.50 | backward-backward: 1803.48 | backward-allreduce: 0.00 | optimizer: 55.59 | batch generator: 0.76 samples/sec: 6.590 | iteration 268700/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.634304E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.85 | backward: 1804.98 | backward-backward: 1804.95 | backward-allreduce: 0.00 | optimizer: 55.83 | batch generator: 0.82 samples/sec: 6.593 | iteration 268800/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.622945E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.19 | backward: 1803.79 | backward-backward: 1803.77 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 1.03 samples/sec: 6.593 | iteration 268900/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.611967E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.11 | backward: 1805.03 | backward-backward: 1805.01 | backward-allreduce: 0.00 | optimizer: 55.35 | batch generator: 0.75 samples/sec: 6.584 | iteration 269000/ 320000 | elapsed time per iteration (ms): 2430.1 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.633491E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.16 | backward: 1806.02 | backward-backward: 1806.00 | backward-allreduce: 0.00 | optimizer: 56.52 | batch generator: 0.93 ----------------------------------------------------------------------------------------------------------- validation results at iteration 269000 | lm_loss value: 2.593280E+00 | lm_loss_ppl value: 1.337357E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.446 | iteration 269100/ 320000 | elapsed time per iteration (ms): 2482.1 | learning rate: 3.000E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.625581E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.13 | backward: 1803.21 | backward-backward: 1803.19 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.81 samples/sec: 6.592 | iteration 269200/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.624622E+00 | loss scale: 65536.0 | number of skipped iterations: 2 | number of nan iterations: 0 | time (ms) | forward: 566.89 | backward: 1805.18 | backward-backward: 1805.16 | backward-allreduce: 0.00 | optimizer: 54.66 | batch generator: 0.85 samples/sec: 6.594 | iteration 269300/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.644542E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.64 | backward: 1804.09 | backward-backward: 1804.07 | backward-allreduce: 0.00 | optimizer: 55.37 | batch generator: 0.77 samples/sec: 6.596 | iteration 269400/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.612613E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.32 | backward: 1803.79 | backward-backward: 1803.77 | backward-allreduce: 0.00 | optimizer: 55.39 | batch generator: 0.77 samples/sec: 6.589 | iteration 269500/ 320000 | elapsed time per iteration (ms): 2428.5 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.636003E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.21 | backward: 1805.07 | backward-backward: 1805.05 | backward-allreduce: 0.00 | optimizer: 55.81 | batch generator: 0.79 samples/sec: 6.597 | iteration 269600/ 320000 | elapsed time per iteration (ms): 2425.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.631122E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.39 | backward: 1803.15 | backward-backward: 1803.12 | backward-allreduce: 0.00 | optimizer: 55.39 | batch generator: 0.76 samples/sec: 6.588 | iteration 269700/ 320000 | elapsed time per iteration (ms): 2428.6 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.631674E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.98 | backward: 1805.36 | backward-backward: 1805.34 | backward-allreduce: 0.00 | optimizer: 55.85 | batch generator: 0.80 samples/sec: 6.592 | iteration 269800/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.632061E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.78 | backward: 1804.01 | backward-backward: 1803.99 | backward-allreduce: 0.00 | optimizer: 55.89 | batch generator: 0.78 samples/sec: 6.596 | iteration 269900/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.623291E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 565.94 | backward: 1803.89 | backward-backward: 1803.87 | backward-allreduce: 0.00 | optimizer: 55.51 | batch generator: 0.75 samples/sec: 6.587 | iteration 270000/ 320000 | elapsed time per iteration (ms): 2428.9 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.626239E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.12 | backward: 1805.11 | backward-backward: 1805.08 | backward-allreduce: 0.00 | optimizer: 56.29 | batch generator: 0.86 WARNING: Deleting old checkpoints: checkpoints-hfcm/global_step170000 ----------------------------------------------------------------------------------------------------------- validation results at iteration 270000 | lm_loss value: 2.594187E+00 | lm_loss_ppl value: 1.338569E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.234 | iteration 270100/ 320000 | elapsed time per iteration (ms): 2566.5 | learning rate: 3.000E-05 | approx flops per GPU: 38.7TFLOPS | lm_loss: 2.623166E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.26 | backward: 1802.21 | backward-backward: 1802.19 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.85 samples/sec: 6.590 | iteration 270200/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.645308E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.78 | backward: 1805.14 | backward-backward: 1805.11 | backward-allreduce: 0.00 | optimizer: 55.74 | batch generator: 0.77 samples/sec: 6.595 | iteration 270300/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.627852E+00 | loss scale: 65536.0 | number of skipped iterations: 2 | number of nan iterations: 0 | time (ms) | forward: 566.74 | backward: 1804.69 | backward-backward: 1804.67 | backward-allreduce: 0.00 | optimizer: 54.45 | batch generator: 0.83 samples/sec: 6.599 | iteration 270400/ 320000 | elapsed time per iteration (ms): 2424.8 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.622331E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 565.95 | backward: 1803.20 | backward-backward: 1803.18 | backward-allreduce: 0.00 | optimizer: 55.25 | batch generator: 0.74 samples/sec: 6.589 | iteration 270500/ 320000 | elapsed time per iteration (ms): 2428.1 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.626585E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.17 | backward: 1805.07 | backward-backward: 1805.04 | backward-allreduce: 0.00 | optimizer: 55.50 | batch generator: 0.77 samples/sec: 6.597 | iteration 270600/ 320000 | elapsed time per iteration (ms): 2425.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.628419E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.10 | backward: 1803.19 | backward-backward: 1803.17 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.81 samples/sec: 6.592 | iteration 270700/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.582142E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.83 | backward: 1804.69 | backward-backward: 1804.67 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.81 samples/sec: 6.593 | iteration 270800/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.622310E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.57 | backward: 1804.27 | backward-backward: 1804.24 | backward-allreduce: 0.00 | optimizer: 55.41 | batch generator: 0.75 samples/sec: 6.595 | iteration 270900/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.605578E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.32 | backward: 1803.78 | backward-backward: 1803.76 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.80 samples/sec: 6.587 | iteration 271000/ 320000 | elapsed time per iteration (ms): 2429.1 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.613550E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.92 | backward: 1806.12 | backward-backward: 1806.09 | backward-allreduce: 0.00 | optimizer: 55.66 | batch generator: 0.78 ----------------------------------------------------------------------------------------------------------- validation results at iteration 271000 | lm_loss value: 2.604464E+00 | lm_loss_ppl value: 1.352397E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.445 | iteration 271100/ 320000 | elapsed time per iteration (ms): 2482.7 | learning rate: 3.000E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.626990E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.66 | backward: 1802.96 | backward-backward: 1802.94 | backward-allreduce: 0.00 | optimizer: 55.98 | batch generator: 0.84 samples/sec: 6.590 | iteration 271200/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.638429E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.82 | backward: 1805.00 | backward-backward: 1804.98 | backward-allreduce: 0.00 | optimizer: 55.66 | batch generator: 0.77 samples/sec: 6.597 | iteration 271300/ 320000 | elapsed time per iteration (ms): 2425.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.624365E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.57 | backward: 1803.36 | backward-backward: 1803.33 | backward-allreduce: 0.00 | optimizer: 55.04 | batch generator: 0.78 samples/sec: 6.598 | iteration 271400/ 320000 | elapsed time per iteration (ms): 2424.9 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.630674E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.05 | backward: 1803.18 | backward-backward: 1803.15 | backward-allreduce: 0.00 | optimizer: 55.34 | batch generator: 0.76 samples/sec: 6.591 | iteration 271500/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.619414E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.04 | backward: 1804.71 | backward-backward: 1804.69 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.77 samples/sec: 6.598 | iteration 271600/ 320000 | elapsed time per iteration (ms): 2424.9 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.636181E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.08 | backward: 1802.68 | backward-backward: 1802.65 | backward-allreduce: 0.00 | optimizer: 55.70 | batch generator: 0.77 samples/sec: 6.590 | iteration 271700/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.617231E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.78 | backward: 1804.44 | backward-backward: 1804.41 | backward-allreduce: 0.00 | optimizer: 56.15 | batch generator: 0.78 samples/sec: 6.596 | iteration 271800/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.628433E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.75 | backward: 1803.22 | backward-backward: 1803.19 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.79 samples/sec: 6.596 | iteration 271900/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.610026E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.19 | backward: 1803.25 | backward-backward: 1803.22 | backward-allreduce: 0.00 | optimizer: 55.94 | batch generator: 0.79 samples/sec: 6.590 | iteration 272000/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.617370E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.92 | backward: 1805.03 | backward-backward: 1805.01 | backward-allreduce: 0.00 | optimizer: 55.71 | batch generator: 0.79 ----------------------------------------------------------------------------------------------------------- validation results at iteration 272000 | lm_loss value: 2.592838E+00 | lm_loss_ppl value: 1.336765E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.446 | iteration 272100/ 320000 | elapsed time per iteration (ms): 2482.1 | learning rate: 3.000E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.598911E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 565.92 | backward: 1803.52 | backward-backward: 1803.50 | backward-allreduce: 0.00 | optimizer: 55.47 | batch generator: 0.84 samples/sec: 6.588 | iteration 272200/ 320000 | elapsed time per iteration (ms): 2428.5 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.612413E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.01 | backward: 1805.02 | backward-backward: 1805.00 | backward-allreduce: 0.00 | optimizer: 56.09 | batch generator: 0.79 samples/sec: 6.595 | iteration 272300/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.636093E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.49 | backward: 1803.81 | backward-backward: 1803.79 | backward-allreduce: 0.00 | optimizer: 55.54 | batch generator: 0.82 samples/sec: 6.596 | iteration 272400/ 320000 | elapsed time per iteration (ms): 2425.6 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.631741E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.19 | backward: 1803.57 | backward-backward: 1803.55 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.80 samples/sec: 6.587 | iteration 272500/ 320000 | elapsed time per iteration (ms): 2428.8 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.634570E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.04 | backward: 1805.40 | backward-backward: 1805.38 | backward-allreduce: 0.00 | optimizer: 56.04 | batch generator: 0.79 samples/sec: 6.599 | iteration 272600/ 320000 | elapsed time per iteration (ms): 2424.6 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.622637E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.26 | backward: 1803.07 | backward-backward: 1803.05 | backward-allreduce: 0.00 | optimizer: 54.85 | batch generator: 0.78 samples/sec: 6.591 | iteration 272700/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.635111E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.68 | backward: 1804.86 | backward-backward: 1804.83 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.79 samples/sec: 6.592 | iteration 272800/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.632802E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.91 | backward: 1804.05 | backward-backward: 1804.03 | backward-allreduce: 0.00 | optimizer: 55.76 | batch generator: 0.80 samples/sec: 6.596 | iteration 272900/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.626512E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.20 | backward: 1803.59 | backward-backward: 1803.57 | backward-allreduce: 0.00 | optimizer: 55.38 | batch generator: 0.78 samples/sec: 6.588 | iteration 273000/ 320000 | elapsed time per iteration (ms): 2428.5 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.641829E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.52 | backward: 1805.89 | backward-backward: 1805.87 | backward-allreduce: 0.00 | optimizer: 55.72 | batch generator: 0.75 ----------------------------------------------------------------------------------------------------------- validation results at iteration 273000 | lm_loss value: 2.607935E+00 | lm_loss_ppl value: 1.357099E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.446 | iteration 273100/ 320000 | elapsed time per iteration (ms): 2482.2 | learning rate: 3.000E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.623015E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.12 | backward: 1803.35 | backward-backward: 1803.32 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.83 samples/sec: 6.591 | iteration 273200/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.639727E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.52 | backward: 1805.26 | backward-backward: 1805.24 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.76 samples/sec: 6.589 | iteration 273300/ 320000 | elapsed time per iteration (ms): 2428.1 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.629545E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.82 | backward: 1804.47 | backward-backward: 1804.45 | backward-allreduce: 0.00 | optimizer: 56.47 | batch generator: 0.86 samples/sec: 6.596 | iteration 273400/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.635016E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 565.95 | backward: 1803.41 | backward-backward: 1803.39 | backward-allreduce: 0.00 | optimizer: 55.94 | batch generator: 0.76 samples/sec: 6.593 | iteration 273500/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.619366E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.89 | backward: 1804.81 | backward-backward: 1804.79 | backward-allreduce: 0.00 | optimizer: 54.91 | batch generator: 0.78 samples/sec: 6.597 | iteration 273600/ 320000 | elapsed time per iteration (ms): 2425.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.634764E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.63 | backward: 1802.60 | backward-backward: 1802.58 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.78 samples/sec: 6.595 | iteration 273700/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.615967E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.44 | backward: 1803.71 | backward-backward: 1803.69 | backward-allreduce: 0.00 | optimizer: 55.59 | batch generator: 0.79 samples/sec: 6.593 | iteration 273800/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.620994E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.69 | backward: 1804.14 | backward-backward: 1804.12 | backward-allreduce: 0.00 | optimizer: 55.69 | batch generator: 0.78 samples/sec: 6.599 | iteration 273900/ 320000 | elapsed time per iteration (ms): 2424.5 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.610711E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 565.90 | backward: 1802.68 | backward-backward: 1802.66 | backward-allreduce: 0.00 | optimizer: 55.59 | batch generator: 0.76 samples/sec: 6.591 | iteration 274000/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.619950E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.80 | backward: 1804.36 | backward-backward: 1804.34 | backward-allreduce: 0.00 | optimizer: 56.06 | batch generator: 0.78 ----------------------------------------------------------------------------------------------------------- validation results at iteration 274000 | lm_loss value: 2.677615E+00 | lm_loss_ppl value: 1.455035E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.446 | iteration 274100/ 320000 | elapsed time per iteration (ms): 2482.3 | learning rate: 3.000E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.616389E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.49 | backward: 1803.01 | backward-backward: 1802.99 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.86 samples/sec: 6.597 | iteration 274200/ 320000 | elapsed time per iteration (ms): 2425.2 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.625290E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.19 | backward: 1802.94 | backward-backward: 1802.92 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.80 samples/sec: 6.587 | iteration 274300/ 320000 | elapsed time per iteration (ms): 2429.0 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.615401E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.08 | backward: 1806.08 | backward-backward: 1806.05 | backward-allreduce: 0.00 | optimizer: 55.49 | batch generator: 0.80 samples/sec: 6.596 | iteration 274400/ 320000 | elapsed time per iteration (ms): 2425.6 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.629214E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.22 | backward: 1802.70 | backward-backward: 1802.67 | backward-allreduce: 0.00 | optimizer: 56.29 | batch generator: 0.80 samples/sec: 6.591 | iteration 274500/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.630982E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.75 | backward: 1804.65 | backward-backward: 1804.63 | backward-allreduce: 0.00 | optimizer: 55.76 | batch generator: 0.85 samples/sec: 6.592 | iteration 274600/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.622842E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.68 | backward: 1804.57 | backward-backward: 1804.55 | backward-allreduce: 0.00 | optimizer: 55.47 | batch generator: 0.77 samples/sec: 6.599 | iteration 274700/ 320000 | elapsed time per iteration (ms): 2424.6 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.601285E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.16 | backward: 1802.87 | backward-backward: 1802.85 | backward-allreduce: 0.00 | optimizer: 55.18 | batch generator: 0.76 samples/sec: 6.589 | iteration 274800/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.617141E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.70 | backward: 1805.21 | backward-backward: 1805.19 | backward-allreduce: 0.00 | optimizer: 55.90 | batch generator: 0.79 samples/sec: 6.594 | iteration 274900/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.626546E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.88 | backward: 1803.50 | backward-backward: 1803.47 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.79 samples/sec: 6.596 | iteration 275000/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.636101E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 565.95 | backward: 1803.81 | backward-backward: 1803.79 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.79 ----------------------------------------------------------------------------------------------------------- validation results at iteration 275000 | lm_loss value: 2.591216E+00 | lm_loss_ppl value: 1.334599E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.438 | iteration 275100/ 320000 | elapsed time per iteration (ms): 2485.1 | learning rate: 3.000E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.600880E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.92 | backward: 1805.16 | backward-backward: 1805.14 | backward-allreduce: 0.00 | optimizer: 55.79 | batch generator: 0.83 samples/sec: 6.595 | iteration 275200/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.617896E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.60 | backward: 1803.51 | backward-backward: 1803.48 | backward-allreduce: 0.00 | optimizer: 55.47 | batch generator: 0.79 samples/sec: 6.595 | iteration 275300/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.608542E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.12 | backward: 1804.12 | backward-backward: 1804.10 | backward-allreduce: 0.00 | optimizer: 55.51 | batch generator: 0.76 samples/sec: 6.588 | iteration 275400/ 320000 | elapsed time per iteration (ms): 2428.7 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.604871E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.78 | backward: 1805.97 | backward-backward: 1805.95 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.78 samples/sec: 6.595 | iteration 275500/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.627907E+00 | loss scale: 131072.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.20 | backward: 1803.99 | backward-backward: 1803.97 | backward-allreduce: 0.00 | optimizer: 55.41 | batch generator: 0.74 samples/sec: 6.595 | iteration 275600/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.624977E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.52 | backward: 1804.16 | backward-backward: 1804.14 | backward-allreduce: 0.00 | optimizer: 54.97 | batch generator: 0.78 samples/sec: 6.589 | iteration 275700/ 320000 | elapsed time per iteration (ms): 2428.4 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.616866E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.84 | backward: 1805.52 | backward-backward: 1805.49 | backward-allreduce: 0.00 | optimizer: 55.66 | batch generator: 0.81 samples/sec: 6.594 | iteration 275800/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.618209E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.46 | backward: 1803.46 | backward-backward: 1803.44 | backward-allreduce: 0.00 | optimizer: 56.18 | batch generator: 0.79 samples/sec: 6.596 | iteration 275900/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.612403E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.26 | backward: 1803.46 | backward-backward: 1803.44 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.79 samples/sec: 6.589 | iteration 276000/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.618902E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.89 | backward: 1805.44 | backward-backward: 1805.42 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.78 ----------------------------------------------------------------------------------------------------------- validation results at iteration 276000 | lm_loss value: 2.657088E+00 | lm_loss_ppl value: 1.425471E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.445 | iteration 276100/ 320000 | elapsed time per iteration (ms): 2482.5 | learning rate: 3.000E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.613836E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.20 | backward: 1803.42 | backward-backward: 1803.39 | backward-allreduce: 0.00 | optimizer: 55.69 | batch generator: 0.84 samples/sec: 6.592 | iteration 276200/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.615405E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.04 | backward: 1804.37 | backward-backward: 1804.35 | backward-allreduce: 0.00 | optimizer: 55.38 | batch generator: 0.81 samples/sec: 6.590 | iteration 276300/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.628997E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.80 | backward: 1805.19 | backward-backward: 1805.17 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.77 samples/sec: 6.598 | iteration 276400/ 320000 | elapsed time per iteration (ms): 2425.0 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.625348E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 565.94 | backward: 1803.14 | backward-backward: 1803.12 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.75 samples/sec: 6.588 | iteration 276500/ 320000 | elapsed time per iteration (ms): 2428.7 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.618023E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.67 | backward: 1805.45 | backward-backward: 1805.42 | backward-allreduce: 0.00 | optimizer: 56.23 | batch generator: 0.80 samples/sec: 6.587 | iteration 276600/ 320000 | elapsed time per iteration (ms): 2428.9 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.620975E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.77 | backward: 1805.36 | backward-backward: 1805.34 | backward-allreduce: 0.00 | optimizer: 56.34 | batch generator: 0.79 samples/sec: 6.601 | iteration 276700/ 320000 | elapsed time per iteration (ms): 2423.9 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.628365E+00 | loss scale: 65536.0 | number of skipped iterations: 2 | number of nan iterations: 0 | time (ms) | forward: 565.99 | backward: 1803.26 | backward-backward: 1803.24 | backward-allreduce: 0.00 | optimizer: 54.24 | batch generator: 0.77 samples/sec: 6.590 | iteration 276800/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.625536E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.69 | backward: 1805.04 | backward-backward: 1805.02 | backward-allreduce: 0.00 | optimizer: 55.80 | batch generator: 0.77 samples/sec: 6.592 | iteration 276900/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.607866E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.76 | backward: 1804.50 | backward-backward: 1804.48 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.78 samples/sec: 6.592 | iteration 277000/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.611594E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.71 | backward: 1804.41 | backward-backward: 1804.39 | backward-allreduce: 0.00 | optimizer: 55.78 | batch generator: 0.78 ----------------------------------------------------------------------------------------------------------- validation results at iteration 277000 | lm_loss value: 2.603708E+00 | lm_loss_ppl value: 1.351375E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.445 | iteration 277100/ 320000 | elapsed time per iteration (ms): 2482.4 | learning rate: 3.000E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.623868E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.08 | backward: 1803.42 | backward-backward: 1803.40 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.84 samples/sec: 6.595 | iteration 277200/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.628037E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.37 | backward: 1803.91 | backward-backward: 1803.89 | backward-allreduce: 0.00 | optimizer: 55.59 | batch generator: 0.77 samples/sec: 6.592 | iteration 277300/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.619352E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.71 | backward: 1804.42 | backward-backward: 1804.40 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.80 samples/sec: 6.592 | iteration 277400/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.633431E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.55 | backward: 1804.64 | backward-backward: 1804.61 | backward-allreduce: 0.00 | optimizer: 55.50 | batch generator: 0.77 samples/sec: 6.594 | iteration 277500/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.612659E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.46 | backward: 1803.79 | backward-backward: 1803.77 | backward-allreduce: 0.00 | optimizer: 55.91 | batch generator: 0.80 samples/sec: 6.597 | iteration 277600/ 320000 | elapsed time per iteration (ms): 2425.4 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.623665E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 565.99 | backward: 1803.62 | backward-backward: 1803.60 | backward-allreduce: 0.00 | optimizer: 55.35 | batch generator: 0.75 samples/sec: 6.588 | iteration 277700/ 320000 | elapsed time per iteration (ms): 2428.5 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.632357E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.62 | backward: 1805.11 | backward-backward: 1805.09 | backward-allreduce: 0.00 | optimizer: 56.41 | batch generator: 0.76 samples/sec: 6.596 | iteration 277800/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.619451E+00 | loss scale: 65536.0 | number of skipped iterations: 2 | number of nan iterations: 0 | time (ms) | forward: 566.77 | backward: 1804.26 | backward-backward: 1804.23 | backward-allreduce: 0.00 | optimizer: 54.47 | batch generator: 0.79 samples/sec: 6.593 | iteration 277900/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.617098E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.48 | backward: 1804.31 | backward-backward: 1804.29 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.82 samples/sec: 6.596 | iteration 278000/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.636542E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.29 | backward: 1803.55 | backward-backward: 1803.52 | backward-allreduce: 0.00 | optimizer: 55.50 | batch generator: 0.74 ----------------------------------------------------------------------------------------------------------- validation results at iteration 278000 | lm_loss value: 2.586487E+00 | lm_loss_ppl value: 1.328302E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.447 | iteration 278100/ 320000 | elapsed time per iteration (ms): 2481.7 | learning rate: 3.000E-05 | approx flops per GPU: 40.1TFLOPS | lm_loss: 2.624359E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.21 | backward: 1802.79 | backward-backward: 1802.77 | backward-allreduce: 0.00 | optimizer: 55.51 | batch generator: 0.85 samples/sec: 6.593 | iteration 278200/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.635259E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.46 | backward: 1804.43 | backward-backward: 1804.41 | backward-allreduce: 0.00 | optimizer: 55.45 | batch generator: 0.78 samples/sec: 6.594 | iteration 278300/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.612983E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.36 | backward: 1804.09 | backward-backward: 1804.07 | backward-allreduce: 0.00 | optimizer: 55.49 | batch generator: 0.78 samples/sec: 6.595 | iteration 278400/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.596241E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.37 | backward: 1803.90 | backward-backward: 1803.88 | backward-allreduce: 0.00 | optimizer: 55.38 | batch generator: 0.77 samples/sec: 6.599 | iteration 278500/ 320000 | elapsed time per iteration (ms): 2424.8 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.626043E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.13 | backward: 1802.90 | backward-backward: 1802.88 | backward-allreduce: 0.00 | optimizer: 55.35 | batch generator: 0.78 samples/sec: 6.598 | iteration 278600/ 320000 | elapsed time per iteration (ms): 2424.9 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.612795E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.30 | backward: 1802.71 | backward-backward: 1802.69 | backward-allreduce: 0.00 | optimizer: 55.47 | batch generator: 0.78 samples/sec: 6.591 | iteration 278700/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.630538E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.47 | backward: 1804.95 | backward-backward: 1804.93 | backward-allreduce: 0.00 | optimizer: 55.75 | batch generator: 0.76 samples/sec: 6.593 | iteration 278800/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.609627E+00 | loss scale: 65536.0 | number of skipped iterations: 2 | number of nan iterations: 0 | time (ms) | forward: 566.62 | backward: 1804.60 | backward-backward: 1804.57 | backward-allreduce: 0.00 | optimizer: 55.23 | batch generator: 0.88 samples/sec: 6.592 | iteration 278900/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.625852E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.69 | backward: 1804.16 | backward-backward: 1804.14 | backward-allreduce: 0.00 | optimizer: 55.94 | batch generator: 0.86 samples/sec: 6.594 | iteration 279000/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.623457E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.71 | backward: 1804.11 | backward-backward: 1804.09 | backward-allreduce: 0.00 | optimizer: 55.39 | batch generator: 0.81 ----------------------------------------------------------------------------------------------------------- validation results at iteration 279000 | lm_loss value: 2.629321E+00 | lm_loss_ppl value: 1.386435E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.448 | iteration 279100/ 320000 | elapsed time per iteration (ms): 2481.3 | learning rate: 3.000E-05 | approx flops per GPU: 40.1TFLOPS | lm_loss: 2.624380E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 565.90 | backward: 1802.95 | backward-backward: 1802.93 | backward-allreduce: 0.00 | optimizer: 55.33 | batch generator: 0.83 samples/sec: 6.592 | iteration 279200/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.604749E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.66 | backward: 1804.47 | backward-backward: 1804.44 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.80 samples/sec: 6.593 | iteration 279300/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.620376E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.51 | backward: 1804.36 | backward-backward: 1804.34 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.79 samples/sec: 6.593 | iteration 279400/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.621475E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.70 | backward: 1804.44 | backward-backward: 1804.42 | backward-allreduce: 0.00 | optimizer: 55.37 | batch generator: 0.75 samples/sec: 6.593 | iteration 279500/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.614299E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.75 | backward: 1803.84 | backward-backward: 1803.82 | backward-allreduce: 0.00 | optimizer: 55.88 | batch generator: 0.79 samples/sec: 6.600 | iteration 279600/ 320000 | elapsed time per iteration (ms): 2424.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.609743E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.23 | backward: 1802.22 | backward-backward: 1802.20 | backward-allreduce: 0.00 | optimizer: 55.43 | batch generator: 0.78 samples/sec: 6.592 | iteration 279700/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.636393E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.67 | backward: 1804.77 | backward-backward: 1804.74 | backward-allreduce: 0.00 | optimizer: 55.46 | batch generator: 0.80 samples/sec: 6.595 | iteration 279800/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.620989E+00 | loss scale: 65536.0 | number of skipped iterations: 2 | number of nan iterations: 0 | time (ms) | forward: 566.47 | backward: 1804.69 | backward-backward: 1804.67 | backward-allreduce: 0.00 | optimizer: 54.63 | batch generator: 0.85 samples/sec: 6.589 | iteration 279900/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.614728E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.54 | backward: 1804.69 | backward-backward: 1804.67 | backward-allreduce: 0.00 | optimizer: 56.59 | batch generator: 0.78 samples/sec: 6.595 | iteration 280000/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.633015E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.62 | backward: 1803.57 | backward-backward: 1803.54 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.84 WARNING: Deleting old checkpoints: checkpoints-hfcm/global_step180000 ----------------------------------------------------------------------------------------------------------- validation results at iteration 280000 | lm_loss value: 2.591592E+00 | lm_loss_ppl value: 1.335100E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.229 | iteration 280100/ 320000 | elapsed time per iteration (ms): 2568.7 | learning rate: 3.000E-05 | approx flops per GPU: 38.7TFLOPS | lm_loss: 2.639858E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.19 | backward: 1803.58 | backward-backward: 1803.56 | backward-allreduce: 0.00 | optimizer: 55.39 | batch generator: 0.83 samples/sec: 6.591 | iteration 280200/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.604257E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.58 | backward: 1804.57 | backward-backward: 1804.55 | backward-allreduce: 0.00 | optimizer: 55.92 | batch generator: 0.78 samples/sec: 6.593 | iteration 280300/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.619667E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.52 | backward: 1804.11 | backward-backward: 1804.08 | backward-allreduce: 0.00 | optimizer: 55.82 | batch generator: 0.83 samples/sec: 6.594 | iteration 280400/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.620037E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.61 | backward: 1804.30 | backward-backward: 1804.28 | backward-allreduce: 0.00 | optimizer: 55.31 | batch generator: 0.78 samples/sec: 6.599 | iteration 280500/ 320000 | elapsed time per iteration (ms): 2424.6 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.607970E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.20 | backward: 1802.47 | backward-backward: 1802.45 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.78 samples/sec: 6.592 | iteration 280600/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.614759E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.49 | backward: 1804.72 | backward-backward: 1804.70 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.78 samples/sec: 6.595 | iteration 280700/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.611769E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.41 | backward: 1803.82 | backward-backward: 1803.80 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.77 samples/sec: 6.592 | iteration 280800/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.617734E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.76 | backward: 1804.21 | backward-backward: 1804.19 | backward-allreduce: 0.00 | optimizer: 55.76 | batch generator: 0.91 samples/sec: 6.597 | iteration 280900/ 320000 | elapsed time per iteration (ms): 2425.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.611873E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.07 | backward: 1802.74 | backward-backward: 1802.72 | backward-allreduce: 0.00 | optimizer: 56.13 | batch generator: 0.79 samples/sec: 6.594 | iteration 281000/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.615429E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.80 | backward: 1803.87 | backward-backward: 1803.84 | backward-allreduce: 0.00 | optimizer: 55.45 | batch generator: 0.78 ----------------------------------------------------------------------------------------------------------- validation results at iteration 281000 | lm_loss value: 2.571984E+00 | lm_loss_ppl value: 1.309177E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.442 | iteration 281100/ 320000 | elapsed time per iteration (ms): 2483.6 | learning rate: 3.000E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.619509E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.59 | backward: 1804.09 | backward-backward: 1804.07 | backward-allreduce: 0.00 | optimizer: 55.74 | batch generator: 0.86 samples/sec: 6.596 | iteration 281200/ 320000 | elapsed time per iteration (ms): 2425.6 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.613125E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.59 | backward: 1803.28 | backward-backward: 1803.26 | backward-allreduce: 0.00 | optimizer: 55.39 | batch generator: 0.77 samples/sec: 6.600 | iteration 281300/ 320000 | elapsed time per iteration (ms): 2424.4 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.611342E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 565.99 | backward: 1802.61 | backward-backward: 1802.58 | backward-allreduce: 0.00 | optimizer: 55.39 | batch generator: 0.75 samples/sec: 6.594 | iteration 281400/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.624325E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.45 | backward: 1804.26 | backward-backward: 1804.24 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.72 samples/sec: 6.591 | iteration 281500/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.616415E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.54 | backward: 1804.86 | backward-backward: 1804.83 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.75 samples/sec: 6.594 | iteration 281600/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.628597E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.57 | backward: 1804.45 | backward-backward: 1804.43 | backward-allreduce: 0.00 | optimizer: 54.98 | batch generator: 0.78 samples/sec: 6.595 | iteration 281700/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.625002E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.36 | backward: 1804.02 | backward-backward: 1804.00 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.80 samples/sec: 6.597 | iteration 281800/ 320000 | elapsed time per iteration (ms): 2425.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.616302E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.22 | backward: 1803.18 | backward-backward: 1803.15 | backward-allreduce: 0.00 | optimizer: 55.45 | batch generator: 0.78 samples/sec: 6.589 | iteration 281900/ 320000 | elapsed time per iteration (ms): 2428.1 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.612314E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.63 | backward: 1805.24 | backward-backward: 1805.22 | backward-allreduce: 0.00 | optimizer: 55.87 | batch generator: 0.78 samples/sec: 6.592 | iteration 282000/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.615814E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.74 | backward: 1804.16 | backward-backward: 1804.13 | backward-allreduce: 0.00 | optimizer: 55.98 | batch generator: 0.76 ----------------------------------------------------------------------------------------------------------- validation results at iteration 282000 | lm_loss value: 2.639042E+00 | lm_loss_ppl value: 1.399979E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.444 | iteration 282100/ 320000 | elapsed time per iteration (ms): 2483.1 | learning rate: 3.000E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.620384E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.46 | backward: 1803.94 | backward-backward: 1803.92 | backward-allreduce: 0.00 | optimizer: 55.49 | batch generator: 0.81 samples/sec: 6.597 | iteration 282200/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.589808E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.83 | backward: 1802.85 | backward-backward: 1802.83 | backward-allreduce: 0.00 | optimizer: 55.47 | batch generator: 0.78 samples/sec: 6.590 | iteration 282300/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.621821E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.05 | backward: 1804.96 | backward-backward: 1804.93 | backward-allreduce: 0.00 | optimizer: 55.50 | batch generator: 0.75 samples/sec: 6.594 | iteration 282400/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.604361E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.41 | backward: 1804.11 | backward-backward: 1804.09 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.77 samples/sec: 6.593 | iteration 282500/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.603216E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.43 | backward: 1804.49 | backward-backward: 1804.47 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.75 samples/sec: 6.600 | iteration 282600/ 320000 | elapsed time per iteration (ms): 2424.4 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.604660E+00 | loss scale: 65536.0 | number of skipped iterations: 2 | number of nan iterations: 0 | time (ms) | forward: 566.22 | backward: 1802.92 | backward-backward: 1802.89 | backward-allreduce: 0.00 | optimizer: 54.87 | batch generator: 0.77 samples/sec: 6.598 | iteration 282700/ 320000 | elapsed time per iteration (ms): 2425.0 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.616640E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.65 | backward: 1803.07 | backward-backward: 1803.05 | backward-allreduce: 0.00 | optimizer: 54.90 | batch generator: 0.76 samples/sec: 6.594 | iteration 282800/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.629823E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.52 | backward: 1803.99 | backward-backward: 1803.96 | backward-allreduce: 0.00 | optimizer: 55.51 | batch generator: 0.80 samples/sec: 6.595 | iteration 282900/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.612117E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.87 | backward: 1803.37 | backward-backward: 1803.35 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.76 samples/sec: 6.592 | iteration 283000/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.592638E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.59 | backward: 1804.10 | backward-backward: 1804.07 | backward-allreduce: 0.00 | optimizer: 55.94 | batch generator: 0.82 ----------------------------------------------------------------------------------------------------------- validation results at iteration 283000 | lm_loss value: 2.576276E+00 | lm_loss_ppl value: 1.314808E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.446 | iteration 283100/ 320000 | elapsed time per iteration (ms): 2482.3 | learning rate: 3.000E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.606422E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.27 | backward: 1802.26 | backward-backward: 1802.24 | backward-allreduce: 0.00 | optimizer: 56.61 | batch generator: 0.88 samples/sec: 6.595 | iteration 283200/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.615243E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.50 | backward: 1803.79 | backward-backward: 1803.76 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.78 samples/sec: 6.595 | iteration 283300/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.604448E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.40 | backward: 1803.81 | backward-backward: 1803.79 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.78 samples/sec: 6.595 | iteration 283400/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.609538E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.57 | backward: 1803.67 | backward-backward: 1803.64 | backward-allreduce: 0.00 | optimizer: 55.49 | batch generator: 0.80 samples/sec: 6.594 | iteration 283500/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.618770E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.42 | backward: 1803.96 | backward-backward: 1803.93 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.79 samples/sec: 6.601 | iteration 283600/ 320000 | elapsed time per iteration (ms): 2424.0 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.613715E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.22 | backward: 1801.85 | backward-backward: 1801.83 | backward-allreduce: 0.00 | optimizer: 55.59 | batch generator: 0.79 samples/sec: 6.594 | iteration 283700/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.605435E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.26 | backward: 1804.22 | backward-backward: 1804.19 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.76 samples/sec: 6.592 | iteration 283800/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.621542E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.40 | backward: 1804.72 | backward-backward: 1804.70 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.78 samples/sec: 6.591 | iteration 283900/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.624734E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.29 | backward: 1804.31 | backward-backward: 1804.28 | backward-allreduce: 0.00 | optimizer: 55.50 | batch generator: 0.80 samples/sec: 6.595 | iteration 284000/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.623863E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.56 | backward: 1803.61 | backward-backward: 1803.59 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.79 ----------------------------------------------------------------------------------------------------------- validation results at iteration 284000 | lm_loss value: 2.595888E+00 | lm_loss_ppl value: 1.340848E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.444 | iteration 284100/ 320000 | elapsed time per iteration (ms): 2482.8 | learning rate: 3.000E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.603976E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.51 | backward: 1803.91 | backward-backward: 1803.89 | backward-allreduce: 0.00 | optimizer: 55.29 | batch generator: 0.85 samples/sec: 6.587 | iteration 284200/ 320000 | elapsed time per iteration (ms): 2428.9 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.597646E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.71 | backward: 1805.37 | backward-backward: 1805.35 | backward-allreduce: 0.00 | optimizer: 56.46 | batch generator: 0.78 samples/sec: 6.597 | iteration 284300/ 320000 | elapsed time per iteration (ms): 2425.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.625438E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.38 | backward: 1802.98 | backward-backward: 1802.96 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.82 samples/sec: 6.595 | iteration 284400/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.601157E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.31 | backward: 1804.10 | backward-backward: 1804.07 | backward-allreduce: 0.00 | optimizer: 55.34 | batch generator: 0.75 samples/sec: 6.590 | iteration 284500/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.614456E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.03 | backward: 1804.68 | backward-backward: 1804.65 | backward-allreduce: 0.00 | optimizer: 55.71 | batch generator: 0.78 samples/sec: 6.598 | iteration 284600/ 320000 | elapsed time per iteration (ms): 2425.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.620919E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.36 | backward: 1802.85 | backward-backward: 1802.83 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.79 samples/sec: 6.598 | iteration 284700/ 320000 | elapsed time per iteration (ms): 2425.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.620927E+00 | loss scale: 65536.0 | number of skipped iterations: 2 | number of nan iterations: 0 | time (ms) | forward: 566.43 | backward: 1803.83 | backward-backward: 1803.81 | backward-allreduce: 0.00 | optimizer: 54.50 | batch generator: 0.78 samples/sec: 6.590 | iteration 284800/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.606726E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.96 | backward: 1804.60 | backward-backward: 1804.58 | backward-allreduce: 0.00 | optimizer: 55.86 | batch generator: 0.80 samples/sec: 6.599 | iteration 284900/ 320000 | elapsed time per iteration (ms): 2424.6 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.613496E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 565.95 | backward: 1802.74 | backward-backward: 1802.72 | backward-allreduce: 0.00 | optimizer: 55.54 | batch generator: 0.74 samples/sec: 6.590 | iteration 285000/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.611547E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.01 | backward: 1804.79 | backward-backward: 1804.76 | backward-allreduce: 0.00 | optimizer: 55.66 | batch generator: 0.82 ----------------------------------------------------------------------------------------------------------- validation results at iteration 285000 | lm_loss value: 2.632782E+00 | lm_loss_ppl value: 1.391242E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.447 | iteration 285100/ 320000 | elapsed time per iteration (ms): 2481.8 | learning rate: 3.000E-05 | approx flops per GPU: 40.1TFLOPS | lm_loss: 2.625723E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.44 | backward: 1802.94 | backward-backward: 1802.92 | backward-allreduce: 0.00 | optimizer: 55.31 | batch generator: 0.80 samples/sec: 6.594 | iteration 285200/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.599673E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.38 | backward: 1804.09 | backward-backward: 1804.06 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.77 samples/sec: 6.591 | iteration 285300/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.617321E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.68 | backward: 1804.39 | backward-backward: 1804.37 | backward-allreduce: 0.00 | optimizer: 55.99 | batch generator: 0.75 samples/sec: 6.599 | iteration 285400/ 320000 | elapsed time per iteration (ms): 2424.8 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.621773E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.01 | backward: 1802.86 | backward-backward: 1802.84 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.79 samples/sec: 6.592 | iteration 285500/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.634519E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.94 | backward: 1804.35 | backward-backward: 1804.33 | backward-allreduce: 0.00 | optimizer: 55.51 | batch generator: 0.79 samples/sec: 6.599 | iteration 285600/ 320000 | elapsed time per iteration (ms): 2424.8 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.603129E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.46 | backward: 1802.64 | backward-backward: 1802.62 | backward-allreduce: 0.00 | optimizer: 55.31 | batch generator: 0.78 samples/sec: 6.595 | iteration 285700/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.617954E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.38 | backward: 1803.54 | backward-backward: 1803.52 | backward-allreduce: 0.00 | optimizer: 55.86 | batch generator: 0.77 samples/sec: 6.593 | iteration 285800/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.614853E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.72 | backward: 1804.43 | backward-backward: 1804.40 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.77 samples/sec: 6.599 | iteration 285900/ 320000 | elapsed time per iteration (ms): 2424.7 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.619084E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.01 | backward: 1802.37 | backward-backward: 1802.35 | backward-allreduce: 0.00 | optimizer: 55.90 | batch generator: 0.77 samples/sec: 6.591 | iteration 286000/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.615359E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.96 | backward: 1804.63 | backward-backward: 1804.60 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.78 ----------------------------------------------------------------------------------------------------------- validation results at iteration 286000 | lm_loss value: 2.602390E+00 | lm_loss_ppl value: 1.349596E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.447 | iteration 286100/ 320000 | elapsed time per iteration (ms): 2481.6 | learning rate: 3.000E-05 | approx flops per GPU: 40.1TFLOPS | lm_loss: 2.591654E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.35 | backward: 1802.31 | backward-backward: 1802.29 | backward-allreduce: 0.00 | optimizer: 55.75 | batch generator: 0.85 samples/sec: 6.594 | iteration 286200/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.614003E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.47 | backward: 1804.23 | backward-backward: 1804.21 | backward-allreduce: 0.00 | optimizer: 55.32 | batch generator: 0.75 samples/sec: 6.590 | iteration 286300/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.614412E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.82 | backward: 1804.83 | backward-backward: 1804.80 | backward-allreduce: 0.00 | optimizer: 55.85 | batch generator: 0.80 samples/sec: 6.593 | iteration 286400/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.607440E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.41 | backward: 1804.18 | backward-backward: 1804.16 | backward-allreduce: 0.00 | optimizer: 55.93 | batch generator: 0.78 samples/sec: 6.591 | iteration 286500/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.597561E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.06 | backward: 1804.61 | backward-backward: 1804.59 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.77 samples/sec: 6.597 | iteration 286600/ 320000 | elapsed time per iteration (ms): 2425.4 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.613889E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.06 | backward: 1803.60 | backward-backward: 1803.58 | backward-allreduce: 0.00 | optimizer: 55.36 | batch generator: 0.73 samples/sec: 6.586 | iteration 286700/ 320000 | elapsed time per iteration (ms): 2429.3 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.612900E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.23 | backward: 1805.66 | backward-backward: 1805.64 | backward-allreduce: 0.00 | optimizer: 55.99 | batch generator: 0.77 samples/sec: 6.596 | iteration 286800/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.612937E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.58 | backward: 1803.37 | backward-backward: 1803.35 | backward-allreduce: 0.00 | optimizer: 55.46 | batch generator: 0.78 samples/sec: 6.588 | iteration 286900/ 320000 | elapsed time per iteration (ms): 2428.8 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.621200E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.08 | backward: 1805.87 | backward-backward: 1805.84 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.79 samples/sec: 6.598 | iteration 287000/ 320000 | elapsed time per iteration (ms): 2425.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.622266E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.48 | backward: 1803.53 | backward-backward: 1803.51 | backward-allreduce: 0.00 | optimizer: 54.75 | batch generator: 0.74 ----------------------------------------------------------------------------------------------------------- validation results at iteration 287000 | lm_loss value: 2.578154E+00 | lm_loss_ppl value: 1.317280E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.440 | iteration 287100/ 320000 | elapsed time per iteration (ms): 2484.6 | learning rate: 3.000E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.607744E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.83 | backward: 1804.94 | backward-backward: 1804.92 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.90 samples/sec: 6.592 | iteration 287200/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.626852E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.98 | backward: 1804.50 | backward-backward: 1804.48 | backward-allreduce: 0.00 | optimizer: 55.43 | batch generator: 0.78 samples/sec: 6.595 | iteration 287300/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.616668E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.07 | backward: 1804.04 | backward-backward: 1804.02 | backward-allreduce: 0.00 | optimizer: 55.41 | batch generator: 0.78 samples/sec: 6.584 | iteration 287400/ 320000 | elapsed time per iteration (ms): 2430.0 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.592692E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.51 | backward: 1806.50 | backward-backward: 1806.48 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.76 samples/sec: 6.595 | iteration 287500/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.587550E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.29 | backward: 1803.31 | backward-backward: 1803.28 | backward-allreduce: 0.00 | optimizer: 55.96 | batch generator: 0.76 samples/sec: 6.587 | iteration 287600/ 320000 | elapsed time per iteration (ms): 2429.0 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.604056E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.02 | backward: 1806.01 | backward-backward: 1805.99 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.76 samples/sec: 6.596 | iteration 287700/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.626589E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.74 | backward: 1803.57 | backward-backward: 1803.55 | backward-allreduce: 0.00 | optimizer: 55.10 | batch generator: 0.78 samples/sec: 6.593 | iteration 287800/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.591304E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.42 | backward: 1804.37 | backward-backward: 1804.35 | backward-allreduce: 0.00 | optimizer: 55.47 | batch generator: 0.75 samples/sec: 6.590 | iteration 287900/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.618286E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.35 | backward: 1804.48 | backward-backward: 1804.46 | backward-allreduce: 0.00 | optimizer: 55.78 | batch generator: 0.79 samples/sec: 6.594 | iteration 288000/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.591945E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.48 | backward: 1803.77 | backward-backward: 1803.75 | backward-allreduce: 0.00 | optimizer: 55.75 | batch generator: 0.79 ----------------------------------------------------------------------------------------------------------- validation results at iteration 288000 | lm_loss value: 2.541649E+00 | lm_loss_ppl value: 1.270060E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.437 | iteration 288100/ 320000 | elapsed time per iteration (ms): 2485.4 | learning rate: 3.000E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.594491E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.86 | backward: 1804.71 | backward-backward: 1804.69 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.87 samples/sec: 6.595 | iteration 288200/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.614877E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.55 | backward: 1803.68 | backward-backward: 1803.66 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.80 samples/sec: 6.588 | iteration 288300/ 320000 | elapsed time per iteration (ms): 2428.6 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.611864E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.65 | backward: 1805.02 | backward-backward: 1805.00 | backward-allreduce: 0.00 | optimizer: 55.51 | batch generator: 0.79 samples/sec: 6.596 | iteration 288400/ 320000 | elapsed time per iteration (ms): 2425.6 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.613517E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.36 | backward: 1803.42 | backward-backward: 1803.40 | backward-allreduce: 0.00 | optimizer: 55.41 | batch generator: 0.76 samples/sec: 6.591 | iteration 288500/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.618828E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.56 | backward: 1805.27 | backward-backward: 1805.25 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.78 samples/sec: 6.592 | iteration 288600/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.600819E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.84 | backward: 1803.63 | backward-backward: 1803.60 | backward-allreduce: 0.00 | optimizer: 56.27 | batch generator: 0.79 samples/sec: 6.593 | iteration 288700/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.617385E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.58 | backward: 1804.11 | backward-backward: 1804.09 | backward-allreduce: 0.00 | optimizer: 55.80 | batch generator: 0.77 samples/sec: 6.587 | iteration 288800/ 320000 | elapsed time per iteration (ms): 2428.8 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.609808E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.20 | backward: 1805.38 | backward-backward: 1805.36 | backward-allreduce: 0.00 | optimizer: 55.87 | batch generator: 0.77 samples/sec: 6.595 | iteration 288900/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.609518E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.34 | backward: 1803.76 | backward-backward: 1803.74 | backward-allreduce: 0.00 | optimizer: 55.47 | batch generator: 0.77 samples/sec: 6.585 | iteration 289000/ 320000 | elapsed time per iteration (ms): 2429.7 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.609041E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.48 | backward: 1805.74 | backward-backward: 1805.72 | backward-allreduce: 0.00 | optimizer: 56.07 | batch generator: 0.86 ----------------------------------------------------------------------------------------------------------- validation results at iteration 289000 | lm_loss value: 2.657918E+00 | lm_loss_ppl value: 1.426656E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.442 | iteration 289100/ 320000 | elapsed time per iteration (ms): 2483.6 | learning rate: 3.000E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.603954E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.83 | backward: 1803.75 | backward-backward: 1803.72 | backward-allreduce: 0.00 | optimizer: 55.84 | batch generator: 0.87 samples/sec: 6.595 | iteration 289200/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.617504E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.52 | backward: 1804.09 | backward-backward: 1804.07 | backward-allreduce: 0.00 | optimizer: 55.06 | batch generator: 0.75 samples/sec: 6.594 | iteration 289300/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.607827E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.41 | backward: 1804.26 | backward-backward: 1804.24 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.76 samples/sec: 6.594 | iteration 289400/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.610159E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.70 | backward: 1804.14 | backward-backward: 1804.11 | backward-allreduce: 0.00 | optimizer: 55.26 | batch generator: 0.82 samples/sec: 6.594 | iteration 289500/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.607700E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.81 | backward: 1803.23 | backward-backward: 1803.21 | backward-allreduce: 0.00 | optimizer: 55.90 | batch generator: 0.97 samples/sec: 6.591 | iteration 289600/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.612709E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.57 | backward: 1805.10 | backward-backward: 1805.08 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.77 samples/sec: 6.587 | iteration 289700/ 320000 | elapsed time per iteration (ms): 2428.9 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.610158E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.49 | backward: 1804.86 | backward-backward: 1804.84 | backward-allreduce: 0.00 | optimizer: 56.17 | batch generator: 0.78 samples/sec: 6.598 | iteration 289800/ 320000 | elapsed time per iteration (ms): 2424.9 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.606183E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.32 | backward: 1802.59 | backward-backward: 1802.57 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.82 samples/sec: 6.589 | iteration 289900/ 320000 | elapsed time per iteration (ms): 2428.4 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.598514E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.87 | backward: 1804.94 | backward-backward: 1804.92 | backward-allreduce: 0.00 | optimizer: 56.18 | batch generator: 0.77 samples/sec: 6.594 | iteration 290000/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.617593E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.16 | backward: 1803.65 | backward-backward: 1803.63 | backward-allreduce: 0.00 | optimizer: 55.41 | batch generator: 0.77 WARNING: Deleting old checkpoints: checkpoints-hfcm/global_step190000 ----------------------------------------------------------------------------------------------------------- validation results at iteration 290000 | lm_loss value: 2.589099E+00 | lm_loss_ppl value: 1.331776E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.224 | iteration 290100/ 320000 | elapsed time per iteration (ms): 2570.5 | learning rate: 3.000E-05 | approx flops per GPU: 38.7TFLOPS | lm_loss: 2.609029E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.92 | backward: 1805.28 | backward-backward: 1805.26 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.86 samples/sec: 6.593 | iteration 290200/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.588088E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.08 | backward: 1803.78 | backward-backward: 1803.76 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.77 samples/sec: 6.595 | iteration 290300/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.635673E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.02 | backward: 1803.31 | backward-backward: 1803.28 | backward-allreduce: 0.00 | optimizer: 55.50 | batch generator: 0.77 samples/sec: 6.591 | iteration 290400/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.593863E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.96 | backward: 1804.42 | backward-backward: 1804.40 | backward-allreduce: 0.00 | optimizer: 55.87 | batch generator: 0.81 samples/sec: 6.591 | iteration 290500/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.607203E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.57 | backward: 1804.73 | backward-backward: 1804.71 | backward-allreduce: 0.00 | optimizer: 55.78 | batch generator: 0.80 samples/sec: 6.584 | iteration 290600/ 320000 | elapsed time per iteration (ms): 2430.1 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.613809E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.87 | backward: 1806.40 | backward-backward: 1806.38 | backward-allreduce: 0.00 | optimizer: 55.50 | batch generator: 0.76 samples/sec: 6.593 | iteration 290700/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.610975E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.91 | backward: 1803.65 | backward-backward: 1803.63 | backward-allreduce: 0.00 | optimizer: 56.05 | batch generator: 0.77 samples/sec: 6.592 | iteration 290800/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.600819E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.31 | backward: 1804.03 | backward-backward: 1804.01 | backward-allreduce: 0.00 | optimizer: 55.43 | batch generator: 0.79 samples/sec: 6.596 | iteration 290900/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.609332E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.60 | backward: 1803.82 | backward-backward: 1803.79 | backward-allreduce: 0.00 | optimizer: 55.00 | batch generator: 0.78 samples/sec: 6.593 | iteration 291000/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.605110E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.77 | backward: 1803.96 | backward-backward: 1803.94 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.74 ----------------------------------------------------------------------------------------------------------- validation results at iteration 291000 | lm_loss value: 2.573569E+00 | lm_loss_ppl value: 1.311254E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.442 | iteration 291100/ 320000 | elapsed time per iteration (ms): 2483.6 | learning rate: 3.000E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.610536E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.42 | backward: 1804.45 | backward-backward: 1804.43 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.86 samples/sec: 6.587 | iteration 291200/ 320000 | elapsed time per iteration (ms): 2428.9 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.608887E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.53 | backward: 1805.54 | backward-backward: 1805.52 | backward-allreduce: 0.00 | optimizer: 55.49 | batch generator: 0.75 samples/sec: 6.596 | iteration 291300/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.609445E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.98 | backward: 1803.06 | backward-backward: 1803.04 | backward-allreduce: 0.00 | optimizer: 55.46 | batch generator: 0.82 samples/sec: 6.586 | iteration 291400/ 320000 | elapsed time per iteration (ms): 2429.3 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.611846E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.29 | backward: 1806.11 | backward-backward: 1806.08 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.79 samples/sec: 6.592 | iteration 291500/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.608083E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.16 | backward: 1803.95 | backward-backward: 1803.93 | backward-allreduce: 0.00 | optimizer: 55.79 | batch generator: 0.79 samples/sec: 6.591 | iteration 291600/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.616662E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.52 | backward: 1805.10 | backward-backward: 1805.08 | backward-allreduce: 0.00 | optimizer: 55.47 | batch generator: 0.78 samples/sec: 6.588 | iteration 291700/ 320000 | elapsed time per iteration (ms): 2428.5 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.610384E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.95 | backward: 1805.26 | backward-backward: 1805.24 | backward-allreduce: 0.00 | optimizer: 55.93 | batch generator: 0.77 samples/sec: 6.593 | iteration 291800/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.605952E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.53 | backward: 1803.56 | backward-backward: 1803.54 | backward-allreduce: 0.00 | optimizer: 56.19 | batch generator: 0.79 samples/sec: 6.594 | iteration 291900/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.607183E+00 | loss scale: 131072.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.51 | backward: 1804.50 | backward-backward: 1804.48 | backward-allreduce: 0.00 | optimizer: 55.10 | batch generator: 0.77 samples/sec: 6.593 | iteration 292000/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.613098E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.62 | backward: 1804.54 | backward-backward: 1804.52 | backward-allreduce: 0.00 | optimizer: 55.43 | batch generator: 0.78 ----------------------------------------------------------------------------------------------------------- validation results at iteration 292000 | lm_loss value: 2.592676E+00 | lm_loss_ppl value: 1.336549E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.444 | iteration 292100/ 320000 | elapsed time per iteration (ms): 2483.0 | learning rate: 3.000E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.597817E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.56 | backward: 1804.39 | backward-backward: 1804.37 | backward-allreduce: 0.00 | optimizer: 54.79 | batch generator: 0.84 samples/sec: 6.594 | iteration 292200/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.606750E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.66 | backward: 1803.86 | backward-backward: 1803.84 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.79 samples/sec: 6.593 | iteration 292300/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.623711E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.57 | backward: 1804.28 | backward-backward: 1804.26 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.76 samples/sec: 6.592 | iteration 292400/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.593409E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.80 | backward: 1804.46 | backward-backward: 1804.44 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.90 samples/sec: 6.589 | iteration 292500/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.588794E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.42 | backward: 1804.70 | backward-backward: 1804.68 | backward-allreduce: 0.00 | optimizer: 55.83 | batch generator: 0.74 samples/sec: 6.593 | iteration 292600/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.617947E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.54 | backward: 1804.45 | backward-backward: 1804.43 | backward-allreduce: 0.00 | optimizer: 55.43 | batch generator: 0.79 samples/sec: 6.586 | iteration 292700/ 320000 | elapsed time per iteration (ms): 2429.5 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.622742E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 568.20 | backward: 1805.69 | backward-backward: 1805.66 | backward-allreduce: 0.00 | optimizer: 55.21 | batch generator: 0.79 samples/sec: 6.597 | iteration 292800/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.608244E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.91 | backward: 1802.75 | backward-backward: 1802.73 | backward-allreduce: 0.00 | optimizer: 55.46 | batch generator: 0.77 samples/sec: 6.591 | iteration 292900/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.602787E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.56 | backward: 1804.54 | backward-backward: 1804.52 | backward-allreduce: 0.00 | optimizer: 55.96 | batch generator: 0.78 samples/sec: 6.592 | iteration 293000/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.613234E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.68 | backward: 1804.31 | backward-backward: 1804.29 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.75 ----------------------------------------------------------------------------------------------------------- validation results at iteration 293000 | lm_loss value: 2.564501E+00 | lm_loss_ppl value: 1.299417E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.444 | iteration 293100/ 320000 | elapsed time per iteration (ms): 2482.9 | learning rate: 3.000E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.618989E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.54 | backward: 1803.60 | backward-backward: 1803.58 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.84 samples/sec: 6.595 | iteration 293200/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.627802E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.46 | backward: 1803.70 | backward-backward: 1803.68 | backward-allreduce: 0.00 | optimizer: 55.42 | batch generator: 0.75 samples/sec: 6.594 | iteration 293300/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.609097E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.62 | backward: 1803.76 | backward-backward: 1803.74 | backward-allreduce: 0.00 | optimizer: 55.73 | batch generator: 0.78 samples/sec: 6.591 | iteration 293400/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.606176E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.12 | backward: 1804.49 | backward-backward: 1804.47 | backward-allreduce: 0.00 | optimizer: 55.47 | batch generator: 0.77 samples/sec: 6.592 | iteration 293500/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.599037E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.56 | backward: 1803.92 | backward-backward: 1803.90 | backward-allreduce: 0.00 | optimizer: 55.38 | batch generator: 0.78 samples/sec: 6.597 | iteration 293600/ 320000 | elapsed time per iteration (ms): 2425.2 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.584431E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.09 | backward: 1803.32 | backward-backward: 1803.30 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.78 samples/sec: 6.585 | iteration 293700/ 320000 | elapsed time per iteration (ms): 2429.6 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.601882E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.21 | backward: 1806.35 | backward-backward: 1806.33 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.85 samples/sec: 6.591 | iteration 293800/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.599518E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.04 | backward: 1804.55 | backward-backward: 1804.53 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.79 samples/sec: 6.592 | iteration 293900/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.589050E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.48 | backward: 1804.82 | backward-backward: 1804.80 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.79 samples/sec: 6.591 | iteration 294000/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.589406E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.53 | backward: 1804.45 | backward-backward: 1804.43 | backward-allreduce: 0.00 | optimizer: 56.24 | batch generator: 0.77 ----------------------------------------------------------------------------------------------------------- validation results at iteration 294000 | lm_loss value: 2.599610E+00 | lm_loss_ppl value: 1.345849E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.439 | iteration 294100/ 320000 | elapsed time per iteration (ms): 2484.8 | learning rate: 3.000E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.606353E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.90 | backward: 1805.11 | backward-backward: 1805.08 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.82 samples/sec: 6.586 | iteration 294200/ 320000 | elapsed time per iteration (ms): 2429.5 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.597341E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 568.19 | backward: 1804.75 | backward-backward: 1804.73 | backward-allreduce: 0.00 | optimizer: 56.21 | batch generator: 0.81 samples/sec: 6.594 | iteration 294300/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.603325E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.50 | backward: 1803.79 | backward-backward: 1803.76 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.74 samples/sec: 6.593 | iteration 294400/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.613786E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.73 | backward: 1804.09 | backward-backward: 1804.07 | backward-allreduce: 0.00 | optimizer: 55.66 | batch generator: 0.81 samples/sec: 6.593 | iteration 294500/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.608609E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.82 | backward: 1804.22 | backward-backward: 1804.19 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.77 samples/sec: 6.589 | iteration 294600/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.612556E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.76 | backward: 1804.64 | backward-backward: 1804.62 | backward-allreduce: 0.00 | optimizer: 55.51 | batch generator: 0.80 samples/sec: 6.594 | iteration 294700/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.607259E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.31 | backward: 1803.03 | backward-backward: 1803.00 | backward-allreduce: 0.00 | optimizer: 55.72 | batch generator: 0.79 samples/sec: 6.588 | iteration 294800/ 320000 | elapsed time per iteration (ms): 2428.5 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.615799E+00 | loss scale: 65536.0 | number of skipped iterations: 2 | number of nan iterations: 0 | time (ms) | forward: 567.14 | backward: 1806.29 | backward-backward: 1806.27 | backward-allreduce: 0.00 | optimizer: 54.71 | batch generator: 0.79 samples/sec: 6.593 | iteration 294900/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.609955E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.11 | backward: 1803.87 | backward-backward: 1803.84 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.78 samples/sec: 6.587 | iteration 295000/ 320000 | elapsed time per iteration (ms): 2429.0 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.617341E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.89 | backward: 1806.21 | backward-backward: 1806.18 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.75 ----------------------------------------------------------------------------------------------------------- validation results at iteration 295000 | lm_loss value: 2.527471E+00 | lm_loss_ppl value: 1.252179E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.436 | iteration 295100/ 320000 | elapsed time per iteration (ms): 2486.0 | learning rate: 3.000E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.601844E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.86 | backward: 1804.57 | backward-backward: 1804.54 | backward-allreduce: 0.00 | optimizer: 56.41 | batch generator: 0.86 samples/sec: 6.592 | iteration 295200/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.622268E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.85 | backward: 1804.40 | backward-backward: 1804.38 | backward-allreduce: 0.00 | optimizer: 55.72 | batch generator: 0.82 samples/sec: 6.588 | iteration 295300/ 320000 | elapsed time per iteration (ms): 2428.6 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.598959E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.66 | backward: 1805.12 | backward-backward: 1805.10 | backward-allreduce: 0.00 | optimizer: 55.43 | batch generator: 0.79 samples/sec: 6.593 | iteration 295400/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.603222E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.91 | backward: 1803.94 | backward-backward: 1803.92 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.78 samples/sec: 6.593 | iteration 295500/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.602046E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.67 | backward: 1804.43 | backward-backward: 1804.41 | backward-allreduce: 0.00 | optimizer: 55.40 | batch generator: 0.81 samples/sec: 6.596 | iteration 295600/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.616945E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.78 | backward: 1803.20 | backward-backward: 1803.17 | backward-allreduce: 0.00 | optimizer: 55.47 | batch generator: 0.80 samples/sec: 6.584 | iteration 295700/ 320000 | elapsed time per iteration (ms): 2430.3 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.606451E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.87 | backward: 1806.36 | backward-backward: 1806.34 | backward-allreduce: 0.00 | optimizer: 55.71 | batch generator: 0.78 samples/sec: 6.593 | iteration 295800/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.621702E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.88 | backward: 1803.84 | backward-backward: 1803.82 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.77 samples/sec: 6.594 | iteration 295900/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.608846E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.38 | backward: 1804.16 | backward-backward: 1804.13 | backward-allreduce: 0.00 | optimizer: 55.49 | batch generator: 0.75 samples/sec: 6.594 | iteration 296000/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.594517E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.14 | backward: 1803.94 | backward-backward: 1803.91 | backward-allreduce: 0.00 | optimizer: 55.82 | batch generator: 0.75 ----------------------------------------------------------------------------------------------------------- validation results at iteration 296000 | lm_loss value: 2.580728E+00 | lm_loss_ppl value: 1.320675E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.438 | iteration 296100/ 320000 | elapsed time per iteration (ms): 2485.2 | learning rate: 3.000E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.620393E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.85 | backward: 1805.44 | backward-backward: 1805.41 | backward-allreduce: 0.00 | optimizer: 55.77 | batch generator: 0.89 samples/sec: 6.589 | iteration 296200/ 320000 | elapsed time per iteration (ms): 2428.1 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.607383E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.90 | backward: 1804.54 | backward-backward: 1804.51 | backward-allreduce: 0.00 | optimizer: 56.29 | batch generator: 0.77 samples/sec: 6.594 | iteration 296300/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.589888E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.12 | backward: 1804.26 | backward-backward: 1804.24 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.72 samples/sec: 6.586 | iteration 296400/ 320000 | elapsed time per iteration (ms): 2429.4 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.616548E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.17 | backward: 1805.88 | backward-backward: 1805.86 | backward-allreduce: 0.00 | optimizer: 56.01 | batch generator: 0.79 samples/sec: 6.595 | iteration 296500/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.594659E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.33 | backward: 1803.81 | backward-backward: 1803.78 | backward-allreduce: 0.00 | optimizer: 55.72 | batch generator: 0.78 samples/sec: 6.584 | iteration 296600/ 320000 | elapsed time per iteration (ms): 2430.0 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.591261E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.94 | backward: 1807.05 | backward-backward: 1807.03 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.74 samples/sec: 6.594 | iteration 296700/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.608925E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.60 | backward: 1803.56 | backward-backward: 1803.53 | backward-allreduce: 0.00 | optimizer: 56.08 | batch generator: 0.80 samples/sec: 6.588 | iteration 296800/ 320000 | elapsed time per iteration (ms): 2428.5 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.603795E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.73 | backward: 1805.74 | backward-backward: 1805.72 | backward-allreduce: 0.00 | optimizer: 55.67 | batch generator: 0.79 samples/sec: 6.594 | iteration 296900/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.612071E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.50 | backward: 1804.18 | backward-backward: 1804.16 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.77 samples/sec: 6.590 | iteration 297000/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.594035E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.38 | backward: 1805.57 | backward-backward: 1805.54 | backward-allreduce: 0.00 | optimizer: 55.69 | batch generator: 0.77 ----------------------------------------------------------------------------------------------------------- validation results at iteration 297000 | lm_loss value: 2.583128E+00 | lm_loss_ppl value: 1.323848E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.437 | iteration 297100/ 320000 | elapsed time per iteration (ms): 2485.4 | learning rate: 3.000E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.607562E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.49 | backward: 1804.94 | backward-backward: 1804.91 | backward-allreduce: 0.00 | optimizer: 55.82 | batch generator: 0.85 samples/sec: 6.591 | iteration 297200/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.607997E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.40 | backward: 1805.18 | backward-backward: 1805.16 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.79 samples/sec: 6.585 | iteration 297300/ 320000 | elapsed time per iteration (ms): 2429.9 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.622734E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.11 | backward: 1806.11 | backward-backward: 1806.09 | backward-allreduce: 0.00 | optimizer: 56.35 | batch generator: 0.80 samples/sec: 6.593 | iteration 297400/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.612820E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.52 | backward: 1804.25 | backward-backward: 1804.23 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.81 samples/sec: 6.588 | iteration 297500/ 320000 | elapsed time per iteration (ms): 2428.6 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.592132E+00 | loss scale: 65536.0 | number of skipped iterations: 2 | number of nan iterations: 0 | time (ms) | forward: 566.81 | backward: 1806.72 | backward-backward: 1806.70 | backward-allreduce: 0.00 | optimizer: 54.66 | batch generator: 0.74 samples/sec: 6.596 | iteration 297600/ 320000 | elapsed time per iteration (ms): 2425.6 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.625898E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.42 | backward: 1803.52 | backward-backward: 1803.50 | backward-allreduce: 0.00 | optimizer: 55.34 | batch generator: 0.88 samples/sec: 6.584 | iteration 297700/ 320000 | elapsed time per iteration (ms): 2430.3 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.606198E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.02 | backward: 1807.01 | backward-backward: 1806.98 | backward-allreduce: 0.00 | optimizer: 55.86 | batch generator: 0.76 samples/sec: 6.590 | iteration 297800/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.611868E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.29 | backward: 1804.62 | backward-backward: 1804.60 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.86 samples/sec: 6.589 | iteration 297900/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.602524E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.51 | backward: 1804.88 | backward-backward: 1804.85 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.77 samples/sec: 6.592 | iteration 298000/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.596305E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.84 | backward: 1804.37 | backward-backward: 1804.35 | backward-allreduce: 0.00 | optimizer: 55.70 | batch generator: 0.82 ----------------------------------------------------------------------------------------------------------- validation results at iteration 298000 | lm_loss value: 2.594529E+00 | lm_loss_ppl value: 1.339028E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.440 | iteration 298100/ 320000 | elapsed time per iteration (ms): 2484.3 | learning rate: 3.000E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.607775E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.58 | backward: 1804.71 | backward-backward: 1804.68 | backward-allreduce: 0.00 | optimizer: 55.89 | batch generator: 0.86 samples/sec: 6.588 | iteration 298200/ 320000 | elapsed time per iteration (ms): 2428.6 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.621907E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.00 | backward: 1805.45 | backward-backward: 1805.43 | backward-allreduce: 0.00 | optimizer: 55.82 | batch generator: 0.85 samples/sec: 6.592 | iteration 298300/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.587513E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.47 | backward: 1804.39 | backward-backward: 1804.37 | backward-allreduce: 0.00 | optimizer: 55.96 | batch generator: 0.78 samples/sec: 6.592 | iteration 298400/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.618744E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.56 | backward: 1803.98 | backward-backward: 1803.96 | backward-allreduce: 0.00 | optimizer: 56.41 | batch generator: 0.80 samples/sec: 6.591 | iteration 298500/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.614689E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.51 | backward: 1804.66 | backward-backward: 1804.64 | backward-allreduce: 0.00 | optimizer: 55.85 | batch generator: 0.76 samples/sec: 6.589 | iteration 298600/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.613406E+00 | loss scale: 131072.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.58 | backward: 1805.92 | backward-backward: 1805.90 | backward-allreduce: 0.00 | optimizer: 55.22 | batch generator: 0.80 samples/sec: 6.587 | iteration 298700/ 320000 | elapsed time per iteration (ms): 2429.0 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.593887E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 567.07 | backward: 1806.31 | backward-backward: 1806.29 | backward-allreduce: 0.00 | optimizer: 55.22 | batch generator: 0.75 samples/sec: 6.590 | iteration 298800/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.603172E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.18 | backward: 1804.43 | backward-backward: 1804.40 | backward-allreduce: 0.00 | optimizer: 55.79 | batch generator: 0.83 samples/sec: 6.591 | iteration 298900/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.600094E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.54 | backward: 1805.16 | backward-backward: 1805.13 | backward-allreduce: 0.00 | optimizer: 55.40 | batch generator: 0.74 samples/sec: 6.592 | iteration 299000/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.599397E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.63 | backward: 1804.42 | backward-backward: 1804.40 | backward-allreduce: 0.00 | optimizer: 55.91 | batch generator: 0.82 ----------------------------------------------------------------------------------------------------------- validation results at iteration 299000 | lm_loss value: 2.563101E+00 | lm_loss_ppl value: 1.297599E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.441 | iteration 299100/ 320000 | elapsed time per iteration (ms): 2484.0 | learning rate: 3.000E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.611730E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.60 | backward: 1804.63 | backward-backward: 1804.61 | backward-allreduce: 0.00 | optimizer: 55.56 | batch generator: 0.82 samples/sec: 6.593 | iteration 299200/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.601367E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.54 | backward: 1803.96 | backward-backward: 1803.93 | backward-allreduce: 0.00 | optimizer: 55.79 | batch generator: 0.78 samples/sec: 6.584 | iteration 299300/ 320000 | elapsed time per iteration (ms): 2430.0 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.603514E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.95 | backward: 1807.11 | backward-backward: 1807.09 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.77 samples/sec: 6.591 | iteration 299400/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.590378E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.98 | backward: 1804.48 | backward-backward: 1804.46 | backward-allreduce: 0.00 | optimizer: 55.82 | batch generator: 0.76 samples/sec: 6.591 | iteration 299500/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.591883E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.43 | backward: 1804.35 | backward-backward: 1804.33 | backward-allreduce: 0.00 | optimizer: 56.23 | batch generator: 0.76 samples/sec: 6.590 | iteration 299600/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.599590E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.92 | backward: 1805.12 | backward-backward: 1805.09 | backward-allreduce: 0.00 | optimizer: 55.59 | batch generator: 0.78 samples/sec: 6.597 | iteration 299700/ 320000 | elapsed time per iteration (ms): 2425.4 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.608386E+00 | loss scale: 65536.0 | number of skipped iterations: 2 | number of nan iterations: 0 | time (ms) | forward: 566.53 | backward: 1803.68 | backward-backward: 1803.66 | backward-allreduce: 0.00 | optimizer: 54.65 | batch generator: 0.77 samples/sec: 6.587 | iteration 299800/ 320000 | elapsed time per iteration (ms): 2428.9 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.599617E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.87 | backward: 1806.05 | backward-backward: 1806.03 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.79 samples/sec: 6.594 | iteration 299900/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.596501E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.74 | backward: 1803.59 | backward-backward: 1803.56 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.78 samples/sec: 6.589 | iteration 300000/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.609462E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.70 | backward: 1805.64 | backward-backward: 1805.61 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.78 WARNING: Deleting old checkpoints: checkpoints-hfcm/global_step200000 ----------------------------------------------------------------------------------------------------------- validation results at iteration 300000 | lm_loss value: 2.638751E+00 | lm_loss_ppl value: 1.399571E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.227 | iteration 300100/ 320000 | elapsed time per iteration (ms): 2569.3 | learning rate: 3.000E-05 | approx flops per GPU: 38.7TFLOPS | lm_loss: 2.604468E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.93 | backward: 1804.29 | backward-backward: 1804.27 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.85 samples/sec: 6.591 | iteration 300200/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.598382E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.49 | backward: 1804.83 | backward-backward: 1804.81 | backward-allreduce: 0.00 | optimizer: 55.87 | batch generator: 0.77 samples/sec: 6.585 | iteration 300300/ 320000 | elapsed time per iteration (ms): 2429.8 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.618266E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 568.04 | backward: 1805.69 | backward-backward: 1805.66 | backward-allreduce: 0.00 | optimizer: 55.70 | batch generator: 0.77 samples/sec: 6.594 | iteration 300400/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.606820E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.44 | backward: 1803.89 | backward-backward: 1803.87 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.77 samples/sec: 6.589 | iteration 300500/ 320000 | elapsed time per iteration (ms): 2428.1 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.611683E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.94 | backward: 1805.42 | backward-backward: 1805.40 | backward-allreduce: 0.00 | optimizer: 55.37 | batch generator: 0.77 samples/sec: 6.590 | iteration 300600/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.603422E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.48 | backward: 1804.02 | backward-backward: 1803.99 | backward-allreduce: 0.00 | optimizer: 56.01 | batch generator: 0.79 samples/sec: 6.590 | iteration 300700/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.600578E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.04 | backward: 1804.75 | backward-backward: 1804.72 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.79 samples/sec: 6.590 | iteration 300800/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.572556E+00 | loss scale: 65536.0 | number of skipped iterations: 2 | number of nan iterations: 0 | time (ms) | forward: 567.84 | backward: 1804.55 | backward-backward: 1804.52 | backward-allreduce: 0.00 | optimizer: 55.12 | batch generator: 0.83 samples/sec: 6.594 | iteration 300900/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.618747E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.82 | backward: 1803.80 | backward-backward: 1803.78 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.80 samples/sec: 6.592 | iteration 301000/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.590358E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.96 | backward: 1804.01 | backward-backward: 1803.98 | backward-allreduce: 0.00 | optimizer: 55.89 | batch generator: 0.96 ----------------------------------------------------------------------------------------------------------- validation results at iteration 301000 | lm_loss value: 2.619112E+00 | lm_loss_ppl value: 1.372354E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.444 | iteration 301100/ 320000 | elapsed time per iteration (ms): 2483.0 | learning rate: 3.000E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.597401E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.74 | backward: 1803.63 | backward-backward: 1803.61 | backward-allreduce: 0.00 | optimizer: 55.46 | batch generator: 0.86 samples/sec: 6.587 | iteration 301200/ 320000 | elapsed time per iteration (ms): 2429.2 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.619598E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.40 | backward: 1805.61 | backward-backward: 1805.59 | backward-allreduce: 0.00 | optimizer: 55.79 | batch generator: 0.78 samples/sec: 6.593 | iteration 301300/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.613900E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.39 | backward: 1803.68 | backward-backward: 1803.66 | backward-allreduce: 0.00 | optimizer: 55.49 | batch generator: 0.75 samples/sec: 6.588 | iteration 301400/ 320000 | elapsed time per iteration (ms): 2428.5 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.592202E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.87 | backward: 1805.77 | backward-backward: 1805.75 | backward-allreduce: 0.00 | optimizer: 55.54 | batch generator: 0.77 samples/sec: 6.585 | iteration 301500/ 320000 | elapsed time per iteration (ms): 2429.7 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.592391E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.89 | backward: 1805.37 | backward-backward: 1805.35 | backward-allreduce: 0.00 | optimizer: 56.10 | batch generator: 0.78 samples/sec: 6.594 | iteration 301600/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.603857E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.31 | backward: 1803.77 | backward-backward: 1803.75 | backward-allreduce: 0.00 | optimizer: 56.05 | batch generator: 0.78 samples/sec: 6.586 | iteration 301700/ 320000 | elapsed time per iteration (ms): 2429.5 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.606798E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.55 | backward: 1805.93 | backward-backward: 1805.91 | backward-allreduce: 0.00 | optimizer: 55.67 | batch generator: 0.77 samples/sec: 6.592 | iteration 301800/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.606375E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.88 | backward: 1804.05 | backward-backward: 1804.03 | backward-allreduce: 0.00 | optimizer: 55.88 | batch generator: 0.76 samples/sec: 6.588 | iteration 301900/ 320000 | elapsed time per iteration (ms): 2428.5 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.611265E+00 | loss scale: 65536.0 | number of skipped iterations: 2 | number of nan iterations: 0 | time (ms) | forward: 567.35 | backward: 1805.99 | backward-backward: 1805.97 | backward-allreduce: 0.00 | optimizer: 54.81 | batch generator: 0.76 samples/sec: 6.595 | iteration 302000/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.592595E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.65 | backward: 1803.29 | backward-backward: 1803.26 | backward-allreduce: 0.00 | optimizer: 55.66 | batch generator: 0.77 ----------------------------------------------------------------------------------------------------------- validation results at iteration 302000 | lm_loss value: 2.571509E+00 | lm_loss_ppl value: 1.308556E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.442 | iteration 302100/ 320000 | elapsed time per iteration (ms): 2483.9 | learning rate: 3.000E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.593477E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.54 | backward: 1804.47 | backward-backward: 1804.45 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.85 samples/sec: 6.587 | iteration 302200/ 320000 | elapsed time per iteration (ms): 2428.9 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.588958E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.74 | backward: 1805.02 | backward-backward: 1805.00 | backward-allreduce: 0.00 | optimizer: 55.70 | batch generator: 0.81 samples/sec: 6.594 | iteration 302300/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.600175E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.60 | backward: 1803.66 | backward-backward: 1803.64 | backward-allreduce: 0.00 | optimizer: 55.78 | batch generator: 0.78 samples/sec: 6.593 | iteration 302400/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.605359E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.50 | backward: 1804.28 | backward-backward: 1804.26 | backward-allreduce: 0.00 | optimizer: 55.78 | batch generator: 0.78 samples/sec: 6.594 | iteration 302500/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.617360E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.63 | backward: 1803.74 | backward-backward: 1803.72 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.77 samples/sec: 6.589 | iteration 302600/ 320000 | elapsed time per iteration (ms): 2428.2 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.606588E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.02 | backward: 1805.36 | backward-backward: 1805.33 | backward-allreduce: 0.00 | optimizer: 55.46 | batch generator: 0.82 samples/sec: 6.595 | iteration 302700/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.592628E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.64 | backward: 1802.84 | backward-backward: 1802.82 | backward-allreduce: 0.00 | optimizer: 56.31 | batch generator: 0.78 samples/sec: 6.592 | iteration 302800/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.618683E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.54 | backward: 1805.07 | backward-backward: 1805.04 | backward-allreduce: 0.00 | optimizer: 55.39 | batch generator: 0.74 samples/sec: 6.590 | iteration 302900/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.600045E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.58 | backward: 1804.21 | backward-backward: 1804.18 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.81 samples/sec: 6.592 | iteration 303000/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.590580E+00 | loss scale: 131072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.52 | backward: 1805.05 | backward-backward: 1805.03 | backward-allreduce: 0.00 | optimizer: 55.37 | batch generator: 0.78 ----------------------------------------------------------------------------------------------------------- validation results at iteration 303000 | lm_loss value: 2.563759E+00 | lm_loss_ppl value: 1.298454E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.439 | iteration 303100/ 320000 | elapsed time per iteration (ms): 2485.0 | learning rate: 3.000E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.586377E+00 | loss scale: 65536.0 | number of skipped iterations: 2 | number of nan iterations: 0 | time (ms) | forward: 567.59 | backward: 1805.39 | backward-backward: 1805.37 | backward-allreduce: 0.00 | optimizer: 54.65 | batch generator: 0.83 samples/sec: 6.594 | iteration 303200/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.612485E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.70 | backward: 1803.85 | backward-backward: 1803.83 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.85 samples/sec: 6.586 | iteration 303300/ 320000 | elapsed time per iteration (ms): 2429.4 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.603385E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.95 | backward: 1805.40 | backward-backward: 1805.37 | backward-allreduce: 0.00 | optimizer: 55.72 | batch generator: 0.78 samples/sec: 6.594 | iteration 303400/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.579690E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.73 | backward: 1803.72 | backward-backward: 1803.70 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.80 samples/sec: 6.591 | iteration 303500/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.600323E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.62 | backward: 1804.34 | backward-backward: 1804.32 | backward-allreduce: 0.00 | optimizer: 56.05 | batch generator: 0.79 samples/sec: 6.585 | iteration 303600/ 320000 | elapsed time per iteration (ms): 2429.8 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.612003E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.72 | backward: 1805.80 | backward-backward: 1805.77 | backward-allreduce: 0.00 | optimizer: 55.87 | batch generator: 0.81 samples/sec: 6.593 | iteration 303700/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.589155E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.67 | backward: 1804.01 | backward-backward: 1803.98 | backward-allreduce: 0.00 | optimizer: 55.77 | batch generator: 0.76 samples/sec: 6.587 | iteration 303800/ 320000 | elapsed time per iteration (ms): 2428.9 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.610490E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.46 | backward: 1805.84 | backward-backward: 1805.82 | backward-allreduce: 0.00 | optimizer: 56.24 | batch generator: 0.76 samples/sec: 6.591 | iteration 303900/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.596044E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.80 | backward: 1805.04 | backward-backward: 1805.02 | backward-allreduce: 0.00 | optimizer: 55.49 | batch generator: 0.76 samples/sec: 6.590 | iteration 304000/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.606059E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.50 | backward: 1805.25 | backward-backward: 1805.23 | backward-allreduce: 0.00 | optimizer: 55.69 | batch generator: 0.84 ----------------------------------------------------------------------------------------------------------- validation results at iteration 304000 | lm_loss value: 2.588999E+00 | lm_loss_ppl value: 1.331644E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.440 | iteration 304100/ 320000 | elapsed time per iteration (ms): 2484.6 | learning rate: 3.000E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.611974E+00 | loss scale: 65536.0 | number of skipped iterations: 2 | number of nan iterations: 0 | time (ms) | forward: 567.28 | backward: 1805.13 | backward-backward: 1805.11 | backward-allreduce: 0.00 | optimizer: 54.99 | batch generator: 0.94 samples/sec: 6.599 | iteration 304200/ 320000 | elapsed time per iteration (ms): 2424.8 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.597064E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.09 | backward: 1802.79 | backward-backward: 1802.77 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.84 samples/sec: 6.586 | iteration 304300/ 320000 | elapsed time per iteration (ms): 2429.2 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.596660E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.84 | backward: 1805.99 | backward-backward: 1805.97 | backward-allreduce: 0.00 | optimizer: 56.05 | batch generator: 0.76 samples/sec: 6.592 | iteration 304400/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.603194E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.81 | backward: 1804.29 | backward-backward: 1804.27 | backward-allreduce: 0.00 | optimizer: 55.54 | batch generator: 0.77 samples/sec: 6.592 | iteration 304500/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.611425E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.45 | backward: 1804.46 | backward-backward: 1804.44 | backward-allreduce: 0.00 | optimizer: 55.76 | batch generator: 0.76 samples/sec: 6.585 | iteration 304600/ 320000 | elapsed time per iteration (ms): 2429.7 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.600327E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.99 | backward: 1805.93 | backward-backward: 1805.91 | backward-allreduce: 0.00 | optimizer: 56.30 | batch generator: 0.78 samples/sec: 6.594 | iteration 304700/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.599837E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.23 | backward: 1803.66 | backward-backward: 1803.64 | backward-allreduce: 0.00 | optimizer: 56.07 | batch generator: 0.77 samples/sec: 6.586 | iteration 304800/ 320000 | elapsed time per iteration (ms): 2429.5 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.599974E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.90 | backward: 1806.37 | backward-backward: 1806.34 | backward-allreduce: 0.00 | optimizer: 55.83 | batch generator: 0.77 samples/sec: 6.592 | iteration 304900/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.585710E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.58 | backward: 1803.50 | backward-backward: 1803.47 | backward-allreduce: 0.00 | optimizer: 56.59 | batch generator: 0.85 samples/sec: 6.590 | iteration 305000/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.608622E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.41 | backward: 1805.25 | backward-backward: 1805.23 | backward-allreduce: 0.00 | optimizer: 55.94 | batch generator: 0.76 ----------------------------------------------------------------------------------------------------------- validation results at iteration 305000 | lm_loss value: 2.574010E+00 | lm_loss_ppl value: 1.311833E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.439 | iteration 305100/ 320000 | elapsed time per iteration (ms): 2484.7 | learning rate: 3.000E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.625141E+00 | loss scale: 65536.0 | number of skipped iterations: 2 | number of nan iterations: 0 | time (ms) | forward: 567.05 | backward: 1805.40 | backward-backward: 1805.38 | backward-allreduce: 0.00 | optimizer: 55.02 | batch generator: 0.82 samples/sec: 6.597 | iteration 305200/ 320000 | elapsed time per iteration (ms): 2425.4 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.606796E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.14 | backward: 1803.34 | backward-backward: 1803.31 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.78 samples/sec: 6.588 | iteration 305300/ 320000 | elapsed time per iteration (ms): 2428.6 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.618584E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.79 | backward: 1805.87 | backward-backward: 1805.84 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.80 samples/sec: 6.598 | iteration 305400/ 320000 | elapsed time per iteration (ms): 2425.0 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.604397E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.29 | backward: 1802.93 | backward-backward: 1802.90 | backward-allreduce: 0.00 | optimizer: 55.45 | batch generator: 0.78 samples/sec: 6.592 | iteration 305500/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.605221E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.45 | backward: 1804.78 | backward-backward: 1804.75 | backward-allreduce: 0.00 | optimizer: 55.72 | batch generator: 0.77 samples/sec: 6.592 | iteration 305600/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.610457E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.61 | backward: 1804.83 | backward-backward: 1804.81 | backward-allreduce: 0.00 | optimizer: 55.51 | batch generator: 0.80 samples/sec: 6.594 | iteration 305700/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.609554E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.33 | backward: 1804.11 | backward-backward: 1804.08 | backward-allreduce: 0.00 | optimizer: 55.59 | batch generator: 0.78 samples/sec: 6.588 | iteration 305800/ 320000 | elapsed time per iteration (ms): 2428.5 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.611088E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.70 | backward: 1805.89 | backward-backward: 1805.87 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.77 samples/sec: 6.595 | iteration 305900/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.585675E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 565.98 | backward: 1803.84 | backward-backward: 1803.82 | backward-allreduce: 0.00 | optimizer: 55.78 | batch generator: 0.79 samples/sec: 6.586 | iteration 306000/ 320000 | elapsed time per iteration (ms): 2429.5 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.608691E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.72 | backward: 1805.74 | backward-backward: 1805.72 | backward-allreduce: 0.00 | optimizer: 56.63 | batch generator: 0.75 ----------------------------------------------------------------------------------------------------------- validation results at iteration 306000 | lm_loss value: 2.671803E+00 | lm_loss_ppl value: 1.446603E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.443 | iteration 306100/ 320000 | elapsed time per iteration (ms): 2483.5 | learning rate: 3.000E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.610115E+00 | loss scale: 131072.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.49 | backward: 1804.37 | backward-backward: 1804.34 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.89 samples/sec: 6.596 | iteration 306200/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.587989E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.16 | backward: 1804.18 | backward-backward: 1804.16 | backward-allreduce: 0.00 | optimizer: 55.11 | batch generator: 0.76 samples/sec: 6.589 | iteration 306300/ 320000 | elapsed time per iteration (ms): 2428.4 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.621048E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.51 | backward: 1805.86 | backward-backward: 1805.84 | backward-allreduce: 0.00 | optimizer: 55.70 | batch generator: 0.74 samples/sec: 6.597 | iteration 306400/ 320000 | elapsed time per iteration (ms): 2425.2 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.605201E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.07 | backward: 1803.33 | backward-backward: 1803.31 | backward-allreduce: 0.00 | optimizer: 55.42 | batch generator: 0.76 samples/sec: 6.588 | iteration 306500/ 320000 | elapsed time per iteration (ms): 2428.7 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.587088E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.72 | backward: 1805.83 | backward-backward: 1805.80 | backward-allreduce: 0.00 | optimizer: 55.82 | batch generator: 0.78 samples/sec: 6.591 | iteration 306600/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.602861E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.67 | backward: 1804.82 | backward-backward: 1804.79 | backward-allreduce: 0.00 | optimizer: 55.86 | batch generator: 0.79 samples/sec: 6.591 | iteration 306700/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.591800E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.47 | backward: 1804.80 | backward-backward: 1804.78 | backward-allreduce: 0.00 | optimizer: 55.85 | batch generator: 0.82 samples/sec: 6.590 | iteration 306800/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.612688E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.69 | backward: 1805.20 | backward-backward: 1805.18 | backward-allreduce: 0.00 | optimizer: 55.70 | batch generator: 0.76 samples/sec: 6.597 | iteration 306900/ 320000 | elapsed time per iteration (ms): 2425.4 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.608026E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.03 | backward: 1803.37 | backward-backward: 1803.34 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.78 samples/sec: 6.585 | iteration 307000/ 320000 | elapsed time per iteration (ms): 2429.6 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.612965E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.80 | backward: 1806.46 | backward-backward: 1806.43 | backward-allreduce: 0.00 | optimizer: 56.01 | batch generator: 0.79 ----------------------------------------------------------------------------------------------------------- validation results at iteration 307000 | lm_loss value: 2.588826E+00 | lm_loss_ppl value: 1.331414E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.443 | iteration 307100/ 320000 | elapsed time per iteration (ms): 2483.4 | learning rate: 3.000E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.617179E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.42 | backward: 1803.55 | backward-backward: 1803.53 | backward-allreduce: 0.00 | optimizer: 56.17 | batch generator: 0.89 samples/sec: 6.595 | iteration 307200/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.604513E+00 | loss scale: 65536.0 | number of skipped iterations: 2 | number of nan iterations: 0 | time (ms) | forward: 566.34 | backward: 1804.39 | backward-backward: 1804.37 | backward-allreduce: 0.00 | optimizer: 54.89 | batch generator: 0.92 samples/sec: 6.587 | iteration 307300/ 320000 | elapsed time per iteration (ms): 2429.2 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.597817E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.80 | backward: 1806.20 | backward-backward: 1806.17 | backward-allreduce: 0.00 | optimizer: 55.78 | batch generator: 0.81 samples/sec: 6.601 | iteration 307400/ 320000 | elapsed time per iteration (ms): 2423.7 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.616958E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 565.99 | backward: 1802.50 | backward-backward: 1802.47 | backward-allreduce: 0.00 | optimizer: 54.87 | batch generator: 0.76 samples/sec: 6.593 | iteration 307500/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.612487E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.56 | backward: 1804.20 | backward-backward: 1804.18 | backward-allreduce: 0.00 | optimizer: 55.77 | batch generator: 0.77 samples/sec: 6.591 | iteration 307600/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.612401E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.78 | backward: 1804.83 | backward-backward: 1804.81 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.79 samples/sec: 6.599 | iteration 307700/ 320000 | elapsed time per iteration (ms): 2424.7 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.583848E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.10 | backward: 1802.58 | backward-backward: 1802.56 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.77 samples/sec: 6.593 | iteration 307800/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.618226E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.53 | backward: 1804.04 | backward-backward: 1804.02 | backward-allreduce: 0.00 | optimizer: 55.71 | batch generator: 0.80 samples/sec: 6.591 | iteration 307900/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.598102E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.82 | backward: 1804.74 | backward-backward: 1804.71 | backward-allreduce: 0.00 | optimizer: 55.65 | batch generator: 0.81 samples/sec: 6.599 | iteration 308000/ 320000 | elapsed time per iteration (ms): 2424.6 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.590889E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.00 | backward: 1802.82 | backward-backward: 1802.80 | backward-allreduce: 0.00 | optimizer: 55.45 | batch generator: 0.78 ----------------------------------------------------------------------------------------------------------- validation results at iteration 308000 | lm_loss value: 2.561873E+00 | lm_loss_ppl value: 1.296007E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.440 | iteration 308100/ 320000 | elapsed time per iteration (ms): 2484.3 | learning rate: 3.000E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.604801E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.45 | backward: 1805.18 | backward-backward: 1805.15 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.84 samples/sec: 6.590 | iteration 308200/ 320000 | elapsed time per iteration (ms): 2427.8 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.601600E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.63 | backward: 1804.59 | backward-backward: 1804.56 | backward-allreduce: 0.00 | optimizer: 56.23 | batch generator: 0.80 samples/sec: 6.599 | iteration 308300/ 320000 | elapsed time per iteration (ms): 2424.5 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.592459E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.16 | backward: 1802.61 | backward-backward: 1802.59 | backward-allreduce: 0.00 | optimizer: 55.39 | batch generator: 0.80 samples/sec: 6.589 | iteration 308400/ 320000 | elapsed time per iteration (ms): 2428.4 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.597729E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.64 | backward: 1805.41 | backward-backward: 1805.39 | backward-allreduce: 0.00 | optimizer: 55.97 | batch generator: 0.79 samples/sec: 6.590 | iteration 308500/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.588279E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.63 | backward: 1805.01 | backward-backward: 1804.98 | backward-allreduce: 0.00 | optimizer: 55.73 | batch generator: 0.76 samples/sec: 6.599 | iteration 308600/ 320000 | elapsed time per iteration (ms): 2424.7 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.592878E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.04 | backward: 1802.99 | backward-backward: 1802.96 | backward-allreduce: 0.00 | optimizer: 55.32 | batch generator: 0.81 samples/sec: 6.587 | iteration 308700/ 320000 | elapsed time per iteration (ms): 2429.1 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.594853E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.94 | backward: 1805.53 | backward-backward: 1805.50 | backward-allreduce: 0.00 | optimizer: 56.30 | batch generator: 0.79 samples/sec: 6.592 | iteration 308800/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.594934E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.70 | backward: 1804.24 | backward-backward: 1804.22 | backward-allreduce: 0.00 | optimizer: 55.71 | batch generator: 0.78 samples/sec: 6.598 | iteration 308900/ 320000 | elapsed time per iteration (ms): 2425.0 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.598338E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 565.99 | backward: 1803.22 | backward-backward: 1803.20 | backward-allreduce: 0.00 | optimizer: 55.38 | batch generator: 0.77 samples/sec: 6.588 | iteration 309000/ 320000 | elapsed time per iteration (ms): 2428.7 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.593876E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.83 | backward: 1805.59 | backward-backward: 1805.57 | backward-allreduce: 0.00 | optimizer: 55.92 | batch generator: 0.81 ----------------------------------------------------------------------------------------------------------- validation results at iteration 309000 | lm_loss value: 2.581536E+00 | lm_loss_ppl value: 1.321743E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.442 | iteration 309100/ 320000 | elapsed time per iteration (ms): 2483.8 | learning rate: 3.000E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.601587E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.59 | backward: 1804.39 | backward-backward: 1804.37 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.85 samples/sec: 6.596 | iteration 309200/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.609782E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 565.88 | backward: 1803.92 | backward-backward: 1803.90 | backward-allreduce: 0.00 | optimizer: 55.50 | batch generator: 0.77 samples/sec: 6.588 | iteration 309300/ 320000 | elapsed time per iteration (ms): 2428.6 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.601483E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.58 | backward: 1805.55 | backward-backward: 1805.52 | backward-allreduce: 0.00 | optimizer: 56.14 | batch generator: 0.77 samples/sec: 6.593 | iteration 309400/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.605293E+00 | loss scale: 65536.0 | number of skipped iterations: 2 | number of nan iterations: 0 | time (ms) | forward: 566.78 | backward: 1804.95 | backward-backward: 1804.93 | backward-allreduce: 0.00 | optimizer: 54.82 | batch generator: 0.77 samples/sec: 6.597 | iteration 309500/ 320000 | elapsed time per iteration (ms): 2425.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.617630E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 565.93 | backward: 1803.49 | backward-backward: 1803.46 | backward-allreduce: 0.00 | optimizer: 55.51 | batch generator: 0.79 samples/sec: 6.592 | iteration 309600/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.611771E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.54 | backward: 1805.09 | backward-backward: 1805.06 | backward-allreduce: 0.00 | optimizer: 55.13 | batch generator: 0.75 samples/sec: 6.594 | iteration 309700/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.605004E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.53 | backward: 1804.17 | backward-backward: 1804.15 | backward-allreduce: 0.00 | optimizer: 55.49 | batch generator: 0.77 samples/sec: 6.599 | iteration 309800/ 320000 | elapsed time per iteration (ms): 2424.6 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.607318E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 565.90 | backward: 1802.75 | backward-backward: 1802.73 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.76 samples/sec: 6.592 | iteration 309900/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.616454E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.76 | backward: 1804.62 | backward-backward: 1804.60 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.77 samples/sec: 6.595 | iteration 310000/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.609328E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.66 | backward: 1803.68 | backward-backward: 1803.66 | backward-allreduce: 0.00 | optimizer: 55.54 | batch generator: 0.77 WARNING: Deleting old checkpoints: checkpoints-hfcm/global_step210000 ----------------------------------------------------------------------------------------------------------- validation results at iteration 310000 | lm_loss value: 2.644738E+00 | lm_loss_ppl value: 1.407976E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.229 | iteration 310100/ 320000 | elapsed time per iteration (ms): 2568.6 | learning rate: 3.000E-05 | approx flops per GPU: 38.7TFLOPS | lm_loss: 2.602393E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.38 | backward: 1802.38 | backward-backward: 1802.36 | backward-allreduce: 0.00 | optimizer: 55.43 | batch generator: 0.83 samples/sec: 6.590 | iteration 310200/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.614357E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.80 | backward: 1805.33 | backward-backward: 1805.31 | backward-allreduce: 0.00 | optimizer: 55.49 | batch generator: 0.81 samples/sec: 6.594 | iteration 310300/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.604395E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.80 | backward: 1803.86 | backward-backward: 1803.84 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.80 samples/sec: 6.598 | iteration 310400/ 320000 | elapsed time per iteration (ms): 2424.9 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.589481E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.01 | backward: 1802.57 | backward-backward: 1802.55 | backward-allreduce: 0.00 | optimizer: 55.93 | batch generator: 0.77 samples/sec: 6.590 | iteration 310500/ 320000 | elapsed time per iteration (ms): 2427.9 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.584879E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.84 | backward: 1804.87 | backward-backward: 1804.85 | backward-allreduce: 0.00 | optimizer: 55.83 | batch generator: 0.77 samples/sec: 6.595 | iteration 310600/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.598371E+00 | loss scale: 32768.0 | number of skipped iterations: 2 | number of nan iterations: 0 | time (ms) | forward: 566.86 | backward: 1803.91 | backward-backward: 1803.89 | backward-allreduce: 0.00 | optimizer: 54.79 | batch generator: 0.76 samples/sec: 6.599 | iteration 310700/ 320000 | elapsed time per iteration (ms): 2424.6 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.597442E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 565.94 | backward: 1802.71 | backward-backward: 1802.69 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.75 samples/sec: 6.590 | iteration 310800/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.604438E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.92 | backward: 1804.87 | backward-backward: 1804.85 | backward-allreduce: 0.00 | optimizer: 55.88 | batch generator: 0.82 samples/sec: 6.595 | iteration 310900/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.590535E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.64 | backward: 1803.68 | backward-backward: 1803.66 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.78 samples/sec: 6.597 | iteration 311000/ 320000 | elapsed time per iteration (ms): 2425.2 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.599418E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 565.91 | backward: 1803.15 | backward-backward: 1803.13 | backward-allreduce: 0.00 | optimizer: 55.73 | batch generator: 0.76 ----------------------------------------------------------------------------------------------------------- validation results at iteration 311000 | lm_loss value: 2.610161E+00 | lm_loss_ppl value: 1.360124E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.439 | iteration 311100/ 320000 | elapsed time per iteration (ms): 2485.0 | learning rate: 3.000E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.622416E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.93 | backward: 1805.31 | backward-backward: 1805.29 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.86 samples/sec: 6.596 | iteration 311200/ 320000 | elapsed time per iteration (ms): 2425.8 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.614030E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.68 | backward: 1803.32 | backward-backward: 1803.29 | backward-allreduce: 0.00 | optimizer: 55.47 | batch generator: 0.79 samples/sec: 6.595 | iteration 311300/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.603634E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.13 | backward: 1804.22 | backward-backward: 1804.20 | backward-allreduce: 0.00 | optimizer: 55.47 | batch generator: 0.79 samples/sec: 6.587 | iteration 311400/ 320000 | elapsed time per iteration (ms): 2429.1 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.600247E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.94 | backward: 1805.70 | backward-backward: 1805.67 | backward-allreduce: 0.00 | optimizer: 56.06 | batch generator: 0.80 samples/sec: 6.593 | iteration 311500/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.607474E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.07 | backward: 1803.04 | backward-backward: 1803.02 | backward-allreduce: 0.00 | optimizer: 56.23 | batch generator: 0.79 samples/sec: 6.597 | iteration 311600/ 320000 | elapsed time per iteration (ms): 2425.2 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.604827E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 565.97 | backward: 1803.37 | backward-backward: 1803.35 | backward-allreduce: 0.00 | optimizer: 55.39 | batch generator: 0.77 samples/sec: 6.588 | iteration 311700/ 320000 | elapsed time per iteration (ms): 2428.5 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.592860E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.92 | backward: 1805.15 | backward-backward: 1805.13 | backward-allreduce: 0.00 | optimizer: 56.09 | batch generator: 0.85 samples/sec: 6.594 | iteration 311800/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.626400E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.53 | backward: 1804.00 | backward-backward: 1803.98 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.79 samples/sec: 6.597 | iteration 311900/ 320000 | elapsed time per iteration (ms): 2425.4 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.606004E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.41 | backward: 1803.35 | backward-backward: 1803.33 | backward-allreduce: 0.00 | optimizer: 55.28 | batch generator: 0.79 samples/sec: 6.589 | iteration 312000/ 320000 | elapsed time per iteration (ms): 2428.4 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.590693E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.63 | backward: 1805.54 | backward-backward: 1805.52 | backward-allreduce: 0.00 | optimizer: 55.82 | batch generator: 0.79 ----------------------------------------------------------------------------------------------------------- validation results at iteration 312000 | lm_loss value: 2.481931E+00 | lm_loss_ppl value: 1.196435E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.442 | iteration 312100/ 320000 | elapsed time per iteration (ms): 2483.8 | learning rate: 3.000E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.596661E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.72 | backward: 1804.23 | backward-backward: 1804.20 | backward-allreduce: 0.00 | optimizer: 55.55 | batch generator: 0.85 samples/sec: 6.598 | iteration 312200/ 320000 | elapsed time per iteration (ms): 2424.8 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.591088E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 565.99 | backward: 1803.06 | backward-backward: 1803.03 | backward-allreduce: 0.00 | optimizer: 55.40 | batch generator: 0.77 samples/sec: 6.592 | iteration 312300/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.604492E+00 | loss scale: 32768.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.75 | backward: 1804.89 | backward-backward: 1804.87 | backward-allreduce: 0.00 | optimizer: 55.09 | batch generator: 0.75 samples/sec: 6.591 | iteration 312400/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.596834E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.80 | backward: 1804.60 | backward-backward: 1804.58 | backward-allreduce: 0.00 | optimizer: 55.75 | batch generator: 0.79 samples/sec: 6.599 | iteration 312500/ 320000 | elapsed time per iteration (ms): 2424.6 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.601281E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 565.93 | backward: 1802.39 | backward-backward: 1802.37 | backward-allreduce: 0.00 | optimizer: 55.92 | batch generator: 0.79 samples/sec: 6.591 | iteration 312600/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.620010E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.56 | backward: 1804.78 | backward-backward: 1804.76 | backward-allreduce: 0.00 | optimizer: 55.66 | batch generator: 0.77 samples/sec: 6.589 | iteration 312700/ 320000 | elapsed time per iteration (ms): 2428.4 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.602548E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.25 | backward: 1804.82 | backward-backward: 1804.79 | backward-allreduce: 0.00 | optimizer: 55.91 | batch generator: 0.81 samples/sec: 6.591 | iteration 312800/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.592072E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.05 | backward: 1804.23 | backward-backward: 1804.20 | backward-allreduce: 0.00 | optimizer: 55.72 | batch generator: 0.79 samples/sec: 6.601 | iteration 312900/ 320000 | elapsed time per iteration (ms): 2424.0 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.598201E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.00 | backward: 1802.03 | backward-backward: 1802.01 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.79 samples/sec: 6.593 | iteration 313000/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.606093E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.63 | backward: 1804.11 | backward-backward: 1804.09 | backward-allreduce: 0.00 | optimizer: 55.77 | batch generator: 0.87 ----------------------------------------------------------------------------------------------------------- validation results at iteration 313000 | lm_loss value: 2.628281E+00 | lm_loss_ppl value: 1.384994E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.440 | iteration 313100/ 320000 | elapsed time per iteration (ms): 2484.4 | learning rate: 3.000E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.601330E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.53 | backward: 1804.96 | backward-backward: 1804.93 | backward-allreduce: 0.00 | optimizer: 55.75 | batch generator: 0.84 samples/sec: 6.594 | iteration 313200/ 320000 | elapsed time per iteration (ms): 2426.4 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.591858E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.76 | backward: 1803.71 | backward-backward: 1803.69 | backward-allreduce: 0.00 | optimizer: 55.52 | batch generator: 0.77 samples/sec: 6.598 | iteration 313300/ 320000 | elapsed time per iteration (ms): 2425.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.610126E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.04 | backward: 1803.08 | backward-backward: 1803.06 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.77 samples/sec: 6.589 | iteration 313400/ 320000 | elapsed time per iteration (ms): 2428.4 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.610417E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.67 | backward: 1805.81 | backward-backward: 1805.79 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.80 samples/sec: 6.587 | iteration 313500/ 320000 | elapsed time per iteration (ms): 2429.0 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.614016E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.98 | backward: 1806.12 | backward-backward: 1806.09 | backward-allreduce: 0.00 | optimizer: 55.50 | batch generator: 0.81 samples/sec: 6.593 | iteration 313600/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.614077E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.54 | backward: 1803.88 | backward-backward: 1803.85 | backward-allreduce: 0.00 | optimizer: 56.21 | batch generator: 0.79 samples/sec: 6.597 | iteration 313700/ 320000 | elapsed time per iteration (ms): 2425.2 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.596313E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.08 | backward: 1803.16 | backward-backward: 1803.13 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.78 samples/sec: 6.588 | iteration 313800/ 320000 | elapsed time per iteration (ms): 2428.5 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.603129E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.70 | backward: 1805.61 | backward-backward: 1805.59 | backward-allreduce: 0.00 | optimizer: 55.80 | batch generator: 0.79 samples/sec: 6.592 | iteration 313900/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.600395E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.08 | backward: 1804.21 | backward-backward: 1804.19 | backward-allreduce: 0.00 | optimizer: 55.60 | batch generator: 0.79 samples/sec: 6.592 | iteration 314000/ 320000 | elapsed time per iteration (ms): 2427.2 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.595892E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.36 | backward: 1803.67 | backward-backward: 1803.64 | backward-allreduce: 0.00 | optimizer: 55.75 | batch generator: 0.95 ----------------------------------------------------------------------------------------------------------- validation results at iteration 314000 | lm_loss value: 2.575979E+00 | lm_loss_ppl value: 1.314419E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.446 | iteration 314100/ 320000 | elapsed time per iteration (ms): 2482.2 | learning rate: 3.000E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.603253E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.23 | backward: 1803.24 | backward-backward: 1803.22 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.84 samples/sec: 6.590 | iteration 314200/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.602779E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.20 | backward: 1804.87 | backward-backward: 1804.85 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.79 samples/sec: 6.591 | iteration 314300/ 320000 | elapsed time per iteration (ms): 2427.6 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.604853E+00 | loss scale: 131072.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.82 | backward: 1805.11 | backward-backward: 1805.09 | backward-allreduce: 0.00 | optimizer: 55.26 | batch generator: 0.78 samples/sec: 6.592 | iteration 314400/ 320000 | elapsed time per iteration (ms): 2427.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.612665E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.95 | backward: 1804.93 | backward-backward: 1804.91 | backward-allreduce: 0.00 | optimizer: 55.08 | batch generator: 0.79 samples/sec: 6.591 | iteration 314500/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.596182E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.93 | backward: 1804.55 | backward-backward: 1804.53 | backward-allreduce: 0.00 | optimizer: 55.59 | batch generator: 0.84 samples/sec: 6.595 | iteration 314600/ 320000 | elapsed time per iteration (ms): 2426.0 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.594902E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.30 | backward: 1803.76 | backward-backward: 1803.74 | backward-allreduce: 0.00 | optimizer: 55.48 | batch generator: 0.76 samples/sec: 6.596 | iteration 314700/ 320000 | elapsed time per iteration (ms): 2425.7 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.614825E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.20 | backward: 1803.07 | backward-backward: 1803.05 | backward-allreduce: 0.00 | optimizer: 56.07 | batch generator: 0.78 samples/sec: 6.590 | iteration 314800/ 320000 | elapsed time per iteration (ms): 2428.0 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.589894E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.42 | backward: 1805.45 | backward-backward: 1805.43 | backward-allreduce: 0.00 | optimizer: 55.79 | batch generator: 0.78 samples/sec: 6.593 | iteration 314900/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.588818E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.53 | backward: 1804.26 | backward-backward: 1804.24 | backward-allreduce: 0.00 | optimizer: 55.72 | batch generator: 0.79 samples/sec: 6.594 | iteration 315000/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.593605E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.71 | backward: 1803.89 | backward-backward: 1803.87 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.80 ----------------------------------------------------------------------------------------------------------- validation results at iteration 315000 | lm_loss value: 2.644554E+00 | lm_loss_ppl value: 1.407716E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.443 | iteration 315100/ 320000 | elapsed time per iteration (ms): 2483.4 | learning rate: 3.000E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.604982E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.48 | backward: 1803.96 | backward-backward: 1803.94 | backward-allreduce: 0.00 | optimizer: 55.71 | batch generator: 0.85 samples/sec: 6.591 | iteration 315200/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.613780E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.47 | backward: 1804.65 | backward-backward: 1804.63 | backward-allreduce: 0.00 | optimizer: 55.94 | batch generator: 0.77 samples/sec: 6.597 | iteration 315300/ 320000 | elapsed time per iteration (ms): 2425.4 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.591585E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.13 | backward: 1803.29 | backward-backward: 1803.26 | backward-allreduce: 0.00 | optimizer: 55.59 | batch generator: 0.80 samples/sec: 6.596 | iteration 315400/ 320000 | elapsed time per iteration (ms): 2425.6 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.586901E+00 | loss scale: 131072.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.23 | backward: 1803.88 | backward-backward: 1803.86 | backward-allreduce: 0.00 | optimizer: 55.14 | batch generator: 0.78 samples/sec: 6.595 | iteration 315500/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.599877E+00 | loss scale: 32768.0 | number of skipped iterations: 2 | number of nan iterations: 0 | time (ms) | forward: 566.42 | backward: 1804.82 | backward-backward: 1804.80 | backward-allreduce: 0.00 | optimizer: 54.56 | batch generator: 0.77 samples/sec: 6.594 | iteration 315600/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.587597E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.37 | backward: 1803.98 | backward-backward: 1803.96 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.82 samples/sec: 6.591 | iteration 315700/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.598109E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.44 | backward: 1805.08 | backward-backward: 1805.06 | backward-allreduce: 0.00 | optimizer: 55.81 | batch generator: 0.76 samples/sec: 6.591 | iteration 315800/ 320000 | elapsed time per iteration (ms): 2427.7 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.591140E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.61 | backward: 1803.59 | backward-backward: 1803.56 | backward-allreduce: 0.00 | optimizer: 57.15 | batch generator: 0.81 samples/sec: 6.593 | iteration 315900/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.605495E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.43 | backward: 1803.86 | backward-backward: 1803.84 | backward-allreduce: 0.00 | optimizer: 56.13 | batch generator: 0.76 samples/sec: 6.596 | iteration 316000/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.601495E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.75 | backward: 1803.30 | backward-backward: 1803.27 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.79 ----------------------------------------------------------------------------------------------------------- validation results at iteration 316000 | lm_loss value: 2.495879E+00 | lm_loss_ppl value: 1.213239E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.444 | iteration 316100/ 320000 | elapsed time per iteration (ms): 2483.1 | learning rate: 3.000E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.608991E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.54 | backward: 1803.63 | backward-backward: 1803.61 | backward-allreduce: 0.00 | optimizer: 55.67 | batch generator: 0.85 samples/sec: 6.600 | iteration 316200/ 320000 | elapsed time per iteration (ms): 2424.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.598338E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 565.95 | backward: 1802.59 | backward-backward: 1802.57 | backward-allreduce: 0.00 | optimizer: 55.36 | batch generator: 0.77 samples/sec: 6.598 | iteration 316300/ 320000 | elapsed time per iteration (ms): 2425.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.591390E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.30 | backward: 1802.96 | backward-backward: 1802.94 | backward-allreduce: 0.00 | optimizer: 55.51 | batch generator: 0.80 samples/sec: 6.593 | iteration 316400/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.595369E+00 | loss scale: 32768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.44 | backward: 1804.27 | backward-backward: 1804.25 | backward-allreduce: 0.00 | optimizer: 55.64 | batch generator: 0.77 samples/sec: 6.594 | iteration 316500/ 320000 | elapsed time per iteration (ms): 2426.5 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.597776E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.50 | backward: 1804.03 | backward-backward: 1804.01 | backward-allreduce: 0.00 | optimizer: 55.61 | batch generator: 0.82 samples/sec: 6.593 | iteration 316600/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.592190E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.46 | backward: 1804.22 | backward-backward: 1804.19 | backward-allreduce: 0.00 | optimizer: 55.83 | batch generator: 0.77 samples/sec: 6.594 | iteration 316700/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.592773E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.43 | backward: 1804.02 | backward-backward: 1804.00 | backward-allreduce: 0.00 | optimizer: 55.73 | batch generator: 0.80 samples/sec: 6.593 | iteration 316800/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.592555E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.39 | backward: 1804.85 | backward-backward: 1804.83 | backward-allreduce: 0.00 | optimizer: 55.25 | batch generator: 0.78 samples/sec: 6.591 | iteration 316900/ 320000 | elapsed time per iteration (ms): 2427.5 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.611229E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.19 | backward: 1804.47 | backward-backward: 1804.44 | backward-allreduce: 0.00 | optimizer: 56.43 | batch generator: 0.77 samples/sec: 6.594 | iteration 317000/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.601868E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.29 | backward: 1804.01 | backward-backward: 1803.99 | backward-allreduce: 0.00 | optimizer: 55.62 | batch generator: 0.76 ----------------------------------------------------------------------------------------------------------- validation results at iteration 317000 | lm_loss value: 2.548714E+00 | lm_loss_ppl value: 1.279065E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.448 | iteration 317100/ 320000 | elapsed time per iteration (ms): 2481.4 | learning rate: 3.000E-05 | approx flops per GPU: 40.1TFLOPS | lm_loss: 2.596201E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.37 | backward: 1802.57 | backward-backward: 1802.55 | backward-allreduce: 0.00 | optimizer: 55.37 | batch generator: 0.88 samples/sec: 6.593 | iteration 317200/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.592696E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.44 | backward: 1804.56 | backward-backward: 1804.54 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.85 samples/sec: 6.593 | iteration 317300/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.599185E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.50 | backward: 1803.95 | backward-backward: 1803.92 | backward-allreduce: 0.00 | optimizer: 55.83 | batch generator: 0.81 samples/sec: 6.593 | iteration 317400/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.599567E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.50 | backward: 1804.00 | backward-backward: 1803.98 | backward-allreduce: 0.00 | optimizer: 55.85 | batch generator: 0.78 samples/sec: 6.593 | iteration 317500/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.587690E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.58 | backward: 1804.12 | backward-backward: 1804.10 | backward-allreduce: 0.00 | optimizer: 55.80 | batch generator: 0.80 samples/sec: 6.592 | iteration 317600/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.607831E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.80 | backward: 1804.20 | backward-backward: 1804.18 | backward-allreduce: 0.00 | optimizer: 55.63 | batch generator: 0.95 samples/sec: 6.595 | iteration 317700/ 320000 | elapsed time per iteration (ms): 2426.2 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.590544E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.29 | backward: 1804.05 | backward-backward: 1804.02 | backward-allreduce: 0.00 | optimizer: 55.46 | batch generator: 0.83 samples/sec: 6.595 | iteration 317800/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.597896E+00 | loss scale: 131072.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.25 | backward: 1804.40 | backward-backward: 1804.38 | backward-allreduce: 0.00 | optimizer: 55.24 | batch generator: 0.76 samples/sec: 6.595 | iteration 317900/ 320000 | elapsed time per iteration (ms): 2425.9 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.598209E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 | time (ms) | forward: 566.36 | backward: 1804.41 | backward-backward: 1804.39 | backward-allreduce: 0.00 | optimizer: 54.77 | batch generator: 0.77 samples/sec: 6.591 | iteration 318000/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.605758E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.47 | backward: 1804.09 | backward-backward: 1804.07 | backward-allreduce: 0.00 | optimizer: 56.52 | batch generator: 0.76 ----------------------------------------------------------------------------------------------------------- validation results at iteration 318000 | lm_loss value: 2.562325E+00 | lm_loss_ppl value: 1.296593E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.448 | iteration 318100/ 320000 | elapsed time per iteration (ms): 2481.4 | learning rate: 3.000E-05 | approx flops per GPU: 40.1TFLOPS | lm_loss: 2.583717E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 565.95 | backward: 1802.67 | backward-backward: 1802.65 | backward-allreduce: 0.00 | optimizer: 55.58 | batch generator: 0.85 samples/sec: 6.594 | iteration 318200/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.591432E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.36 | backward: 1804.37 | backward-backward: 1804.34 | backward-allreduce: 0.00 | optimizer: 55.50 | batch generator: 0.80 samples/sec: 6.592 | iteration 318300/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.593581E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.44 | backward: 1804.88 | backward-backward: 1804.86 | backward-allreduce: 0.00 | optimizer: 55.67 | batch generator: 0.79 samples/sec: 6.593 | iteration 318400/ 320000 | elapsed time per iteration (ms): 2426.9 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.585911E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.32 | backward: 1804.40 | backward-backward: 1804.38 | backward-allreduce: 0.00 | optimizer: 55.81 | batch generator: 0.79 samples/sec: 6.594 | iteration 318500/ 320000 | elapsed time per iteration (ms): 2426.3 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.600398E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.34 | backward: 1804.28 | backward-backward: 1804.25 | backward-allreduce: 0.00 | optimizer: 55.35 | batch generator: 0.75 samples/sec: 6.593 | iteration 318600/ 320000 | elapsed time per iteration (ms): 2426.7 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.593790E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.44 | backward: 1804.21 | backward-backward: 1804.18 | backward-allreduce: 0.00 | optimizer: 55.71 | batch generator: 0.81 samples/sec: 6.594 | iteration 318700/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.620253E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.45 | backward: 1804.34 | backward-backward: 1804.32 | backward-allreduce: 0.00 | optimizer: 55.44 | batch generator: 0.80 samples/sec: 6.598 | iteration 318800/ 320000 | elapsed time per iteration (ms): 2424.8 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.588499E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.30 | backward: 1802.68 | backward-backward: 1802.65 | backward-allreduce: 0.00 | optimizer: 55.47 | batch generator: 0.79 samples/sec: 6.597 | iteration 318900/ 320000 | elapsed time per iteration (ms): 2425.4 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.601134E+00 | loss scale: 65536.0 | number of skipped iterations: 2 | number of nan iterations: 0 | time (ms) | forward: 566.30 | backward: 1804.25 | backward-backward: 1804.23 | backward-allreduce: 0.00 | optimizer: 54.48 | batch generator: 0.75 samples/sec: 6.591 | iteration 319000/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.610588E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.37 | backward: 1804.99 | backward-backward: 1804.97 | backward-allreduce: 0.00 | optimizer: 55.68 | batch generator: 0.77 ----------------------------------------------------------------------------------------------------------- validation results at iteration 319000 | lm_loss value: 2.584090E+00 | lm_loss_ppl value: 1.325123E+01 | ----------------------------------------------------------------------------------------------------------- samples/sec: 6.440 | iteration 319100/ 320000 | elapsed time per iteration (ms): 2484.4 | learning rate: 3.000E-05 | approx flops per GPU: 40.0TFLOPS | lm_loss: 2.602000E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.43 | backward: 1804.66 | backward-backward: 1804.64 | backward-allreduce: 0.00 | optimizer: 56.14 | batch generator: 0.85 samples/sec: 6.591 | iteration 319200/ 320000 | elapsed time per iteration (ms): 2427.4 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.585670E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.59 | backward: 1804.48 | backward-backward: 1804.46 | backward-allreduce: 0.00 | optimizer: 55.95 | batch generator: 0.78 samples/sec: 6.594 | iteration 319300/ 320000 | elapsed time per iteration (ms): 2426.6 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.590407E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.52 | backward: 1803.87 | backward-backward: 1803.85 | backward-allreduce: 0.00 | optimizer: 55.79 | batch generator: 0.78 samples/sec: 6.593 | iteration 319400/ 320000 | elapsed time per iteration (ms): 2426.8 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.602609E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.31 | backward: 1804.36 | backward-backward: 1804.34 | backward-allreduce: 0.00 | optimizer: 55.71 | batch generator: 0.74 samples/sec: 6.596 | iteration 319500/ 320000 | elapsed time per iteration (ms): 2425.5 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.588853E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.88 | backward: 1803.02 | backward-backward: 1803.00 | backward-allreduce: 0.00 | optimizer: 55.26 | batch generator: 0.80 samples/sec: 6.595 | iteration 319600/ 320000 | elapsed time per iteration (ms): 2426.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.592412E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.74 | backward: 1803.44 | backward-backward: 1803.42 | backward-allreduce: 0.00 | optimizer: 55.51 | batch generator: 0.80 samples/sec: 6.592 | iteration 319700/ 320000 | elapsed time per iteration (ms): 2427.1 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.597655E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.42 | backward: 1804.72 | backward-backward: 1804.69 | backward-allreduce: 0.00 | optimizer: 55.57 | batch generator: 0.78 samples/sec: 6.593 | iteration 319800/ 320000 | elapsed time per iteration (ms): 2427.0 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.598595E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 566.29 | backward: 1804.55 | backward-backward: 1804.52 | backward-allreduce: 0.00 | optimizer: 55.76 | batch generator: 0.74 samples/sec: 6.598 | iteration 319900/ 320000 | elapsed time per iteration (ms): 2424.9 | learning rate: 3.000E-05 | approx flops per GPU: 41.0TFLOPS | lm_loss: 2.603832E+00 | loss scale: 65536.0 | number of skipped iterations: 2 | number of nan iterations: 0 | time (ms) | forward: 565.86 | backward: 1804.13 | backward-backward: 1804.11 | backward-allreduce: 0.00 | optimizer: 54.57 | batch generator: 0.76 samples/sec: 6.589 | iteration 320000/ 320000 | elapsed time per iteration (ms): 2428.3 | learning rate: 3.000E-05 | approx flops per GPU: 40.9TFLOPS | lm_loss: 2.613633E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 | time (ms) | forward: 567.02 | backward: 1805.39 | backward-backward: 1805.37 | backward-allreduce: 0.00 | optimizer: 55.53 | batch generator: 0.80 ----------------------------------------------------------------------------------------------------------- validation results at iteration 320000 | lm_loss value: 2.575310E+00 | lm_loss_ppl value: 1.313539E+01 | ----------------------------------------------------------------------------------------------------------- --------------------------------------------------------------------------------------------------------------------------- validation results at the end of training for val data | lm_loss value: 2.577925E+00 | lm_loss_ppl value: 1.316979E+01 | --------------------------------------------------------------------------------------------------------------------------- WARNING: Deleting old checkpoints: checkpoints-hfcm/global_step220000 ---------------------------------------------------------------------------------------------------------------------- test results at the end of training for test data | lm_loss value: 2.623284E+00 | lm_loss_ppl value: 1.378091E+01 | ----------------------------------------------------------------------------------------------------------------------