Errors in loading models(error:Killed)
GPU device used is: A800 (80G)
Memory size is: 64G
But the following error log appears during the loading of the model:
'''
[2023-10-25 09:14:47,988] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-10-25 09:14:50,747] [INFO] building CogVLMModel model ...
[2023-10-25 09:14:50,749] [INFO] [RANK 0] > initializing model parallel with size 1
[2023-10-25 09:14:50,750] [INFO] [RANK 0] You are using model-only mode.
For torch.distributed users or loading model parallel models, set environment variables RANK, WORLD_SIZE and LOCAL_RANK.
[2023-10-25 09:15:05,366] [INFO] [RANK 0] > number of parameters on model parallel rank 0: 17639685376
[2023-10-25 09:15:11,894] [INFO] [RANK 0] global rank 0 is loading checkpoint /home/user/CogVLM/main/CogVLM-main/cogvlm-chat/1/mp_rank_00_model_states.pt
Killed
'''
How should this situation be handled?
Sounds like OOM.
Based on the troubleshooting, OOM did occur