What an incredible model with enormous potential applications in my work as an historian! However, my limited technical background seems to be holding me back from running this effectively in my environment.

The server command at https://huggingface.co./mistralai/Pixtral-Large-Instruct-2411 appears to assume a single-node setup, but I need to run this across multiple nodes (specifically, two or maybe three nodes, each with 4 NVIDIA A100 GPUs, providing 160GB of GPU memory per node). Despite my efforts, I’ve been struggling to get it working properly in this multi-node configuration.

I’ve successfully converted the Docker image into a .sif Apptainer file and downloaded the model to a local directory. However, I'm consistently running into issues usually after all shards are loaded (e.g. tried to allocate 2.93 GiB. GPU 0 has a total capacity of 39.50 GiB of which 106.12 MiB is free. or /usr/lib/python3.12/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d '; ). Has anyone successfully deployed this model in a multi-node environment and might be able to offer some guidance? Any advice would be greatly appreciated!

Current attempt here....

#!/bin/bash -l
#SBATCH -A
#SBATCH -q default
#SBATCH -p gpu
#SBATCH --time=01:00:00
#SBATCH -N 3
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=128
#SBATCH --gpus-per-task=4
#SBATCH --error="debug_local_model-%j.err"
#SBATCH --output="debug_local_model-%j.out"

setup

module --force purge
module load env/release/2023.1
module load Apptainer/1.3.1-GCCcore-12.3.0

pnix

export PMIX_MCA_psec=native

Local model

export LOCAL_MODEL_DIR="/workspace/models--mistralai--Pixtral-Large-Instruct-2411/snapshots/6aea62fd4e842bb7981339519accf19c7120ccd3"

sif

export SIF_IMAGE="vllm-openai.sif"

Apptainer

bind

export APPTAINER_ARGS="--nvccli -B /mnt/tier2/project/:/workspace"

Ray

export HEAD_HOSTNAME="$(hostname)"
export HEAD_IPADDRESS="$(hostname --ip-address)"
export RANDOM_PORT=$(python3 -c 'import socket; s=socket.socket(); s.bind(("",0)); print(s.getsockname()[1]); s.close()')

export RAY_CMD_HEAD="ray start --block --head --port=${RANDOM_PORT} --num-cpus=${SLURM_CPUS_PER_TASK} --verbose"
export RAY_CMD_WORKER="ray start --block --address=${HEAD_IPADDRESS}:${RANDOM_PORT} --num-cpus=${SLURM_CPUS_PER_TASK} --verbose"

export TENSOR_PARALLEL_SIZE=4
export PIPELINE_PARALLEL_SIZE=${SLURM_NNODES}

end the setup

LOGGING TO try and identify problems

echo "========== ENVIRONMENT VARIABLES =========="
env

echo "========== SLURM VARs =========="
echo "SLURM_JOBID: ${SLURM_JOBID}"
echo "SLURM_NNODES: ${SLURM_NNODES}"
echo "SLURM_NODELIST: ${SLURM_NODELIST}"
echo "SLURM_CPUS_PER_TASK: ${SLURM_CPUS_PER_TASK}"
echo "SLURM_GPUS_PER_TASK: ${SLURM_GPUS_PER_TASK}"

echo "========== NODE & GPU INFORMATION (HOST) =========="
srun -N ${SLURM_NNODES} -l hostname
srun -N ${SLURM_NNODES} -l nvidia-smi -L

echo "========== NODE & GPU INFORMATION (INSIDE APPTAINER) =========="
srun -N ${SLURM_NNODES} -l apptainer exec ${APPTAINER_ARGS} ${SIF_IMAGE} nvidia-smi

echo "========== CHECKING MODEL DIRECTORY ACCESS INSIDE APPTAINER =========="
srun -N ${SLURM_NNODES} -l apptainer exec ${APPTAINER_ARGS} ${SIF_IMAGE} ls -l ${LOCAL_MODEL_DIR}

Additional check: print disk usage to confirm full model presence

srun -N 1 -l apptainer exec ${APPTAINER_ARGS} ${SIF_IMAGE} du -sh ${LOCAL_MODEL_DIR}

START RAY

echo "========== STARTING RAY HEAD NODE =========="
srun -J "head_ray_node_step_%J" -N 1 --ntasks-per-node=1 -c $(( SLURM_CPUS_PER_TASK/2 )) -w ${HEAD_HOSTNAME} apptainer exec ${APPTAINER_ARGS} ${SIF_IMAGE} ${RAY_CMD_HEAD} &
sleep 20

echo "========== STARTING RAY WORKERS =========="
srun -J "worker_ray_node_step_%J" -N $(( SLURM_NNODES-1 )) --ntasks-per-node=1 -c ${SLURM_CPUS_PER_TASK} -x ${HEAD_HOSTNAME} apptainer exec ${APPTAINER_ARGS} ${SIF_IMAGE} ${RAY_CMD_WORKER} &
sleep 30

TEST VLLM

echo "HEAD NODE: ${HEAD_HOSTNAME}"
echo "IP ADDRESS: ${HEAD_IPADDRESS}"
echo "RANDOM PORT (RAY): ${RANDOM_PORT}"
echo "SSH TUNNEL CMD: ssh -p 8822 ${USER}@login.lxp.lu -NL 8000:${HEAD_IPADDRESS}:8000"

echo "========== TESTING VLLM SERVE COMMAND LOCALLY =========="

Attempt to run vllm serve from local.

export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

apptainer exec ${APPTAINER_ARGS} ${SIF_IMAGE} vllm serve ${LOCAL_MODEL_DIR}
--config-format mistral
--load-format mistral
--tokenizer_mode mistral
--limit_mm_per_prompt 'image=10'
--max-model-len 4096
--tensor-parallel-size ${TENSOR_PARALLEL_SIZE}
--pipeline-parallel-size ${PIPELINE_PARALLEL_SIZE}
--port 8000
--host 0.0.0.0 \

echo "========== SCRIPT COMPLETE =========="

mistralai
/

Pixtral-Large-Instruct-2411

Multinode operation