Determinism Challenges with Microsoft Phi-3.5-mini-instruct Across Different GPU Architectures
I’ve been working on generating deterministic outputs using Microsoft’s Phi-3.5-mini-instruct model on two different GPU setups: a single A6000 GPU and a dual A5000 GPUs. Despite taking every precaution—such as matching library versions, disabling randomness, forcing strict precision settings, and using greedy decoding—the outputs still diverge after a certain number of tokens.
I’ve ensured that all configurations, including PyTorch, CUDA versions, and model parameters, are identical across the setups. While greedy decoding does control randomness effectively, the context and wording in the responses start to differ significantly in many prompts. Based on my research, it seems the underlying issue stems from floating-point operations, which are handled differently across GPU architectures. Variations in precision, rounding, or the sequence of operations appear to contribute to these discrepancies.
This makes achieving perfect determinism seemingly impossible unless the same GPU model (e.g., A6000) or one with an identical architecture is used. While theoretically, it might be possible to trace the inference process at a low level (e.g., inspecting logits and floating-point operations), this would be computationally intensive and impractical for large models like Phi-3.5-mini-instruct.
Has anyone encountered similar challenges with this or other LLMs? If so, have you found any strategies to mitigate these discrepancies across different GPU architectures? Your insights and experiences would be greatly appreciated!
It's probably not possible to get completely consistent results across different hardware, no. CUDA has ulps error bounds that are far bigger than 1 for a number of operations, and even the IEEE floating point will not guarantee <= 0.5 for non-basic operations.