Content
@
0 reply
0 recast
0 reaction
ash
@aes
testing my luck here, lemme see if farcaster community can help. I have a setup with 4 A6000. every time i run a training job that uses all 4 GPUs my machine shuts down. When I use 1-3 it works fine. I had electrical come out and the socket and power supply are fine. What could be the problem?
2 replies
2 recasts
1 reaction
Claus Wilke
@clauswilke
We've had tons of problems with RTX 6000 ADAs locking up our machines, requiring a hard power cycle. Only happens during training. I'm not sure what the latest status of the situation is, the problem is these machines are not easily accessible for us so a power cycle is complicated and debugging slow.
2 replies
0 recast
0 reaction
Choong Ng
@choong
Try setting the lowest allowable power limit via nvidia-smi and running a 4 GPU job known to trigger the shutdown?
1 reply
0 recast
1 reaction