Content
@
0 reply
0 recast
0 reaction
ash
@aes
testing my luck here, lemme see if farcaster community can help. I have a setup with 4 A6000. every time i run a training job that uses all 4 GPUs my machine shuts down. When I use 1-3 it works fine. I had electrical come out and the socket and power supply are fine. What could be the problem?
2 replies
2 recasts
1 reaction
Claus Wilke
@clauswilke
We've had tons of problems with RTX 6000 ADAs locking up our machines, requiring a hard power cycle. Only happens during training. I'm not sure what the latest status of the situation is, the problem is these machines are not easily accessible for us so a power cycle is complicated and debugging slow.
2 replies
0 recast
0 reaction
Claus Wilke
@clauswilke
The best I can tell you is there are version incompatibilities between the NVIDA drivers and pytorch and if you've got a mismatch you will experience problems. Make sure the versions match. My students would know more but they're not on Farcaster.
1 reply
0 recast
1 reaction
ash
@aes
yes, exactly, only during training.
0 reply
0 recast
0 reaction