Skip to content Skip to sidebar Skip to footer

"loss Is Nan" Error In Keras, How To Debug?

I know there are other questions here with the 'Loss is NaN', but I'm working with example code provided by François Chollet (author of Keras), what is supposed to be the simplest

Solution 1:

Over the last few days I've learned some relevant things.

  • This is a known issue with PlaidML, apparently still unresolved. Here is the discussion at github.
  • While I read in many places that Tensorflow will only work on an Nvidia GPU, I found at the Intel website that Tensorflow will work with the Mac Intel CPU.
  • The pip install commands I found here and there for Tensorflow did not work for me, but conda will work: "conda install tensorflow" or "conda install tensorflow -c anaconda"
  • Since I now have both Tensorflow Keras and PlaidML Keras installed, I can import with either "from tensorflow.keras import XXX" or "from keras import XXX" to choose the CPU or GPU version. That's useful.

Not yet investigated: According to many websites, if I have a Thunderbolt 3 port (one Mac does, one does not), it is possible to add an external Nvidia eGPU to a Mac. Whether it will work is unknown. Gamer discussion boards seem to say it doesn't work for their purposes, but Keras-related websites say that Tensorflow will use an eGPU just fine.

At any rate, I can proceed with my projects using tensorflow.keras on the CPU. It's not ideal but it works. On the test code above I was seeing about 100 μs/sample, 6 s/epoch.

Bottom line for people experiencing PlaidML-Keras issues: It's a universal issue. PlaidML is broken for a lot of GPUs. For now use Tensorflow on the Intel CPU, and keep your eye on Issue #168 to watch for a fix.

Post a Comment for ""loss Is Nan" Error In Keras, How To Debug?"