Making work TensorFlow with Nvidia RTX 3090 on Windows 10

Maria Dobromyslova
5 min readNov 10, 2020

--

Nvidia RTX 3090

I was recently acquired a new Nvidia RTX 3090, and since most of my ML algorithms based on TensorFlow — I have decided to make it work with the latest version.

If you want to skip the whole testing and installation process and just want to know the version of TF that will work with RTX video card — here is the build configuration comparison table that I tested in the process:

The initial build configurations for TensorFlow 2.3.0 (latest release) were taken from the https://www.tensorflow.org/install/source_windows#gpu page.

During the installation, there weren’t any error messages. However, after installation, I decided to test it on a small style transfer example. Here is a small and quick code snippet to apply for artistic style transfer with TensorFlow:

This code based on the https://tfhub.dev/google/magenta/arbitrary-image-stylization-v1-256/2 model, and if you want to run it on your machine, here is the images that I used in testing:

Content (Left) and Style (Right) images

After running this code with TensorFlow 2.3.0 fresh installation — I noticed that the first execution time was around 10 minutes. This is because of the compilation of the PTX operations and this is also was noted on the https://www.tensorflow.org/install/gpu#hardware_requirements build page. Also, I increased cache size as it was recommended for set CUDA_CACHE_MAXSIZE=2147483648 and the total cache size after the launch was about 1.2 Gb.

But there were several problems: the cache wasn’t cleared after the execution, and the resulting image — doesn’t contain much content:

To compare it with the expected results, we need to generate it somehow. Lucky for us, it’s easy to do with CPU. Just add the set CUDA_VISIBLE_DEVICES=-1 flag before executing the code, and you will get results like that:

As you can see, there is a huge difference between the results. So, I decided to test the second build:

You can install the latest dev version of TensorFlow with the following commands:

pip3 install tensorflow-gpu
pip3 install tensorflow-hub
pip3 install tf-nightly-gpu

I added tensorflow-hub because you will need it for the code example provided above.

Here is how you can validate the installed version:

pip3 show tensorflow-gpu
pip3 show tf-nightly-gpu

The combination of the CUDA toolkit and the cuDNN library was generated by the trial and error process. So, I’m posting only the working together versions.

After installation and running some more tests — I’m finally getting the expected results.

But, there was still something slightly off. The execution time was 16 seconds, while on CPU — it was around 8 seconds. You would expect GPU to be faster than CPU or at least on such a small example — at least the same. Additionally, there was a warning message about PTX compilation:

[tensorflow/stream_executor/gpu/redzone_allocator.cc:314] Internal: ptxas exited with non-zero error code -1, output:Relying on driver to perform ptx compilation.
Modify $PATH to customize ptxas location.
This message will be only logged once.

and a lot of:

[tensorflow/core/platform/windows/subprocess.cc:308] SubProcess ended with return code: 4294967295

I know that path to ptxas is already in $PATH (I double-checked this with echo %PATH%), and after some research, I found that you can enable explicit logging for the PTX compiler. You need to add set TF_CPP_VMODULE="asm_compiler=3" before running the code.

So, now I saw a more detailed message:

[I tensorflow/stream_executor/gpu/asm_compiler.cc:157] Looking for ptxas.exe at C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v11.0/bin/ptxas.exe[I tensorflow/stream_executor/gpu/asm_compiler.cc:166] Using ptxas.exe at C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v11.0/bin/ptxas.exe[I tensorflow/stream_executor/gpu/asm_compiler.cc:186] ptx written to: C:\Users\user\AppData\Local\Temp\/tempfile-ALM-68c0-15812-5b3a6f65ff7e3[I tensorflow/stream_executor/gpu/asm_compiler.cc:215] C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v11.0/bin/ptxas.exe C:\Users\user\AppData\Local\Temp\/tempfile-ALM-68c0-15812-5b3a6f65ff7e3 -o C:\Use rs\user\AppData\Local\Temp\/tempfile-ALM-68c0-15812-5b3a6f65ffba1 -arch=sm_86 -v[I tensorflow/core/platform/windows/subprocess.cc:308] SubProcess ended with return code: 4294967295

Great! Now we got the command that TensorFlow uses to compile the PTX operations. Did you notice the flag -arch=sm_86? This flag tells the PTX compiler when it needs to use 8.6 compute capability. This is the current one that the Nvidia RTX 3090 uses.

In case if you can't find the file that the PTX compiler is trying to compile - here it is:

Save it into the C:\Users\user\AppData\Local\Temp\ALM_test file and run the PTX compilation with the following command:

cd "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\bin\" && ptxas.exe C:\Users\user\AppData\Local\Temp\ALM_test -o C:\Users\user\AppData\Local\Temp\ALM_test_output -arch=sm_86 -v 

And the result is the following error message:

ptxas fatal : Value 'sm_86' is not defined for option 'gpu-name'

Having the error message is a good sign this means — we got somewhere! So, this error message is telling us that the sm_86 (8.6 compute capability) is not supported in this version of the PTX compiler.

I was thinking: where I can find the compiler that can support it?

The answer is, of course, the latest CUDA versions, and it's 11.1.1_456.81 for the current moment. But, here is the issue - the latest dev version of TensorFlow uses the 11.0 version of CUDA, so what is the solution? It's easy - use the 11.0 for the TensorFlow and copy the ptxas.exe from the 11.1 version. To do this just copy C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.1\bin\ptxas.exe into C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\bin\ptxas.exe. You can move or rename the 11.0 one before copying the 11.1.

Okay, let’s run our compilation once again with the new ptxas:

cd "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\bin\" && ptxas.exe C:\Users\user\AppData\Local\Temp\ALM_test -o C:\Users\user\AppData\Local\Temp\ALM_test_output -arch=sm_86 -vptxas info : 0 bytes gmem
ptxas info : Compiling entry function 'redzone_checker' for 'sm_86'
ptxas info : Function properties for redzone_checker
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 8 registers, 384 bytes cmem[0]

And it looks like it works! Now, after running the example of the style transfer — there are no warning messages regarding PTX, and also execution time is now down to 8 seconds — as expected! Also, the cache is clear now, which is also indicates success.

After running TensorFlow on more advanced examples, I found that execution time with PTX warnings is about 2 hours, and by fixing the compilation — it’s now down to 20 minutes. It’s around 2/3 optimization, and I’m very happy with the results.

Conclusions

I think this issue will be gone in the future releases of TensorFlow because I think the RTX 30 series is simply way too new to expect the native support from TensorFlow already.

And I hope this hotfix will work for you as well.

Let me know if this was helpful or if you have any additional questions. Feel free to reach me on Twitter.

--

--