使用Ubuntu22+tensorflow2+ROCm来炼丹

本文章介绍了如何在Ubuntu22.04.2上安装tensorflow2和ROCm来使用6600XT炼丹

安装ROCm

本文参考

https://blog.csdn.net/qq_51403540/article/details/123951460

https://github.com/RadeonOpenCompute/ROCm-docker/blob/master/quick-start.md

官方那个教程有问题,在我的 Ubuntu 22.04.2 上面死活装不上,提示

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
The following packages have unmet dependencies:
 openmp-extras : Depends: libstdc++-5-dev but it is not installable or
                          libstdc++-7-dev but it is not installable
                 Depends: libgcc-5-dev but it is not installable or
                          libgcc-7-dev but it is not installable
                 Recommends: gcc but it is not going to be installed
                 Recommends: g++ but it is not going to be installed
 rocm-gdb : Depends: libpython3.8 but it is not installable
 rocm-llvm : Depends: python but it is not installable
             Depends: libstdc++-5-dev but it is not installable or
                      libstdc++-7-dev but it is not installable
             Depends: libgcc-5-dev but it is not installable or
                      libgcc-7-dev but it is not installable
             Recommends: gcc but it is not going to be installed
             Recommends: g++ but it is not going to be installed
             Recommends: gcc-multilib but it is not going to be installed
             Recommends: g++-multilib but it is not going to be installed
E: Unable to correct problems, you have held broken packages.

然后我翻了半天和这个问题相关的 issue https://github.com/RadeonOpenCompute/ROCm/issues/1713 也没有解决

唯一有用的是最后有个人提了一句

1
2
3
@karim789
Please try latest ROCm-5.4
https://repo.radeon.com/amdgpu/5.4/ubuntu/

然后给的地址还是错的,正确的地址是 https://repo.radeon.com/amdgpu-install/5.4/ubuntu/

我们顺腾摸瓜找到正确的下载地址是 https://repo.radeon.com/amdgpu-install/5.4/ubuntu/jammy/amdgpu-install_5.4.50400-1_all.deb

下载并解压安装这个包

1
2
wget https://repo.radeon.com/amdgpu-install/5.4/ubuntu/jammy/amdgpu-install_5.4.50400-1_all.deb
sudo apt install ./amdgpu-install_5.4.50400-1_all.deb

安装ROCm

1
sudo amdgpu-install --usecase=rocm,hip,mllib --no-dkms

添加个用户组

1
sudo usermod -a -G video,render $LOGNAME

如果你显卡比较新,比如6600XT,那么现在你还是用不了GPU加速

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
hongdou@ubuntu:~$ python3
Python 3.10.6 (main, Mar 10 2023, 10:55:28) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
2023-03-24 06:58:54.678214: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
>>> tf.test.is_gpu_available()
WARNING:tensorflow:From <stdin>:1: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.
2023-03-24 06:59:46.151811: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-24 06:59:46.339405: I tensorflow/compiler/xla/stream_executor/rocm/rocm_gpu_executor.cc:843] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-03-24 06:59:47.186830: I tensorflow/compiler/xla/stream_executor/rocm/rocm_gpu_executor.cc:843] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-03-24 06:59:47.186879: I tensorflow/compiler/xla/stream_executor/rocm/rocm_gpu_executor.cc:843] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-03-24 06:59:47.186906: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1990] Ignoring visible gpu device (device: 0, name: AMD Radeon RX 6600 XT, pci bus id: 0000:07:00.0) with AMDGPU version : gfx1032. The supported AMDGPU versions are gfx1030, gfx900, gfx906, gfx908, gfx90a.
False
>>> tf.test.is_gpu_available()
2023-03-24 07:00:26.074446: I tensorflow/compiler/xla/stream_executor/rocm/rocm_gpu_executor.cc:843] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-03-24 07:00:26.074583: I tensorflow/compiler/xla/stream_executor/rocm/rocm_gpu_executor.cc:843] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-03-24 07:00:26.074638: I tensorflow/compiler/xla/stream_executor/rocm/rocm_gpu_executor.cc:843] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-03-24 07:00:26.074666: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1990] Ignoring visible gpu device (device: 0, name: AMD Radeon RX 6600 XT, pci bus id: 0000:07:00.0) with AMDGPU version : gfx1032. The supported AMDGPU versions are gfx1030, gfx900, gfx906, gfx908, gfx90a.
False
>>> quit()

大致意思是我显卡版本是gfx1032,但是他只支持到gfx1030

解决方法加个环境变量

1
export HSA_OVERRIDE_GFX_VERSION=10.3.0

记得写进.profile文件,这样每次开shell都会自动加变量

可以用这个命令来看显卡

1
2
3
4
5
6
7
8
9
hongdou@ubuntu:~$ rocm-smi


======================= ROCm System Management Interface =======================
================================= Concise Info =================================
GPU  Temp (DieEdge)  AvgPwr  SCLK  MCLK   Fan     Perf  PwrCap  VRAM%  GPU%
0    32.0c           3.0W    0Mhz  96Mhz  29.41%  auto  130.0W    0%   0%
================================================================================
============================= End of ROCm SMI Log ==============================

使用tensorflow

使用官方的docker镜像里的tensorflow

官方教程 https://github.com/ROCmSoftwarePlatform/tensorflow-upstream

用起来很简单,执行这个命令即可

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
alias drun='sudo docker run \
      -it \
      --network=host \
      --device=/dev/kfd \
      --device=/dev/dri \
      --ipc=host \
      --shm-size 16G \
      --group-add video \
      --cap-add=SYS_PTRACE \
      --security-opt seccomp=unconfined \
      -v $HOME/dockerx:/dockerx'

drun rocm/tensorflow:latest

alias会在shell被关闭后失效,你可以把他加进 .profile 文件

如果你用的显卡比较新,比如6600XT,进去后找不到显卡,那么记得加个环境变量,毕竟docker里的shell不使用你宿主机的.profile

1
export HSA_OVERRIDE_GFX_VERSION=10.3.0

用起来还行

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
root@ubuntu:~# python3
Python 3.9.16 (main, Dec  7 2022, 01:11:51)
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
2023-03-24 07:08:56.052778: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
>>> tf.test.is_gpu_available()
WARNING:tensorflow:From <stdin>:1: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.
WARNING:tensorflow:From <stdin>:1: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.
True
>>>

在本机安装tensorflow

我不习惯在docker里开发,因为启停容器麻烦,还要考虑去持久化数据,开jupyter还要映射端口

因此我选择在本机装个tensorflow

1
pip3 install tensorflow-rocm

用起来很舒服

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
hongdou@ubuntu:~$ python3
Python 3.10.6 (main, Mar 10 2023, 10:55:28) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
2023-03-24 07:01:12.820135: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
>>> tf.test.is_gpu_available()
WARNING:tensorflow:From <stdin>:1: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.
2023-03-24 07:01:22.728177: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-24 07:01:22.784620: I tensorflow/compiler/xla/stream_executor/rocm/rocm_gpu_executor.cc:843] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-03-24 07:01:22.832643: I tensorflow/compiler/xla/stream_executor/rocm/rocm_gpu_executor.cc:843] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-03-24 07:01:22.832695: I tensorflow/compiler/xla/stream_executor/rocm/rocm_gpu_executor.cc:843] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-03-24 07:01:22.833405: I tensorflow/compiler/xla/stream_executor/rocm/rocm_gpu_executor.cc:843] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-03-24 07:01:22.833440: I tensorflow/compiler/xla/stream_executor/rocm/rocm_gpu_executor.cc:843] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-03-24 07:01:22.833472: I tensorflow/compiler/xla/stream_executor/rocm/rocm_gpu_executor.cc:843] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-03-24 07:01:22.833501: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1613] Created device /device:GPU:0 with 7676 MB memory:  -> device: 0, name: AMD Radeon RX 6600 XT, pci bus id: 0000:07:00.0
True
>>>
Licensed under CC BY-NC-SA 4.0
Built with Hugo
主题 StackJimmy 设计