ca4115:是联通AI环境下GPU机器部署及初始化

联通AI环境下GPU机器部署及初始化

1)安装依赖

#需要安装依赖

apt install gcc

apt install make

2)驱动安装

系统:

root@ubuntu:~# lsb_release -a

No LSB modules are available.

Distributor ID: Ubuntu

Description: Ubuntu 20.04.6 LTS

Release: 20.04

Codename: focal

下载:

wget https://developer.download.nvidia.com/compute/cuda/12.2.2/local_installers/cuda_12.2.2_535.104.05_linux.run

配权:

chmod +x cuda_12.2.2_535.104.05_linux.run

安装:

./cuda_12.2.2_535.104.05_linux.run

root@ubuntu:~# ./cuda_12.2.2_535.104.05_linux.run ============ Summary ============Driver: InstalledToolkit: Installed in /usr/local/cuda-12.2/Please make sure that - PATH includes /usr/local/cuda-12.2/bin - LD_LIBRARY_PATH includes /usr/local/cuda-12.2/lib64, or, add /usr/local/cuda-12.2/lib64 to /etc/ld.so.conf and run ldconfig as rootTo uninstall the CUDA Toolkit, run cuda-uninstaller in /usr/local/cuda-12.2/binTo uninstall the NVIDIA Driver, run nvidia-uninstallLogfile is /var/log/cuda-installer.log

3)安装nvidia-tookit

distribution=$(. /etc/os-release;echo $ID$VERSION_ID) curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | sudo apt-key add - curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.listsudo apt-get update sudo apt-get install -y nvidia-container-toolkit

4)查看GPU卡

nvidia-smi

查看卡信息

nvidia-smi --query-gpu=gpu_name,gpu_bus_id --format=csv

查看卡组网信息

root@ubuntu:~# nvidia-smi topo -m

root@ubuntu:~# nvidia-smi nvlink --status

root@ubuntu:~# nvidia-smi topo -p2p n

GPU直接连接驱动管理工具

wget https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2004/x86_64/nvidia-fabricmanager-535_535.104.05-1_amd64.deb

sudo apt-get install nvidia-fabricmanager

sudo systemctl start nvidia-fabricmanager

sudo systemctl enable nvidia-fabricmanager

systemctl status nvidia-fabricmanager