一、安装准备

1添加阿里云的安装源

curl -o /etc/yum.repos.d/epel.repo http://mirrors.aliyun.com/repo/epel-7.repo
curl -o /etc/yum.repos.d/CentOS-Base.repo https://mirrors.aliyun.com/repo/Centos-7.repo
sed -i -e '/mirrors.cloud.aliyuncs.com/d' -e '/mirrors.aliyuncs.com/d' /etc/yum.repos.d/CentOS-Base.repo

2安装基础环境

yum -y install apr autoconf automake bash bash-completion bind-utils bzip2 bzip2-devel chrony cmake coreutils curl curl-devel dbus dbus-libs dhcp-common dos2unix e2fsprogs e2fsprogs-devel file file-libs freetype freetype-devel gcc gcc-c++ gdb glib2 glib2-devel glibc glibc-devel gmp gmp-devel gnupg iotop kernel kernel-devel kernel-doc kernel-firmware kernel-headers krb5-devel libaio-devel libcurl libcurl-devel libevent libevent-devel libffi-devel libidn libidn-devel libjpeg libjpeg-devel libmcrypt libmcrypt-devel libpng libpng-devel libxml2 libxml2-devel libxslt libxslt-devel libzip libzip-devel lrzsz lsof make microcode_ctl mysql mysql-devel ncurses ncurses-devel net-snmp net-snmp-libs net-snmp-utils net-tools nfs-utils nss nss-sysinit nss-tools openldap-clients openldap-devel openssh openssh-clients openssh-server openssl openssl-devel patch policycoreutils polkit procps readline-devel rpm rpm-build rpm-libs rsync sos sshpass strace sysstat tar tmux tree unzip uuid uuid-devel vim wget yum-utils zip zlib* jq

3时间同步

systemctl start chronyd && systemctl enable chronyd

4重启

reboot

5整体升级

yum update -y

6再次重启

reboot

二、安装GPU显卡驱动

1禁用系统默认安装的 nouveau 驱动

# 修改配置
echo -e "blacklist nouveau\noptions nouveau modeset=0" > /etc/modprobe.d/blacklist.conf
 
# 备份原来的镜像文件
cp /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r).img.bak
 
# 重建新镜像文件
sudo dracut --force
 
# 重启
reboot
 
# 查看nouveau是否启动，如果结果为空即为禁用成功
lsmod | grep nouveau

2安装DKMS模块

DKMS全称是DynamicKernel ModuleSupport，它可以帮我们维护内核外的驱动程序，在内核版本变动之后可以自动重新生成新的模块。

yum -y install dkms

3拷贝驱动安装包

如果没有提前下载，官网下载即可驱动官网下载地址

cp NVIDIA-Linux-x86_64-418.226.00.run /data/

4安装

sudo sh NVIDIA-Linux-x86_64-418.226.00.run -no-x-check -no-nouveau-check -no-opengl-files
# -no-x-check   #安装驱动时关闭X服务
# -no-nouveau-check   #安装驱动时禁用nouveau
# -no-opengl-files   #只安装驱动文件，不安装OpenGL文件

5按照安装提示进行安装，一路点yes、ok

6验证安装结果

nvidia-smi

7显示如下代表安装成功

Wed Jul  7 11:11:33 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.129      Driver Version: 410.129      CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:41:00.0 Off |                    0 |
| N/A   94C    P0    36W /  70W |      0MiB / 15079MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
 
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

8显卡验证

lspci | grep -i nvidia

41:00.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)

8.1可能报错指令不存在，安装如下指令

yum install -y pciutils

二、下载升级 gcc 源码并编译安装：

1安装

cd /data/
wget https://mirrors.tuna.tsinghua.edu.cn/gnu/gcc/gcc-8.5.0/gcc-8.5.0.tar.gz 
tar -xvf gcc-8.5.0.tar.gz
cd gcc-8.5.0
./contrib/download_prerequisites
mkdir build
cd build
../configure --enable-checking=release --enable-languages=c,c++ --disable-multilib
make -j 16
make install

2建立软连接

cp /usr/local/lib64/libstdc++.so.6.0.25 /lib64
cd /lib64
rm -rf libstdc++.so.6
ln -s libstdc++.so.6.0.25 libstdc++.so.6

3查看

gcc -v

4显示如下代表安装成功

Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/local/libexec/gcc/x86_64-pc-linux-gnu/8.5.0/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: ../configure --enable-checking=release --enable-languages=c,c++ --disable-multilib
Thread model: posix
gcc version 8.5.0 (GCC)

三、英伟达cuda安装

1禁用Nouveau

没有输出就是已经禁用了Nouveau

[root@localhost opt]# lsmod | grep nouveau

2设置开机启动级别

systemctl set-default multi-user.target

3下载cuda安装包

也可以离线下载，cuda官网下载地址

wget https://developer.download.nvidia.com/compute/cuda/10.1/Prod/local_installers/cuda_10.1.243_418.87.00_linux.run

4安装

sudo sh cuda_10.1.243_418.87.00_linux.run

5会出现安装界面，输入accept，第二个界面, 直接选择install

6添加CUDA进入环境变量

6.0 打开配置文件

 vim /etc/profile

6.1在开头添加以下四行

输入 i按键，然后粘贴以下四行，输入esc按键，输入:wq保存退出

PATH=$PATH:/usr/local/cuda-10.1/bin/
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-10.1/lib64/
export PATH
export LD_LIBRARY_PATH

6.2生效文件

source /etc/profile

7验证安装

输出相应的版本

nvcc -V

四、英伟达cudnn安装

1cudnn下载

下载相关版本的CUDNN（需要先注册账号才能下载）：注意：要选择CUDA相对应版本的。
下载地址
在这里插入图片描述

上传并解压

cd /data/
tar xzvf cudnn-10.1-linux-x64-v7.6.5.32.tgz
cp cuda/include/cudnn.h /usr/local/cuda/include
cp cuda/lib64/libcudnn* /usr/local/cuda/lib64
chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib64/libcudnn*

五、安装基本docker

1卸载旧版本

官方安装参考

sudo yum remove docker \
                  docker-client \
                  docker-client-latest \
                  docker-common \
                  docker-latest \
                  docker-latest-logrotate \
                  docker-logrotate \
                  docker-engine

2下载安装包

sudo yum install -y yum-utils

sudo yum-config-manager \
    --add-repo \
    https://download.docker.com/linux/centos/docker-ce.repo

3配置

停用 disable

sudo yum-config-manager --enable docker-ce-nightly
sudo yum-config-manager --enable docker-ce-test

4安装最新版 Docker Engine

sudo yum install docker-ce docker-ce-cli containerd.io

5启动docker

sudo systemctl start docker

6验证docker是否安装成功

提示以下内容代表安装成功

sudo docker run hello-world

在这里插入图片描述

六、安装Nvidia-docker

官方安装参考
因为原本的docker不支持GPU加速，所以NVIDIA单独做了一个docker来加速gpu

1安装依赖

sudo dnf install -y tar bzip2 make automake gcc gcc-c++ vim pciutils elfutils-libelf-devel libglvnd-devel iptables

1.1可能报错

sudo: dnf: command not found
执行以下指令，然后重复上面安装

yum install dnf

2安装docker CE

sudo yum-config-manager --add-repo=https://download.docker.com/linux/centos/docker-ce.repo

sudo yum repolist -v

sudo yum install -y https://download.docker.com/linux/centos/7/x86_64/stable/Packages/containerd.io-1.4.3-3.1.el7.x86_64.rpm

sudo yum install docker-ce -y

sudo systemctl --now enable docker

sudo docker run --rm hello-world

distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
   && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo

sudo yum clean expire-cache

sudo yum install -y nvidia-docker2

sudo systemctl restart docker

sudo docker run --rm --gpus all nvidia/cuda:10.1-base nvidia-smi

3弹出以下提示代表安装成功