Kubernetes GPU共享实践

## 环境准备
### 1. GPU 节点准备
GPU共享依赖NVIDIA驱动和nvidia-docker2，需要事先安装。NVIDIA驱动安装参考nvidia-docker

NVIDIA驱动和nvidia-docker2安装
# 如果已安装nvidia-docker，需要先进行卸载
>$ docker volume ls -q -f driver=nvidia-docker | xargs -r -I{} -n1 docker ps -q -a -f volume={} | xargs -r docker rm -f
>$ sudo yum remove nvidia-docker -y

# 安装nvidia-docker2 repo
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | \
 >$ sudo tee /etc/yum.repos.d/nvidia-docker.repo

# 安装nvidia-docker，并重新加载docker配置
>$ sudo yum install -y nvidia-docker2
>$ sudo pkill -SIGHUP dockerd

# 在cuda:9.0容器中测试nvidia-smi命令
>$ docker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi
设置NVIDIA runtime为docker默认运行时环境
编辑docker daemon config文件，没有则创建一个。
文件路径：/etc/docker/daemon.json
文件内容：
```json
{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
           "path": "/usr/bin/nvidia-container-runtime",
           "runtimeArgs": []
      }
   }
}
```
2. 部署GPU share scheduler
从git 仓库clone gpushare-scheduler-extender项目到本地
进入项目config目录下，将scheduler-policy-config.json文件复制到k8s master节点的/etc/kubernetes/目录下。
在kubernetes上部署gpushare-schd-extender。复制config目录下的gpushare-schd-extender.yaml文件到kubernetes 集群mater节点，执行kubectl apply -f gpushare-schd-extender.yaml命令，完成gpushare-schd-extender部署。
3. 修改调度器配置
修改调度器配置，其目的是将scheduler-policy-config.json加入到默认的调度器配置中(/etc/kubernetes/manifests/kube-scheduler.yaml)。
第一步：在调度程序参数中添加策略配置文件参数

```yaml
- --policy-config-file=/etc/kubernetes/scheduler-policy-config.json

```
第二步：挂载卷到Pod 的Spec中

```yaml
- mountPath: /etc/kubernetes/scheduler-policy-config.json
  name: scheduler-policy-config
  readOnly: true
- hostPath:
      path: /etc/kubernetes/scheduler-policy-config.json
      type: FileOrCreate
  name: scheduler-policy-config
```
如果在此之前未对调度器做过任何修改和配置，也可以直接使用config目录下的kube-scheduler.yaml(复制该文件到/etc/Kubernetes/manifest即可)。

⚠️ 注意：

如果Kubernetes默认调度程序部署为静态pod，不要在/etc/Kubernetes/manifest中编辑yaml文件。需要提前在/etc/kubernetes/manifest目录之外编辑好yaml文件。然后将编辑好的yaml文件复制到/etc/kubernetes/manifest/目录，然后kubernetes将自动用yaml文件更新默认的静态pod。

4. 部署Device Plugin
从git仓库clone项目gpushare-device-plugin到本地。
复制根目录下的device-plugin-rbac.yaml和device-plugin-ds.yaml到master节点，执行kubectl apply -f device-plugin-rbac.yaml和kubectl apply -f device-plugin-ds.yaml命令完成部署。
⚠️ 注意：

在部署之前需要删除默认的GPU Device Plugin。例如，如果当前使用的是nvidia-device-Plugin，则需要执行kubectl delete ds -n kube-system nvidia-device-plugin-daemonset删除默认GPU Device Plugin。

5.将gpushare节点标签添加到需要GPU共享的节点
添加标签gpushare=true到需要要安装device plugin（需要共享GPU）的所有节点。

>$ kubectl label node <target_node> gpushare=true
6.升级kubectl扩展工具
下载kubectl-inspect-gpushare到本地
>$ wget https://github.com/AliyunContainerService/gpushare-device-plugin/releases/download/v0.3.0/kubectl-inspect-gpushare
在k8s master节点上安装kubectl-inspect-gpushare
复制kubectl-inspect-gpushare到/usr/bin目录下并添加可执行权限。
>$ chmod u+x /usr/bin/kubectl-inspect-gpushare
⚠️ 注意：

如果你的kubectl版本低于kubectl 1.12，需要先升级kubectl.

服务部署和使用
1.查询共享GPU显存分配情况
>$ kubectl inspect gpushare

For more details, please run kubectl inspect gpushare -d

2.镜像中申请和使用共享GPU
To request GPU sharing, you just need to specify aliyun.com/gpu-mem

```yaml
apiVersion: apps/v1beta1
kind: StatefulSet

metadata:
  name: binpack-1
  labels:
    app: binpack-1

spec:
  replicas: 3
  serviceName: "binpack-1"
  podManagementPolicy: "Parallel"
  selector: # define how the deployment finds the pods it manages
    matchLabels:
      app: binpack-1

template: # define the pods specifications
    metadata:
      labels:
        app: binpack-1

spec:
      containers:
      - name: binpack-1
        image: cheyang/gpu-player:v2
        resources:
          limits:
            # GiB
            aliyun.com/gpu-mem: 3
```
3.限制GPU显存使用
为了在应用程序里边限制GPU显存的使用，可以使用如下环境变量：

ALIYUN_COM_GPU_MEM_DEV：当前物理设备GPU显存总大小（单位：GiB）
ALIYUN_COM_GPU_MEM_CONTAINER：当前容器分配的GPU显存大小（单位：GiB）
示例：通过TensorFlow API设置比例来限制GPU显存

```python
fraction = round( 3 * 0.7 / 15 , 1 )
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = fraction
sess = tf.Session(config=config)
# Runs the op.
while True:
    sess.run(c)
```
⚠️

0.7 is because tensorflow control gpu memory is not accurate, it is recommended to multiply by 0.7 to ensure that the upper limit is not exceeded.