nvidia-docker 컨테이너 구동불가

개요

nvidia docker기반 container가 구동되지 않는다.

컨테이너 구동시 아래와 같은 오류가 발생한다.:

 nvidia-docker run -it -p 9999:7777 tensorflow/tensorflow:latest-gpu /bin/bash
docker: Error response from daemon: create nvidia_driver_384.130: Post http://%2Fvar%2Flib%2Fnvidia-docker%2Fnvidia-docker.sock/VolumeDriver.Create: dial unix /var/lib/nvidia-docker/nvidia-docker.sock: connect: no such file or directory.
See 'docker run --help'.

점검

오류를 보니 /var/lib/nvidia-docker/nvidia-docker.socknvidia-docker 소켓을 못찾는다.:

root@deeplearning:~# cd /var/lib/nvidia-docker/
root@deeplearning:/var/lib/nvidia-docker# ls
  • 해당 경로를 찾아 들어가 보니 nvidia-docker 소켓이 없다.

서비스 상태를 체크하였다.:

root@deeplearning:/var/lib/nvidia-docker# systemctl status nvidia-docker
● nvidia-docker.service - NVIDIA Docker plugin
   Loaded: loaded (/lib/systemd/system/nvidia-docker.service; enabled; vendor preset: enabled)
   Active: inactive (dead) since Mon 2018-03-12 23:01:13 KST; 4 months 18 days ago
     Docs: https://github.com/NVIDIA/nvidia-docker/wiki
 Main PID: 1375 (code=exited, status=0/SUCCESS)

Dec 08 16:55:22 deeplearning nvidia-docker-plugin[1375]: /usr/bin/nvidia-docker-plugin | 2017/12/08 16:55:22 Provisi
Dec 08 16:55:22 deeplearning nvidia-docker-plugin[1375]: /usr/bin/nvidia-docker-plugin | 2017/12/08 16:55:22 Serving
Dec 08 16:55:22 deeplearning nvidia-docker-plugin[1375]: /usr/bin/nvidia-docker-plugin | 2017/12/08 16:55:22 Serving
Dec 08 16:55:24 deeplearning nvidia-docker-plugin[1375]: /usr/bin/nvidia-docker-plugin | 2017/12/08 16:55:24 Receive
Dec 08 16:55:24 deeplearning nvidia-docker-plugin[1375]: /usr/bin/nvidia-docker-plugin | 2017/12/08 16:55:24 Plugins
Dec 27 12:24:50 deeplearning nvidia-docker-plugin[1375]: /usr/bin/nvidia-docker-plugin | 2017/12/27 12:24:50 Receive
Dec 27 12:25:14 deeplearning nvidia-docker-plugin[1375]: /usr/bin/nvidia-docker-plugin | 2017/12/27 12:25:14 Receive
Dec 27 14:06:31 deeplearning nvidia-docker-plugin[1375]: /usr/bin/nvidia-docker-plugin | 2017/12/27 14:06:31 Receive
Dec 27 14:09:01 deeplearning nvidia-docker-plugin[1375]: /usr/bin/nvidia-docker-plugin | 2017/12/27 14:09:01 Receive
Mar 12 23:01:13 deeplearning nvidia-docker-plugin[1375]: /usr/bin/nvidia-docker-plugin | 2018/03/12 23:01:13 Success
  • stop 상태이다. 왜 종료 됐을까?

로그를 점검하였다.:

root@deeplearning:/var/log/upstart# journalctl -u nvidia-docker
-- Logs begin at Fri 2017-12-08 16:53:51 KST, end at Tue 2018-07-31 16:00:18 KST. --
Dec 08 16:55:20 deeplearning systemd[1]: Starting NVIDIA Docker plugin...
Dec 08 16:55:20 deeplearning systemd[1]: Started NVIDIA Docker plugin.
Dec 08 16:55:20 deeplearning nvidia-docker-plugin[1375]: /usr/bin/nvidia-docker-plugin | 2017/12/08 16:55:20 Loading NVIDIA unified memory
Dec 08 16:55:20 deeplearning nvidia-docker-plugin[1375]: /usr/bin/nvidia-docker-plugin | 2017/12/08 16:55:20 Loading NVIDIA management library
Dec 08 16:55:21 deeplearning nvidia-docker-plugin[1375]: /usr/bin/nvidia-docker-plugin | 2017/12/08 16:55:21 Discovering GPU devices
Dec 08 16:55:22 deeplearning nvidia-docker-plugin[1375]: /usr/bin/nvidia-docker-plugin | 2017/12/08 16:55:22 Provisioning volumes at /var/lib/nvidia-docker/volumes
Dec 08 16:55:22 deeplearning nvidia-docker-plugin[1375]: /usr/bin/nvidia-docker-plugin | 2017/12/08 16:55:22 Serving plugin API at /var/lib/nvidia-docker
Dec 08 16:55:22 deeplearning nvidia-docker-plugin[1375]: /usr/bin/nvidia-docker-plugin | 2017/12/08 16:55:22 Serving remote API at localhost:3476
Dec 08 16:55:24 deeplearning nvidia-docker-plugin[1375]: /usr/bin/nvidia-docker-plugin | 2017/12/08 16:55:24 Received activate request
Dec 08 16:55:24 deeplearning nvidia-docker-plugin[1375]: /usr/bin/nvidia-docker-plugin | 2017/12/08 16:55:24 Plugins activated [VolumeDriver]
Dec 27 12:24:50 deeplearning nvidia-docker-plugin[1375]: /usr/bin/nvidia-docker-plugin | 2017/12/27 12:24:50 Received mount request for volume 'nvidia_driver_384.98'
Dec 27 12:25:14 deeplearning nvidia-docker-plugin[1375]: /usr/bin/nvidia-docker-plugin | 2017/12/27 12:25:14 Received unmount request for volume 'nvidia_driver_384.98'
Dec 27 14:06:31 deeplearning nvidia-docker-plugin[1375]: /usr/bin/nvidia-docker-plugin | 2017/12/27 14:06:31 Received mount request for volume 'nvidia_driver_384.98'
Dec 27 14:09:01 deeplearning nvidia-docker-plugin[1375]: /usr/bin/nvidia-docker-plugin | 2017/12/27 14:09:01 Received mount request for volume 'nvidia_driver_384.98'
Mar 12 23:01:13 deeplearning nvidia-docker-plugin[1375]: /usr/bin/nvidia-docker-plugin | 2018/03/12 23:01:13 Successfully terminated
  • 로그에는 특이사항이 없다.

해결

nvidia-docker서비스를 start 하였다.:

root@deeplearning:/var/lib/nvidia-docker# systemctl start nvidia-docker

root@deeplearning:/var/lib/nvidia-docker# systemctl status nvidia-docker
● nvidia-docker.service - NVIDIA Docker plugin
   Loaded: loaded (/lib/systemd/system/nvidia-docker.service; enabled; vendor preset: enabled)
   Active: active (running) since Tue 2018-07-31 15:50:09 KST; 8s ago
     Docs: https://github.com/NVIDIA/nvidia-docker/wiki
  Process: 13348 ExecStartPost=/bin/sh -c /bin/echo unix://$SOCK_DIR/nvidia-docker.sock > $SPEC_FILE (code=exited, s
  Process: 13339 ExecStartPost=/bin/sh -c /bin/mkdir -p $( dirname $SPEC_FILE ) (code=exited, status=0/SUCCESS)
 Main PID: 13331 (nvidia-docker-p)
    Tasks: 9
   Memory: 37.5M
      CPU: 2.148s
   CGroup: /system.slice/nvidia-docker.service
           └─13331 /usr/bin/nvidia-docker-plugin -s /var/lib/nvidia-docker

Jul 31 15:50:09 deeplearning systemd[1]: Starting NVIDIA Docker plugin...
Jul 31 15:50:09 deeplearning systemd[1]: Started NVIDIA Docker plugin.
Jul 31 15:50:09 deeplearning nvidia-docker-plugin[13331]: /usr/bin/nvidia-docker-plugin | 2018/07/31 15:50:09 Loadin
Jul 31 15:50:09 deeplearning nvidia-docker-plugin[13331]: /usr/bin/nvidia-docker-plugin | 2018/07/31 15:50:09 Loadin
Jul 31 15:50:10 deeplearning nvidia-docker-plugin[13331]: /usr/bin/nvidia-docker-plugin | 2018/07/31 15:50:10 Discov
Jul 31 15:50:11 deeplearning nvidia-docker-plugin[13331]: /usr/bin/nvidia-docker-plugin | 2018/07/31 15:50:11 Provis
Jul 31 15:50:11 deeplearning nvidia-docker-plugin[13331]: /usr/bin/nvidia-docker-plugin | 2018/07/31 15:50:11 Servin
Jul 31 15:50:11 deeplearning nvidia-docker-plugin[13331]: /usr/bin/nvidia-docker-plugin | 2018/07/31 15:50:11 Servin

소켓 파일이 잘 생성되었다.:

root@deeplearning:/var/lib/nvidia-docker# cd /var/lib/nvidia-docker/
root@deeplearning:/var/lib/nvidia-docker# ls -l
total 4
srwxr-xr-x 1 nvidia-docker nvidia-docker    0 Jul 31 15:50 nvidia-docker.sock
drwxr-xr-x 3 nvidia-docker nvidia-docker 4096 Apr 12  2017 volumes

컨테이너를 구동하였다.:

root@deeplearning:/var/lib/nvidia-docker# nvidia-docker run -it -p 9999:7777 tensorflow/tensorflow:latest-gpu /bin/bash

root@6ef7cb742eac:/notebooks#
  • 커맨드라인을 보니 컨테이너에서 bash가 잘 실행되었다.

이제 컨테이너가 잘 구동된다.:

root@deeplearning:/var/lib/nvidia-docker# nvidia-docker ps -a
CONTAINER ID        IMAGE                                     COMMAND                  CREATED             STATUS                       PORTS                                                                                                                  NAMES
6ef7cb742eac        tensorflow/tensorflow:latest-gpu          "/bin/bash"              42 minutes ago      Up 59 seconds                6006/tcp, 8888/tcp, 0.0.0.0:9999->7777/tcp                                                                             adoring_mirzakhani
...

방금만든 테스트 컨테이너는 삭제하였다.:

root@deeplearning:/var/lib/nvidia-docker# nvidia-docker stop 6ef7cb742eac
6ef7cb742eac

root@deeplearning:/var/lib/nvidia-docker# nvidia-docker rm 6ef7cb742eac
6ef7cb742eac

root@deeplearning:/var/lib/nvidia-docker# nvidia-docker ps -a |grep 6ef7cb742eac

이제 컨테이너가 잘 구동 된다.