Troubleshooting

Some of the problems you could run into when using the agent, along with solutions

Failed to initialize NVML: Unknown Error

This error applies to any utilities/libraries that use NVML: pytorch, nvidia-smi, pynvml etc. It frequently shows up unexpectedly and prevents applications from using the GPU until it is fixed.

Quick solution

If the error only appears inside the container, you can quickly fix it by restarting it. If the error also appears on the host after running nvidia-smi you can fix it by using the reboot command on the host.

Proper solution

  1. If the error has not yet appeared, you can check if your system is affected by this problem.

    • run agent docker on the host PC;

    • run sudo systemctl daemon-reload on the host;

    • execute nvidia-smi into the agent container and catch NVML initialization error

  2. Set the parameter "exec-opts": ["native.cgroupdriver=cgroupfs"] in the /etc/docker/daemon.json file.

~$ cat /etc/docker/daemon.json 
{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "args": [],
            "path": "nvidia-container-runtime"
        }
    },
    "exec-opts": ["native.cgroupdriver=cgroupfs"]
}
  1. Restart Docker with sudo systemctl restart docker

You can also try other official NVIDIA fixes to solve this problem for a specific docker container or plunge into this problem by reading this discussion or this official description.

CUDA Out Of Memory Error

This error could appear in any training apps.

Solution

  1. Check the amount of free GPU memory by running nvidia-smi command in your machine terminal - it will give you an understanding of how much GPU memory is it necessary to free in order to train your machine learning model

  2. Stop unnecessary app sessions in Supervisely: START button → App Sessions → stop all unnecessary app sessions by clicking on Stop button in front of every undesired app session

  3. Stop unnecessary processes in your machine terminal by running sudo kill <put_your_process_id_here>

  4. Select a lighter machine learning model (check "Memory" column in a model table - there is information about how much GPU memory will this model require to train).

If this information is not provided, use a simple rule: the higher the model in the table, the lighter it is.

  1. Reduce batch size or model input resolution

    MMsegmentation image resolution/batch size

    MMdetection v3 image resolution/batch size

Additional: stop a process via docker.

  1. run docker ps - it will return a big table with all docker containers running on this machine

  2. run docker stop <put_your_container_id_here>

Last updated