How to solve dist.init_process_group from hanging or deadlocks?

Written by- Aionlinecourse1368 times views

The main reason for the deadlock or hanging is the update of the dist.init process group.

The problem is usually caused by a large number of processes waiting for a resource, for example, a disk drive. This can be solved by adding more resources to the system or by reducing the number of processes in the system.


How to solve dist.init_process_group from hanging or deadlocks

A possible solution to this issue would be to add more resources to the system in order to reduce the time that it takes for processes to access shared resources and reduce wait time.

The following fixes are based on Writing Distributed Applications with PyTorch, Initialization Methods.

Issue 1:

It will hang unless you pass in nprocs=world_size to mp.spawn(). In other words, it's waiting for the "whole world" to show up, process-wise.


Issue 2:

The MASTER_ADDR and MASTER_PORT need to be the same in each process' environment and need to be a free address:port combination on the machine where the process with rank 0 will be run.


Both of these are implied or directly read from the following quote from the link above (emphasis added):

Environment Variable

We have been using the environment variable initialization method throughout this tutorial. By setting the following four environment variables on all machines, all processes will be able to properly connect to the master, obtain information about the other processes, and finally handshake with them.

MASTER_PORT: A free port on the machine that will host the process with rank 0.

MASTER_ADDR: IP address of the machine that will host the process with rank 0.

WORLD_SIZE: The total number of processes, so that the master knows how many workers to wait for.

RANK: Rank of each process, so they will know whether it is the master of a worker.


Here's some code to demonstrate both of those in action:

import torch
import torch.multiprocessing as mp
import torch.distributed as dist
import os

def find_free_port():
    """ https://stackoverflow.com/questions/1365265/on-localhost-how-do-i-pick-a-free-port-number """
    import socket
    from contextlib import closing

    with closing(socket.socket(socket.AF_INET, socket.SOCK_STREAM)) as s:
        s.bind(('', 0))
        s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
        return str(s.getsockname()[1])


def setup_process(rank, master_addr, master_port, world_size, backend='gloo'):
    print(f'setting up {rank=} {world_size=} {backend=}')

    # set up the master's ip address so this child process can coordinate
    os.environ['MASTER_ADDR'] = master_addr
    os.environ['MASTER_PORT'] = master_port
    print(f"{master_addr=} {master_port=}")

    # Initializes the default distributed process group, and this will also initialize the distributed package.
    dist.init_process_group(backend, rank=rank, world_size=world_size)
    print(f"{rank=} init complete")
    dist.destroy_process_group()
    print(f"{rank=} destroy complete")
        
if __name__ == '__main__':
    world_size = 4
    master_addr = '127.0.0.1'
    master_port = find_free_port()
    mp.spawn(setup_process, args=(master_addr,master_port,world_size,), nprocs=world_size)

Thank you for reading the article.