How to solve dist.init_process_group from hanging or deadlocks?

Written by - Aionlinecourse1524 times views

The main reason for the deadlock or hanging is the update of the dist.init process group.

The problem is usually caused by a large number of processes waiting for a resource, for example, a disk drive. This can be solved by adding more resources to the system or by reducing the number of processes in the system.

How to solve dist.init_process_group from hanging or deadlocks

A possible solution to this issue would be to add more resources to the system in order to reduce the time that it takes for processes to access shared resources and reduce wait time.

The following fixes are based on Writing Distributed Applications with PyTorch, Initialization Methods.

Issue 1:

It will hang unless you pass in nprocs=world_size to mp.spawn(). In other words, it's waiting for the "whole world" to show up, process-wise.


Issue 2:

The MASTER_ADDR and MASTER_PORT need to be the same in each process' environment and need to be a free address:port combination on the machine where the process with rank 0 will be run.


Both of these are implied or directly read from the following quote from the link above (emphasis added):

Environment Variable

We have been using the environment variable initialization method throughout this tutorial. By setting the following four environment variables on all machines, all processes will be able to properly connect to the master, obtain information about the other processes, and finally handshake with them.

MASTER_PORT: A free port on the machine that will host the process with rank 0.

MASTER_ADDR: IP address of the machine that will host the process with rank 0.

WORLD_SIZE: The total number of processes, so that the master knows how many workers to wait for.

RANK: Rank of each process, so they will know whether it is the master of a worker.


Here's some code to demonstrate both of those in action:

import torch
import torch.multiprocessing as mp
import torch.distributed as dist
import os

def find_free_port():
    """ https://stackoverflow.com/questions/1365265/on-localhost-how-do-i-pick-a-free-port-number """
    import socket
    from contextlib import closing

    with closing(socket.socket(socket.AF_INET, socket.SOCK_STREAM)) as s:
        s.bind(('', 0))
        s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
        return str(s.getsockname()[1])


def setup_process(rank, master_addr, master_port, world_size, backend='gloo'):
    print(f'setting up {rank=} {world_size=} {backend=}')

    # set up the master's ip address so this child process can coordinate
    os.environ['MASTER_ADDR'] = master_addr
    os.environ['MASTER_PORT'] = master_port
    print(f"{master_addr=} {master_port=}")

    # Initializes the default distributed process group, and this will also initialize the distributed package.
    dist.init_process_group(backend, rank=rank, world_size=world_size)
    print(f"{rank=} init complete")
    dist.destroy_process_group()
    print(f"{rank=} destroy complete")
        
if __name__ == '__main__':
    world_size = 4
    master_addr = '127.0.0.1'
    master_port = find_free_port()
    mp.spawn(setup_process, args=(master_addr,master_port,world_size,), nprocs=world_size)

Thank you for reading the article.

Recommended Projects

Deep Learning Interview Guide

Medical Image Segmentation With UNET

Have you ever thought about how doctors are so precise in diagnosing any conditions based on medical images? Quite simply,...

Computer Vision
Deep Learning Interview Guide

Build A Book Recommender System With TF-IDF And Clustering(Python)

Have you ever thought about the reasons behind the segregation and recommendation of books with similarities? This project is aimed...

Machine LearningDeep LearningNatural Language Processing
Deep Learning Interview Guide

Automatic Eye Cataract Detection Using YOLOv8

Cataracts are a leading cause of vision impairment worldwide, affecting millions of people every year. Early detection and timely intervention...

Computer Vision
Deep Learning Interview Guide

Crop Disease Detection Using YOLOv8

In this project, we are utilizing AI for a noble objective, which is crop disease detection. Well, you're here if...

Computer Vision
Deep Learning Interview Guide

Vegetable classification with Parallel CNN model

The Vegetable Classification project shows how CNNs can sort vegetables efficiently. As industries like agriculture and food retail grow, automating...

Machine LearningDeep Learning
Deep Learning Interview Guide

Banana Leaf Disease Detection using Vision Transformer model

Banana cultivation is a significant agricultural activity in many tropical and subtropical regions, providing a vital source of income and...

Deep LearningComputer Vision
Deep Learning Interview Guide

Credit Card Default Prediction Using Machine Learning Techniques

This project aims to develop and assess machine learning models in predicting customer defaults, assisting businesses in evaluating the risk...

Machine Learning