How to solve dist.init_process_group from hanging or deadlocks?

Written by - Aionlinecourse1644 times views

The main reason for the deadlock or hanging is the update of the dist.init process group.

The problem is usually caused by a large number of processes waiting for a resource, for example, a disk drive. This can be solved by adding more resources to the system or by reducing the number of processes in the system.

How to solve dist.init_process_group from hanging or deadlocks

A possible solution to this issue would be to add more resources to the system in order to reduce the time that it takes for processes to access shared resources and reduce wait time.

The following fixes are based on Writing Distributed Applications with PyTorch, Initialization Methods.

Issue 1:

It will hang unless you pass in nprocs=world_size to mp.spawn(). In other words, it's waiting for the "whole world" to show up, process-wise.


Issue 2:

The MASTER_ADDR and MASTER_PORT need to be the same in each process' environment and need to be a free address:port combination on the machine where the process with rank 0 will be run.


Both of these are implied or directly read from the following quote from the link above (emphasis added):

Environment Variable

We have been using the environment variable initialization method throughout this tutorial. By setting the following four environment variables on all machines, all processes will be able to properly connect to the master, obtain information about the other processes, and finally handshake with them.

MASTER_PORT: A free port on the machine that will host the process with rank 0.

MASTER_ADDR: IP address of the machine that will host the process with rank 0.

WORLD_SIZE: The total number of processes, so that the master knows how many workers to wait for.

RANK: Rank of each process, so they will know whether it is the master of a worker.


Here's some code to demonstrate both of those in action:

import torch
import torch.multiprocessing as mp
import torch.distributed as dist
import os

def find_free_port():
    """ https://stackoverflow.com/questions/1365265/on-localhost-how-do-i-pick-a-free-port-number """
    import socket
    from contextlib import closing

    with closing(socket.socket(socket.AF_INET, socket.SOCK_STREAM)) as s:
        s.bind(('', 0))
        s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
        return str(s.getsockname()[1])


def setup_process(rank, master_addr, master_port, world_size, backend='gloo'):
    print(f'setting up {rank=} {world_size=} {backend=}')

    # set up the master's ip address so this child process can coordinate
    os.environ['MASTER_ADDR'] = master_addr
    os.environ['MASTER_PORT'] = master_port
    print(f"{master_addr=} {master_port=}")

    # Initializes the default distributed process group, and this will also initialize the distributed package.
    dist.init_process_group(backend, rank=rank, world_size=world_size)
    print(f"{rank=} init complete")
    dist.destroy_process_group()
    print(f"{rank=} destroy complete")
        
if __name__ == '__main__':
    world_size = 4
    master_addr = '127.0.0.1'
    master_port = find_free_port()
    mp.spawn(setup_process, args=(master_addr,master_port,world_size,), nprocs=world_size)

Thank you for reading the article.

Recommended Projects

Deep Learning Interview Guide

Topic modeling using K-means clustering to group customer reviews

Have you ever thought about the ways one can analyze a review to extract all the misleading or useful information?...

Natural Language Processing
Deep Learning Interview Guide

Automatic Eye Cataract Detection Using YOLOv8

Cataracts are a leading cause of vision impairment worldwide, affecting millions of people every year. Early detection and timely intervention...

Computer Vision
Deep Learning Interview Guide

Medical Image Segmentation With UNET

Have you ever thought about how doctors are so precise in diagnosing any conditions based on medical images? Quite simply,...

Computer Vision
Deep Learning Interview Guide

Build A Book Recommender System With TF-IDF And Clustering(Python)

Have you ever thought about the reasons behind the segregation and recommendation of books with similarities? This project is aimed...

Machine LearningDeep LearningNatural Language Processing
Deep Learning Interview Guide

Build Regression Models in Python for House Price Prediction

Ever wondered how experts predict house prices? This project dives into exactly that! Using Python, we'll build regression models that...

Machine Learning
Deep Learning Interview Guide

Optimizing Chunk Sizes for Efficient and Accurate Document Retrieval Using HyDE Evaluation

This project demonstrates the integration of generative AI techniques with efficient document retrieval by leveraging GPT-4 and vector indexing. It...

Natural Language ProcessingGenerative AI
Deep Learning Interview Guide

Crop Disease Detection Using YOLOv8

In this project, we are utilizing AI for a noble objective, which is crop disease detection. Well, you're here if...

Computer Vision