- How to use sample weights with tensorflow datasets?
- How to Fine-tune HuggingFace BERT model for Text Classification
- How to Convert Yolov5 model to tensorflow.js
- Machine Learning Project: Airline Tickets Price Prediction
- Machine Learning Project: Hotel Booking Prediction [Part 2]
- Machine Learning Project: Hotel Booking Prediction [Part 1]
- Machine Learning Project Environment Setup
- Computer vision final year project ideas and guidelines
- Virtual assistant final year project ideas and guidelines
- Build Your First Machine Learning Project in Python(Step by Step Tutorial)
- Self-driving car github repositories and projects
- Self-Driving car research topics and guidelines
- Self-Driving car final year project ideas and guidelines
- Artificial Intelligence in Self Driving Car and how it works
- A Quick Guide to Build and Deploy Machine Leaning Models with IBM Watson and Django
- Learn Time Series Analysis in Python- A Step by Step Guide using the ARIMA Model
- A Quick Guide to Deploy your Machine Learning Models using Django and Rest API
- Build and Deploy a Restaurant Chatbot with Rasa and Python
How to solve dist.init_process_group from hanging or deadlocks?
The main reason for the deadlock or hanging is the update of the dist.init process group.
The problem is usually caused by a large number of processes waiting for a resource, for example, a disk drive. This can be solved by adding more resources to the system or by reducing the number of processes in the system.
How to solve dist.init_process_group from hanging or deadlocks
A possible solution to this issue would be to add more resources to the system in order to reduce the time that it takes for processes to access shared resources and reduce wait time.
The following fixes are based on Writing Distributed Applications with PyTorch, Initialization Methods.
Issue 1:
It will hang unless you pass in nprocs=world_size
to mp.spawn()
. In other words, it's waiting for the "whole world" to show up, process-wise.
Issue 2:
The MASTER_ADDR and MASTER_PORT need to be the same in each process' environment and need to be a free address:port combination on the machine where the process with rank 0 will be run.
Both of these are implied or directly read from the following quote from the link above (emphasis added):
Environment Variable
We have been using the environment variable initialization method throughout this tutorial. By setting the following four environment variables on all machines, all processes will be able to properly connect to the master, obtain information about the other processes, and finally handshake with them.
MASTER_PORT: A free port on the machine that will host the process with rank 0.
MASTER_ADDR: IP address of the machine that will host the process with rank 0.
WORLD_SIZE: The total number of processes, so that the master knows how many workers to wait for.
RANK: Rank of each process, so they will know whether it is the master of a worker.
Here's some code to demonstrate both of those in action:
import torch
import torch.multiprocessing as mp
import torch.distributed as dist
import os
def find_free_port():
""" https://stackoverflow.com/questions/1365265/on-localhost-how-do-i-pick-a-free-port-number """
import socket
from contextlib import closing
with closing(socket.socket(socket.AF_INET, socket.SOCK_STREAM)) as s:
s.bind(('', 0))
s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
return str(s.getsockname()[1])
def setup_process(rank, master_addr, master_port, world_size, backend='gloo'):
print(f'setting up {rank=} {world_size=} {backend=}')
# set up the master's ip address so this child process can coordinate
os.environ['MASTER_ADDR'] = master_addr
os.environ['MASTER_PORT'] = master_port
print(f"{master_addr=} {master_port=}")
# Initializes the default distributed process group, and this will also initialize the distributed package.
dist.init_process_group(backend, rank=rank, world_size=world_size)
print(f"{rank=} init complete")
dist.destroy_process_group()
print(f"{rank=} destroy complete")
if __name__ == '__main__':
world_size = 4
master_addr = '127.0.0.1'
master_port = find_free_port()
mp.spawn(setup_process, args=(master_addr,master_port,world_size,), nprocs=world_size)
Thank you for reading the article.