persia.distributed
Module Contents
- class persia.distributed.BaguaDistributedOption(algorithm, **options)
Bases:
DistributedBaseOption
Implements an option to convert torch model to a bagua distributed model.
Example for
BaguaDistributedOption
:from persia.distributed import BaguaDistributedOption kwargs = { "enable_bagua_net": True } bagua_option = BaguaDistributedOption("gradient_allreduce", **kwargs)
Algorithms supported in Bagua:
Algorithm Name
gradient_allreduce
decentralized
low_precision_decentralized
qadam
bytegrad
async
Note
You can review Bagua Algorithm for more details, especially the arguments of algorithm.
Note
The
BaguaDistributedOption
only supports the CUDA environment, if you want to run PERSIA task on the CPU cluster, tryDDPOption
with backend=’gloo’ instead ofBaguaDistributedOption
.- Parameters:
algorithm (str) – name of Bagua algorithm.
options (dict) – options for Bagua algorithm
- convert2distributed_model(model, world_size, rank_id, device_id=None, master_addr=None, optimizer=None)
- Parameters:
model (torch.nn.Module) – the PyTorch model that needs to be converted to data-parallel model.
world_size (int) – total number of processes.
rank_id (int) – rank of current process.
device_id (int, optional) – device id for current process.
master_addr (str, optional) – master of collective communication ip address.
optimizer (torch.optim.Optimizer, optional) – the PyTorch optimizer that may need to be converted alongside the model.
- init_with_env_file()
Check if the current option is initiad with a ddp environment file.
- Returns:
Whether distributed option init with env file.
- Return type:
bool
- class persia.distributed.DDPOption(initialization_method='tcp', backend='nccl', **options)
Bases:
DistributedBaseOption
Implements an option to convert torch model to a DDP model.
Current backend in
DDPOption
only support nccl and gloo. You can setbackend="nccl"
if your PERSIA task is training on the cluster with the CUDA device. Or setbackend="gloo"
if your PERSIA task is training on the cluster only with the CPU.For example:
from persia.distributed.DDPOption ddp_option = DDPOption(backend="nccl")
If you want to change the default master_port or master_addr, add the
kwargs
toDDPOption
.from persia.distributed.DDPOption ddp_option = DDPOption(backend="nccl", master_port=23333, master_addr="localhost")
- Parameters:
initialization_method (str) – the PyTorch distributed initialization_method method, support tcp and file currently. See PyTorch initialization for more details.
backend (str) – backend of collective communication. Currently support nccl.
options (dict) – options that include the master_port or master_addr.
- convert2distributed_model(model, world_size, rank_id, device_id=None, master_addr=None, optimizer=None)
- Parameters:
model (torch.nn.Module) – the PyTorch model that needs to be converted to data-parallel model.
world_size (int) – total number of processes.
rank_id (int) – rank of current process.
device_id (int, optional) – device id for current process.
master_addr (str, optional) – master of collective communication ip address.
optimizer (torch.optim.Optimizer, optional) – the PyTorch optimizer that may need to be converted alongside the model.
- init_with_env_file()
Check if the current option was initialized with a ddp env file or not
- Returns:
True
if the current option was initialized with a ddp env file- Return type:
bool
- class persia.distributed.DistributedBaseOption(master_port, master_addr=None)
Bases:
abc.ABC
Implements a common option to convert torch model to a distributed data parallel model, e.g. Bagua Distributed or pyTorch DDP.
This class should not be instantiated directly.
- Parameters:
master_port (int) – master of collective communication ip address.
master_addr (str, optional) – master of collective communication service port.
- abstract convert2distributed_model(model, world_size, rank_id, device_id=None, master_addr=None, optimizer=None)
- Parameters:
model (torch.nn.Module) – the PyTorch model that needs to be converted to data-parallel model.
world_size (int) – total number of processes.
rank_id (int) – rank of current process.
device_id (int, optional) – device id for current process.
master_addr (str, optional) – master of the collective communication ip address.
optimizer (torch.optim.Optimizer, optional) – the PyTorch optimizer that may need to be converted alongside the model.
- abstract init_with_env_file()
Check if the current option was initialized with a ddp env file or not
- Returns:
True
if the current option was initialized with a ddp env file- Return type:
bool
- persia.distributed.get_default_distributed_option(device_id=None)
Get default distributed option.
- Parameters:
device_id (int, optional) – CUDA device_id. Apply
backend="nccl"
to theDDPOption
if the device_id not None, otherwise use thebackend="gloo"
for CPU only mode.- Returns:
Default distributed option.
- Return type: