persia.distributed

Module Contents

class persia.distributed.BaguaDistributedOption(algorithm, **options)

Bases: DistributedBaseOption

Implements an option to convert torch model to a bagua distributed model.

Example for BaguaDistributedOption:

from persia.distributed import BaguaDistributedOption

kwargs = {
    "enable_bagua_net": True
}
bagua_option = BaguaDistributedOption("gradient_allreduce", **kwargs)

Algorithms supported in Bagua:

Algorithm Name

gradient_allreduce

decentralized

low_precision_decentralized

qadam

bytegrad

async

Note

You can review Bagua Algorithm for more details, especially the arguments of algorithm.

Note

The BaguaDistributedOption only supports the CUDA environment, if you want to run PERSIA task on the CPU cluster, try DDPOption with backend=’gloo’ instead of BaguaDistributedOption.

Parameters:
  • algorithm (str) – name of Bagua algorithm.

  • options (dict) – options for Bagua algorithm

convert2distributed_model(model, world_size, rank_id, device_id=None, master_addr=None, optimizer=None)
Parameters:
  • model (torch.nn.Module) – the PyTorch model that needs to be converted to data-parallel model.

  • world_size (int) – total number of processes.

  • rank_id (int) – rank of current process.

  • device_id (int, optional) – device id for current process.

  • master_addr (str, optional) – master of collective communication ip address.

  • optimizer (torch.optim.Optimizer, optional) – the PyTorch optimizer that may need to be converted alongside the model.

init_with_env_file()

Check if the current option is initiad with a ddp environment file.

Returns:

Whether distributed option init with env file.

Return type:

bool

class persia.distributed.DDPOption(initialization_method='tcp', backend='nccl', **options)

Bases: DistributedBaseOption

Implements an option to convert torch model to a DDP model.

Current backend in DDPOption only support nccl and gloo. You can set backend="nccl" if your PERSIA task is training on the cluster with the CUDA device. Or set backend="gloo" if your PERSIA task is training on the cluster only with the CPU.

For example:

from persia.distributed.DDPOption

ddp_option = DDPOption(backend="nccl")

If you want to change the default master_port or master_addr, add the kwargs to DDPOption.

from persia.distributed.DDPOption

ddp_option = DDPOption(backend="nccl", master_port=23333, master_addr="localhost")
Parameters:
  • initialization_method (str) – the PyTorch distributed initialization_method method, support tcp and file currently. See PyTorch initialization for more details.

  • backend (str) – backend of collective communication. Currently support nccl.

  • options (dict) – options that include the master_port or master_addr.

convert2distributed_model(model, world_size, rank_id, device_id=None, master_addr=None, optimizer=None)
Parameters:
  • model (torch.nn.Module) – the PyTorch model that needs to be converted to data-parallel model.

  • world_size (int) – total number of processes.

  • rank_id (int) – rank of current process.

  • device_id (int, optional) – device id for current process.

  • master_addr (str, optional) – master of collective communication ip address.

  • optimizer (torch.optim.Optimizer, optional) – the PyTorch optimizer that may need to be converted alongside the model.

init_with_env_file()

Check if the current option was initialized with a ddp env file or not

Returns:

True if the current option was initialized with a ddp env file

Return type:

bool

class persia.distributed.DistributedBaseOption(master_port, master_addr=None)

Bases: abc.ABC

Implements a common option to convert torch model to a distributed data parallel model, e.g. Bagua Distributed or pyTorch DDP.

This class should not be instantiated directly.

Parameters:
  • master_port (int) – master of collective communication ip address.

  • master_addr (str, optional) – master of collective communication service port.

abstract convert2distributed_model(model, world_size, rank_id, device_id=None, master_addr=None, optimizer=None)
Parameters:
  • model (torch.nn.Module) – the PyTorch model that needs to be converted to data-parallel model.

  • world_size (int) – total number of processes.

  • rank_id (int) – rank of current process.

  • device_id (int, optional) – device id for current process.

  • master_addr (str, optional) – master of the collective communication ip address.

  • optimizer (torch.optim.Optimizer, optional) – the PyTorch optimizer that may need to be converted alongside the model.

abstract init_with_env_file()

Check if the current option was initialized with a ddp env file or not

Returns:

True if the current option was initialized with a ddp env file

Return type:

bool

persia.distributed.get_default_distributed_option(device_id=None)

Get default distributed option.

Parameters:

device_id (int, optional) – CUDA device_id. Apply backend="nccl" to the DDPOption if the device_id not None, otherwise use the backend="gloo" for CPU only mode.

Returns:

Default distributed option.

Return type:

DDPOption