mtrl.agent package

Subpackages

Submodules

mtrl.agent.abstract module

Interface for the agent.

class mtrl.agent.abstract.Agent(env_obs_shape: List[int], action_shape: List[int], action_range: Tuple[int, int], multitask_cfg: omegaconf.dictconfig.DictConfig, device: torch.device)[source]

Bases: abc.ABC

Abstract agent class that every other agent should extend.

Parameters
  • env_obs_shape (List[int]) – shape of the environment observation that the actor gets.

  • action_shape (List[int]) – shape of the action vector that the actor produces.

  • action_range (Tuple[int, int]) – min and max values for the action vector.

  • multitask_cfg (ConfigType) – config for encoding the multitask knowledge.

  • device (torch.device) – device for the agent.

abstract complete_init(cfg_to_load_model: omegaconf.dictconfig.DictConfig) → None[source]

Complete the init process.

The derived classes should implement this to perform different post-processing steps.

Parameters

cfg_to_load_model (ConfigType) – config to load the model.

get_component_name_list_for_checkpointing() → List[Tuple[torch.nn.modules.module.Module, str]][source]

Get the list of tuples of (model, name) from the agent to checkpoint.

Returns

list of tuples of (model, name).

Return type

List[Tuple[ModelType, str]]

get_last_shared_layers(component_name: str) → Optional[List[torch.nn.modules.module.Module]][source]

Get the last shared layer for any given component.

Parameters

component_name (str) – given component.

Returns

list of layers.

Return type

List[ModelType]

get_optimizer_name_list_for_checkpointing() → List[Tuple[torch.optim.optimizer.Optimizer, str]][source]

Get the list of tuples of (optimizer, name) from the agent to checkpoint.

Returns

list of tuples of (optimizer, name).

Return type

List[Tuple[OptimizerType, str]]

load(model_dir: Optional[str], step: Optional[int]) → None[source]

Load the agent.

Parameters
  • model_dir (Optional[str]) – directory to load the model from.

  • step (Optional[int]) – step for tracking the training of the agent.

load_latest_step(model_dir: str) → int[source]

Load the agent using the latest training step.

Parameters

model_dir (Optional[str]) – directory to load the model from.

Returns

step for tracking the training of the agent.

Return type

int

load_metadata(model_dir: str) → Optional[Dict[Any, Any]][source]

Load the metadata of the agent.

Parameters

model_dir (str) – directory to load the model from.

Returns

metadata.

Return type

Optional[Dict[Any, Any]]

abstract sample_action(multitask_obs: Dict[str, torch.Tensor], modes: List[str]) → numpy.ndarray[source]

Sample the action to perform.

Parameters
  • multitask_obs (ObsType) – Observation from the multitask environment.

  • modes (List[str]) – modes for sampling the action.

Returns

sampled action.

Return type

np.ndarray

save(model_dir: str, step: int, retain_last_n: int, should_save_metadata: bool = True) → None[source]

Save the agent.

Parameters
  • model_dir (str) – directory to save.

  • step (int) – step for tracking the training of the agent.

  • retain_last_n (int) – number of models to retain.

  • should_save_metadata (bool, optional) – should training metadata be saved. Defaults to True.

save_components(model_dir: str, step: int, retain_last_n: int) → None[source]

Save the different components of the agent.

Parameters
  • model_dir (str) – directory to save.

  • step (int) – step for tracking the training of the agent.

  • retain_last_n (int) – number of models to retain.

save_components_or_optimizers(component_or_optimizer_list: Union[List[Tuple[torch.nn.modules.module.Module, str]], List[Tuple[torch.optim.optimizer.Optimizer, str]]], model_dir: str, step: int, retain_last_n: int, suffix: str = '') → None[source]

Save the components and optimizers from the given list.

Parameters
  • component_or_optimizer_list – (Union[ List[Tuple[ComponentType, str]], List[Tuple[OptimizerType, str]] ]): list of components and optimizers to save.

  • model_dir (str) – directory to save.

  • step (int) – step for tracking the training of the agent.

  • retain_last_n (int) – number of models to retain.

  • suffix (str, optional) – suffix to add at the name of the model before checkpointing. Defaults to “”.

save_metadata(model_dir: str, step: int) → None[source]

Save the metadata.

Parameters
  • model_dir (str) – directory to save.

  • step (int) – step for tracking the training of the agent.

save_optimizers(model_dir: str, step: int, retain_last_n: int) → None[source]

Save the different optimizers of the agent.

Parameters
  • model_dir (str) – directory to save.

  • step (int) – step for tracking the training of the agent.

  • retain_last_n (int) – number of models to retain.

abstract select_action(multitask_obs: Dict[str, torch.Tensor], modes: List[str]) → numpy.ndarray[source]

Select the action to perform.

Parameters
  • multitask_obs (ObsType) – Observation from the multitask environment.

  • modes (List[str]) – modes for selecting the action.

Returns

selected action.

Return type

np.ndarray

abstract train(training: bool = True) → None[source]

Set the agent in training/evaluation mode

Parameters

training (bool, optional) – should set in training mode. Defaults to True.

abstract update(replay_buffer: mtrl.replay_buffer.ReplayBuffer, logger: mtrl.logger.Logger, step: int, kwargs_to_compute_gradient: Optional[Dict[str, Any]] = None, buffer_index_to_sample: Optional[numpy.ndarray] = None) → numpy.ndarray[source]

Update the agent.

Parameters
  • replay_buffer (ReplayBuffer) – replay buffer to sample the data.

  • logger (Logger) – logger for logging.

  • step (int) – step for tracking the training progress.

  • kwargs_to_compute_gradient (Optional[Dict[str, Any]], optional) – Defaults to None.

  • buffer_index_to_sample (Optional[np.ndarray], optional) – if this parameter is specified, use these indices instead of sampling from the replay buffer. If this is set to None, sample from the replay buffer. buffer_index_to_sample Defaults to None.

Returns

index sampled (from the replay buffer) to train the model. If

buffer_index_to_sample is not set to None, return buffer_index_to_sample.

Return type

np.ndarray

mtrl.agent.deepmdp module

class mtrl.agent.deepmdp.Agent(env_obs_shape: List[int], action_shape: List[int], action_range: Tuple[int, int], device: torch.device, actor_cfg: omegaconf.dictconfig.DictConfig, critic_cfg: omegaconf.dictconfig.DictConfig, decoder_cfg: omegaconf.dictconfig.DictConfig, reward_decoder_cfg: omegaconf.dictconfig.DictConfig, transition_model_cfg: omegaconf.dictconfig.DictConfig, alpha_optimizer_cfg: omegaconf.dictconfig.DictConfig, actor_optimizer_cfg: omegaconf.dictconfig.DictConfig, critic_optimizer_cfg: omegaconf.dictconfig.DictConfig, multitask_cfg: omegaconf.dictconfig.DictConfig, decoder_optimizer_cfg: omegaconf.dictconfig.DictConfig, encoder_optimizer_cfg: omegaconf.dictconfig.DictConfig, reward_decoder_optimizer_cfg: omegaconf.dictconfig.DictConfig, transition_model_optimizer_cfg: omegaconf.dictconfig.DictConfig, discount: float = 0.99, init_temperature: float = 0.01, actor_update_freq: int = 2, critic_tau: float = 0.005, critic_target_update_freq: int = 2, encoder_tau: float = 0.005, loss_reduction: str = 'mean', decoder_update_freq: int = 1, decoder_latent_lambda: float = 0.0, cfg_to_load_model: Optional[omegaconf.dictconfig.DictConfig] = None, should_complete_init: bool = True)[source]

Bases: mtrl.agent.sac_ae.Agent

DeepMDP Agent

Abstract agent class that every other agent should extend.

Parameters
  • env_obs_shape (List[int]) – shape of the environment observation that the actor gets.

  • action_shape (List[int]) – shape of the action vector that the actor produces.

  • action_range (Tuple[int, int]) – min and max values for the action vector.

  • multitask_cfg (ConfigType) – config for encoding the multitask knowledge.

  • device (torch.device) – device for the agent.

update_decoder(batch: mtrl.replay_buffer.ReplayBufferSample, task_info: mtrl.agent.ds.task_info.TaskInfo, logger: mtrl.logger.Logger, step: int, kwargs_to_compute_gradient: Dict[str, Any])[source]

Update the decoder component.

Parameters
  • batch (ReplayBufferSample) – batch from the replay buffer.

  • task_info (TaskInfo) – task_info object.

  • logger ([Logger]) – logger object.

  • step (int) – step for tracking the training of the agent.

  • kwargs_to_compute_gradient (Dict[str, Any]) –

update_transition_reward_model(batch: mtrl.replay_buffer.ReplayBufferSample, task_info: mtrl.agent.ds.task_info.TaskInfo, logger: mtrl.logger.Logger, step: int, kwargs_to_compute_gradient: Dict[str, Any])[source]

Update the transition model and reward decoder.

Parameters
  • batch (ReplayBufferSample) – batch from the replay buffer.

  • task_info (TaskInfo) – task_info object.

  • logger ([Logger]) – logger object.

  • step (int) – step for tracking the training of the agent.

  • kwargs_to_compute_gradient (Dict[str, Any]) –

mtrl.agent.distral module

class mtrl.agent.distral.Agent(env_obs_shape: List[int], action_shape: List[int], action_range: Tuple[int, int], multitask_cfg: omegaconf.dictconfig.DictConfig, device: torch.device, distral_alpha: float, distral_beta: float, agent_index_to_task_index: List[str], distilled_agent_cfg: omegaconf.dictconfig.DictConfig, task_agent_cfg: omegaconf.dictconfig.DictConfig, cfg_to_load_model: Optional[omegaconf.dictconfig.DictConfig] = None, should_complete_init: bool = True)[source]

Bases: mtrl.agent.abstract.Agent

Distral algorithm.

complete_init(cfg_to_load_model: Optional[omegaconf.dictconfig.DictConfig]) → None[source]

Complete the init process.

The derived classes should implement this to perform different post-processing steps.

Parameters

cfg_to_load_model (ConfigType) – config to load the model.

load(model_dir: Optional[str], step: Optional[int]) → None[source]

Load the agent.

Parameters
  • model_dir (Optional[str]) – directory to load the model from.

  • step (Optional[int]) – step for tracking the training of the agent.

load_latest_step(model_dir: str) → int[source]

Load the agent using the latest training step.

Parameters

model_dir (Optional[str]) – directory to load the model from.

Returns

step for tracking the training of the agent.

Return type

int

sample_action(multitask_obs: Dict[str, torch.Tensor], modes: List[str]) → numpy.ndarray[source]

Used during training

save(model_dir: str, step: int, retain_last_n: int, should_save_metadata: bool = True) → None[source]

Save the agent.

Parameters
  • model_dir (str) – directory to save.

  • step (int) – step for tracking the training of the agent.

  • retain_last_n (int) – number of models to retain.

  • should_save_metadata (bool, optional) – should training metadata be saved. Defaults to True.

select_action(multitask_obs: Dict[str, torch.Tensor], modes: List[str]) → numpy.ndarray[source]

Used during testing

train(training: bool = True) → None[source]

Set the agent in training/evaluation mode

Parameters

training (bool, optional) – should set in training mode. Defaults to True.

update(replay_buffer: mtrl.replay_buffer.ReplayBuffer, logger: mtrl.logger.Logger, step: int, kwargs_to_compute_gradient: Optional[Dict[str, Any]] = None, buffer_index_to_sample: Optional[numpy.ndarray] = None) → numpy.ndarray[source]

Update the agent.

Parameters
  • replay_buffer (ReplayBuffer) – replay buffer to sample the data.

  • logger (Logger) – logger for logging.

  • step (int) – step for tracking the training progress.

  • kwargs_to_compute_gradient (Optional[Dict[str, Any]], optional) – Defaults to None.

  • buffer_index_to_sample (Optional[np.ndarray], optional) – if this parameter is specified, use these indices instead of sampling from the replay buffer. If this is set to None, sample from the replay buffer. buffer_index_to_sample Defaults to None.

Returns

index sampled (from the replay buffer) to train the model. If

buffer_index_to_sample is not set to None, return buffer_index_to_sample.

Return type

np.ndarray

class mtrl.agent.distral.DistilledAgent(env_obs_shape: List[int], action_shape: List[int], action_range: Tuple[int, int], multitask_cfg: omegaconf.dictconfig.DictConfig, device: torch.device, actor_cfg: omegaconf.dictconfig.DictConfig, actor_optimizer_cfg: omegaconf.dictconfig.DictConfig, cfg_to_load_model: Optional[omegaconf.dictconfig.DictConfig] = None, should_complete_init: bool = True)[source]

Bases: mtrl.agent.abstract.Agent

Centroid policy for distral

complete_init(cfg_to_load_model: Optional[omegaconf.dictconfig.DictConfig]) → None[source]

Complete the init process.

The derived classes should implement this to perform different post-processing steps.

Parameters

cfg_to_load_model (ConfigType) – config to load the model.

load(model_dir: Optional[str], step: Optional[int]) → None[source]

Load the agent.

Parameters
  • model_dir (Optional[str]) – directory to load the model from.

  • step (Optional[int]) – step for tracking the training of the agent.

load_latest_step(model_dir: str) → int[source]

Load the agent using the latest training step.

Parameters

model_dir (Optional[str]) – directory to load the model from.

Returns

step for tracking the training of the agent.

Return type

int

sample_action(multitask_obs: Dict[str, torch.Tensor], modes: List[str])[source]

Sample the action to perform.

Parameters
  • multitask_obs (ObsType) – Observation from the multitask environment.

  • modes (List[str]) – modes for sampling the action.

Returns

sampled action.

Return type

np.ndarray

save(model_dir: str, step: int, retain_last_n: int, should_save_metadata: bool = True) → None[source]

Save the agent.

Parameters
  • model_dir (str) – directory to save.

  • step (int) – step for tracking the training of the agent.

  • retain_last_n (int) – number of models to retain.

  • should_save_metadata (bool, optional) – should training metadata be saved. Defaults to True.

select_action(multitask_obs: Dict[str, torch.Tensor], modes: List[str])[source]

Select the action to perform.

Parameters
  • multitask_obs (ObsType) – Observation from the multitask environment.

  • modes (List[str]) – modes for selecting the action.

Returns

selected action.

Return type

np.ndarray

train(training=True) → None[source]

Set the agent in training/evaluation mode

Parameters

training (bool, optional) – should set in training mode. Defaults to True.

update(replay_buffer: mtrl.replay_buffer.ReplayBuffer, logger: mtrl.logger.Logger, step: int, kwargs_to_compute_gradient: Optional[Dict[str, Any]] = None, buffer_index_to_sample: Optional[numpy.ndarray] = None)[source]

Update the agent.

Parameters
  • replay_buffer (ReplayBuffer) – replay buffer to sample the data.

  • logger (Logger) – logger for logging.

  • step (int) – step for tracking the training progress.

  • kwargs_to_compute_gradient (Optional[Dict[str, Any]], optional) – Defaults to None.

  • buffer_index_to_sample (Optional[np.ndarray], optional) – if this parameter is specified, use these indices instead of sampling from the replay buffer. If this is set to None, sample from the replay buffer. buffer_index_to_sample Defaults to None.

Returns

index sampled (from the replay buffer) to train the model. If

buffer_index_to_sample is not set to None, return buffer_index_to_sample.

Return type

np.ndarray

class mtrl.agent.distral.TaskAgent(env_obs_shape: List[int], action_shape: List[int], action_range: Tuple[int, int], multitask_cfg: omegaconf.dictconfig.DictConfig, device: torch.device, agent_cfg: omegaconf.dictconfig.DictConfig, index: int, env_index: int, distral_alpha: float, distral_beta: float, distilled_agent: mtrl.agent.distral.DistilledAgent, cfg_to_load_model: Optional[omegaconf.dictconfig.DictConfig] = None, should_complete_init: bool = True)[source]

Bases: mtrl.agent.wrapper.Agent

Wrapper class for the task specific agent

load(model_dir: Optional[str], step: Optional[int]) → None[source]

Load the agent.

Parameters
  • model_dir (Optional[str]) – directory to load the model from.

  • step (Optional[int]) – step for tracking the training of the agent.

load_latest_step(model_dir: str) → int[source]

Load the agent using the latest training step.

Parameters

model_dir (Optional[str]) – directory to load the model from.

Returns

step for tracking the training of the agent.

Return type

int

patch_agent() → None[source]

Change some function definitions at runtime.

save(model_dir: str, step: int, retain_last_n: int, should_save_metadata: bool = True) → None[source]

Save the agent.

Parameters
  • model_dir (str) – directory to save.

  • step (int) – step for tracking the training of the agent.

  • retain_last_n (int) – number of models to retain.

  • should_save_metadata (bool, optional) – should training metadata be saved. Defaults to True.

update_actor_and_alpha(batch: mtrl.replay_buffer.ReplayBufferSample, task_info: mtrl.agent.ds.task_info.TaskInfo, logger: mtrl.logger.Logger, step: int, kwargs_to_compute_gradient: Dict[str, Any]) → None[source]

Update the actor and alpha component.

Parameters
  • batch (ReplayBufferSample) – batch from the replay buffer.

  • task_info (TaskInfo) – task_info object.

  • logger ([Logger]) – logger object.

  • step (int) – step for tracking the training of the agent.

  • kwargs_to_compute_gradient (Dict[str, Any]) –

mtrl.agent.distral.gaussian_kld(mean1: torch.Tensor, logvar1: torch.Tensor, mean2: torch.Tensor, logvar2: torch.Tensor) → torch.Tensor[source]
Compute KL divergence between a bunch of univariate Gaussian

distributions with the given means and log-variances. ie KL(N(mean1, logvar1) || N(mean2, logvar2))

Parameters
  • mean1 (TensorType) –

  • logvar1 (TensorType) –

  • mean2 (TensorType) –

  • logvar2 (TensorType) –

Returns

[description]

Return type

TensorType

mtrl.agent.grad_manipulation module

class mtrl.agent.grad_manipulation.Agent(env_obs_shape: List[int], action_shape: List[int], action_range: Tuple[int, int], multitask_cfg: omegaconf.dictconfig.DictConfig, agent_cfg: omegaconf.dictconfig.DictConfig, device: torch.device, cfg_to_load_model: Optional[omegaconf.dictconfig.DictConfig] = None, should_complete_init: bool = True)[source]

Bases: mtrl.agent.wrapper.Agent

Base Class for Gradient Manipulation Algorithms.

update(replay_buffer: mtrl.replay_buffer.ReplayBuffer, logger: mtrl.logger.Logger, step: int, kwargs_to_compute_gradient: Optional[Dict[str, Any]] = None, buffer_index_to_sample: Optional[numpy.ndarray] = None) → numpy.ndarray[source]

Update the agent.

Parameters
  • replay_buffer (ReplayBuffer) – replay buffer to sample the data.

  • logger (Logger) – logger for logging.

  • step (int) – step for tracking the training progress.

  • kwargs_to_compute_gradient (Optional[Dict[str, Any]], optional) – Defaults to None.

  • buffer_index_to_sample (Optional[np.ndarray], optional) – if this parameter is specified, use these indices instead of sampling from the replay buffer. If this is set to None, sample from the replay buffer. buffer_index_to_sample Defaults to None.

Returns

index sampled (from the replay buffer) to train the model. If

buffer_index_to_sample is not set to None, return buffer_index_to_sample.

Return type

np.ndarray

class mtrl.agent.grad_manipulation.EnvMetadata(env_index: torch.Tensor, unique_env_index: torch.Tensor, env_index_count: torch.Tensor)[source]

Bases: object

env_index: torch.Tensor
env_index_count: torch.Tensor
unique_env_index: torch.Tensor

mtrl.agent.gradnorm module

class mtrl.agent.gradnorm.Agent(env_obs_shape: List[int], action_shape: List[int], action_range: Tuple[int, int], multitask_cfg: omegaconf.dictconfig.DictConfig, agent_cfg: omegaconf.dictconfig.DictConfig, device: torch.device, cfg_to_load_model: Optional[omegaconf.dictconfig.DictConfig] = None, should_complete_init: bool = True)[source]

Bases: mtrl.agent.grad_manipulation.Agent

GradNorm algorithm.

mtrl.agent.hipbmdp module

class mtrl.agent.hipbmdp.Agent(env_obs_shape: List[int], action_shape: List[int], action_range: Tuple[int, int], device: torch.device, actor_cfg: omegaconf.dictconfig.DictConfig, critic_cfg: omegaconf.dictconfig.DictConfig, decoder_cfg: omegaconf.dictconfig.DictConfig, reward_decoder_cfg: omegaconf.dictconfig.DictConfig, transition_model_cfg: omegaconf.dictconfig.DictConfig, alpha_optimizer_cfg: omegaconf.dictconfig.DictConfig, actor_optimizer_cfg: omegaconf.dictconfig.DictConfig, critic_optimizer_cfg: omegaconf.dictconfig.DictConfig, multitask_cfg: omegaconf.dictconfig.DictConfig, decoder_optimizer_cfg: omegaconf.dictconfig.DictConfig, encoder_optimizer_cfg: omegaconf.dictconfig.DictConfig, reward_decoder_optimizer_cfg: omegaconf.dictconfig.DictConfig, transition_model_optimizer_cfg: omegaconf.dictconfig.DictConfig, discount: float = 0.99, init_temperature: float = 0.01, actor_update_freq: int = 2, critic_tau: float = 0.005, critic_target_update_freq: int = 2, encoder_tau: float = 0.005, loss_reduction: str = 'mean', decoder_update_freq: int = 1, decoder_latent_lambda: float = 0.0, cfg_to_load_model: Optional[omegaconf.dictconfig.DictConfig] = None, should_complete_init: bool = True)[source]

Bases: mtrl.agent.deepmdp.Agent

HiPBMDP Agent

Abstract agent class that every other agent should extend.

Parameters
  • env_obs_shape (List[int]) – shape of the environment observation that the actor gets.

  • action_shape (List[int]) – shape of the action vector that the actor produces.

  • action_range (Tuple[int, int]) – min and max values for the action vector.

  • multitask_cfg (ConfigType) – config for encoding the multitask knowledge.

  • device (torch.device) – device for the agent.

get_task_encoding(env_index: torch.Tensor, modes: List[str], disable_grad: bool)[source]

Get the task encoding for the different environments.

Parameters
  • env_index (TensorType) – environment index.

  • modes (List[str]) –

  • disable_grad (bool) – should disable tracking gradient.

Returns

task encodings.

Return type

TensorType

update_task_encoder(batch: mtrl.replay_buffer.ReplayBufferSample, task_info: mtrl.agent.ds.task_info.TaskInfo, logger, step, kwargs_to_compute_gradient: Dict[str, Any])[source]

Update the task encoder component.

Parameters
  • batch (ReplayBufferSample) – batch from the replay buffer.

  • task_info (TaskInfo) – task_info object.

  • logger ([Logger]) – logger object.

  • step (int) – step for tracking the training of the agent.

  • kwargs_to_compute_gradient (Dict[str, Any]) –

mtrl.agent.pcgrad module

class mtrl.agent.pcgrad.Agent(env_obs_shape: List[int], action_shape: List[int], action_range: Tuple[int, int], device: torch.device, agent_cfg: omegaconf.dictconfig.DictConfig, multitask_cfg: omegaconf.dictconfig.DictConfig, cfg_to_load_model: Optional[omegaconf.dictconfig.DictConfig] = None, should_complete_init: bool = True)[source]

Bases: mtrl.agent.grad_manipulation.Agent

PCGrad algorithm.

get_shuffled_task_indices(num_tasks: int) → numpy.ndarray[source]
mtrl.agent.pcgrad.apply_vector_grad_to_parameters(vec: torch.Tensor, parameters: Iterable[torch.Tensor], accumulate: bool = False)[source]

Apply vector gradients to the parameters

Parameters
  • vec (TensorType) – a single vector represents the gradients of a model.

  • parameters (Iterable[TensorType]) – an iterator of Tensors that are the parameters of a model.

mtrl.agent.sac module

class mtrl.agent.sac.Agent(env_obs_shape: List[int], action_shape: List[int], action_range: Tuple[int, int], device: torch.device, actor_cfg: omegaconf.dictconfig.DictConfig, critic_cfg: omegaconf.dictconfig.DictConfig, alpha_optimizer_cfg: omegaconf.dictconfig.DictConfig, actor_optimizer_cfg: omegaconf.dictconfig.DictConfig, critic_optimizer_cfg: omegaconf.dictconfig.DictConfig, multitask_cfg: omegaconf.dictconfig.DictConfig, discount: float, init_temperature: float, actor_update_freq: int, critic_tau: float, critic_target_update_freq: int, encoder_tau: float, loss_reduction: str = 'mean', cfg_to_load_model: Optional[omegaconf.dictconfig.DictConfig] = None, should_complete_init: bool = True)[source]

Bases: mtrl.agent.abstract.Agent

SAC algorithm.

Abstract agent class that every other agent should extend.

Parameters
  • env_obs_shape (List[int]) – shape of the environment observation that the actor gets.

  • action_shape (List[int]) – shape of the action vector that the actor produces.

  • action_range (Tuple[int, int]) – min and max values for the action vector.

  • multitask_cfg (ConfigType) – config for encoding the multitask knowledge.

  • device (torch.device) – device for the agent.

act(multitask_obs: Dict[str, torch.Tensor], modes: List[str], sample: bool) → numpy.ndarray[source]

Select/sample the action to perform.

Parameters
  • multitask_obs (ObsType) – Observation from the multitask environment.

  • mode (List[str]) – mode in which to select the action.

  • sample (bool) – sample (if True) or select (if False) an action.

Returns

selected/sample action.

Return type

np.ndarray

complete_init(cfg_to_load_model: Optional[omegaconf.dictconfig.DictConfig])[source]

Complete the init process.

The derived classes should implement this to perform different post-processing steps.

Parameters

cfg_to_load_model (ConfigType) – config to load the model.

get_alpha(env_index: torch.Tensor) → torch.Tensor[source]

Get the alpha value for the given environments.

Parameters

env_index (TensorType) – environment index.

Returns

alpha values.

Return type

TensorType

get_last_shared_layers(component_name: str) → Optional[List[torch.nn.modules.module.Module]][source]

Get the last shared layer for any given component.

Parameters

component_name (str) – given component.

Returns

list of layers.

Return type

List[ModelType]

get_parameters(name: str) → List[torch.nn.parameter.Parameter][source]

Get parameters corresponding to a given component.

Parameters

name (str) – name of the component.

Returns

list of parameters.

Return type

List[torch.nn.parameter.Parameter]

get_task_encoding(env_index: torch.Tensor, modes: List[str], disable_grad: bool) → torch.Tensor[source]

Get the task encoding for the different environments.

Parameters
  • env_index (TensorType) – environment index.

  • modes (List[str]) –

  • disable_grad (bool) – should disable tracking gradient.

Returns

task encodings.

Return type

TensorType

get_task_info(task_encoding: torch.Tensor, component_name: str, env_index: torch.Tensor)mtrl.agent.ds.task_info.TaskInfo[source]

Encode task encoding into task info.

Parameters
  • task_encoding (TensorType) – encoding of the task.

  • component_name (str) – name of the component.

  • env_index (TensorType) – index of the environment.

Returns

TaskInfo object.

Return type

TaskInfo

sample_action(multitask_obs: Dict[str, torch.Tensor], modes: List[str]) → numpy.ndarray[source]

Sample the action to perform.

Parameters
  • multitask_obs (ObsType) – Observation from the multitask environment.

  • modes (List[str]) – modes for sampling the action.

Returns

sampled action.

Return type

np.ndarray

select_action(multitask_obs: Dict[str, torch.Tensor], modes: List[str]) → numpy.ndarray[source]

Select the action to perform.

Parameters
  • multitask_obs (ObsType) – Observation from the multitask environment.

  • modes (List[str]) – modes for selecting the action.

Returns

selected action.

Return type

np.ndarray

train(training: bool = True) → None[source]

Set the agent in training/evaluation mode

Parameters

training (bool, optional) – should set in training mode. Defaults to True.

update(replay_buffer: mtrl.replay_buffer.ReplayBuffer, logger: mtrl.logger.Logger, step: int, kwargs_to_compute_gradient: Optional[Dict[str, Any]] = None, buffer_index_to_sample: Optional[numpy.ndarray] = None) → numpy.ndarray[source]

Update the agent.

Parameters
  • replay_buffer (ReplayBuffer) – replay buffer to sample the data.

  • logger (Logger) – logger for logging.

  • step (int) – step for tracking the training progress.

  • kwargs_to_compute_gradient (Optional[Dict[str, Any]], optional) – Defaults to None.

  • buffer_index_to_sample (Optional[np.ndarray], optional) – if this parameter is specified, use these indices instead of sampling from the replay buffer. If this is set to None, sample from the replay buffer. buffer_index_to_sample Defaults to None.

Returns

index sampled (from the replay buffer) to train the model. If

buffer_index_to_sample is not set to None, return buffer_index_to_sample.

Return type

np.ndarray

update_actor_and_alpha(batch: mtrl.replay_buffer.ReplayBufferSample, task_info: mtrl.agent.ds.task_info.TaskInfo, logger: mtrl.logger.Logger, step: int, kwargs_to_compute_gradient: Dict[str, Any]) → None[source]

Update the actor and alpha component.

Parameters
  • batch (ReplayBufferSample) – batch from the replay buffer.

  • task_info (TaskInfo) – task_info object.

  • logger ([Logger]) – logger object.

  • step (int) – step for tracking the training of the agent.

  • kwargs_to_compute_gradient (Dict[str, Any]) –

update_critic(batch: mtrl.replay_buffer.ReplayBufferSample, task_info: mtrl.agent.ds.task_info.TaskInfo, logger: mtrl.logger.Logger, step: int, kwargs_to_compute_gradient: Dict[str, Any]) → None[source]

Update the critic component.

Parameters
  • batch (ReplayBufferSample) – batch from the replay buffer.

  • task_info (TaskInfo) – task_info object.

  • logger ([Logger]) – logger object.

  • step (int) – step for tracking the training of the agent.

  • kwargs_to_compute_gradient (Dict[str, Any]) –

update_decoder(batch: mtrl.replay_buffer.ReplayBufferSample, task_info: mtrl.agent.ds.task_info.TaskInfo, logger: mtrl.logger.Logger, step: int, kwargs_to_compute_gradient: Dict[str, Any]) → None[source]

Update the decoder component.

Parameters
  • batch (ReplayBufferSample) – batch from the replay buffer.

  • task_info (TaskInfo) – task_info object.

  • logger ([Logger]) – logger object.

  • step (int) – step for tracking the training of the agent.

  • kwargs_to_compute_gradient (Dict[str, Any]) –

update_task_encoder(batch: mtrl.replay_buffer.ReplayBufferSample, task_info: mtrl.agent.ds.task_info.TaskInfo, logger: mtrl.logger.Logger, step: int, kwargs_to_compute_gradient: Dict[str, Any]) → None[source]

Update the task encoder component.

Parameters
  • batch (ReplayBufferSample) – batch from the replay buffer.

  • task_info (TaskInfo) – task_info object.

  • logger ([Logger]) – logger object.

  • step (int) – step for tracking the training of the agent.

  • kwargs_to_compute_gradient (Dict[str, Any]) –

update_transition_reward_model(batch: mtrl.replay_buffer.ReplayBufferSample, task_info: mtrl.agent.ds.task_info.TaskInfo, logger: mtrl.logger.Logger, step: int, kwargs_to_compute_gradient: Dict[str, Any]) → None[source]

Update the transition model and reward decoder.

Parameters
  • batch (ReplayBufferSample) – batch from the replay buffer.

  • task_info (TaskInfo) – task_info object.

  • logger ([Logger]) – logger object.

  • step (int) – step for tracking the training of the agent.

  • kwargs_to_compute_gradient (Dict[str, Any]) –

mtrl.agent.sac_ae module

class mtrl.agent.sac_ae.Agent(env_obs_shape: List[int], action_shape: List[int], action_range: Tuple[int, int], device: torch.device, actor_cfg: omegaconf.dictconfig.DictConfig, critic_cfg: omegaconf.dictconfig.DictConfig, decoder_cfg: omegaconf.dictconfig.DictConfig, alpha_optimizer_cfg: omegaconf.dictconfig.DictConfig, actor_optimizer_cfg: omegaconf.dictconfig.DictConfig, critic_optimizer_cfg: omegaconf.dictconfig.DictConfig, multitask_cfg: omegaconf.dictconfig.DictConfig, decoder_optimizer_cfg: omegaconf.dictconfig.DictConfig, encoder_optimizer_cfg: omegaconf.dictconfig.DictConfig, discount: float, init_temperature: float, actor_update_freq: int, critic_tau: float, critic_target_update_freq: int, encoder_tau: float, loss_reduction: str = 'mean', decoder_update_freq: int = 1, decoder_latent_lambda: float = 0.0, cfg_to_load_model: Optional[omegaconf.dictconfig.DictConfig] = None, should_complete_init: bool = True)[source]

Bases: mtrl.agent.sac.Agent

SAC+AE algorithm.

Abstract agent class that every other agent should extend.

Parameters
  • env_obs_shape (List[int]) – shape of the environment observation that the actor gets.

  • action_shape (List[int]) – shape of the action vector that the actor produces.

  • action_range (Tuple[int, int]) – min and max values for the action vector.

  • multitask_cfg (ConfigType) – config for encoding the multitask knowledge.

  • device (torch.device) – device for the agent.

update_decoder(batch: mtrl.replay_buffer.ReplayBufferSample, task_info: mtrl.agent.ds.task_info.TaskInfo, logger: mtrl.logger.Logger, step: int, kwargs_to_compute_gradient: Dict[str, Any]) → None[source]

Update the decoder component.

Parameters
  • batch (ReplayBufferSample) – batch from the replay buffer.

  • task_info (TaskInfo) – task_info object.

  • logger ([Logger]) – logger object.

  • step (int) – step for tracking the training of the agent.

  • kwargs_to_compute_gradient (Dict[str, Any]) –

mtrl.agent.utils module

mtrl.agent.utils.build_mlp(input_dim: int, hidden_dim: int, output_dim: int, num_layers: int) → torch.nn.modules.module.Module[source]

Utility function to build a mlp model. This assumes all the hidden layers are using the same dimensionality.

Parameters
  • input_dim (int) – input dimension.

  • hidden_dim (int) – dimension of the hidden layers.

  • output_dim (int) – dimension of the output layer.

  • num_layers (int) – number of layers in the mlp.

Returns

[description]

Return type

ModelType

mtrl.agent.utils.build_mlp_as_module_list(input_dim: int, hidden_dim: int, output_dim: int, num_layers: int) → torch.nn.modules.module.Module[source]

Utility function to build a module list of layers. This assumes all the hidden layers are using the same dimensionality.

Parameters
  • input_dim (int) – input dimension.

  • hidden_dim (int) – dimension of the hidden layers.

  • output_dim (int) – dimension of the output layer.

  • num_layers (int) – number of layers in the mlp.

Returns

[description]

Return type

ModelType

class mtrl.agent.utils.eval_mode(*models)[source]

Bases: object

Put the agent in the eval mode

mtrl.agent.utils.preprocess_obs(obs: torch.Tensor, bits=5) → torch.Tensor[source]

Preprocessing image, see https://arxiv.org/abs/1807.03039.

mtrl.agent.utils.set_seed_everywhere(seed: int) → None[source]

Set seed for reproducibility.

Parameters

seed (int) – seed.

mtrl.agent.utils.soft_update_params(net: torch.nn.modules.module.Module, target_net: torch.nn.modules.module.Module, tau: float) → None[source]

Perform soft udpate on the net using target net.

Parameters
  • net ([ModelType]) – model to update.

  • target_net (ModelType) – model to update with.

  • tau (float) – control the extent of update.

mtrl.agent.utils.weight_init(m: torch.nn.modules.module.Module)[source]

Custom weight init for Conv2D and Linear layers.

mtrl.agent.utils.weight_init_conv(m: torch.nn.modules.module.Module)[source]
mtrl.agent.utils.weight_init_linear(m: torch.nn.modules.module.Module)[source]
mtrl.agent.utils.weight_init_moe_layer(m: torch.nn.modules.module.Module)[source]

mtrl.agent.wrapper module

class mtrl.agent.wrapper.Agent(env_obs_shape: List[int], action_shape: List[int], action_range: Tuple[int, int], multitask_cfg: omegaconf.dictconfig.DictConfig, agent_cfg: omegaconf.dictconfig.DictConfig, device: torch.device, cfg_to_load_model: Optional[omegaconf.dictconfig.DictConfig] = None, should_complete_init: bool = True)[source]

Bases: mtrl.agent.abstract.Agent

This wrapper agent wraps over the other agents. It is useful for alogorithms like PCGrad and GradNorm that can be used with may policies.

Parameters
  • env_obs_shape (List[int]) – shape of the environment observation that the actor gets.

  • action_shape (List[int]) – shape of the action vector that the actor produces.

  • action_range (Tuple[int, int]) – min and max values for the action vector.

  • multitask_cfg (ConfigType) – config for encoding the multitask knowledge.

  • agent_cfg (ConfigType) – config for the agents that are wrapper over.

  • device (torch.device) – device for the agent.

  • cfg_to_load_model (Optional[ConfigType], optional) – config to load the model from filesystem. Defaults to None.

  • should_complete_init (bool, optional) – should call complete_init method. Defaults to True.

complete_init(cfg_to_load_model: Optional[omegaconf.dictconfig.DictConfig])[source]

Complete the init process.

The derived classes should implement this to perform different post-processing steps.

Parameters

cfg_to_load_model (ConfigType) – config to load the model.

get_component_name_list_for_checkpointing() → List[Tuple[torch.nn.modules.module.Module, str]][source]

Get the list of tuples of (model, name) from the agent to checkpoint.

Returns

list of tuples of (model, name).

Return type

List[Tuple[ModelType, str]]

get_last_shared_layers(component_name: str) → Optional[List[torch.nn.modules.module.Module]][source]

Get the last shared layer for any given component.

Parameters

component_name (str) – given component.

Returns

list of layers.

Return type

List[ModelType]

get_optimizer_name_list_for_checkpointing() → List[Tuple[torch.optim.optimizer.Optimizer, str]][source]

Get the list of tuples of (optimizer, name) from the agent to checkpoint.

Returns

list of tuples of (optimizer, name).

Return type

List[Tuple[OptimizerType, str]]

load(model_dir: Optional[str], step: Optional[int]) → None[source]

Load the agent.

Parameters
  • model_dir (Optional[str]) – directory to load the model from.

  • step (Optional[int]) – step for tracking the training of the agent.

sample_action(multitask_obs: Dict[str, torch.Tensor], modes: List[str])[source]

Sample the action to perform.

Parameters
  • multitask_obs (ObsType) – Observation from the multitask environment.

  • modes (List[str]) – modes for sampling the action.

Returns

sampled action.

Return type

np.ndarray

save(model_dir: str, step: int, retain_last_n: int, should_save_metadata: bool = True) → None[source]

Save the agent.

Parameters
  • model_dir (str) – directory to save.

  • step (int) – step for tracking the training of the agent.

  • retain_last_n (int) – number of models to retain.

  • should_save_metadata (bool, optional) – should training metadata be saved. Defaults to True.

save_components(model_dir: str, step: int, retain_last_n: int) → None[source]

Save the different components of the agent.

Parameters
  • model_dir (str) – directory to save.

  • step (int) – step for tracking the training of the agent.

  • retain_last_n (int) – number of models to retain.

save_optimizers(model_dir: str, step: int, retain_last_n: int) → None[source]

Save the different optimizers of the agent.

Parameters
  • model_dir (str) – directory to save.

  • step (int) – step for tracking the training of the agent.

  • retain_last_n (int) – number of models to retain.

select_action(multitask_obs: Dict[str, torch.Tensor], modes: List[str])[source]

Select the action to perform.

Parameters
  • multitask_obs (ObsType) – Observation from the multitask environment.

  • modes (List[str]) – modes for selecting the action.

Returns

selected action.

Return type

np.ndarray

train(training: bool = True)[source]

Set the agent in training/evaluation mode

Parameters

training (bool, optional) – should set in training mode. Defaults to True.

update(replay_buffer: mtrl.replay_buffer.ReplayBuffer, logger: mtrl.logger.Logger, step: int, kwargs_to_compute_gradient: Optional[Dict[str, Any]] = None, buffer_index_to_sample: Optional[numpy.ndarray] = None)[source]

Update the agent.

Parameters
  • replay_buffer (ReplayBuffer) – replay buffer to sample the data.

  • logger (Logger) – logger for logging.

  • step (int) – step for tracking the training progress.

  • kwargs_to_compute_gradient (Optional[Dict[str, Any]], optional) – Defaults to None.

  • buffer_index_to_sample (Optional[np.ndarray], optional) – if this parameter is specified, use these indices instead of sampling from the replay buffer. If this is set to None, sample from the replay buffer. buffer_index_to_sample Defaults to None.

Returns

index sampled (from the replay buffer) to train the model. If

buffer_index_to_sample is not set to None, return buffer_index_to_sample.

Return type

np.ndarray

Module contents