mtrl.agent package¶

Submodules¶

mtrl.agent.abstract module¶

Interface for the agent.

class mtrl.agent.abstract.Agent(env_obs_shape: List[int], action_shape: List[int], action_range: Tuple[int, int], multitask_cfg: omegaconf.dictconfig.DictConfig, device: torch.device)[source]¶

Bases: abc.ABC

Abstract agent class that every other agent should extend.

Parameters

env_obs_shape (List[int]) – shape of the environment observation that the actor gets.
action_shape (List[int]) – shape of the action vector that the actor produces.
action_range (Tuple[int, int]) – min and max values for the action vector.
multitask_cfg (ConfigType) – config for encoding the multitask knowledge.
device (torch.device) – device for the agent.

abstract complete_init(cfg_to_load_model: omegaconf.dictconfig.DictConfig) → None[source]¶

Complete the init process.

The derived classes should implement this to perform different post-processing steps.

Parameters: cfg_to_load_model (ConfigType) – config to load the model.

get_component_name_list_for_checkpointing() → List[Tuple[torch.nn.modules.module.Module, str]][source]¶

Get the list of tuples of (model, name) from the agent to checkpoint.

Returns: list of tuples of (model, name).
Return type: List[Tuple[ModelType, str]]

get_last_shared_layers(component_name: str) → Optional[List[torch.nn.modules.module.Module]][source]¶

Get the last shared layer for any given component.

Parameters: component_name (str) – given component.
Returns: list of layers.
Return type: List[ModelType]

get_optimizer_name_list_for_checkpointing() → List[Tuple[torch.optim.optimizer.Optimizer, str]][source]¶

Get the list of tuples of (optimizer, name) from the agent to checkpoint.

Returns: list of tuples of (optimizer, name).
Return type: List[Tuple[OptimizerType, str]]

load(model_dir: Optional[str], step: Optional[int]) → None[source]¶

Load the agent.

Parameters

model_dir (Optional[str]) – directory to load the model from.
step (Optional[int]) – step for tracking the training of the agent.

load_latest_step(model_dir: str) → int[source]¶

Load the agent using the latest training step.

Parameters: model_dir (Optional[str]) – directory to load the model from.
Returns: step for tracking the training of the agent.
Return type: int

load_metadata(model_dir: str) → Optional[Dict[Any, Any]][source]¶

Load the metadata of the agent.

Parameters: model_dir (str) – directory to load the model from.
Returns: metadata.
Return type: Optional[Dict[Any, Any]]

abstract sample_action(multitask_obs: Dict[str, torch.Tensor], modes: List[str]) → numpy.ndarray[source]¶

Sample the action to perform.

Parameters

multitask_obs (ObsType) – Observation from the multitask environment.
modes (List[str]) – modes for sampling the action.

Returns

sampled action.

Return type

np.ndarray

save(model_dir: str, step: int, retain_last_n: int, should_save_metadata: bool = True) → None[source]¶

Save the agent.

Parameters

model_dir (str) – directory to save.
step (int) – step for tracking the training of the agent.
retain_last_n (int) – number of models to retain.
should_save_metadata (bool, optional) – should training metadata be saved. Defaults to True.

save_components(model_dir: str, step: int, retain_last_n: int) → None[source]¶

Save the different components of the agent.

Parameters

model_dir (str) – directory to save.
step (int) – step for tracking the training of the agent.
retain_last_n (int) – number of models to retain.

save_components_or_optimizers(component_or_optimizer_list: Union[List[Tuple[torch.nn.modules.module.Module, str]], List[Tuple[torch.optim.optimizer.Optimizer, str]]], model_dir: str, step: int, retain_last_n: int, suffix: str = '') → None[source]¶

Save the components and optimizers from the given list.

Parameters

component_or_optimizer_list – (Union[ List[Tuple[ComponentType, str]], List[Tuple[OptimizerType, str]] ]): list of components and optimizers to save.
model_dir (str) – directory to save.
step (int) – step for tracking the training of the agent.
retain_last_n (int) – number of models to retain.
suffix (str, optional) – suffix to add at the name of the model before checkpointing. Defaults to “”.

save_metadata(model_dir: str, step: int) → None[source]¶

Save the metadata.

Parameters

model_dir (str) – directory to save.
step (int) – step for tracking the training of the agent.

save_optimizers(model_dir: str, step: int, retain_last_n: int) → None[source]¶

Save the different optimizers of the agent.

Parameters

model_dir (str) – directory to save.
step (int) – step for tracking the training of the agent.
retain_last_n (int) – number of models to retain.

abstract select_action(multitask_obs: Dict[str, torch.Tensor], modes: List[str]) → numpy.ndarray[source]¶

Select the action to perform.

Parameters

multitask_obs (ObsType) – Observation from the multitask environment.
modes (List[str]) – modes for selecting the action.

Returns

selected action.

Return type

np.ndarray

abstract train(training: bool = True) → None[source]¶

Set the agent in training/evaluation mode

Parameters: training (bool, optional) – should set in training mode. Defaults to True.

abstract update(replay_buffer: mtrl.replay_buffer.ReplayBuffer, logger: mtrl.logger.Logger, step: int, kwargs_to_compute_gradient: Optional[Dict[str, Any]] = None, buffer_index_to_sample: Optional[numpy.ndarray] = None) → numpy.ndarray[source]¶

Update the agent.

Parameters

replay_buffer (ReplayBuffer) – replay buffer to sample the data.
logger (Logger) – logger for logging.
step (int) – step for tracking the training progress.
kwargs_to_compute_gradient (Optional[Dict[str, Any]], optional) – Defaults to None.
buffer_index_to_sample (Optional[np.ndarray], optional) – if this parameter is specified, use these indices instead of sampling from the replay buffer. If this is set to None, sample from the replay buffer. buffer_index_to_sample Defaults to None.

Returns

index sampled (from the replay buffer) to train the model. If: buffer_index_to_sample is not set to None, return buffer_index_to_sample.

Return type

np.ndarray

mtrl.agent.deepmdp module¶

class mtrl.agent.deepmdp.Agent(env_obs_shape: List[int], action_shape: List[int], action_range: Tuple[int, int], device: torch.device, actor_cfg: omegaconf.dictconfig.DictConfig, critic_cfg: omegaconf.dictconfig.DictConfig, decoder_cfg: omegaconf.dictconfig.DictConfig, reward_decoder_cfg: omegaconf.dictconfig.DictConfig, transition_model_cfg: omegaconf.dictconfig.DictConfig, alpha_optimizer_cfg: omegaconf.dictconfig.DictConfig, actor_optimizer_cfg: omegaconf.dictconfig.DictConfig, critic_optimizer_cfg: omegaconf.dictconfig.DictConfig, multitask_cfg: omegaconf.dictconfig.DictConfig, decoder_optimizer_cfg: omegaconf.dictconfig.DictConfig, encoder_optimizer_cfg: omegaconf.dictconfig.DictConfig, reward_decoder_optimizer_cfg: omegaconf.dictconfig.DictConfig, transition_model_optimizer_cfg: omegaconf.dictconfig.DictConfig, discount: float = 0.99, init_temperature: float = 0.01, actor_update_freq: int = 2, critic_tau: float = 0.005, critic_target_update_freq: int = 2, encoder_tau: float = 0.005, loss_reduction: str = 'mean', decoder_update_freq: int = 1, decoder_latent_lambda: float = 0.0, cfg_to_load_model: Optional[omegaconf.dictconfig.DictConfig] = None, should_complete_init: bool = True)[source]¶

Bases: mtrl.agent.sac_ae.Agent

DeepMDP Agent

Abstract agent class that every other agent should extend.

Parameters

env_obs_shape (List[int]) – shape of the environment observation that the actor gets.
action_shape (List[int]) – shape of the action vector that the actor produces.
action_range (Tuple[int, int]) – min and max values for the action vector.
multitask_cfg (ConfigType) – config for encoding the multitask knowledge.
device (torch.device) – device for the agent.

update_decoder(batch: mtrl.replay_buffer.ReplayBufferSample, task_info: mtrl.agent.ds.task_info.TaskInfo, logger: mtrl.logger.Logger, step: int, kwargs_to_compute_gradient: Dict[str, Any])[source]¶

Update the decoder component.

Parameters

batch (ReplayBufferSample) – batch from the replay buffer.
task_info (TaskInfo) – task_info object.
logger ([Logger]) – logger object.
step (int) – step for tracking the training of the agent.
kwargs_to_compute_gradient (Dict[str, Any]) –

update_transition_reward_model(batch: mtrl.replay_buffer.ReplayBufferSample, task_info: mtrl.agent.ds.task_info.TaskInfo, logger: mtrl.logger.Logger, step: int, kwargs_to_compute_gradient: Dict[str, Any])[source]¶

Update the transition model and reward decoder.

Parameters

batch (ReplayBufferSample) – batch from the replay buffer.
task_info (TaskInfo) – task_info object.
logger ([Logger]) – logger object.
step (int) – step for tracking the training of the agent.
kwargs_to_compute_gradient (Dict[str, Any]) –

mtrl.agent.distral module¶

class mtrl.agent.distral.Agent(env_obs_shape: List[int], action_shape: List[int], action_range: Tuple[int, int], multitask_cfg: omegaconf.dictconfig.DictConfig, device: torch.device, distral_alpha: float, distral_beta: float, agent_index_to_task_index: List[str], distilled_agent_cfg: omegaconf.dictconfig.DictConfig, task_agent_cfg: omegaconf.dictconfig.DictConfig, cfg_to_load_model: Optional[omegaconf.dictconfig.DictConfig] = None, should_complete_init: bool = True)[source]¶

Bases: mtrl.agent.abstract.Agent

Distral algorithm.

complete_init(cfg_to_load_model: Optional[omegaconf.dictconfig.DictConfig]) → None[source]¶

Complete the init process.

The derived classes should implement this to perform different post-processing steps.

Parameters: cfg_to_load_model (ConfigType) – config to load the model.

load(model_dir: Optional[str], step: Optional[int]) → None[source]¶

Load the agent.

Parameters

model_dir (Optional[str]) – directory to load the model from.
step (Optional[int]) – step for tracking the training of the agent.

load_latest_step(model_dir: str) → int[source]¶

Load the agent using the latest training step.

Parameters: model_dir (Optional[str]) – directory to load the model from.
Returns: step for tracking the training of the agent.
Return type: int

sample_action(multitask_obs: Dict[str, torch.Tensor], modes: List[str]) → numpy.ndarray[source]¶: Used during training

save(model_dir: str, step: int, retain_last_n: int, should_save_metadata: bool = True) → None[source]¶

Save the agent.

Parameters

model_dir (str) – directory to save.
step (int) – step for tracking the training of the agent.
retain_last_n (int) – number of models to retain.
should_save_metadata (bool, optional) – should training metadata be saved. Defaults to True.

select_action(multitask_obs: Dict[str, torch.Tensor], modes: List[str]) → numpy.ndarray[source]¶: Used during testing

train(training: bool = True) → None[source]¶

Set the agent in training/evaluation mode

Parameters: training (bool, optional) – should set in training mode. Defaults to True.

update(replay_buffer: mtrl.replay_buffer.ReplayBuffer, logger: mtrl.logger.Logger, step: int, kwargs_to_compute_gradient: Optional[Dict[str, Any]] = None, buffer_index_to_sample: Optional[numpy.ndarray] = None) → numpy.ndarray[source]¶

Update the agent.

Parameters

replay_buffer (ReplayBuffer) – replay buffer to sample the data.
logger (Logger) – logger for logging.
step (int) – step for tracking the training progress.
kwargs_to_compute_gradient (Optional[Dict[str, Any]], optional) – Defaults to None.
buffer_index_to_sample (Optional[np.ndarray], optional) – if this parameter is specified, use these indices instead of sampling from the replay buffer. If this is set to None, sample from the replay buffer. buffer_index_to_sample Defaults to None.

Returns

index sampled (from the replay buffer) to train the model. If: buffer_index_to_sample is not set to None, return buffer_index_to_sample.

Return type

np.ndarray

class mtrl.agent.distral.DistilledAgent(env_obs_shape: List[int], action_shape: List[int], action_range: Tuple[int, int], multitask_cfg: omegaconf.dictconfig.DictConfig, device: torch.device, actor_cfg: omegaconf.dictconfig.DictConfig, actor_optimizer_cfg: omegaconf.dictconfig.DictConfig, cfg_to_load_model: Optional[omegaconf.dictconfig.DictConfig] = None, should_complete_init: bool = True)[source]¶

Bases: mtrl.agent.abstract.Agent

Centroid policy for distral

complete_init(cfg_to_load_model: Optional[omegaconf.dictconfig.DictConfig]) → None[source]¶

Complete the init process.

The derived classes should implement this to perform different post-processing steps.

Parameters: cfg_to_load_model (ConfigType) – config to load the model.

load(model_dir: Optional[str], step: Optional[int]) → None[source]¶

Load the agent.

Parameters

model_dir (Optional[str]) – directory to load the model from.
step (Optional[int]) – step for tracking the training of the agent.

load_latest_step(model_dir: str) → int[source]¶

Load the agent using the latest training step.

Parameters: model_dir (Optional[str]) – directory to load the model from.
Returns: step for tracking the training of the agent.
Return type: int

sample_action(multitask_obs: Dict[str, torch.Tensor], modes: List[str])[source]¶

Sample the action to perform.

Parameters

multitask_obs (ObsType) – Observation from the multitask environment.
modes (List[str]) – modes for sampling the action.

Returns

sampled action.

Return type

np.ndarray

save(model_dir: str, step: int, retain_last_n: int, should_save_metadata: bool = True) → None[source]¶

Save the agent.

Parameters

model_dir (str) – directory to save.
step (int) – step for tracking the training of the agent.
retain_last_n (int) – number of models to retain.
should_save_metadata (bool, optional) – should training metadata be saved. Defaults to True.

select_action(multitask_obs: Dict[str, torch.Tensor], modes: List[str])[source]¶

Select the action to perform.

Parameters

multitask_obs (ObsType) – Observation from the multitask environment.
modes (List[str]) – modes for selecting the action.

Returns

selected action.

Return type

np.ndarray

train(training=True) → None[source]¶

Set the agent in training/evaluation mode

Parameters: training (bool, optional) – should set in training mode. Defaults to True.

update(replay_buffer: mtrl.replay_buffer.ReplayBuffer, logger: mtrl.logger.Logger, step: int, kwargs_to_compute_gradient: Optional[Dict[str, Any]] = None, buffer_index_to_sample: Optional[numpy.ndarray] = None)[source]¶

Update the agent.

Parameters

replay_buffer (ReplayBuffer) – replay buffer to sample the data.
logger (Logger) – logger for logging.
step (int) – step for tracking the training progress.
kwargs_to_compute_gradient (Optional[Dict[str, Any]], optional) – Defaults to None.
buffer_index_to_sample (Optional[np.ndarray], optional) – if this parameter is specified, use these indices instead of sampling from the replay buffer. If this is set to None, sample from the replay buffer. buffer_index_to_sample Defaults to None.

Returns

index sampled (from the replay buffer) to train the model. If: buffer_index_to_sample is not set to None, return buffer_index_to_sample.

Return type

np.ndarray

class mtrl.agent.distral.TaskAgent(env_obs_shape: List[int], action_shape: List[int], action_range: Tuple[int, int], multitask_cfg: omegaconf.dictconfig.DictConfig, device: torch.device, agent_cfg: omegaconf.dictconfig.DictConfig, index: int, env_index: int, distral_alpha: float, distral_beta: float, distilled_agent: mtrl.agent.distral.DistilledAgent, cfg_to_load_model: Optional[omegaconf.dictconfig.DictConfig] = None, should_complete_init: bool = True)[source]¶

Bases: mtrl.agent.wrapper.Agent

Wrapper class for the task specific agent

load(model_dir: Optional[str], step: Optional[int]) → None[source]¶

Load the agent.

Parameters

model_dir (Optional[str]) – directory to load the model from.
step (Optional[int]) – step for tracking the training of the agent.

load_latest_step(model_dir: str) → int[source]¶

Load the agent using the latest training step.

Parameters: model_dir (Optional[str]) – directory to load the model from.
Returns: step for tracking the training of the agent.
Return type: int

patch_agent() → None[source]¶: Change some function definitions at runtime.

save(model_dir: str, step: int, retain_last_n: int, should_save_metadata: bool = True) → None[source]¶

Save the agent.

Parameters

model_dir (str) – directory to save.
step (int) – step for tracking the training of the agent.
retain_last_n (int) – number of models to retain.
should_save_metadata (bool, optional) – should training metadata be saved. Defaults to True.

update_actor_and_alpha(batch: mtrl.replay_buffer.ReplayBufferSample, task_info: mtrl.agent.ds.task_info.TaskInfo, logger: mtrl.logger.Logger, step: int, kwargs_to_compute_gradient: Dict[str, Any]) → None[source]¶

Update the actor and alpha component.

Parameters

batch (ReplayBufferSample) – batch from the replay buffer.
task_info (TaskInfo) – task_info object.
logger ([Logger]) – logger object.
step (int) – step for tracking the training of the agent.
kwargs_to_compute_gradient (Dict[str, Any]) –

mtrl.agent.distral.gaussian_kld(mean1: torch.Tensor, logvar1: torch.Tensor, mean2: torch.Tensor, logvar2: torch.Tensor) → torch.Tensor[source]¶

Compute KL divergence between a bunch of univariate Gaussian: distributions with the given means and log-variances. ie KL(N(mean1, logvar1) || N(mean2, logvar2))

Parameters

mean1 (TensorType) –
logvar1 (TensorType) –
mean2 (TensorType) –
logvar2 (TensorType) –

Returns

[description]

Return type

TensorType

mtrl.agent.grad_manipulation module¶

class mtrl.agent.grad_manipulation.Agent(env_obs_shape: List[int], action_shape: List[int], action_range: Tuple[int, int], multitask_cfg: omegaconf.dictconfig.DictConfig, agent_cfg: omegaconf.dictconfig.DictConfig, device: torch.device, cfg_to_load_model: Optional[omegaconf.dictconfig.DictConfig] = None, should_complete_init: bool = True)[source]¶

Bases: mtrl.agent.wrapper.Agent

Base Class for Gradient Manipulation Algorithms.

update(replay_buffer: mtrl.replay_buffer.ReplayBuffer, logger: mtrl.logger.Logger, step: int, kwargs_to_compute_gradient: Optional[Dict[str, Any]] = None, buffer_index_to_sample: Optional[numpy.ndarray] = None) → numpy.ndarray[source]¶

Update the agent.

Parameters

replay_buffer (ReplayBuffer) – replay buffer to sample the data.
logger (Logger) – logger for logging.
step (int) – step for tracking the training progress.
kwargs_to_compute_gradient (Optional[Dict[str, Any]], optional) – Defaults to None.
buffer_index_to_sample (Optional[np.ndarray], optional) – if this parameter is specified, use these indices instead of sampling from the replay buffer. If this is set to None, sample from the replay buffer. buffer_index_to_sample Defaults to None.

Returns

index sampled (from the replay buffer) to train the model. If: buffer_index_to_sample is not set to None, return buffer_index_to_sample.

Return type

np.ndarray

class mtrl.agent.grad_manipulation.EnvMetadata(env_index: torch.Tensor, unique_env_index: torch.Tensor, env_index_count: torch.Tensor)[source]¶

Bases: object

env_index: torch.Tensor¶

env_index_count: torch.Tensor¶

unique_env_index: torch.Tensor¶

mtrl.agent.gradnorm module¶

class mtrl.agent.gradnorm.Agent(env_obs_shape: List[int], action_shape: List[int], action_range: Tuple[int, int], multitask_cfg: omegaconf.dictconfig.DictConfig, agent_cfg: omegaconf.dictconfig.DictConfig, device: torch.device, cfg_to_load_model: Optional[omegaconf.dictconfig.DictConfig] = None, should_complete_init: bool = True)[source]¶

Bases: mtrl.agent.grad_manipulation.Agent

GradNorm algorithm.

mtrl.agent.hipbmdp module¶

class mtrl.agent.hipbmdp.Agent(env_obs_shape: List[int], action_shape: List[int], action_range: Tuple[int, int], device: torch.device, actor_cfg: omegaconf.dictconfig.DictConfig, critic_cfg: omegaconf.dictconfig.DictConfig, decoder_cfg: omegaconf.dictconfig.DictConfig, reward_decoder_cfg: omegaconf.dictconfig.DictConfig, transition_model_cfg: omegaconf.dictconfig.DictConfig, alpha_optimizer_cfg: omegaconf.dictconfig.DictConfig, actor_optimizer_cfg: omegaconf.dictconfig.DictConfig, critic_optimizer_cfg: omegaconf.dictconfig.DictConfig, multitask_cfg: omegaconf.dictconfig.DictConfig, decoder_optimizer_cfg: omegaconf.dictconfig.DictConfig, encoder_optimizer_cfg: omegaconf.dictconfig.DictConfig, reward_decoder_optimizer_cfg: omegaconf.dictconfig.DictConfig, transition_model_optimizer_cfg: omegaconf.dictconfig.DictConfig, discount: float = 0.99, init_temperature: float = 0.01, actor_update_freq: int = 2, critic_tau: float = 0.005, critic_target_update_freq: int = 2, encoder_tau: float = 0.005, loss_reduction: str = 'mean', decoder_update_freq: int = 1, decoder_latent_lambda: float = 0.0, cfg_to_load_model: Optional[omegaconf.dictconfig.DictConfig] = None, should_complete_init: bool = True)[source]¶

Bases: mtrl.agent.deepmdp.Agent

HiPBMDP Agent

Abstract agent class that every other agent should extend.

Parameters

env_obs_shape (List[int]) – shape of the environment observation that the actor gets.
action_shape (List[int]) – shape of the action vector that the actor produces.
action_range (Tuple[int, int]) – min and max values for the action vector.
multitask_cfg (ConfigType) – config for encoding the multitask knowledge.
device (torch.device) – device for the agent.

get_task_encoding(env_index: torch.Tensor, modes: List[str], disable_grad: bool)[source]¶

Get the task encoding for the different environments.

Parameters

env_index (TensorType) – environment index.
modes (List[str]) –
disable_grad (bool) – should disable tracking gradient.

Returns

task encodings.

Return type

TensorType

update_task_encoder(batch: mtrl.replay_buffer.ReplayBufferSample, task_info: mtrl.agent.ds.task_info.TaskInfo, logger, step, kwargs_to_compute_gradient: Dict[str, Any])[source]¶

Update the task encoder component.

Parameters

batch (ReplayBufferSample) – batch from the replay buffer.
task_info (TaskInfo) – task_info object.
logger ([Logger]) – logger object.
step (int) – step for tracking the training of the agent.
kwargs_to_compute_gradient (Dict[str, Any]) –

mtrl.agent.pcgrad module¶

class mtrl.agent.pcgrad.Agent(env_obs_shape: List[int], action_shape: List[int], action_range: Tuple[int, int], device: torch.device, agent_cfg: omegaconf.dictconfig.DictConfig, multitask_cfg: omegaconf.dictconfig.DictConfig, cfg_to_load_model: Optional[omegaconf.dictconfig.DictConfig] = None, should_complete_init: bool = True)[source]¶

Bases: mtrl.agent.grad_manipulation.Agent

PCGrad algorithm.

get_shuffled_task_indices(num_tasks: int) → numpy.ndarray[source]¶

mtrl.agent.pcgrad.apply_vector_grad_to_parameters(vec: torch.Tensor, parameters: Iterable[torch.Tensor], accumulate: bool = False)[source]¶

Apply vector gradients to the parameters

Parameters

vec (TensorType) – a single vector represents the gradients of a model.
parameters (Iterable[TensorType]) – an iterator of Tensors that are the parameters of a model.

mtrl.agent.sac module¶

class mtrl.agent.sac.Agent(env_obs_shape: List[int], action_shape: List[int], action_range: Tuple[int, int], device: torch.device, actor_cfg: omegaconf.dictconfig.DictConfig, critic_cfg: omegaconf.dictconfig.DictConfig, alpha_optimizer_cfg: omegaconf.dictconfig.DictConfig, actor_optimizer_cfg: omegaconf.dictconfig.DictConfig, critic_optimizer_cfg: omegaconf.dictconfig.DictConfig, multitask_cfg: omegaconf.dictconfig.DictConfig, discount: float, init_temperature: float, actor_update_freq: int, critic_tau: float, critic_target_update_freq: int, encoder_tau: float, loss_reduction: str = 'mean', cfg_to_load_model: Optional[omegaconf.dictconfig.DictConfig] = None, should_complete_init: bool = True)[source]¶

Bases: mtrl.agent.abstract.Agent

SAC algorithm.

Abstract agent class that every other agent should extend.

Parameters

env_obs_shape (List[int]) – shape of the environment observation that the actor gets.
action_shape (List[int]) – shape of the action vector that the actor produces.
action_range (Tuple[int, int]) – min and max values for the action vector.
multitask_cfg (ConfigType) – config for encoding the multitask knowledge.
device (torch.device) – device for the agent.

act(multitask_obs: Dict[str, torch.Tensor], modes: List[str], sample: bool) → numpy.ndarray[source]¶

Select/sample the action to perform.

Parameters

multitask_obs (ObsType) – Observation from the multitask environment.
mode (List[str]) – mode in which to select the action.
sample (bool) – sample (if True) or select (if False) an action.

Returns

selected/sample action.

Return type

np.ndarray

complete_init(cfg_to_load_model: Optional[omegaconf.dictconfig.DictConfig])[source]¶

Complete the init process.

The derived classes should implement this to perform different post-processing steps.

Parameters: cfg_to_load_model (ConfigType) – config to load the model.

get_alpha(env_index: torch.Tensor) → torch.Tensor[source]¶

Get the alpha value for the given environments.

Parameters: env_index (TensorType) – environment index.
Returns: alpha values.
Return type: TensorType

get_last_shared_layers(component_name: str) → Optional[List[torch.nn.modules.module.Module]][source]¶

Get the last shared layer for any given component.

Parameters: component_name (str) – given component.
Returns: list of layers.
Return type: List[ModelType]

get_parameters(name: str) → List[torch.nn.parameter.Parameter][source]¶

Get parameters corresponding to a given component.

Parameters: name (str) – name of the component.
Returns: list of parameters.
Return type: List[torch.nn.parameter.Parameter]

get_task_encoding(env_index: torch.Tensor, modes: List[str], disable_grad: bool) → torch.Tensor[source]¶

Get the task encoding for the different environments.

Parameters

env_index (TensorType) – environment index.
modes (List[str]) –
disable_grad (bool) – should disable tracking gradient.

Returns

task encodings.

Return type

TensorType

get_task_info(task_encoding: torch.Tensor, component_name: str, env_index: torch.Tensor) → mtrl.agent.ds.task_info.TaskInfo [source]¶

Encode task encoding into task info.

Parameters

task_encoding (TensorType) – encoding of the task.
component_name (str) – name of the component.
env_index (TensorType) – index of the environment.

Returns

TaskInfo object.

Return type

TaskInfo

sample_action(multitask_obs: Dict[str, torch.Tensor], modes: List[str]) → numpy.ndarray[source]¶

Sample the action to perform.

Parameters

multitask_obs (ObsType) – Observation from the multitask environment.
modes (List[str]) – modes for sampling the action.

Returns

sampled action.

Return type

np.ndarray

select_action(multitask_obs: Dict[str, torch.Tensor], modes: List[str]) → numpy.ndarray[source]¶

Select the action to perform.

Parameters

multitask_obs (ObsType) – Observation from the multitask environment.
modes (List[str]) – modes for selecting the action.

Returns

selected action.

Return type

np.ndarray

train(training: bool = True) → None[source]¶

Set the agent in training/evaluation mode

Parameters: training (bool, optional) – should set in training mode. Defaults to True.

update(replay_buffer: mtrl.replay_buffer.ReplayBuffer, logger: mtrl.logger.Logger, step: int, kwargs_to_compute_gradient: Optional[Dict[str, Any]] = None, buffer_index_to_sample: Optional[numpy.ndarray] = None) → numpy.ndarray[source]¶

Update the agent.

Parameters

replay_buffer (ReplayBuffer) – replay buffer to sample the data.
logger (Logger) – logger for logging.
step (int) – step for tracking the training progress.
kwargs_to_compute_gradient (Optional[Dict[str, Any]], optional) – Defaults to None.
buffer_index_to_sample (Optional[np.ndarray], optional) – if this parameter is specified, use these indices instead of sampling from the replay buffer. If this is set to None, sample from the replay buffer. buffer_index_to_sample Defaults to None.

Returns

index sampled (from the replay buffer) to train the model. If: buffer_index_to_sample is not set to None, return buffer_index_to_sample.

Return type

np.ndarray

update_actor_and_alpha(batch: mtrl.replay_buffer.ReplayBufferSample, task_info: mtrl.agent.ds.task_info.TaskInfo, logger: mtrl.logger.Logger, step: int, kwargs_to_compute_gradient: Dict[str, Any]) → None[source]¶

Update the actor and alpha component.

Parameters

batch (ReplayBufferSample) – batch from the replay buffer.
task_info (TaskInfo) – task_info object.
logger ([Logger]) – logger object.
step (int) – step for tracking the training of the agent.
kwargs_to_compute_gradient (Dict[str, Any]) –

update_critic(batch: mtrl.replay_buffer.ReplayBufferSample, task_info: mtrl.agent.ds.task_info.TaskInfo, logger: mtrl.logger.Logger, step: int, kwargs_to_compute_gradient: Dict[str, Any]) → None[source]¶

Update the critic component.

Parameters

batch (ReplayBufferSample) – batch from the replay buffer.
task_info (TaskInfo) – task_info object.
logger ([Logger]) – logger object.
step (int) – step for tracking the training of the agent.
kwargs_to_compute_gradient (Dict[str, Any]) –

update_decoder(batch: mtrl.replay_buffer.ReplayBufferSample, task_info: mtrl.agent.ds.task_info.TaskInfo, logger: mtrl.logger.Logger, step: int, kwargs_to_compute_gradient: Dict[str, Any]) → None[source]¶

Update the decoder component.

Parameters

batch (ReplayBufferSample) – batch from the replay buffer.
task_info (TaskInfo) – task_info object.
logger ([Logger]) – logger object.
step (int) – step for tracking the training of the agent.
kwargs_to_compute_gradient (Dict[str, Any]) –

update_task_encoder(batch: mtrl.replay_buffer.ReplayBufferSample, task_info: mtrl.agent.ds.task_info.TaskInfo, logger: mtrl.logger.Logger, step: int, kwargs_to_compute_gradient: Dict[str, Any]) → None[source]¶

Update the task encoder component.

Parameters

batch (ReplayBufferSample) – batch from the replay buffer.
task_info (TaskInfo) – task_info object.
logger ([Logger]) – logger object.
step (int) – step for tracking the training of the agent.
kwargs_to_compute_gradient (Dict[str, Any]) –

update_transition_reward_model(batch: mtrl.replay_buffer.ReplayBufferSample, task_info: mtrl.agent.ds.task_info.TaskInfo, logger: mtrl.logger.Logger, step: int, kwargs_to_compute_gradient: Dict[str, Any]) → None[source]¶

Update the transition model and reward decoder.

Parameters

batch (ReplayBufferSample) – batch from the replay buffer.
task_info (TaskInfo) – task_info object.
logger ([Logger]) – logger object.
step (int) – step for tracking the training of the agent.
kwargs_to_compute_gradient (Dict[str, Any]) –

mtrl.agent.sac_ae module¶

class mtrl.agent.sac_ae.Agent(env_obs_shape: List[int], action_shape: List[int], action_range: Tuple[int, int], device: torch.device, actor_cfg: omegaconf.dictconfig.DictConfig, critic_cfg: omegaconf.dictconfig.DictConfig, decoder_cfg: omegaconf.dictconfig.DictConfig, alpha_optimizer_cfg: omegaconf.dictconfig.DictConfig, actor_optimizer_cfg: omegaconf.dictconfig.DictConfig, critic_optimizer_cfg: omegaconf.dictconfig.DictConfig, multitask_cfg: omegaconf.dictconfig.DictConfig, decoder_optimizer_cfg: omegaconf.dictconfig.DictConfig, encoder_optimizer_cfg: omegaconf.dictconfig.DictConfig, discount: float, init_temperature: float, actor_update_freq: int, critic_tau: float, critic_target_update_freq: int, encoder_tau: float, loss_reduction: str = 'mean', decoder_update_freq: int = 1, decoder_latent_lambda: float = 0.0, cfg_to_load_model: Optional[omegaconf.dictconfig.DictConfig] = None, should_complete_init: bool = True)[source]¶

Bases: mtrl.agent.sac.Agent

SAC+AE algorithm.

Abstract agent class that every other agent should extend.

Parameters

env_obs_shape (List[int]) – shape of the environment observation that the actor gets.
action_shape (List[int]) – shape of the action vector that the actor produces.
action_range (Tuple[int, int]) – min and max values for the action vector.
multitask_cfg (ConfigType) – config for encoding the multitask knowledge.
device (torch.device) – device for the agent.

update_decoder(batch: mtrl.replay_buffer.ReplayBufferSample, task_info: mtrl.agent.ds.task_info.TaskInfo, logger: mtrl.logger.Logger, step: int, kwargs_to_compute_gradient: Dict[str, Any]) → None[source]¶

Update the decoder component.

Parameters

batch (ReplayBufferSample) – batch from the replay buffer.
task_info (TaskInfo) – task_info object.
logger ([Logger]) – logger object.
step (int) – step for tracking the training of the agent.
kwargs_to_compute_gradient (Dict[str, Any]) –

mtrl.agent.utils module¶

mtrl.agent.utils.build_mlp(input_dim: int, hidden_dim: int, output_dim: int, num_layers: int) → torch.nn.modules.module.Module[source]¶

Utility function to build a mlp model. This assumes all the hidden layers are using the same dimensionality.

Parameters

input_dim (int) – input dimension.
hidden_dim (int) – dimension of the hidden layers.
output_dim (int) – dimension of the output layer.
num_layers (int) – number of layers in the mlp.

Returns

[description]

Return type

ModelType

mtrl.agent.utils.build_mlp_as_module_list(input_dim: int, hidden_dim: int, output_dim: int, num_layers: int) → torch.nn.modules.module.Module[source]¶

Utility function to build a module list of layers. This assumes all the hidden layers are using the same dimensionality.

Parameters

input_dim (int) – input dimension.
hidden_dim (int) – dimension of the hidden layers.
output_dim (int) – dimension of the output layer.
num_layers (int) – number of layers in the mlp.

Returns

[description]

Return type

ModelType

class mtrl.agent.utils.eval_mode(*models)[source]¶

Bases: object

Put the agent in the eval mode

mtrl.agent.utils.preprocess_obs(obs: torch.Tensor, bits=5) → torch.Tensor[source]¶: Preprocessing image, see https://arxiv.org/abs/1807.03039.

mtrl.agent.utils.set_seed_everywhere(seed: int) → None[source]¶

Set seed for reproducibility.

Parameters: seed (int) – seed.

mtrl.agent.utils.soft_update_params(net: torch.nn.modules.module.Module, target_net: torch.nn.modules.module.Module, tau: float) → None[source]¶

Perform soft udpate on the net using target net.

Parameters

net ([ModelType]) – model to update.
target_net (ModelType) – model to update with.
tau (float) – control the extent of update.

mtrl.agent.utils.weight_init(m: torch.nn.modules.module.Module)[source]¶: Custom weight init for Conv2D and Linear layers.

mtrl.agent.utils.weight_init_conv(m: torch.nn.modules.module.Module)[source]¶

mtrl.agent.utils.weight_init_linear(m: torch.nn.modules.module.Module)[source]¶

mtrl.agent.utils.weight_init_moe_layer(m: torch.nn.modules.module.Module)[source]¶

mtrl.agent.wrapper module¶

class mtrl.agent.wrapper.Agent(env_obs_shape: List[int], action_shape: List[int], action_range: Tuple[int, int], multitask_cfg: omegaconf.dictconfig.DictConfig, agent_cfg: omegaconf.dictconfig.DictConfig, device: torch.device, cfg_to_load_model: Optional[omegaconf.dictconfig.DictConfig] = None, should_complete_init: bool = True)[source]¶

Bases: mtrl.agent.abstract.Agent

This wrapper agent wraps over the other agents. It is useful for alogorithms like PCGrad and GradNorm that can be used with may policies.

Parameters

env_obs_shape (List[int]) – shape of the environment observation that the actor gets.
action_shape (List[int]) – shape of the action vector that the actor produces.
action_range (Tuple[int, int]) – min and max values for the action vector.
multitask_cfg (ConfigType) – config for encoding the multitask knowledge.
agent_cfg (ConfigType) – config for the agents that are wrapper over.
device (torch.device) – device for the agent.
cfg_to_load_model (Optional[ConfigType], optional) – config to load the model from filesystem. Defaults to None.
should_complete_init (bool, optional) – should call complete_init method. Defaults to True.

complete_init(cfg_to_load_model: Optional[omegaconf.dictconfig.DictConfig])[source]¶

Complete the init process.

The derived classes should implement this to perform different post-processing steps.

Parameters: cfg_to_load_model (ConfigType) – config to load the model.

get_component_name_list_for_checkpointing() → List[Tuple[torch.nn.modules.module.Module, str]][source]¶

Get the list of tuples of (model, name) from the agent to checkpoint.

Returns: list of tuples of (model, name).
Return type: List[Tuple[ModelType, str]]

get_last_shared_layers(component_name: str) → Optional[List[torch.nn.modules.module.Module]][source]¶

Get the last shared layer for any given component.

Parameters: component_name (str) – given component.
Returns: list of layers.
Return type: List[ModelType]

get_optimizer_name_list_for_checkpointing() → List[Tuple[torch.optim.optimizer.Optimizer, str]][source]¶

Get the list of tuples of (optimizer, name) from the agent to checkpoint.

Returns: list of tuples of (optimizer, name).
Return type: List[Tuple[OptimizerType, str]]

load(model_dir: Optional[str], step: Optional[int]) → None[source]¶

Load the agent.

Parameters

model_dir (Optional[str]) – directory to load the model from.
step (Optional[int]) – step for tracking the training of the agent.

sample_action(multitask_obs: Dict[str, torch.Tensor], modes: List[str])[source]¶

Sample the action to perform.

Parameters

multitask_obs (ObsType) – Observation from the multitask environment.
modes (List[str]) – modes for sampling the action.

Returns

sampled action.

Return type

np.ndarray

save(model_dir: str, step: int, retain_last_n: int, should_save_metadata: bool = True) → None[source]¶

Save the agent.

Parameters

model_dir (str) – directory to save.
step (int) – step for tracking the training of the agent.
retain_last_n (int) – number of models to retain.
should_save_metadata (bool, optional) – should training metadata be saved. Defaults to True.

save_components(model_dir: str, step: int, retain_last_n: int) → None[source]¶

Save the different components of the agent.

Parameters

model_dir (str) – directory to save.
step (int) – step for tracking the training of the agent.
retain_last_n (int) – number of models to retain.

save_optimizers(model_dir: str, step: int, retain_last_n: int) → None[source]¶

Save the different optimizers of the agent.

Parameters

model_dir (str) – directory to save.
step (int) – step for tracking the training of the agent.
retain_last_n (int) – number of models to retain.

select_action(multitask_obs: Dict[str, torch.Tensor], modes: List[str])[source]¶

Select the action to perform.

Parameters

multitask_obs (ObsType) – Observation from the multitask environment.
modes (List[str]) – modes for selecting the action.

Returns

selected action.

Return type

np.ndarray

train(training: bool = True)[source]¶

Set the agent in training/evaluation mode

Parameters: training (bool, optional) – should set in training mode. Defaults to True.

update(replay_buffer: mtrl.replay_buffer.ReplayBuffer, logger: mtrl.logger.Logger, step: int, kwargs_to_compute_gradient: Optional[Dict[str, Any]] = None, buffer_index_to_sample: Optional[numpy.ndarray] = None)[source]¶

Update the agent.

Parameters

replay_buffer (ReplayBuffer) – replay buffer to sample the data.
logger (Logger) – logger for logging.
step (int) – step for tracking the training progress.
kwargs_to_compute_gradient (Optional[Dict[str, Any]], optional) – Defaults to None.
buffer_index_to_sample (Optional[np.ndarray], optional) – if this parameter is specified, use these indices instead of sampling from the replay buffer. If this is set to None, sample from the replay buffer. buffer_index_to_sample Defaults to None.

Returns

index sampled (from the replay buffer) to train the model. If: buffer_index_to_sample is not set to None, return buffer_index_to_sample.

Return type

np.ndarray

mtrl.agent package¶

Subpackages¶

Submodules¶

mtrl.agent.abstract module¶

mtrl.agent.deepmdp module¶

mtrl.agent.distral module¶

mtrl.agent.grad_manipulation module¶

mtrl.agent.gradnorm module¶

mtrl.agent.hipbmdp module¶

mtrl.agent.pcgrad module¶

mtrl.agent.sac module¶

mtrl.agent.sac_ae module¶

mtrl.agent.utils module¶

mtrl.agent.wrapper module¶

Module contents¶