mtrl.agent package¶
Subpackages¶
- mtrl.agent.components package
- Submodules
- mtrl.agent.components.actor module
- mtrl.agent.components.base module
- mtrl.agent.components.critic module
- mtrl.agent.components.decoder module
- mtrl.agent.components.encoder module
- mtrl.agent.components.hipbmdp_theta module
- mtrl.agent.components.moe_layer module
- mtrl.agent.components.reward_decoder module
- mtrl.agent.components.scripted_soft_modularization module
- mtrl.agent.components.soft_modularization module
- mtrl.agent.components.task_encoder module
- mtrl.agent.components.transition_model module
- Module contents
- mtrl.agent.ds package
Submodules¶
mtrl.agent.abstract module¶
Interface for the agent.
-
class
mtrl.agent.abstract.
Agent
(env_obs_shape: List[int], action_shape: List[int], action_range: Tuple[int, int], multitask_cfg: omegaconf.dictconfig.DictConfig, device: torch.device)[source]¶ Bases:
abc.ABC
Abstract agent class that every other agent should extend.
- Parameters
env_obs_shape (List[int]) – shape of the environment observation that the actor gets.
action_shape (List[int]) – shape of the action vector that the actor produces.
action_range (Tuple[int, int]) – min and max values for the action vector.
multitask_cfg (ConfigType) – config for encoding the multitask knowledge.
device (torch.device) – device for the agent.
-
abstract
complete_init
(cfg_to_load_model: omegaconf.dictconfig.DictConfig) → None[source]¶ Complete the init process.
The derived classes should implement this to perform different post-processing steps.
- Parameters
cfg_to_load_model (ConfigType) – config to load the model.
-
get_component_name_list_for_checkpointing
() → List[Tuple[torch.nn.modules.module.Module, str]][source]¶ Get the list of tuples of (model, name) from the agent to checkpoint.
- Returns
list of tuples of (model, name).
- Return type
List[Tuple[ModelType, str]]
Get the last shared layer for any given component.
- Parameters
component_name (str) – given component.
- Returns
list of layers.
- Return type
List[ModelType]
-
get_optimizer_name_list_for_checkpointing
() → List[Tuple[torch.optim.optimizer.Optimizer, str]][source]¶ Get the list of tuples of (optimizer, name) from the agent to checkpoint.
- Returns
list of tuples of (optimizer, name).
- Return type
List[Tuple[OptimizerType, str]]
-
load
(model_dir: Optional[str], step: Optional[int]) → None[source]¶ Load the agent.
- Parameters
model_dir (Optional[str]) – directory to load the model from.
step (Optional[int]) – step for tracking the training of the agent.
-
load_latest_step
(model_dir: str) → int[source]¶ Load the agent using the latest training step.
- Parameters
model_dir (Optional[str]) – directory to load the model from.
- Returns
step for tracking the training of the agent.
- Return type
int
-
load_metadata
(model_dir: str) → Optional[Dict[Any, Any]][source]¶ Load the metadata of the agent.
- Parameters
model_dir (str) – directory to load the model from.
- Returns
metadata.
- Return type
Optional[Dict[Any, Any]]
-
abstract
sample_action
(multitask_obs: Dict[str, torch.Tensor], modes: List[str]) → numpy.ndarray[source]¶ Sample the action to perform.
- Parameters
multitask_obs (ObsType) – Observation from the multitask environment.
modes (List[str]) – modes for sampling the action.
- Returns
sampled action.
- Return type
np.ndarray
-
save
(model_dir: str, step: int, retain_last_n: int, should_save_metadata: bool = True) → None[source]¶ Save the agent.
- Parameters
model_dir (str) – directory to save.
step (int) – step for tracking the training of the agent.
retain_last_n (int) – number of models to retain.
should_save_metadata (bool, optional) – should training metadata be saved. Defaults to True.
-
save_components
(model_dir: str, step: int, retain_last_n: int) → None[source]¶ Save the different components of the agent.
- Parameters
model_dir (str) – directory to save.
step (int) – step for tracking the training of the agent.
retain_last_n (int) – number of models to retain.
-
save_components_or_optimizers
(component_or_optimizer_list: Union[List[Tuple[torch.nn.modules.module.Module, str]], List[Tuple[torch.optim.optimizer.Optimizer, str]]], model_dir: str, step: int, retain_last_n: int, suffix: str = '') → None[source]¶ Save the components and optimizers from the given list.
- Parameters
component_or_optimizer_list – (Union[ List[Tuple[ComponentType, str]], List[Tuple[OptimizerType, str]] ]): list of components and optimizers to save.
model_dir (str) – directory to save.
step (int) – step for tracking the training of the agent.
retain_last_n (int) – number of models to retain.
suffix (str, optional) – suffix to add at the name of the model before checkpointing. Defaults to “”.
-
save_metadata
(model_dir: str, step: int) → None[source]¶ Save the metadata.
- Parameters
model_dir (str) – directory to save.
step (int) – step for tracking the training of the agent.
-
save_optimizers
(model_dir: str, step: int, retain_last_n: int) → None[source]¶ Save the different optimizers of the agent.
- Parameters
model_dir (str) – directory to save.
step (int) – step for tracking the training of the agent.
retain_last_n (int) – number of models to retain.
-
abstract
select_action
(multitask_obs: Dict[str, torch.Tensor], modes: List[str]) → numpy.ndarray[source]¶ Select the action to perform.
- Parameters
multitask_obs (ObsType) – Observation from the multitask environment.
modes (List[str]) – modes for selecting the action.
- Returns
selected action.
- Return type
np.ndarray
-
abstract
train
(training: bool = True) → None[source]¶ Set the agent in training/evaluation mode
- Parameters
training (bool, optional) – should set in training mode. Defaults to True.
-
abstract
update
(replay_buffer: mtrl.replay_buffer.ReplayBuffer, logger: mtrl.logger.Logger, step: int, kwargs_to_compute_gradient: Optional[Dict[str, Any]] = None, buffer_index_to_sample: Optional[numpy.ndarray] = None) → numpy.ndarray[source]¶ Update the agent.
- Parameters
replay_buffer (ReplayBuffer) – replay buffer to sample the data.
logger (Logger) – logger for logging.
step (int) – step for tracking the training progress.
kwargs_to_compute_gradient (Optional[Dict[str, Any]], optional) – Defaults to None.
buffer_index_to_sample (Optional[np.ndarray], optional) – if this parameter is specified, use these indices instead of sampling from the replay buffer. If this is set to None, sample from the replay buffer. buffer_index_to_sample Defaults to None.
- Returns
- index sampled (from the replay buffer) to train the model. If
buffer_index_to_sample is not set to None, return buffer_index_to_sample.
- Return type
np.ndarray
mtrl.agent.deepmdp module¶
-
class
mtrl.agent.deepmdp.
Agent
(env_obs_shape: List[int], action_shape: List[int], action_range: Tuple[int, int], device: torch.device, actor_cfg: omegaconf.dictconfig.DictConfig, critic_cfg: omegaconf.dictconfig.DictConfig, decoder_cfg: omegaconf.dictconfig.DictConfig, reward_decoder_cfg: omegaconf.dictconfig.DictConfig, transition_model_cfg: omegaconf.dictconfig.DictConfig, alpha_optimizer_cfg: omegaconf.dictconfig.DictConfig, actor_optimizer_cfg: omegaconf.dictconfig.DictConfig, critic_optimizer_cfg: omegaconf.dictconfig.DictConfig, multitask_cfg: omegaconf.dictconfig.DictConfig, decoder_optimizer_cfg: omegaconf.dictconfig.DictConfig, encoder_optimizer_cfg: omegaconf.dictconfig.DictConfig, reward_decoder_optimizer_cfg: omegaconf.dictconfig.DictConfig, transition_model_optimizer_cfg: omegaconf.dictconfig.DictConfig, discount: float = 0.99, init_temperature: float = 0.01, actor_update_freq: int = 2, critic_tau: float = 0.005, critic_target_update_freq: int = 2, encoder_tau: float = 0.005, loss_reduction: str = 'mean', decoder_update_freq: int = 1, decoder_latent_lambda: float = 0.0, cfg_to_load_model: Optional[omegaconf.dictconfig.DictConfig] = None, should_complete_init: bool = True)[source]¶ Bases:
mtrl.agent.sac_ae.Agent
DeepMDP Agent
Abstract agent class that every other agent should extend.
- Parameters
env_obs_shape (List[int]) – shape of the environment observation that the actor gets.
action_shape (List[int]) – shape of the action vector that the actor produces.
action_range (Tuple[int, int]) – min and max values for the action vector.
multitask_cfg (ConfigType) – config for encoding the multitask knowledge.
device (torch.device) – device for the agent.
-
update_decoder
(batch: mtrl.replay_buffer.ReplayBufferSample, task_info: mtrl.agent.ds.task_info.TaskInfo, logger: mtrl.logger.Logger, step: int, kwargs_to_compute_gradient: Dict[str, Any])[source]¶ Update the decoder component.
- Parameters
batch (ReplayBufferSample) – batch from the replay buffer.
task_info (TaskInfo) – task_info object.
logger ([Logger]) – logger object.
step (int) – step for tracking the training of the agent.
kwargs_to_compute_gradient (Dict[str, Any]) –
-
update_transition_reward_model
(batch: mtrl.replay_buffer.ReplayBufferSample, task_info: mtrl.agent.ds.task_info.TaskInfo, logger: mtrl.logger.Logger, step: int, kwargs_to_compute_gradient: Dict[str, Any])[source]¶ Update the transition model and reward decoder.
- Parameters
batch (ReplayBufferSample) – batch from the replay buffer.
task_info (TaskInfo) – task_info object.
logger ([Logger]) – logger object.
step (int) – step for tracking the training of the agent.
kwargs_to_compute_gradient (Dict[str, Any]) –
mtrl.agent.distral module¶
-
class
mtrl.agent.distral.
Agent
(env_obs_shape: List[int], action_shape: List[int], action_range: Tuple[int, int], multitask_cfg: omegaconf.dictconfig.DictConfig, device: torch.device, distral_alpha: float, distral_beta: float, agent_index_to_task_index: List[str], distilled_agent_cfg: omegaconf.dictconfig.DictConfig, task_agent_cfg: omegaconf.dictconfig.DictConfig, cfg_to_load_model: Optional[omegaconf.dictconfig.DictConfig] = None, should_complete_init: bool = True)[source]¶ Bases:
mtrl.agent.abstract.Agent
Distral algorithm.
-
complete_init
(cfg_to_load_model: Optional[omegaconf.dictconfig.DictConfig]) → None[source]¶ Complete the init process.
The derived classes should implement this to perform different post-processing steps.
- Parameters
cfg_to_load_model (ConfigType) – config to load the model.
-
load
(model_dir: Optional[str], step: Optional[int]) → None[source]¶ Load the agent.
- Parameters
model_dir (Optional[str]) – directory to load the model from.
step (Optional[int]) – step for tracking the training of the agent.
-
load_latest_step
(model_dir: str) → int[source]¶ Load the agent using the latest training step.
- Parameters
model_dir (Optional[str]) – directory to load the model from.
- Returns
step for tracking the training of the agent.
- Return type
int
-
sample_action
(multitask_obs: Dict[str, torch.Tensor], modes: List[str]) → numpy.ndarray[source]¶ Used during training
-
save
(model_dir: str, step: int, retain_last_n: int, should_save_metadata: bool = True) → None[source]¶ Save the agent.
- Parameters
model_dir (str) – directory to save.
step (int) – step for tracking the training of the agent.
retain_last_n (int) – number of models to retain.
should_save_metadata (bool, optional) – should training metadata be saved. Defaults to True.
-
select_action
(multitask_obs: Dict[str, torch.Tensor], modes: List[str]) → numpy.ndarray[source]¶ Used during testing
-
train
(training: bool = True) → None[source]¶ Set the agent in training/evaluation mode
- Parameters
training (bool, optional) – should set in training mode. Defaults to True.
-
update
(replay_buffer: mtrl.replay_buffer.ReplayBuffer, logger: mtrl.logger.Logger, step: int, kwargs_to_compute_gradient: Optional[Dict[str, Any]] = None, buffer_index_to_sample: Optional[numpy.ndarray] = None) → numpy.ndarray[source]¶ Update the agent.
- Parameters
replay_buffer (ReplayBuffer) – replay buffer to sample the data.
logger (Logger) – logger for logging.
step (int) – step for tracking the training progress.
kwargs_to_compute_gradient (Optional[Dict[str, Any]], optional) – Defaults to None.
buffer_index_to_sample (Optional[np.ndarray], optional) – if this parameter is specified, use these indices instead of sampling from the replay buffer. If this is set to None, sample from the replay buffer. buffer_index_to_sample Defaults to None.
- Returns
- index sampled (from the replay buffer) to train the model. If
buffer_index_to_sample is not set to None, return buffer_index_to_sample.
- Return type
np.ndarray
-
-
class
mtrl.agent.distral.
DistilledAgent
(env_obs_shape: List[int], action_shape: List[int], action_range: Tuple[int, int], multitask_cfg: omegaconf.dictconfig.DictConfig, device: torch.device, actor_cfg: omegaconf.dictconfig.DictConfig, actor_optimizer_cfg: omegaconf.dictconfig.DictConfig, cfg_to_load_model: Optional[omegaconf.dictconfig.DictConfig] = None, should_complete_init: bool = True)[source]¶ Bases:
mtrl.agent.abstract.Agent
Centroid policy for distral
-
complete_init
(cfg_to_load_model: Optional[omegaconf.dictconfig.DictConfig]) → None[source]¶ Complete the init process.
The derived classes should implement this to perform different post-processing steps.
- Parameters
cfg_to_load_model (ConfigType) – config to load the model.
-
load
(model_dir: Optional[str], step: Optional[int]) → None[source]¶ Load the agent.
- Parameters
model_dir (Optional[str]) – directory to load the model from.
step (Optional[int]) – step for tracking the training of the agent.
-
load_latest_step
(model_dir: str) → int[source]¶ Load the agent using the latest training step.
- Parameters
model_dir (Optional[str]) – directory to load the model from.
- Returns
step for tracking the training of the agent.
- Return type
int
-
sample_action
(multitask_obs: Dict[str, torch.Tensor], modes: List[str])[source]¶ Sample the action to perform.
- Parameters
multitask_obs (ObsType) – Observation from the multitask environment.
modes (List[str]) – modes for sampling the action.
- Returns
sampled action.
- Return type
np.ndarray
-
save
(model_dir: str, step: int, retain_last_n: int, should_save_metadata: bool = True) → None[source]¶ Save the agent.
- Parameters
model_dir (str) – directory to save.
step (int) – step for tracking the training of the agent.
retain_last_n (int) – number of models to retain.
should_save_metadata (bool, optional) – should training metadata be saved. Defaults to True.
-
select_action
(multitask_obs: Dict[str, torch.Tensor], modes: List[str])[source]¶ Select the action to perform.
- Parameters
multitask_obs (ObsType) – Observation from the multitask environment.
modes (List[str]) – modes for selecting the action.
- Returns
selected action.
- Return type
np.ndarray
-
train
(training=True) → None[source]¶ Set the agent in training/evaluation mode
- Parameters
training (bool, optional) – should set in training mode. Defaults to True.
-
update
(replay_buffer: mtrl.replay_buffer.ReplayBuffer, logger: mtrl.logger.Logger, step: int, kwargs_to_compute_gradient: Optional[Dict[str, Any]] = None, buffer_index_to_sample: Optional[numpy.ndarray] = None)[source]¶ Update the agent.
- Parameters
replay_buffer (ReplayBuffer) – replay buffer to sample the data.
logger (Logger) – logger for logging.
step (int) – step for tracking the training progress.
kwargs_to_compute_gradient (Optional[Dict[str, Any]], optional) – Defaults to None.
buffer_index_to_sample (Optional[np.ndarray], optional) – if this parameter is specified, use these indices instead of sampling from the replay buffer. If this is set to None, sample from the replay buffer. buffer_index_to_sample Defaults to None.
- Returns
- index sampled (from the replay buffer) to train the model. If
buffer_index_to_sample is not set to None, return buffer_index_to_sample.
- Return type
np.ndarray
-
-
class
mtrl.agent.distral.
TaskAgent
(env_obs_shape: List[int], action_shape: List[int], action_range: Tuple[int, int], multitask_cfg: omegaconf.dictconfig.DictConfig, device: torch.device, agent_cfg: omegaconf.dictconfig.DictConfig, index: int, env_index: int, distral_alpha: float, distral_beta: float, distilled_agent: mtrl.agent.distral.DistilledAgent, cfg_to_load_model: Optional[omegaconf.dictconfig.DictConfig] = None, should_complete_init: bool = True)[source]¶ Bases:
mtrl.agent.wrapper.Agent
Wrapper class for the task specific agent
-
load
(model_dir: Optional[str], step: Optional[int]) → None[source]¶ Load the agent.
- Parameters
model_dir (Optional[str]) – directory to load the model from.
step (Optional[int]) – step for tracking the training of the agent.
-
load_latest_step
(model_dir: str) → int[source]¶ Load the agent using the latest training step.
- Parameters
model_dir (Optional[str]) – directory to load the model from.
- Returns
step for tracking the training of the agent.
- Return type
int
-
save
(model_dir: str, step: int, retain_last_n: int, should_save_metadata: bool = True) → None[source]¶ Save the agent.
- Parameters
model_dir (str) – directory to save.
step (int) – step for tracking the training of the agent.
retain_last_n (int) – number of models to retain.
should_save_metadata (bool, optional) – should training metadata be saved. Defaults to True.
-
update_actor_and_alpha
(batch: mtrl.replay_buffer.ReplayBufferSample, task_info: mtrl.agent.ds.task_info.TaskInfo, logger: mtrl.logger.Logger, step: int, kwargs_to_compute_gradient: Dict[str, Any]) → None[source]¶ Update the actor and alpha component.
- Parameters
batch (ReplayBufferSample) – batch from the replay buffer.
task_info (TaskInfo) – task_info object.
logger ([Logger]) – logger object.
step (int) – step for tracking the training of the agent.
kwargs_to_compute_gradient (Dict[str, Any]) –
-
-
mtrl.agent.distral.
gaussian_kld
(mean1: torch.Tensor, logvar1: torch.Tensor, mean2: torch.Tensor, logvar2: torch.Tensor) → torch.Tensor[source]¶ - Compute KL divergence between a bunch of univariate Gaussian
distributions with the given means and log-variances. ie KL(N(mean1, logvar1) || N(mean2, logvar2))
- Parameters
mean1 (TensorType) –
logvar1 (TensorType) –
mean2 (TensorType) –
logvar2 (TensorType) –
- Returns
[description]
- Return type
TensorType
mtrl.agent.grad_manipulation module¶
-
class
mtrl.agent.grad_manipulation.
Agent
(env_obs_shape: List[int], action_shape: List[int], action_range: Tuple[int, int], multitask_cfg: omegaconf.dictconfig.DictConfig, agent_cfg: omegaconf.dictconfig.DictConfig, device: torch.device, cfg_to_load_model: Optional[omegaconf.dictconfig.DictConfig] = None, should_complete_init: bool = True)[source]¶ Bases:
mtrl.agent.wrapper.Agent
Base Class for Gradient Manipulation Algorithms.
-
update
(replay_buffer: mtrl.replay_buffer.ReplayBuffer, logger: mtrl.logger.Logger, step: int, kwargs_to_compute_gradient: Optional[Dict[str, Any]] = None, buffer_index_to_sample: Optional[numpy.ndarray] = None) → numpy.ndarray[source]¶ Update the agent.
- Parameters
replay_buffer (ReplayBuffer) – replay buffer to sample the data.
logger (Logger) – logger for logging.
step (int) – step for tracking the training progress.
kwargs_to_compute_gradient (Optional[Dict[str, Any]], optional) – Defaults to None.
buffer_index_to_sample (Optional[np.ndarray], optional) – if this parameter is specified, use these indices instead of sampling from the replay buffer. If this is set to None, sample from the replay buffer. buffer_index_to_sample Defaults to None.
- Returns
- index sampled (from the replay buffer) to train the model. If
buffer_index_to_sample is not set to None, return buffer_index_to_sample.
- Return type
np.ndarray
-
mtrl.agent.gradnorm module¶
-
class
mtrl.agent.gradnorm.
Agent
(env_obs_shape: List[int], action_shape: List[int], action_range: Tuple[int, int], multitask_cfg: omegaconf.dictconfig.DictConfig, agent_cfg: omegaconf.dictconfig.DictConfig, device: torch.device, cfg_to_load_model: Optional[omegaconf.dictconfig.DictConfig] = None, should_complete_init: bool = True)[source]¶ Bases:
mtrl.agent.grad_manipulation.Agent
GradNorm algorithm.
mtrl.agent.hipbmdp module¶
-
class
mtrl.agent.hipbmdp.
Agent
(env_obs_shape: List[int], action_shape: List[int], action_range: Tuple[int, int], device: torch.device, actor_cfg: omegaconf.dictconfig.DictConfig, critic_cfg: omegaconf.dictconfig.DictConfig, decoder_cfg: omegaconf.dictconfig.DictConfig, reward_decoder_cfg: omegaconf.dictconfig.DictConfig, transition_model_cfg: omegaconf.dictconfig.DictConfig, alpha_optimizer_cfg: omegaconf.dictconfig.DictConfig, actor_optimizer_cfg: omegaconf.dictconfig.DictConfig, critic_optimizer_cfg: omegaconf.dictconfig.DictConfig, multitask_cfg: omegaconf.dictconfig.DictConfig, decoder_optimizer_cfg: omegaconf.dictconfig.DictConfig, encoder_optimizer_cfg: omegaconf.dictconfig.DictConfig, reward_decoder_optimizer_cfg: omegaconf.dictconfig.DictConfig, transition_model_optimizer_cfg: omegaconf.dictconfig.DictConfig, discount: float = 0.99, init_temperature: float = 0.01, actor_update_freq: int = 2, critic_tau: float = 0.005, critic_target_update_freq: int = 2, encoder_tau: float = 0.005, loss_reduction: str = 'mean', decoder_update_freq: int = 1, decoder_latent_lambda: float = 0.0, cfg_to_load_model: Optional[omegaconf.dictconfig.DictConfig] = None, should_complete_init: bool = True)[source]¶ Bases:
mtrl.agent.deepmdp.Agent
HiPBMDP Agent
Abstract agent class that every other agent should extend.
- Parameters
env_obs_shape (List[int]) – shape of the environment observation that the actor gets.
action_shape (List[int]) – shape of the action vector that the actor produces.
action_range (Tuple[int, int]) – min and max values for the action vector.
multitask_cfg (ConfigType) – config for encoding the multitask knowledge.
device (torch.device) – device for the agent.
-
get_task_encoding
(env_index: torch.Tensor, modes: List[str], disable_grad: bool)[source]¶ Get the task encoding for the different environments.
- Parameters
env_index (TensorType) – environment index.
modes (List[str]) –
disable_grad (bool) – should disable tracking gradient.
- Returns
task encodings.
- Return type
TensorType
-
update_task_encoder
(batch: mtrl.replay_buffer.ReplayBufferSample, task_info: mtrl.agent.ds.task_info.TaskInfo, logger, step, kwargs_to_compute_gradient: Dict[str, Any])[source]¶ Update the task encoder component.
- Parameters
batch (ReplayBufferSample) – batch from the replay buffer.
task_info (TaskInfo) – task_info object.
logger ([Logger]) – logger object.
step (int) – step for tracking the training of the agent.
kwargs_to_compute_gradient (Dict[str, Any]) –
mtrl.agent.pcgrad module¶
-
class
mtrl.agent.pcgrad.
Agent
(env_obs_shape: List[int], action_shape: List[int], action_range: Tuple[int, int], device: torch.device, agent_cfg: omegaconf.dictconfig.DictConfig, multitask_cfg: omegaconf.dictconfig.DictConfig, cfg_to_load_model: Optional[omegaconf.dictconfig.DictConfig] = None, should_complete_init: bool = True)[source]¶ Bases:
mtrl.agent.grad_manipulation.Agent
PCGrad algorithm.
-
mtrl.agent.pcgrad.
apply_vector_grad_to_parameters
(vec: torch.Tensor, parameters: Iterable[torch.Tensor], accumulate: bool = False)[source]¶ Apply vector gradients to the parameters
- Parameters
vec (TensorType) – a single vector represents the gradients of a model.
parameters (Iterable[TensorType]) – an iterator of Tensors that are the parameters of a model.
mtrl.agent.sac module¶
-
class
mtrl.agent.sac.
Agent
(env_obs_shape: List[int], action_shape: List[int], action_range: Tuple[int, int], device: torch.device, actor_cfg: omegaconf.dictconfig.DictConfig, critic_cfg: omegaconf.dictconfig.DictConfig, alpha_optimizer_cfg: omegaconf.dictconfig.DictConfig, actor_optimizer_cfg: omegaconf.dictconfig.DictConfig, critic_optimizer_cfg: omegaconf.dictconfig.DictConfig, multitask_cfg: omegaconf.dictconfig.DictConfig, discount: float, init_temperature: float, actor_update_freq: int, critic_tau: float, critic_target_update_freq: int, encoder_tau: float, loss_reduction: str = 'mean', cfg_to_load_model: Optional[omegaconf.dictconfig.DictConfig] = None, should_complete_init: bool = True)[source]¶ Bases:
mtrl.agent.abstract.Agent
SAC algorithm.
Abstract agent class that every other agent should extend.
- Parameters
env_obs_shape (List[int]) – shape of the environment observation that the actor gets.
action_shape (List[int]) – shape of the action vector that the actor produces.
action_range (Tuple[int, int]) – min and max values for the action vector.
multitask_cfg (ConfigType) – config for encoding the multitask knowledge.
device (torch.device) – device for the agent.
-
act
(multitask_obs: Dict[str, torch.Tensor], modes: List[str], sample: bool) → numpy.ndarray[source]¶ Select/sample the action to perform.
- Parameters
multitask_obs (ObsType) – Observation from the multitask environment.
mode (List[str]) – mode in which to select the action.
sample (bool) – sample (if True) or select (if False) an action.
- Returns
selected/sample action.
- Return type
np.ndarray
-
complete_init
(cfg_to_load_model: Optional[omegaconf.dictconfig.DictConfig])[source]¶ Complete the init process.
The derived classes should implement this to perform different post-processing steps.
- Parameters
cfg_to_load_model (ConfigType) – config to load the model.
-
get_alpha
(env_index: torch.Tensor) → torch.Tensor[source]¶ Get the alpha value for the given environments.
- Parameters
env_index (TensorType) – environment index.
- Returns
alpha values.
- Return type
TensorType
Get the last shared layer for any given component.
- Parameters
component_name (str) – given component.
- Returns
list of layers.
- Return type
List[ModelType]
-
get_parameters
(name: str) → List[torch.nn.parameter.Parameter][source]¶ Get parameters corresponding to a given component.
- Parameters
name (str) – name of the component.
- Returns
list of parameters.
- Return type
List[torch.nn.parameter.Parameter]
-
get_task_encoding
(env_index: torch.Tensor, modes: List[str], disable_grad: bool) → torch.Tensor[source]¶ Get the task encoding for the different environments.
- Parameters
env_index (TensorType) – environment index.
modes (List[str]) –
disable_grad (bool) – should disable tracking gradient.
- Returns
task encodings.
- Return type
TensorType
-
get_task_info
(task_encoding: torch.Tensor, component_name: str, env_index: torch.Tensor) → mtrl.agent.ds.task_info.TaskInfo[source]¶ Encode task encoding into task info.
- Parameters
task_encoding (TensorType) – encoding of the task.
component_name (str) – name of the component.
env_index (TensorType) – index of the environment.
- Returns
TaskInfo object.
- Return type
-
sample_action
(multitask_obs: Dict[str, torch.Tensor], modes: List[str]) → numpy.ndarray[source]¶ Sample the action to perform.
- Parameters
multitask_obs (ObsType) – Observation from the multitask environment.
modes (List[str]) – modes for sampling the action.
- Returns
sampled action.
- Return type
np.ndarray
-
select_action
(multitask_obs: Dict[str, torch.Tensor], modes: List[str]) → numpy.ndarray[source]¶ Select the action to perform.
- Parameters
multitask_obs (ObsType) – Observation from the multitask environment.
modes (List[str]) – modes for selecting the action.
- Returns
selected action.
- Return type
np.ndarray
-
train
(training: bool = True) → None[source]¶ Set the agent in training/evaluation mode
- Parameters
training (bool, optional) – should set in training mode. Defaults to True.
-
update
(replay_buffer: mtrl.replay_buffer.ReplayBuffer, logger: mtrl.logger.Logger, step: int, kwargs_to_compute_gradient: Optional[Dict[str, Any]] = None, buffer_index_to_sample: Optional[numpy.ndarray] = None) → numpy.ndarray[source]¶ Update the agent.
- Parameters
replay_buffer (ReplayBuffer) – replay buffer to sample the data.
logger (Logger) – logger for logging.
step (int) – step for tracking the training progress.
kwargs_to_compute_gradient (Optional[Dict[str, Any]], optional) – Defaults to None.
buffer_index_to_sample (Optional[np.ndarray], optional) – if this parameter is specified, use these indices instead of sampling from the replay buffer. If this is set to None, sample from the replay buffer. buffer_index_to_sample Defaults to None.
- Returns
- index sampled (from the replay buffer) to train the model. If
buffer_index_to_sample is not set to None, return buffer_index_to_sample.
- Return type
np.ndarray
-
update_actor_and_alpha
(batch: mtrl.replay_buffer.ReplayBufferSample, task_info: mtrl.agent.ds.task_info.TaskInfo, logger: mtrl.logger.Logger, step: int, kwargs_to_compute_gradient: Dict[str, Any]) → None[source]¶ Update the actor and alpha component.
- Parameters
batch (ReplayBufferSample) – batch from the replay buffer.
task_info (TaskInfo) – task_info object.
logger ([Logger]) – logger object.
step (int) – step for tracking the training of the agent.
kwargs_to_compute_gradient (Dict[str, Any]) –
-
update_critic
(batch: mtrl.replay_buffer.ReplayBufferSample, task_info: mtrl.agent.ds.task_info.TaskInfo, logger: mtrl.logger.Logger, step: int, kwargs_to_compute_gradient: Dict[str, Any]) → None[source]¶ Update the critic component.
- Parameters
batch (ReplayBufferSample) – batch from the replay buffer.
task_info (TaskInfo) – task_info object.
logger ([Logger]) – logger object.
step (int) – step for tracking the training of the agent.
kwargs_to_compute_gradient (Dict[str, Any]) –
-
update_decoder
(batch: mtrl.replay_buffer.ReplayBufferSample, task_info: mtrl.agent.ds.task_info.TaskInfo, logger: mtrl.logger.Logger, step: int, kwargs_to_compute_gradient: Dict[str, Any]) → None[source]¶ Update the decoder component.
- Parameters
batch (ReplayBufferSample) – batch from the replay buffer.
task_info (TaskInfo) – task_info object.
logger ([Logger]) – logger object.
step (int) – step for tracking the training of the agent.
kwargs_to_compute_gradient (Dict[str, Any]) –
-
update_task_encoder
(batch: mtrl.replay_buffer.ReplayBufferSample, task_info: mtrl.agent.ds.task_info.TaskInfo, logger: mtrl.logger.Logger, step: int, kwargs_to_compute_gradient: Dict[str, Any]) → None[source]¶ Update the task encoder component.
- Parameters
batch (ReplayBufferSample) – batch from the replay buffer.
task_info (TaskInfo) – task_info object.
logger ([Logger]) – logger object.
step (int) – step for tracking the training of the agent.
kwargs_to_compute_gradient (Dict[str, Any]) –
-
update_transition_reward_model
(batch: mtrl.replay_buffer.ReplayBufferSample, task_info: mtrl.agent.ds.task_info.TaskInfo, logger: mtrl.logger.Logger, step: int, kwargs_to_compute_gradient: Dict[str, Any]) → None[source]¶ Update the transition model and reward decoder.
- Parameters
batch (ReplayBufferSample) – batch from the replay buffer.
task_info (TaskInfo) – task_info object.
logger ([Logger]) – logger object.
step (int) – step for tracking the training of the agent.
kwargs_to_compute_gradient (Dict[str, Any]) –
mtrl.agent.sac_ae module¶
-
class
mtrl.agent.sac_ae.
Agent
(env_obs_shape: List[int], action_shape: List[int], action_range: Tuple[int, int], device: torch.device, actor_cfg: omegaconf.dictconfig.DictConfig, critic_cfg: omegaconf.dictconfig.DictConfig, decoder_cfg: omegaconf.dictconfig.DictConfig, alpha_optimizer_cfg: omegaconf.dictconfig.DictConfig, actor_optimizer_cfg: omegaconf.dictconfig.DictConfig, critic_optimizer_cfg: omegaconf.dictconfig.DictConfig, multitask_cfg: omegaconf.dictconfig.DictConfig, decoder_optimizer_cfg: omegaconf.dictconfig.DictConfig, encoder_optimizer_cfg: omegaconf.dictconfig.DictConfig, discount: float, init_temperature: float, actor_update_freq: int, critic_tau: float, critic_target_update_freq: int, encoder_tau: float, loss_reduction: str = 'mean', decoder_update_freq: int = 1, decoder_latent_lambda: float = 0.0, cfg_to_load_model: Optional[omegaconf.dictconfig.DictConfig] = None, should_complete_init: bool = True)[source]¶ Bases:
mtrl.agent.sac.Agent
SAC+AE algorithm.
Abstract agent class that every other agent should extend.
- Parameters
env_obs_shape (List[int]) – shape of the environment observation that the actor gets.
action_shape (List[int]) – shape of the action vector that the actor produces.
action_range (Tuple[int, int]) – min and max values for the action vector.
multitask_cfg (ConfigType) – config for encoding the multitask knowledge.
device (torch.device) – device for the agent.
-
update_decoder
(batch: mtrl.replay_buffer.ReplayBufferSample, task_info: mtrl.agent.ds.task_info.TaskInfo, logger: mtrl.logger.Logger, step: int, kwargs_to_compute_gradient: Dict[str, Any]) → None[source]¶ Update the decoder component.
- Parameters
batch (ReplayBufferSample) – batch from the replay buffer.
task_info (TaskInfo) – task_info object.
logger ([Logger]) – logger object.
step (int) – step for tracking the training of the agent.
kwargs_to_compute_gradient (Dict[str, Any]) –
mtrl.agent.utils module¶
-
mtrl.agent.utils.
build_mlp
(input_dim: int, hidden_dim: int, output_dim: int, num_layers: int) → torch.nn.modules.module.Module[source]¶ Utility function to build a mlp model. This assumes all the hidden layers are using the same dimensionality.
- Parameters
input_dim (int) – input dimension.
hidden_dim (int) – dimension of the hidden layers.
output_dim (int) – dimension of the output layer.
num_layers (int) – number of layers in the mlp.
- Returns
[description]
- Return type
ModelType
-
mtrl.agent.utils.
build_mlp_as_module_list
(input_dim: int, hidden_dim: int, output_dim: int, num_layers: int) → torch.nn.modules.module.Module[source]¶ Utility function to build a module list of layers. This assumes all the hidden layers are using the same dimensionality.
- Parameters
input_dim (int) – input dimension.
hidden_dim (int) – dimension of the hidden layers.
output_dim (int) – dimension of the output layer.
num_layers (int) – number of layers in the mlp.
- Returns
[description]
- Return type
ModelType
-
mtrl.agent.utils.
preprocess_obs
(obs: torch.Tensor, bits=5) → torch.Tensor[source]¶ Preprocessing image, see https://arxiv.org/abs/1807.03039.
-
mtrl.agent.utils.
set_seed_everywhere
(seed: int) → None[source]¶ Set seed for reproducibility.
- Parameters
seed (int) – seed.
-
mtrl.agent.utils.
soft_update_params
(net: torch.nn.modules.module.Module, target_net: torch.nn.modules.module.Module, tau: float) → None[source]¶ Perform soft udpate on the net using target net.
- Parameters
net ([ModelType]) – model to update.
target_net (ModelType) – model to update with.
tau (float) – control the extent of update.
mtrl.agent.wrapper module¶
-
class
mtrl.agent.wrapper.
Agent
(env_obs_shape: List[int], action_shape: List[int], action_range: Tuple[int, int], multitask_cfg: omegaconf.dictconfig.DictConfig, agent_cfg: omegaconf.dictconfig.DictConfig, device: torch.device, cfg_to_load_model: Optional[omegaconf.dictconfig.DictConfig] = None, should_complete_init: bool = True)[source]¶ Bases:
mtrl.agent.abstract.Agent
This wrapper agent wraps over the other agents. It is useful for alogorithms like PCGrad and GradNorm that can be used with may policies.
- Parameters
env_obs_shape (List[int]) – shape of the environment observation that the actor gets.
action_shape (List[int]) – shape of the action vector that the actor produces.
action_range (Tuple[int, int]) – min and max values for the action vector.
multitask_cfg (ConfigType) – config for encoding the multitask knowledge.
agent_cfg (ConfigType) – config for the agents that are wrapper over.
device (torch.device) – device for the agent.
cfg_to_load_model (Optional[ConfigType], optional) – config to load the model from filesystem. Defaults to None.
should_complete_init (bool, optional) – should call complete_init method. Defaults to True.
-
complete_init
(cfg_to_load_model: Optional[omegaconf.dictconfig.DictConfig])[source]¶ Complete the init process.
The derived classes should implement this to perform different post-processing steps.
- Parameters
cfg_to_load_model (ConfigType) – config to load the model.
-
get_component_name_list_for_checkpointing
() → List[Tuple[torch.nn.modules.module.Module, str]][source]¶ Get the list of tuples of (model, name) from the agent to checkpoint.
- Returns
list of tuples of (model, name).
- Return type
List[Tuple[ModelType, str]]
Get the last shared layer for any given component.
- Parameters
component_name (str) – given component.
- Returns
list of layers.
- Return type
List[ModelType]
-
get_optimizer_name_list_for_checkpointing
() → List[Tuple[torch.optim.optimizer.Optimizer, str]][source]¶ Get the list of tuples of (optimizer, name) from the agent to checkpoint.
- Returns
list of tuples of (optimizer, name).
- Return type
List[Tuple[OptimizerType, str]]
-
load
(model_dir: Optional[str], step: Optional[int]) → None[source]¶ Load the agent.
- Parameters
model_dir (Optional[str]) – directory to load the model from.
step (Optional[int]) – step for tracking the training of the agent.
-
sample_action
(multitask_obs: Dict[str, torch.Tensor], modes: List[str])[source]¶ Sample the action to perform.
- Parameters
multitask_obs (ObsType) – Observation from the multitask environment.
modes (List[str]) – modes for sampling the action.
- Returns
sampled action.
- Return type
np.ndarray
-
save
(model_dir: str, step: int, retain_last_n: int, should_save_metadata: bool = True) → None[source]¶ Save the agent.
- Parameters
model_dir (str) – directory to save.
step (int) – step for tracking the training of the agent.
retain_last_n (int) – number of models to retain.
should_save_metadata (bool, optional) – should training metadata be saved. Defaults to True.
-
save_components
(model_dir: str, step: int, retain_last_n: int) → None[source]¶ Save the different components of the agent.
- Parameters
model_dir (str) – directory to save.
step (int) – step for tracking the training of the agent.
retain_last_n (int) – number of models to retain.
-
save_optimizers
(model_dir: str, step: int, retain_last_n: int) → None[source]¶ Save the different optimizers of the agent.
- Parameters
model_dir (str) – directory to save.
step (int) – step for tracking the training of the agent.
retain_last_n (int) – number of models to retain.
-
select_action
(multitask_obs: Dict[str, torch.Tensor], modes: List[str])[source]¶ Select the action to perform.
- Parameters
multitask_obs (ObsType) – Observation from the multitask environment.
modes (List[str]) – modes for selecting the action.
- Returns
selected action.
- Return type
np.ndarray
-
train
(training: bool = True)[source]¶ Set the agent in training/evaluation mode
- Parameters
training (bool, optional) – should set in training mode. Defaults to True.
-
update
(replay_buffer: mtrl.replay_buffer.ReplayBuffer, logger: mtrl.logger.Logger, step: int, kwargs_to_compute_gradient: Optional[Dict[str, Any]] = None, buffer_index_to_sample: Optional[numpy.ndarray] = None)[source]¶ Update the agent.
- Parameters
replay_buffer (ReplayBuffer) – replay buffer to sample the data.
logger (Logger) – logger for logging.
step (int) – step for tracking the training progress.
kwargs_to_compute_gradient (Optional[Dict[str, Any]], optional) – Defaults to None.
buffer_index_to_sample (Optional[np.ndarray], optional) – if this parameter is specified, use these indices instead of sampling from the replay buffer. If this is set to None, sample from the replay buffer. buffer_index_to_sample Defaults to None.
- Returns
- index sampled (from the replay buffer) to train the model. If
buffer_index_to_sample is not set to None, return buffer_index_to_sample.
- Return type
np.ndarray