Proximal Policy Gradient with Dual Network Architecture (PPO-DNA)
Overview
PPO-DNA is a more sample efficient variant of PPO, based on using separate optimizers and hyperparameters for the actor (policy) and critic (value) networks.
Original paper:
Implemented Variants
Variants Implemented | Description |
---|---|
ppo_dna_atari_envpool.py , docs |
Uses the blazing fast Envpool Atari vectorized environment. |
Below are our single-file implementations of PPO-DNA:
ppo_dna_atari_envpool.py
The ppo_dna_atari_envpool.py has the following features:
- Uses the blazing fast Envpool vectorized environment.
- For Atari games. It uses convolutional layers and common atari-based pre-processing techniques.
- Works with the Atari's pixel
Box
observation space of shape(210, 160, 3)
- Works with the
Discrete
action space
Warning
Note that ppo_dna_atari_envpool.py
does not work in Windows and MacOs . See envpool's built wheels here: https://pypi.org/project/envpool/#files
Usage
poetry install -E envpool
python cleanrl/ppo_dna_atari_envpool.py --help
python cleanrl/ppo_dna_atari_envpool.py --env-id Breakout-v5
Explanation of the logged metrics
See related docs for ppo.py
.
Implementation details
ppo_dna_atari_envpool.py uses a customized RecordEpisodeStatistics
to work with envpool but has the same other implementation details as ppo_atari.py
(see related docs).
Experiment results
Below are the average episodic returns for ppo_dna_atari_envpool.py
compared to ppo_atari_envpool.py
.
Environment | ppo_dna_atari_envpool.py |
ppo_atari_envpool.py |
---|---|---|
BattleZone-v5 (40M steps) | 94800 ± 18300 | 28800 ± 6800 |
BeamRider-v5 (10M steps) | 5470 ± 850 | 1990 ± 560 |
Breakout-v5 (10M steps) | 321 ± 63 | 352 ± 52 |
DoubleDunk-v5 (40M steps) | -4.9 ± 0.3 | -2.0 ± 0.8 |
NameThisGame-v5 (40M steps) | 8500 ± 2600 | 4400 ± 1200 |
Phoenix-v5 (45M steps) | 184000 ± 58000 | 10200 ± 2700 |
Pong-v5 (3M steps) | 19.5 ± 1.1 | 16.6 ± 2.3 |
Qbert-v5 (45M steps) | 12600 ± 4600 | 10800 ± 3300 |
Tennis-v5 (10M steps) | 13.0 ± 2.3 | -12.4 ± 2.9 |
Learning curves:
data:image/s3,"s3://crabby-images/7cd64/7cd648645654c3b23de1001a88ddd29c14d96b5f" alt=""
data:image/s3,"s3://crabby-images/0c68c/0c68c7a0386a2ad2646a901e66cf0b930ad47b72" alt=""
data:image/s3,"s3://crabby-images/c8987/c8987aa6a416f68bf1337426e33af73d8febba6e" alt=""
data:image/s3,"s3://crabby-images/d55ce/d55ce1935ce2372b6d5eda4dcc53b3a80aa7c3ce" alt=""
data:image/s3,"s3://crabby-images/59151/5915131f15d0877de2e1a5b8f5faf5db58ffcf72" alt=""
data:image/s3,"s3://crabby-images/1c070/1c070d84bafa9bbc77d9538fa74a7c3e04b65cf9" alt=""
data:image/s3,"s3://crabby-images/11a19/11a19fe3146b6f60cb2704002e6e9be89c2b4292" alt=""
data:image/s3,"s3://crabby-images/5e3cc/5e3ccca6aab165a984a4bad36931eb73330e7024" alt=""
data:image/s3,"s3://crabby-images/45fbd/45fbd0efaf0faeec4c7a20c4046dcdb7f50223b7" alt=""
data:image/s3,"s3://crabby-images/4fd4f/4fd4f334ba2fa210828733ed7d70943b6e70e292" alt=""
data:image/s3,"s3://crabby-images/89e82/89e825c49042902d2abd6fe86b4082ccb8d5a0df" alt=""
data:image/s3,"s3://crabby-images/d90fb/d90fb748161ea5fec2248397fe1d1f130cbc0402" alt=""
data:image/s3,"s3://crabby-images/6c937/6c937550c5cb03faf0eea61170a339c2289d6559" alt=""
data:image/s3,"s3://crabby-images/1ae54/1ae54e0de95bbd9f6b23737def048c8716d43f60" alt=""
data:image/s3,"s3://crabby-images/6b716/6b716fc5114f3a405ba6cccf3132a2d038e14018" alt=""
data:image/s3,"s3://crabby-images/fc287/fc2873ec05a5a099b24204df45b526edd813ea6d" alt=""
data:image/s3,"s3://crabby-images/27f66/27f6628bf0ccb95fcbfc0e1aa3b271c5cf862425" alt=""
data:image/s3,"s3://crabby-images/44bb5/44bb59577f8f85363b70d9eb9f4ca43eb2c0ab9e" alt=""
Tracked experiments: