How To Invest In Stock Market Using CANSLIM Parameters

The first question pertains to the “ Stock Selection” aspect of the strategy, which comprises rules and techniques for “ Stock Screening.” The CANSLIM investing system offers such stock selection…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Procgen Benchmark RL Generalization

Generalization is a key problem in deep reinforcement learning (RL). Although agents tuned to a particular level or task can often outperform their human counterparts, they often struggle to replicate the same level of maneuverability and flexibility when placed in a different environment. Thus, one of the hurdles that must be overcome in the training of an agent is overfitting to a specific task or training set. Agents that become over-sensitive to novel signals in a particular environment tend to have significant drops in performance in even slightly different environments, creating a perceived tradeoff curve between performance and generalizability. Pushing this curve further, enhancing an agent’s ability to adapt without sacrificing its performance, is therefore a key challenge towards achieving human-level artificial intelligence.

One of the benefits of Procgen is that it can generate distinct environments (levels) for testing and training. This allows us to measure the generalizability of our agent by training on one set of levels and seeing how the agent performs on a different set of levels. We see that our agent does indeed seem to generalize across different environments, suggesting that it is learning meaningful skills as opposed to memorizing trajectories. Ultimately, RL experiments should keep generalizability in mind, and new methods for improving the generalizability of agents are invaluable.

Clearly, the random agent performs erratically, disregarding the objective of the game (to gather fruit items, avoid non-fruit items, and reach the end of the level) and hitting a terminal state prematurely. Taking a closer look at the loop above, we can see that the Gym environment has provided us with a wealth of information about the game that our agent is, at this point, programmed to ignore.

In Gym, the `env.step()` method is intended to be our main way of interacting with and extracting information from the environment. In particular, `env.step` takes in a ‘valid’ action (we will get to that shortly), uses that action to update the environment, and returns a 4-tuple consisting of the resultant state, rewards gained by the action, an indicator of whether the resultant state is terminal (ends the game), and an open variable to process any other information (often unused).

Game states: Game states returned by the `env.reset` and `env.step` methods come in the form of integer NumPy arrays of shape (64, 64, 3) ranging from 0–255. We can render this state array as an image. We provided a side-by-side of the game state and the full game image below:

From our exploration with the states, we made several deductions pertaining to the task at hand for our eventual agent:

Reward space: We discovered the following reward function:

Because there is no passive reward for actions that extend the game, the reward function is sparse and sufficient exploration is crucial towards encouraging an agent to both seek out fruit and finish the level.

The training process for our models was as such: we had the agent step through a vectorized environment for a set number of steps to form each batch. Whenever an individual environment reached a terminal state, it would automatically reset and output its ending reward and duration (game lifetime). To evaluate model performance, we tracked both the reward and duration of the agent within the parallel environments. In the plots below, mean episode reward (‘eprewmean’) and mean episode length (‘eplenmean’) represent the mean episode length and reward for the past 100 episodes. At test-time, we loaded model checkpoints that were saved along the training process and removed the model’s ability to update, simply running the agent in the Fruitbot environment to collect information about the reward and episode length. We also utilized Procgen’s built-in functionality to specify level seeds to make sure that the levels at test-time were fully distinct from the levels that the agent saw during training, allowing us to test how well the agent generalizes to unseen levels.

Note: in plots where the x-axis is marked as “Training Iterations”, this is referring to timesteps stepped through all environments during training. For testing plots, the x-axis is timesteps run for testing

Conclusions

As our findings clearly show, PPO2 was the most promising baseline model that we tested. Although we did not extensively experiment with hyperparameters in this phase of exploration, we determined that likely the tuned models would have similar relative ranking in terms of training and generalization. Therefore, in trying to deepen our exploration, we decided to choose the PPO2 model and further modify it to improve its generalization. Based on our readings, PPO seemed to be the easiest to get results without extensive hyperparameter tuning, and we were hoping that this would lead us to have more consistent results for quick generalization within our timeframe. Therefore, we chose the Open-AI implementation of PPO as our baseline model.

Here we highlight some key takeaways that we explored in the training of our agent.

We altered the architecture trying to achieve improved generalization over our limited training steps and levels. We tried depths of: [16, 32, 64], [32, 32] (only 2 convolutional sequences), and [16, 64, 64]. Generally, larger sizes seem to help: they train very quickly due to additional model complexity but then stop improving early on. We decided to go with the largest net and add regularization to hedge against overfitting.

It’s clear the regularized net had a better testing performance after training for 50M timesteps over the baseline, so proceeded with using the regularized net for our final model. We think it does indeed help with overfitting in the training process.

Below is a graphical representation of our final convolutional network architecture:

Note that we also replaced the ReLU layers of the original Impala Net architecture with Leaky ReLU layers, in an effort to avoid any issues with vanishing gradients and lack of model updates at a given time step. As noted before, this update seemed to lead to faster training-time increases, as well as higher rewards during testing.

The psuedocode:

The way that BAIR implemented EPOpt for PPO and gym environments was to at each batch, have the agent take n trajectories (controlled by the hyperparameter “paths”) in an environment and then take all trajectories with a reward lower than the nth percentile of rewards (controlled by the hyperparameter “epsilon”). Unfortunately, their implementation was built to only take one environment at a time. We decided to re-write the implementation to take a set of vectorized environments (a rather tedious process), but in order to preserve dimensionality, we could only take one trajectory from each environment. Therefore, we settled on instead taking the minimum trajectory at each batch. We believe that this method preserves the adversarial and transfer benefits of the original EPOpt algorithm.

Initially, we only used 1 normal update for every 4 EPOpt updates in training our model, but ultimately found that at test-time the model had completely unlearned everything by 50M timesteps. Therefore for our final run, we used only 1 EPOpt update per every 9 normal updates, in an effort to only slightly change the sample space of the episodes. Additionally, we had our EPOpt sampling sample the worst of 5 trajectories (we would have tried more, but unfortunately ran into memory constraints).

The baseline PPO2 model comes with a variety of hyperparameters. We left the majority as default due to time constraints, but decided to explore the number of environments and learning rate as 2 hyperparameters with potentially large effects.

In a final comparison of our 2 best performing hyper-parameter regimes ({lr: 1e-3, num envs: 64}, {lr: 5e-4, num envs 32}), we see very little difference in test performance, and finally chose {lr: 5e-4, num envs 32} as the final hyperparameter scheme in consideration of clock speed, as having 32 environments reduces matrix computations in the batched implementations.

In our investigation, we tracked several different models and model variations. We summarize our findings here:

The best model was PPO using depths [16,64,64] and also using batch normalization and dropout. We used the EPO-pt policy ensemble method to improve results and trained the model using our found best learning rate of 0.0005 and 32 environments.

We hoped that this model would produce overall more regularized test results, and an agent would continue to train just as quickly on its own set of levels while generalizing better to a far expanded set of levels.

In our final model runs, we decided to analyze 3 separate models — one trained with the EPO-pt algorithm and the regularized policy network and updated hyperparameters on 50 levels, one trained without EPO-pt but with all other updates on 50 levels, and one without EPO-pt but with all other updates on 100 levels. Below are our results from checkpoints saved every 5M time steps during the course of training, compared against checkpoints saved from the baseline model during the course of training (at every 10M). We notice an early increased test performance in both variations of our final model, in comparison to the baseline. It seems therefore that the final model is capable of producing better test results with fewer training iterations, at least initially. As the number of training iterations increases, all models seem to average out to very similar high levels of test performance, with much variance in results. We believe the variance is due to the fact that the levels were randomized during testing, so given more time we would generate multiple test runs and average the results.

In the following plot, it’s difficult to see if the addition of EPOpt was useful in generalization in this case. Perhaps only using one EPOpt sample for every 9 normal samples from the environment is not adversarial enough. Additionally, sampling a larger number of trajectories to take the minimum of could also increase the adversarial aspect of the method. These hyperparameters offer a route for further exploration.

We also include our extra run with our regularized policy network trained on 100 levels compared to the one trained on 50 levels, to demonstrate that no matter the amount of regularization added to a model, it still generalizes better given a larger number of levels to work with.

One analysis of these results is that although given much time spent on the same levels, a good agent model (such as the baseline) will learn the strategies needed to generalize to more levels well. However, given more regularization features such as EPO-pt and regularized policy networks, an agent model will learn useful strategies that generalize well to more levels much earlier on, which is supported by our far higher early test performance.

Generalization is still a central goal of reinforcement learning algorithms. Models can generally perform very well on their training data, but it is difficult for these models to perform well on unseen data, in our case, different game levels. Our final model learned quickly on limited training levels, and we think that performance and generalization would have improved with more training and further investigation of hyperparameters specific to our new methods. However, the ability to admirably generalize on just a handful of training levels shows potential in this model despite the limited scope of investigation for this specific project.

The OpenAI Procgen Benchmark consists of 16 games with various levels of difficulty. For the purposes of our project, we zeroed in on a single game, Fruitbot, at the easiest difficulty. This scope limits our variables to more clearly see the effect of our algorithms on the performance. Our analysis seems to indicate that the addition of regularization methods, such as regularizing layers in policy networks and training sample space shaping with the EPO-pt algorithm, can have a large effect on the generalizing abilities of models trained for a small number of iterations. However, as is the theme of this project, it is yet unclear if our novel method will generalize well to improving performance on other games and further testing is required across levels for different games.

As Reinforcement Learning continues to grow and be further researched, it has become clear that innumerable variables affect model performance in many ways that are not fully understood. These variables range from the relatively straightforward learning rate to difficult to predict variables such as random seeds. A successful model must be able to account for a wide range of these values and be able to generalize to new scenarios quickly, despite initial variability. Our method presented here offers one promising avenue to improve model performance and generalization, but further research is needed to thoroughly gauge its total effect on a wider range of games and agents.

Ana Tudor. Ana is a 3rd year EECS student with research interests in applied ML and data/information sciences. Her background is in quantum computing, optimization algorithms, and applied data science research.

Contributions. Ana did model ideation, development, testing, and analysis. She specifically contributed to development of the ACER agent modeling and training for the fruitbot environment, convolutional policy net architecture design and testing, finalizing the agent model and training process, and reporting results. Responsible for about 30% of the project.

Richard Zhang. Richard is a 4th year CS and Stats student with research interests mainly in computational biology and bioinformatics. His current research is in the reconstruction of CRISPR-mediated single-cell phylogenies, working with the Yosef Lab at Berkeley.

Contributions. Richard did model ideation, development, testing, and analysis. His contributions included the ideation of generalization techniques, in network architecture and EPO-pt. He also helped implement EPO-pt and making it compatible with multiple environments, created a training process for the A2C algorithm. In addition he generated plots, performed model testing (including creating the testing pipeline), explored hyperparameters, and helped in design decision-making. Responsible for about 29% of the project.

William Wu. William is a 4th year CS and Stats student with an academic background in theoretical statistics and statistical learning and a professional background in software engineering. He is interested in the efficient production and deployment of machine learning applications.

Contributions. William worked in model testing, visualization, and analysis. He built a framework to test and visualize models independently of their training processes, and is responsible for many of the images and gifs shown in the report. He also worked in exploring and understanding the Gym API, and in particular how to access and interact with the environment. Responsible for about 18% of the project.

Quentin Delepine. Quentin is a 3rd Year EECS Student with a background in mechanical design and software development. His interests include robotics and automation.

Contributions. Quentin worked on initial environment setup, experimenting with the procgen environments and understanding of the codebase. In addition to helping with model implementation and documentation, he created flexible tools for visualization of training and testing data that were leveraged for decision making and analysis. Responsible for about 23% of the project.

Add a comment

Related posts:

The contrast of falling in love

The feeling of falling in love or infatuation is actually the feeling of focusing on the positive aspects of someone. Most of the times, this happens at the beginning of the relationship, when you…

Hello Team Pando!

Great products are wildly complex. Not in the sense that they expose wild complexity to users but in the sense that the team of people building them have to deal with a staggering amount of…

Desafio de trabalho pt. 2

Eu e um grupo de amigos estamos organizando uma viagem internacional que pretendemos realizar no final desse ano. Mas infelizmente estamos tendo dificuldade em conseguir visualizar o roteiro da…