(Q-learning en Python)
Q-Learning seems quite simple and we aim to implement it for learning to play to 421 game.
Implement a new PlayerQ
At initialization Q is created empty with the other required variable.
At perception steps the player update its Q-value and choose a new action to perform accordingly to Q-Learning algorithm.
Ints for implementing Q
A simple way to implement Q in python language is to implement it as a Dictionnary of dictionaries.
Initilizing an empty Q will look like:
Typically in the constructor method __init__
in python (and with Q-Learning attributes):
Initilizing action values for a given state will look like:
A new state requires to be added to qvalues
each time it is necesary in the wakeUp
and the perceive
methods. To test the increasing dictionary it is possible to simply use the python function print
with the dictionary in parameter.
Finally, modifying a value in qvalues will look like:
To resume
First you have to increase Q dictionary with a new entrance each time a new state is visited.
Then you can implement the update of Q value for the last visited state (
Q[stateStr][actionStr]
)To notice that updateQ will require another method to select the maximal value in Q for a given state.
Now the action method can randomly select an exploration or an exploitation action.
To notice that action will require another method to select the action with the maximal value in Q for a given state.
A proper PlayerQ class permit users to customize the algorithms parameters $\epsilon$, $\gamma$ ... Let’s do it in the
__init__
method (with default parameters value).Handle default parameters value in python with w3schools.
You can must test your code at each development step by executing the code for few games and validate that the output is as expected (a good python tool to make test: pytest
).
You can now try to answer the question: how many episodes are required to learn a good enough policy.
Going further:
PlayerQ save its learned Q-values on a file.
PlayerQ initialize its Q-values by loading a file.
A new PlayerBestQ simply play the best action always from a given Q-values dictionary (without upgrading Q).
You are capable of plotting the sum over Q with one point per episode (with pyplot for instance).
Last updated
Was this helpful?