Learning performance

Did not solve the environment.
Best 100-episode average reward was
-793.86 ± 13.36.
(Taxi-v1 is considered "solved"
when the agent obtains an average reward of at least 9.7
over 100 consecutive episodes.)