After some thought on the Poisson distribution implementation and the unfair testing I was performing, I think I might have to return to the Normal distribution. While the Poisson simplifies things by only using one parameter, it is insufficient in choosing exact slot values.
I have been unfairly testing my agent by fixing values in place, making the learning process different from the testing (turning the slot selection process from a randomised voting process into a decision list and temporarily ignoring the low probability rules). While this isn’t too big of a deal, the idea of an online learning agent is that it will have a result at any point – not after saying “it’s testing time, so gimme your best policy”. The reason the Poisson distribution is ill-suited for this is because the size of the distribution remains the same, regardless of parameter (in a sense). For example, if a particular slot was to be used with selection probability 1, the Poisson would morph this into: the slot will b used 0 times with probability ~0.33, once per policy with probability ~0.33, twice with probability ~0.16… However, a Normal distribution with given SD value can shape the size of it’s distribution, so a slot with selection probability 1 and SD near 0 will be used once with near 100% probability.
Also, the results seem to speak that Poisson is slower than Normal in learning, but this could be due to other implementation changes made.