On the theory of reinforcement learning with once-per-episode feedback