A key finding of our work is the empirical validation of the Attempting Control Goal (ACG) hypothesis. This hypothesis suggests that data collected while actively trying to perform a control task is more informative than randomly collected data.
Our experiments support a "weak" version of this hypothesis: even without a perfect policy, the iterative loop of computing the best possible controller with the current data and using it to attempt the task yields highly valuable data for learning. On the SSP, we found that 20 seconds of online data gathered while attempting to balance the puck was more effective at improving the controller than several minutes of manually collected demonstration data.
Our formal convergence analysis helps explain this phenomenon. For the model to converge correctly, the data must satisfy certain mathematical properties related to ergodicity (i.e., the data must sufficiently explore the relevant parts of the system's state space). Data collected while pursuing a control goal is more likely to be structured in a way that fulfills these requirements, leading to faster and more efficient learning. This insight suggests that learning can be made more efficient by designing data collection strategies that are guided by these theoretical principles.