The thing about continuous space solutions is that they are typically differentiable, which means you can use a gradient descent or LM optimization rather than needing to fully explore the solution space. Typically there are large regions which are heuristically excludable, which is what you are getting at I think, but even an unbiased sampling plus gradient descent often makes problems much more tractable than discrete problems.
The type of learning problem where I agree with your point is in something like learning how to classify hand written digits. My point about the continuous nature being unsearchable in practice is about recursive forms - if I choose this policy, my opponent will choose to react to the fact that I had that policy.
In your learning problem where thing were made tractable by differentiation you have something like an elevation map that you are following, but in the multi-stage decision problem you have something more like a fractal elevation map. When you want to know the value of a particular point on the elevation map you have to look for the highest point or the lowest point on the elevation map you get by zooming in on the area which is the resultant of your having chosen a particular policy.
The problem is that since this is a multi-agent environment they can react to your policy choice. So they can for example choose to have you get a high value only if you have the correct password entered on a form. That elevation map is designed to be a plain everywhere and another fractal zoom corresponding with a high utility or a low error term only at the point where you enter the right password.
Choose a random point and you aren't going to have any information about what the password was. The optimization process won't help you. So you have to search. One way to do that is to do a random search; if you do that you eventually find a differing elevation - assuming one exists. But what if there were two passwords - one takes you to a low elevation fractal world that corresponds with a low reward because it is a honeypot. The other takes you to the fractal zoom where the elevation map is conditioned on you having root access to the system.
This argument shows us that we actually would need to search over every point to get the best answer possible. Yet if we do that we have to search over the entire continuous distribution for our policy. Since by definition there are an infinite number of states a computer with infinite search speed can't enumerate them; there is another infinite fractal under every policy choice that also needs full enumeration. We have non-termination by a diagonalization argument for a computer that has infinite speed.
Now observe that in our reality passwords exist. Less extreme - notice that reacting to policy choice in general, for example, moving out of the way of a car that drives toward you but not changing the way you would walk if it doesn't, isn't actually an unusual property in decision problems. It is normal.
I get what you're saying about recursive adversarial problems and their fractal nature, but this is exactly what GANs do to great success, despite the fact that it's hard. Yes, they have to train a lot slower, but learning general strategies and patterns in opponent behaviour still works.
Your password example on the other hand is a discrete, non-differentiable example. If it was differentiable - for example instead of a true/false you got an edit distance to the real password, then passwords would be trivial to crack.
I am taking about decision problems, you are taking about learning problems. These are different. Skip past the idea that you need to learn something. You’ve finished doing so.
What happens once we learn an approximation of that landscape; a map that has error, it doesn’t correspond fully with the territory.
The cognitive bias framing calls the map biased, but if you generalize from that to a more global sense of irrationality the reasoning is in error. In a more particular situation you have a simpler game tree because it is just the game tree under the node. The lifting of constraints produces the ability to have further insight - the map has to be an approximation.
Don’t reach for edit distance; make the boolean a Maybe Boolean which needs further resolution. See that the approximation is demanded because the world isn’t setup to allow all things to be learnable. My honeypot example is simpler than reality - there exists passwords for which trying to guess the password but getting the honeypot resolves to the learner being jailed; generally the learner in the actual game wouldn’t even get to have infinite guesses either, but I made the problem simpler to expose the problem complexity in terms that learning theory would be more familiar with - the elevation maps of the error landscape that learners like to slide down.
Decision problems are a subset of learning problems. As soon as someone can simulate your environment there is no negative consequence to further exploring the solution space via differentiable evaluation methods which allow efficiently training an optimal player.
Your intuitions are steering you wrong. Think about this from first principles in light of some of the corrections I'm going to provide:
> which allow efficiently training an optimal player.
Training an optimal player is not possible in practice. We know and have known the mathematics for optimal play for decades. Since we know it we are able to calculate the amount of space such a solution would take up in memory. Again this is a studied thing. Here is Peter Norvig in Artificial Intelligence: A Modern Approach to tell you the same thing: Page 173. "Because calculating optimal decisions in complex games is intractable, all algorithms must make some assumptions and approximations."
> Decision problems are a subset of learning problems.
This framing has some benefits - it makes generalization simpler. It has some downsides too - in complicated environments it will only approximate the solution and because of that there will be times where it gets things wrong.
In theory you have at first an intractable problem at your initial training time. Then when the game begins and play has progresses you have a more tractable problem because the information available to you eliminates parts of the game tree from consideration. The result of this is that we actually have two learning problems - not one learning problem. One is computed prior to the game. The other is computed during the game.
This theoretical issue has been studied and found to exist in practice by DeepMind. They tried training agents that didn't use tree search and just used the learned heuristic. These lost to agents that also used tree search.
Here is a section from a talk by Noam Brown - he briefly covers your intuition and why it breaks down.
This is also something you can see without reference to theory by looking at the physical progress on optimal solutions. Chess solving for example has the solutions via the end game tables, but they only have them for the more specific instances you reach near the end of the game tree. It is widely understood that we don't have enough memory to store the full solution to the game.
> As soon as someone can simulate your environment there is no negative consequence
This is a non-physical claim. There is obviously a cost to computation. It consumes both energy and time. Our best understanding is that we have a finite amount of these. Your theoretical approach isn't physically real.
> As soon as someone can simulate your environment...
It doesn't become easy at this point. It remains intractable.
A very simple example of why it doesn't get easy is the halting problem from computer science.
A more complicated example that you will have to really think about in order to understand is the nature of the equilibrium adversarial strategy. It is defined with a respect to an oracle - something which would be able to perfectly simulate its strategy. And it is trying to not lose to an oracle; it is assuming you have a very good map.
You've got to remember - your simulation is your map - it isn't the territory. When you play, you aren't playing on your map. You are playing in the territory via your map. The equilibrium strategies were already assuming you had a map. So they aren't trying to make it easy for your map to give you the right answer. They are trying to make some places un-mappable.
Again - remember the real world. Do I know your password? Why not? And what is my password, if it is so easy to know it?