Immediately after switching the page, it will work with CSR.
Please reload your browser to see how it works.
Intra-distribution generalization seems like the only rigorously defined kind of generalization we have. Can you provide any references that describe this other kind of generalization? I'd love to learn more.
This is a false dichotomy. Functionally the reality is in the middle. They "memorize" training data in the sense that the loss curve is fit to these points but at test time they are asked to interpolate (and extrapolate) to new points. How well they generalize depends on how well an interpolation between training points works. If it reliably works then you could say that interpolation is a good approximation of some grammar rule, say. It's all about the data.