I hope the following excerpts will provide an insight into what my question is going to be. These are from http://neuralnetworksanddeeplearning.com/chap3.html
"The learning then gradually slows down. Finally, at around epoch 280 the classification accuracy pretty much stops improving. Later epochs merely see small stochastic fluctuations near the value of the accuracy at epoch 280. Contrast this with the earlier graph, where the cost associated to the training data continues to smoothly drop. If we just look at that cost, it appears that our model is still getting "better". But the test accuracy results show the improvement is an illusion. Just like the model that Fermi disliked, what our network learns after epoch 280 no longer generalizes to the test data. And so it's not useful learning. We say the network is overfitting or overtraining beyond epoch 280."
---We are training a neural network and the cost (on training data) is dropping till epoch 400 but the classification accuracy is becoming static (barring a few stochastic fluctuations) after epoch 280 so we conclude that model is overfitting on training data post epoch 280
"We can see that the cost on the test data improves until around epoch 15, but after that it actually starts to get worse, even though the cost on the training data is continuing to get better. This is another sign that our model is overfitting. It poses a puzzle, though, which is whether we should regard epoch 15 or epoch 280 as the point at which overfitting is coming to dominate learning? From a practical point of view, what we really care about is improving classification accuracy on the test data, while the cost on the test data is no more than a proxy for classification accuracy. And so it makes most sense to regard epoch 280 as the point beyond which overfitting is dominating learning in our neural network."
---As opposed to classification accuracy on test data compared with training cost previously we are now placing cost on test data against training cost
then the book goes on to explain why 280 is the right epoch where the overfitting has started. That is what i have an issue with. I cant wrap my head around this.
We are asking model to minimize the cost and thus cost is the metric it uses as a measure of a its own strength to classify correctly. If we consider 280 as the right epoch where the overfitting has started, have we not in a way created a biased model that though is a better classifier on the particular test data but nonetheless is making decisions with low confidence and hence is more prone to deviate from the results shown on the test data?