Exercise 7.1 - Behavior of training set error with increasing sample size

Answers

The statement the error on the test will always decrease as we get more training data since the model will be better estimated is not a gold standard. The prerequisites behind this statement are at least:

The test data shares an identical distribution with the training data. Which, although plausible in general, fails in cases such as adversarial training (where an adversary schemes for samples that hinders the model) or defense against zero-day attack (where the model is required to detect an unknown threat from benign samples solely).
The model is complex enough to learn the knowledge embedded in the data. For a basic example, a tabular predictor that only saves the latest finite samples and predicts by table looking-up. In this case, the size of the dataset would hardly help. Moreover, the break of the i.i.d. assumption could worsen its performance.

We now reduce the discussion in cases where the assumptions for PAC learning holds. When the training set is small, the trained model is usually over-fitted to the current data set (since the model is very complex), so the accuracy can be relatively high. As the training set increases, the model has to learn to adapt to more general-purpose parameters, thus reducing the overfitting effect laterally, resulting in lower accuracy. As pointed out in Section 7.5.4, increasing the training set is an important method of countering over-fitting besides adding regularizers. Increasing dataset is fundamentally the only solution to models’ robustness since one cannot exhaust all possible data augmentation strategies.

solour_lfq

2021-03-24 13:42

Exercise 7.1 - Behavior of training set error with increasing sample size

Answers

Comments

Add answer