13.1Optimizing Model Fit
You have just learned A LOT. By now, some of you may be feeling a bit lost and wondering how this all fits together. Basically, you've been learning a variety of techniques to optimize the model fit. Or, in other words, to maximize your fit statistics (accuracy, AUC, R squared, RMSE) without overfitting. Overfitting occurs when a model is excessively complex and is defined by having too many variables measured for the number of cases observed. See the image below for a vizual depiction of model fit.
The negative outcome of overfitting is that you may get predictions that are too "custom-fit" to the cases you have and they won't generalize well to the entire population (and future predictions). So, how do you prevent overfitting? There are actually two primary strategies that can be jointly implemented. The first involves reducing the number of features in your model. That's why you learned filter-based feature selection, permutation feature importance, and PCA. The second involves how you sample your data for training model versus testing your model. That's why you learned a variety of new algorithms besides linear regression--many of which use advanced sampling techniques.
Therefore, the two primary (and somewhat competing) objectives that you need to achieve--and that you have been learning Azure ML Studio pills toward--are:
Maximize the accuracy (or R squared) of the model
Minimize the likelihood of overfitting your model
To sum up all of the techniques you've learned thusfar for meeting these two objectives, you may want to carefully review the image below. It is a (relatively) quick cheatsheet for optimizing model fit. It represents the four general categories, or techniques, for improving the model fit. There is some ordering to these techniques, but often, you'll need to recursively try and retry all of them.