Practice: Hyper parameter tuning for an overfitted decision tree classifier
This project was initiated as part of a machine learning course from Udacity where I was tasked with selecting a model that accurately predicts whether an individual makes more than $50,000. The dataset for this project originates from the UCI Machine Learning Repository.
Below is a glimpse of the features in the dataset:
The data was preprocessed by scaling continuous variables and encoding the categorical variables. I then split the data sets into a train and testing sets, with which I trained 3 classifiers and obtained the following accuracy, f-score and duration results:
The decision tree and random forest classifiers overfit the data as they performed well on the training set but were unable to generalize to the test dataset. I decided to explore hyper-parameter tuning as a way to optimize the model performance.
- Review and understand the available model parameters
Below are the default parameter values;
2. Specify an objective metric
By default sklearn uses the accuracy_score for classification and the r2 score for regression. Alternative metrics can be defined based on what is important for the classification problem in question as defined here.
For example when classifying a patient as sick or healthy, we are more interested in the model’s ability to accurately classify sick patients so that we don’t send them away without treatment, and less worried about a healthy person being misclassified.
In such cases, we may select recall as the objective metric as we are interested in optimizing the number of correctly identified positive cases from all the actual positive cases or choose the F1 score which gives a better measure of incorrectly classified cases than accuracy.
3. Select an approach for the parameter search
Commonly used approaches include GridSearchCV which considers all parameter combinations, while RandomizedSearchCV can sample a given number of candidates from a parameter space with a specified distribution.
For tuning the decision tree classifier, I chose to optimize the F1 score metric using a grid search cross validation approach and tuned the following parameters:
- max_depth : if left as None, the tree nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples which likely ends up in overfitting as shown in the example below:
- min_samples_leaf : tuned this parameter to prevent the model from trying to fit a single data point in a leaf node which can be another symptom of overfitting.
Lastly, I set the cross validation(cv) strategy to 4 to generate stratified folds for the model to use when validating results.
The tuned model has an improved F score which shows that we were able to mitigate the level of overfitting as the model is now able to better generalize when presented with unseen data by selecting the best hyper-parameter values.
This concludes my tuning practice session, where I learnt how to improve a classification model through hyper-parameter tuning.