How to avoid overfitting in random forest Python?
How to solve overfitting problem in random forest?
Approach: Follow the below steps/check list to ensure and overcome overfitting problem in random forest model.
Step 1: Data Size : is it too small in size?
Step 2: n_estimators : More the number of Trees the less likely the algorithm is to overfit and generalize. But keep time complexity in mind. Lower the number , model will get close to decision tree with restricted feature set.
Step 3: max_features: Represent the number of maximum features provided to each tree in random forest. The smaller, the less likely to overfit, but too small will start to introduce underfitting. Rule of Thumb: Square root of number of feature present in the dataset.
Step 4: max_depth: Will need to experiment with this. Start with 5-10 range value, and increasing will get the best result but it can also increase the model complexity.
Step 5: min_sample_leaf: Set value greater than 1. This has similar effect to the max_depth parameter, it means the branch will stop splitting once the leaves have that "min_sample_leaf = value" number of sample each.