数据挖掘:决策树 Decision Trees
决策树是一种用于分类和回归的可解释性模型,通过自顶向下的方法构建,从根节点开始,每个父节点选择特征进行分裂。特征选择基于方差减少、信息增益或基尼不纯度,所有例子在每个节点用于特征选择。决策树易受过拟合、对数据敏感和难以并行计算的局限性。随机森林通过训练多个决策树来提高鲁棒性,树并行训练,分类用多数投票,回归用平均。随机性来自袋ging和随机选择特征。决策树在工业中广泛应用,但对数据敏感,集成方法可帮助改进性能。
文章目录
-
-
-
- Building Decision Trees
-
-
- Limitations of decision Trees
- Random Forest
- Summary

Building Decision Trees
采用自顶向下的方法,从包含所有特征的根节点开始。
在每个父节点,选择一个特征来分割示例。
通过最大化连续目标的方差减少来选择特征。
通过最大化分类目标的信息增益(1-熵)来选择特征。
通过最大化分类目标的基尼不纯度,其中基尼不纯度计算为 1-\sum_{i=1}^n p_i^2。
* All examples are used for feature selection at each node
Limitations of decision Trees
Over-complex trees can overfit the data
* Limit the number of levels of splitting,
* Prune branches
Sensitive to data
Modifying a few examples can result in selecting different features which in turn result in a different tree. Random forest is an ensemble learning method that operates by constructing multiple decision trees during training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.
Not easy to be parallelized in computing
Random Forest
Train multiple decision trees to improve robustness
Trees are trained independently in parallel
Majority voting for classification, average for regression
Where is the randomness from?
* Bagging: randomly sample training examples with replacement
* E.g. [1,2,3,4,5] → [1,2,2,3,4]
Randomly select a subset of features
Summary
- Decision trees: a comprehensible model for classification and regression tasks
- Straightforward to train and tune, ubiquitous in industrial applications
- Highly sensitive to data nuances
- Ensembles can significantly enhance performance, particularly through techniques like bagging and boosting, which are covered in more detail later.
