Decision tree is a multi-class classification tool allowing a data point to be classified into one of many (two or more) classes available. A decision tree divides the sample space into a rectilinear region. This will be more clear with an example. Let us say we have this auto-insurance claim related data as shown in the following table. We want to predict what type of customer profile may more likely lead to claim payout. The decision tree model may first divide the sample space based on age. So, now we have two regions divided based on the age. Next, one of those regions will further sub-divided based Marital_status, and then that newly divided sub-regision may further get divide based on Num_of_vehicle_owned.
A decision tree is made up of a root node followed by intermediate node and leaf node. Each leaf node represents one of the class into which data points have been classified to. An intermediate node represents the decision rule based on which parent node data points have been divided into children node. This decision rule is based on the measure of impurity, either using Gini index or entropy, for each predictor variable and choosing that predictor which result in more purer children node than the parent node. The marital status is.
The advantage of Decision tree are:
R-code to build decision tree:
model <- rpart("claim>0 ~ Age + Num_of_vehicle_owned + Marital_status",
data=train.data,
method="class",
control=rpart.control(cp=0.001,
minsplit=100,
minbusket=100,
maxdepth=5))
A decision tree is made up of a root node followed by intermediate node and leaf node. Each leaf node represents one of the class into which data points have been classified to. An intermediate node represents the decision rule based on which parent node data points have been divided into children node. This decision rule is based on the measure of impurity, either using Gini index or entropy, for each predictor variable and choosing that predictor which result in more purer children node than the parent node. The marital status is.
| Claim | Age | Num_of_vehicle_owned | Marital_status |
| 0 | 25 | 1 | Single |
| 0 | 30 | 2 | Married |
| 1 | 25 | 1 | Single |
The advantage of Decision tree are:
- Multi-class Classifier
- Heterogeneous attributes (categorical, continuous)
- No Data normalization needed
- Fast Training speed
- Model interpretability
- Fast Testing speed
- Weak classifiers
- Large training data set needed
- Overfitting may happen unless pruned
R-code to build decision tree:
model <- rpart("claim>0 ~ Age + Num_of_vehicle_owned + Marital_status",
data=train.data,
method="class",
control=rpart.control(cp=0.001,
minsplit=100,
minbusket=100,
maxdepth=5))
Comments
Post a Comment