I developed a method for generating understandable rule-based predictive models from complex ensemble models.  The details of the project are presented in the following table:
Project Synopsis
Title:RuleCOSI: Rule extraction for interpreting classification tree ensembles
Highlights:
  • Development of post-hoc explainability method for accurate tree ensemble models
  • The size of ensemble models is reduced around 96% without decreasing the model performance, having a F-measure of 91.82 in average
Period: 2018 – Present
Outputs:
  • Academic article about the extended version of the algorithm published in top-journal Information Fusion
  • Academic article about the first version of the algorithm published in the journal Expert Systems with Applications
  • Book Chapter about explanation of ensemble models.
  • Python library compatible with scikit-learn available in GitHub
Skills
&
Technologies:
  • Ensemble learning
  • Rule-based classification
  • Explainable Machine Learning
  • Software engineering
  • Academic writing

Research details

Nowadays, Machine Learning is widely used in practical applications for solving problems that require predictive analytics. Several new methods are constantly presented in the field, incrementally improving the performance of the older models. However, the improvement in predictive performance usually comes with an increment in the model complexity, making the decision mechanisms of the models difficult to be understood by human intuition. Therefore, the purpose of this research project was to increase the interpretability of tree ensembles for classification, as it is shown in Figure 1.

Figure 1. Trade off between accuracy and interpretability of machine learning models.

Interpretability is the degree to which a human can understand the cause of a decision

Miller, 2017

For this purpose, RuleCOSI (Rule COmbination and SImplification) a novel heuristic method that extracts, combines and simplifies decision rules from ensembles was presented. The initial algorithm was published in this academic paper [1] in 2019. My research evolved since then and it was the main topic of my doctoral dissertation in 2020. I recently published the extended version, RuleCOSI+ in the top-ranked journal Information Fusion [2]. In this short post I introduce the main characteristics of the most recent version of the algorithm and show a small example result.

RuleCOSI+ algorithm

The algorithm has three basic steps as it is depicted in Figure 2.

Figure 2. Overview of the algorithm

The first step is to extract a ruleset from each of the trees forming the tree ensemble. this is done with a simple procedure in which a rule is created from the paths from the node root in the tree to each of the leaf nodes.

The second step is to make a combination of all the feature space of the rules. The final step is to generalize and simplify the rules based on pessimistic error.

The output of the algorithm is a single set of decision rules that are much simpler and have a similar performance to that of the tree ensemble.

RuleCOSI+ is able to handle two types of tree ensembles: Boosting and Bagging. The python library can work with several implementations of this ensemble types, such as Random Forests, XGBoost, CatBoost and Light GBM.

Example

Here I present a example using the the UCI steel plates faults dataset. The dataset contains 27 indicators that approximately describe the geometric shape of the defect and its outline. The task is to classify the type of surface defect. Because RuleCOSI+ can only work with binary classification problems, I considered for this example the dirtiness fault type.

Figure 3. Steel plate

The first step is to train a tree ensemble. In this case I trained an XGBoost ensemble with 50 trees. The F-measure of this model is 0.9958. The first 10 trees are shown in Figure 4.

Figure 4. First 10 trees of the XGBoost model for classifying dirtiness fault type in the steel pleats faults dataset.

After applying RuleCOSI+ to the result, the simplified ruleset has just 7 rules, with an F-measure value of 0.9926. The generated rules are presented in Figure 5.

Figure 5. Combined and simplified ruleset obtained from the XGBoost tree ensemble from Figure 4.

Python library

The library was implemented as a python package available in GitHub. The documentation of the library is available in this link.

Conclusions

Tree ensembles are widely used methods used for improving classification performance in many domains, including fault detection in manufacturing. However the complexity of the ensembles makes it very hard to be interpreted by humans.

The results of RuleCOSI+ were satisfactory in improving the interpretability of tree ensembles without decreasing its classification performance.

References

[1] Obregon, J., Kim, A., & Jung, J. Y. (2019). RuleCOSI: Combination and simplification of production rules from boosted decision trees for imbalanced classification. Expert Systems with Applications, 126, 64-82.

[2] Obregon, J. & Jung, J. Y. (2022). RuleCOSI+: Rule Extraction for Interpreting Classification Tree Ensembles. Information Fusion, in press.