GPT2-small-spanish: a Language Model for Spansih text generation

Last year I was thrilled by all the articles talking about GPT-3 (Generative Pre-trained Transformer 3), a language model created by OpenAi, with billions of parameters that is capable of generating human-like text. Because accesing that model was hard, together with a collegue at INU in Korea, we decided to train a GPT-2 model and see firsthand the capabilities of this model.

So because our mother tongue is Spanish, we decided to fine-tune a pre-trained English model using the Spanish wikipedia. The process was not as simple as it looked on paper because of the complexity of the model. Luckily, we found a very nice article wrote by Pierre Guillou on how to fine-tune the GPT-2 model for Portuguese.

The result was a good learning experience on different topics related to NLP such as tokenization, language generation, transformer models and the Hugging Face platform, which is an AI community that provides very useful libraries to build, train and deploy state-of-the-art models in the NLP area.

The model was uploaded in the Hugging Face platform and so far (June 2021) has been dowloaded more than 2,500 times. The model was uploaded under the user ‘datificate‘, which is a side personal project we started with my friend to generate machine learning content in Spanish.

The model can be used for different tasks, but here I show the result (in Spanish) of text generation based on the classic unicorn text used in the original presentation of the GPT-2 model. The notebook used to generate the results can be found in our github repository.

>> Texto Generado 1
En un hallazgo impactante, el científico descubrió un rebaño de unicornios que vivían en un valle remoto, anteriormente inexplorado, en las montañas de los Andes. Aún más sorprendente para los investigadores fue el hecho de que los unicornios hablaban perfecto español. A su vez, el descubrimiento de la especie “N. scelanoides” había llevado al editor de ciencia de la Universidad de Toronto, William King, a la conclusión de que “nadie está de acuerdo en quién es el “Unicornio del Pacífico Norte del mundo”. En consecuencia, King, en su artículo de 1996 “La hipótesis de King y las dificultades de la hipótesis de King”, sugirió que el nombre del género “Aeolophora” no debe ser confundido con las lenguas de los indígenas de Norteamérica.

Translation (using google translate)

>> Generated text 1

In a shocking find, the scientist discovered a herd of unicorns living in a remote, previously unexplored valley in the Andes mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect Spanish. In turn, the discovery of the species “N. scelanoides” had led the science editor of the University of Toronto, William King, to the conclusion that “no one agrees on who is the” Unicorn of the North Pacific of the world Consequently, King, in his 1996 article “The King Hypothesis and the Difficulties of the King Hypothesis,” suggested that the genus name “Aeolophora” should not be confused with the languages of the Native Americans of North America.

The model needs more fine-tuning with other types of text (news articles, movies subtitles, etc. ). However the result was fun and it was very exciting to learn about NLP tasks on the field of artificial intelligence.

Machine learning interpretability of tree ensembles

Nowadays, Machine Learning is widely used in practical applications for solving problems that require predictive analytics. Several new methods are constantly presented in the field, incrementally improving the performance of the older models. However, the improvement in predictive performance usually comes with an increment in the model complexity, making the decision mechanisms of the models difficult to be understood by human intuition. Therefore, the purpose of this research project was to increase the interpretability of tree ensembles for classification, as it is shown in Figure 1.

Figure 1. Trade off between accuracy and interpretability of machine learning models.

Interpretability is the degree to which a human can understand the cause of a decision

Miller, 2017

For this purpose, RuleCOSI (Rule COmbination and SImplification) a novel heuristic method that extracts, combines and simplifies decision rules from ensembles was presented. The initial algorithm was published in this academic paper [1] in 2019. My research evolved since then and it was the main topic of my doctoral dissertation in 2020. In this short post I introduce the main characteristics of the algorithm and show a small example result.

RuleCOSI algorithm

The algorithm has three basic steps as it is depicted in Figure 2.

Figure 2. Overview of the algorithm

The first step is to extract a ruleset from each of the trees forming the tree ensemble. this is done with a simple procedure in which a rule is created from the paths from the node root in the tree to each of the leaf nodes.

The second step is to make a combination of all the feature space of the rules. The final step is to generalize and simplify the rules based on pessimistic error.

The output of the algorithm is a single set of decision rules that are much simpler and have a similar performance to that of the tree ensemble.

RuleCOSI is able to handle two types of tree ensembles: Boosting and Bagging. The python library can work with several implementations of this ensemble types, such as Random Forests, XGBoost, CatBoost and Light GBM.

Example

Here I present a example using the the UCI steel plates faults dataset. The dataset contains 27 indicators that approximately describe the geometric shape of the defect and its outline. The task is to classify the type of surface defect. Because RuleCOSI can only work with binary classification problems, I considered for this example the dirtiness fault type.

Figure 3. Steel plate

The first step is to train a tree ensemble. In this case I trained an XGBoost ensemble with 50 trees. The F-measure of this model is 0.9958. The first 10 trees are shown in Figure 4.

Figure 4. First 10 trees of the XGBoost model for classifying dirtiness fault type in the steel pleats faults dataset.

After applying RuleCOSI to the result, the simplified ruleset has just 7 rules, with an F-measure value of 0.9926. The generated rules are presented in Figure 5.

Figure 5. Combined and simplified ruleset obtained from the XGBoost tree ensemble from Figure 4.

Python library

The library was implemented as a python package available in GitHub. The documentation of the library is available in this link.

Conclusions

Tree ensembles are widely used methods used for improving classification performance in many domains, including fault detection using manufacturing data. However the complexity of the ensembles makes it very hard to be interpreted by humans.

The results of RuleCOSI were satisfactory in improving the interpretability of tree ensembles without decreasing its classification performance.

References

[1] Obregon, J., Kim, A., & Jung, J. Y. (2019). RuleCOSI: Combination and simplification of production rules from boosted decision trees for imbalanced classification. Expert Systems with Applications, 126, 64-82.