GPT2-small-spanish: a Language Model for Spansih text generation

Last year I was thrilled by all the articles talking about GPT-3 (Generative Pre-trained Transformer 3), a language model created by OpenAi, with billions of parameters that is capable of generating human-like text. Because accesing that model was hard, together with a collegue at INU in Korea, we decided to train a GPT-2 model and see firsthand the capabilities of this model.

So because our mother tongue is Spanish, we decided to fine-tune a pre-trained English model using the Spanish wikipedia. The process was not as simple as it looked on paper because of the complexity of the model. Luckily, we found a very nice article wrote by Pierre Guillou on how to fine-tune the GPT-2 model for Portuguese.

The result was a good learning experience on different topics related to NLP such as tokenization, language generation, transformer models and the Hugging Face platform, which is an AI community that provides very useful libraries to build, train and deploy state-of-the-art models in the NLP area.

The model was uploaded in the Hugging Face platform and so far (June 2021) has been dowloaded more than 2,500 times. The model was uploaded under the user ‘datificate‘, which is a side personal project we started with my friend to generate machine learning content in Spanish.

The model can be used for different tasks, but here I show the result (in Spanish) of text generation based on the classic unicorn text used in the original presentation of the GPT-2 model. The notebook used to generate the results can be found in our github repository.

>> Texto Generado 1
En un hallazgo impactante, el científico descubrió un rebaño de unicornios que vivían en un valle remoto, anteriormente inexplorado, en las montañas de los Andes. Aún más sorprendente para los investigadores fue el hecho de que los unicornios hablaban perfecto español. A su vez, el descubrimiento de la especie “N. scelanoides” había llevado al editor de ciencia de la Universidad de Toronto, William King, a la conclusión de que “nadie está de acuerdo en quién es el “Unicornio del Pacífico Norte del mundo”. En consecuencia, King, en su artículo de 1996 “La hipótesis de King y las dificultades de la hipótesis de King”, sugirió que el nombre del género “Aeolophora” no debe ser confundido con las lenguas de los indígenas de Norteamérica.

Translation (using google translate)

>> Generated text 1

In a shocking find, the scientist discovered a herd of unicorns living in a remote, previously unexplored valley in the Andes mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect Spanish. In turn, the discovery of the species “N. scelanoides” had led the science editor of the University of Toronto, William King, to the conclusion that “no one agrees on who is the” Unicorn of the North Pacific of the world Consequently, King, in his 1996 article “The King Hypothesis and the Difficulties of the King Hypothesis,” suggested that the genus name “Aeolophora” should not be confused with the languages of the Native Americans of North America.

The model needs more fine-tuning with other types of text (news articles, movies subtitles, etc. ). However the result was fun and it was very exciting to learn about NLP tasks on the field of artificial intelligence.

