The National Information Processing Institute (OPI PIB) has developed two new statistical models of the Polish language. Both models have ranked top in a classification by Allegro. One of the models was trained on Poland’s largest body of texts.
The two Polish RoBERTa models are statistical representations of natural language created by means of machine learning. Thanks to the use of large data sets, they allow for precise mapping of the syntax and semantics of the Polish language. The models developed and made available by OPI PIB will facilitate the construction of advanced processing tools for Polish which will be used for many purposes, including text classification or detection of emotions conveyed by the text.
The models use the BERT architecture presented by Google last year. The larger one, Polish RoBERTa large, was trained with 130 GB of data, while the smaller one, Polish RoBERTa base, was created with a data set of 20 GB.
Both models can be applied depending on the user’s needs and technical capacity: the larger one is more precise, but requires more computing power. The smaller one, in turn, is faster but offers less robust results.
The models have been tested with the Comprehensive List for Language Evaluation (the KLEJ benchmark) developed by Allegro. The Benchmark facilitates an evaluation of the model’s performance based on nine tasks, including sentiment analysis and analysis of semantic similarity of texts. At the moment, our two models are ranked first and second in Allegro’s ranking.
“Unidirectional language models attempt to guess what the next word will be in a given text”, explains Sławomir Dadas from the Laboratory of Intelligent Information Systems at OPI PIB. “The BERT architecture, on the other hand, enables the model to learn a language on a slightly different principle: a few words are randomly removed from a sentence, and the model’s task is to learn the best way to fill in the blanks. If the model has access to a large text corpus, it is able to gradually learn more about the semantic relationships between words.”
With its 130 GB-large data corpus (equivalent to over 400,000 books), Polish RoBERTa large is currently the largest model trained in Poland.
Both models were created at OPI PIB’s Intelligent Information Systems Laboratory. In 2018, the Laboratory’s team responsible for the POL-on System of Information on Science and Higher Education was awarded the prestigious EUNIS Elite Award.