Machine learning models



Neural language models enable internet users to access machine translation and chatbot services. Most of them are designed specifically for the English language. To increase the variety of languages, experts at OPI PIB have made Polish-language models available free of charge to all programmers. Download them now and use them to your advantage!

Neural language models

RoBERTa

A set of Polish neural language models that rely on the Transformer architecture and are trained using masked language modelling and the techniques described in RoBERTa: A Robustly Optimized BERT Pretraining Approach. Two sizes of model are available: base and large. The base models contain approximately 100 million parameters; the large models contain 350 million parameters. The large models offer higher prediction quality in practical use, but require more computational resources. Large Polish text corpora (20–200 GB) were used to train the models. Each model comes in two variants, which makes it possible to read them in popular machine learning libraries: Fairseq and Hugginface Transformers.

Fairseq models: base (version 1), base (version 2), large (version 1), large (version 2)

Huggingface Transformers models: base (version 1), base (version 2), large (version 1), large (version 2)

BART

A Transformer neural language model that uses the encoder–decoder architecture. The model was trained on a set of Polish documents of 200 GB using the method described in BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. The model can be adapted to solving predictive tasks, but is primarily designed to be used in sequence-to-sequence tasks in which documents (for example, from machine translation or chatbots) serve both as the input and the output. The model comes in two variants, which makes it possible to read them in popular machine learning libraries: Fairseq and Hugginface Transformers.

Download: Fairseq model, Huggingface Transformers model

GPT-2

A neural language model that is based on the Transformer architecture and trained using the autoregressive language model method. The neural network architecture complies with the English-language GPT-2 models described in Language Models are Unsupervised Multitask Learners. OPI PIB offers the models in two sizes: medium, which contains approximately 350 million parameters, and large, which contains approximately 700 million parameters. The files are compatible with the Fairseq library.

Download: medium model, large model

ELMo

A language model that is based on the long short-term memory recurrent neural networks presented in Deep contextualized word representations. The Polish-language model can be read by the AllenNLP.

Download: model

Static representations of words

Word2Vec

Classic vector representations of words for the Polish language trained using the method described in Distributed Representations of Words and Phrases
and their Compositionality
. A large corpus of Polish-language documents was used to train the vectors. The set contains approximately 2 million words—including the words that appear at least three times in the corpus—and other defined symbol categories, such as punctuation marks, numbers from 0 to 10,000, and Polish forenames and surnames. The vectors are compatible with the Gensim library. The vectors offered by OPI PIB range from 100-dimensional to 800-dimensional.

Download: 100d, 300d, 500d, 800d

GloVe

Vector representations of words for the Polish language that have been trained using the GloVe method developed by experts at Stanford University. A large corpus of documents in Polish was used to train the vectors. The set contains approximately 2 million words—including the words that appear at least three times in the corpus—and other defined symbol categories, such as punctuation marks, numbers from 0 to 10,000, and Polish forenames and surnames. The vectors are saved in a text format that is compatible with various libraries designed for this type of model. The vectors offered by OPI PIB range from 100-dimensional to 800-dimensional.

Download: 100d, 300d, 500d, 800d

FastText

A model that contains vector representations of words and word parts in the Polish language. Unlike traditional, static representations of languages, the model is capable of generating new vectors for the words that are not included in dictionaries, based on the sum of the representations of their parts. The model was trained on a large corpus of Polish-language documents using the method described in Enriching Word Vectors with Subword Information. The set contains approximately 2 million words—including the words that appear at least three times in the corpus—and other defined symbol categories, such as punctuation marks, numbers from 0 to 10,000, and Polish forenames and surnames. The vectors are compatible with the Gensim library. The vectors offered by OPI PIB range from 100-dimensional to 800-dimensional.

Download: 100d, 300d, 500d, 800d (part 1), 800d (part 2)

Machine translation models

Polish–English and English–Polish models based on convolutional networks. The models are used in machine translation of documents included in the Fairseq library. They are based on convolutional neural networks. OPI PIB offers two models: Polish–English and English–Polish. They were trained on the data available on the OPUS website, which comprises a set of 40 million pairs of source and target language sentences.

Download: Polish–English model, English–Polish model

Models for detecting signs of depression

The models are part of the winning solution in the competition the Shared Task on Detecting Signs of Depression from Social Media Text organized at the conference LT-EDI-ACL2022. The task was to create a system that, given social media posts in English, should detect the level of user depression as not depression, moderate depression or severe depression.
The solution consisted of three models: two classification models and the DepRoBERTa (RoBERTa for Depression Detection) language model. The DepRoBERTa model was prepared using a corpus of about 400,000 posts from the Reddit, mainly concerning depression, anxiety and suicidal thoughts. The models have been made available in a way that allows them to be read in a popular machine learning library Hugginface Transformers. More information about the competition and our solution, can be found in the following publication OPI@LT-EDI-ACL2022: Detecting Signs of Depression from Social Media Text using RoBERTa Pre-trained Language Models.

Models: DepRoBERTa, roberta-depression-detection, deproberta-depression-detection