We share our scientific achievements

To ensure easy access to OPI PIB’s scientific resources for the public, the institute has implemented an open access policy for publications and research data, which enables all users to explore OPI PIB’s discoveries.

OPI PIB supports the development of a society that benefits from scientific findings and the latest technological advancements. We share the results of our work for free.

We believe that our approach contributes to the fostering of innovation in Poland.

Key principles of the open access policy

Publications

We want our publications to be accessible to the public. The majority of our publications are available free of charge in the ‘Publications’ tab.

Publishing house

OPI PIB publishes monographs and other hard-copy and electronic publications, which are available on our ‘Publishing house’ page.

Research data

We guarantee open access to research data and other related metadata by:

defining data usage rules
storing data in an electronic research repository
ensuring public access to data in line with the FAIR rules
making agreements with research team members and other data creators.

Scientific resources

OPI PIB’s scientific resources are also available in peer-reviewed journals, scholarly books and open repositories.

Where to find our tools and data

Machine learning models

Our experts have developed neural language models that are now available to all software developers. Neural language models enable internet users to access machine translation and chatbot services.

Although most neural language models are designed specifically for the English language, experts at OPI PIB have also made Polish models available free of charge to the public. Models such as QRA are adapted to understand the Polish language and to generate text in Polish.

Download them now and use them to your advantage.

The models are available at GitHub OPI PIB.

Neural language models

Qra

AI Lab and the Gdańsk University of Technology have developed Polish generative neural language models that rely on the Llama2 model and have been trained on one terabyte of exclusively Polish text data. Qra is the first modern generative model to be pretrained on such a large Polish text corpus. There are three distinct Qra models, each varying in complexity: Qra 1B, Qra 7B and Qra 13B. Qra 7B and Qra 13B achieve significantly better perplexity results than the original Llama-2 models, demonstrating superior capabilities in modelling the comprehension, lexis and grammar of the Polish language.

Download: Qra

RoBERTa

A set of Polish neural language models that rely on the Transformer architecture and are trained using masked language modelling (MLM) and the techniques described in RoBERTa: A Robustly Optimized BERT Pretraining Approach. Two sizes of model are available: base and large. The base models are neural networks with approximately 100 million parameters; the large models contain 350 million. The large models offer higher prediction quality in practical use, but require more computational resources. Large Polish text corpora (20-200 GB) were used to train the models. Each model comes in two variants, which makes them compatible with popular machine learning libraries Fairseq and Hugginface Transformers.

Fairseq models: base (version 1), base (version 2), large (version 1), large (version 2)

Huggingface Transformers models: base (version 1), base (version 2), large (version 1), large (version 2)

BART

A Transformer neural language model that utilises an encoder-decoder architecture. BART was trained on a set of Polish documents of over 200 GB using the method described in BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. The model can be adapted to solving predictive tasks, but is designed to be used primarily in sequence-to-sequence tasks in which documents (for example, from machine translation and chatbots) serve both as the input and the output. The model comes in two variants, which makes them compatible with popular machine learning libraries Fairseq and Hugginface Transformers.

Download: Fairseq model, Huggingface Transformers model

GPT-2

A neural language model that is based on the Transformer architecture and trained using the autoregressive language model method. The neural network architecture complies with the English GPT-2 models described in Language Models are Unsupervised Multitask Learners. OPI PIB offers the model in two sizes: medium, which contains approximately 350 million parameters, and large, which contains approximately 700 million parameters. The files are compatible with the Fairseq library.

Download: medium model, large model

ELMo

A language model that is based on the long short-term memory (LSTM) recurrent neural networks presented in Deep contextualized word representations. The Polish language model is compatible with the AllenNLP library.

Download: model

Static representations of words

Word2Vec

Classic vector representations of words for the Polish language trained using the method described in Distributed Representations of Words and Phrases and their Compositionality. A large corpus of Polish-language documents was used to train the vectors. The set contains approximately 2 million words—including the ones that appear at least three times in the corpus—and other defined symbol categories, such as punctuation marks, numbers from 0 to 10,000, and Polish forenames and surnames. The vectors are compatible with the Gensim library. The vectors offered by OPI PIB range from 100-dimensional to 800-dimensional.

Download: 100d, 300d, 500d, 800d

GloVe

Vector representations of words for the Polish language that have been trained using the GloVe method developed at Stanford University. A large corpus of Polish-language documents was used to train the vectors. The set contains approximately 2 million words—including the ones that appear at least three times in the corpus—and other defined symbol categories, such as punctuation marks, numbers from 0 to 10,000, and Polish forenames and surnames. The vectors are saved in a text format that is compatible with various libraries designed for this type of model. The vectors offered by OPI PIB range from 100-dimensional to 800-dimensional.

Download: 100d, 300d, 500d, 800d

FastText

A model that contains vector representations of words and word parts in the Polish language. Unlike traditional, static representations of languages, the model is capable of generating new vectors for the words that are not included in dictionaries, based on the sum of representations of parts of such words. The model was trained on a large corpus of Polish-language documents using the method described in Enriching Word Vectors with Subword Information. The set contains approximately 2 million words—including the ones that appear at least three times in the corpus—and other defined symbol categories, such as punctuation marks, numbers from 0 to 10,000, and Polish forenames and surnames. The vectors are compatible with the Gensim library. The vectors offered by OPI PIB range from 100-dimensional to 800-dimensional.

Download: 100d, 300d, 500d, 800d (part 1), 800d (part 2)

Machine translation models

Polish-English and English-Polish models based on convolutional networks. The models are used in the machine translation of documents contained in the Fairseq library. They are based on convolutional neural networks. OPI PIB offers two models: Polish-English and English-Polish. They were trained on the data available on the OPUS website, which comprises a set of 40 million pairs of source and target language sentences.

Download: Polish-English model, English-Polish model

Models for detecting signs of depression

The models are part of the winning solution in the Shared Task on Detecting Signs of Depression from Social Media Text competition, which was organised at the LT-EDI-ACL2022 conference. Competitors were tasked with creating a system capable of determining three levels of user depression (no depression, moderate depression and severe depression), based on their social media posts in English. OPI PIB’s solution consisted of three models: two classification models and the DepRoBERTa (RoBERTa for Depression Detection) language model. DepRoBERTa was prepared using a corpus of approximately 400,000 Reddit posts, mainly concerning depression, anxiety and suicidal thoughts. The models are compatible with the popular Hugginface Transformers machine learning library. For more information on the competition and on OPI PIB’s solution, see OPI@LT-EDI-ACL2022: Detecting Signs of Depression from Social Media Text using RoBERTa Pre-trained Language Models.

Models: DepRoBERTa, roberta-depression-detection, deproberta-depression-detection

Natural language processing toolkit

We invite programmers to use our natural language processing toolkit. Discover the OPI PIB Toolkit for NLP.

The toolkit relies on REST API and integrates four language models. OPI PIB’s API enables its users to train and test their own programmes based on natural language processing (NLP) solutions.

The tookit is simple, compact and ready to use. It saves users time that would otherwise be required to configure multiple language models. Users can use preset components to create their own, more advanced solutions and applications quickly and seamlessly.

The OPI PIB Toolkit for NLP is:

multilingual—users can analyse documents written in Polish, English, German or French
ready to use—users can prototype and develop their own solutions
compact—users can spend more time solving real problems instead of configuring and implementing basic NLP functionalities.

The OPI PIB Toolkit for NLP is available at the Inventorum website.

Scientific datasets

OPI PIB believes that the development of Polish science is paramount. That is why the institute has made its scientific datasets available to all researchers. Scholars publish and give access to raw source and partially processed data that lays the groundwork for future research work. The data pertains to various OPI PIB research projects.

Downloadable data

We have made the following data available:

data on information extraction from HTML documents Download [6.9 MB]
data on information extraction from emergency and firefighting reports Download [298.7 kB]
data on the results on classification of commercial websites via various machine learning methods to identify innovative firms Download [983.1 kB]
data on publications on classification of text documents Download [187.7 kB]
database of mpMRI scans for prostate cancer diagnosis. Download [74,1 GB]

Home after War in VR

Home after War is a free VR application that is available on the Oculus Store. In it, you are introduced to Ahmaid, who has experienced violence in the Middle East at the hands of ISIS.

Ahmaid shows you around his home, whose details have been meticulously recreated on the basis of scans of his real-life home. The owner tells his story and explains the consequences of his return after the defeat of ISIS.

The Polish language version of Home after War was prepared by OPI PIB with the intention of immersing Polish-speaking users in this unique experience. The release also helps experts at OPI PIB to conduct research on the impact of VR on empathy.

The NAVOICA education platform source code

OPI PIB supports lifelong learning. The institute has made the NAVOICA platform source code available to the public. NAVOICA is a learning management system platform that is used to develop scalable MOOC e-learning websites that enable the creation and implementation of courses for any number of participants in an asynchronous model.

NAVOICA is a modified version of the Open edX platform. The tool is popular both in Poland and abroad. We hope that making the source code available to the public will help in the creation of new professional educational platforms.

Download: GitHub – OPI-PIB/navoica-platform