The launch of the first Polish open large language model: PLLuM

The collaborative efforts of six leading scientific units with expertise in artificial intelligence (AI), natural language processing and corpus linguistics are underway to develop the Polish Large Language Model (PLLuM) and an associated smart assistant, both of which are trained predominantly on Polish text.

Large language models (LLMs) have already captivated the world for more than a year with their unparalleled artificial intelligence capabilities. While undeniably impressive, the most powerful models like ChatGPT and Google Bard come with constraints, including payment requirements, restricted access and a limited corpus of Polish training text.

What would happen if an open, free model that is trained trained primarily on Polish text was developed? Such a feat requires expertise and capabilities that surpass the confines of individual scientific institutions, as well as massive computational power and extensive datasets of suitable diversity and quality.

To overcome these difficulties, on 29 November, 2023, the eve of the first anniversary of ChatGPT, six leading institutions in Poland that operate in AI and linguistics – Wrocław University of Science and Technology (the consortium leader), the National Information Processing Institute (OPI), the NASK National Research Institute, the Institute of Computer Science of the Polish Academy of Sciences, the University of Lodz, and the Institute of Slavic Studies of the Polish Academy of Sciences – joined forces to establish the Polish Large Language Universal Model consortium.

The consortium’s goal is to develop the first open Polish LLM and an associated smart assistant. The project will adhere to ethical and responsible best practices in AI, incorporating data representativeness, transparency and fairness.

‘Developed by leading research institutions in collaboration with the public sector, our transparent and fully accessible open model will be a global innovation. The project combines access to data, expertise, technical resources and know-how from scientific and government institutions that stand united in the common goal of supporting science, the economy and the competitiveness of Polish enterprise,’ says Wojciech Pawlak, head of the NASK National Research Institute.

Although open LLMs exist, none of them are trained on datasets that are representative of the Polish language. Models that rely on an insignificant share of Polish text in the training process and merely adjust themselves to the needs of the Polish language have proved unsuitable for most commercial applications. PLLuM aims to support Polish enterprise by facilitating access to extended Polish-language models under free, open-source licenses that are designed to meet this demand.

‘OPI is thrilled to join the PLLuM consortium, bringing its years of expertise in the advancement of natural language processing tools to the collaborative effort. Our mutual interest is the development of the IT sector and the scientific community in Poland. It is vital that new IT tools be designed and made available to the public free of charge. OPI developed the Polish RoBERTa large model, which is the top Polish-language representative model according to the KLEJ Benchmark. We are happy to share our knowledge and experience in the design of the PLLum model. We need models that are trained on Polish language text. They are indispensible in the analysis of the Polish internet,’ saysDr Jarosław Protasiewicz, head of OPI.

‘LLMs have become universal, essential engines for natural language processing, but their creation and training remain beyond the capabilities of Polish enterprise. It is vital that an open Polish LLM that relies on the existing AI infrastructure be developed, such as the one available at the Wrocław University of Science and Technology. That model must support the science sector, as well as small and medium IT and AI enterprises, which play pivotal roles in driving the Polish economy,’ explains Prof Maciej Piasecki, project manager at the Wrocław University of Science and Technology, the consortium leader.

An open model not only ensures access to the research object, but also creates an opportunity to develop and test the model’s explainability methods and to explore the depths of the black box. What factors contribute to the models generating responses in such a persuasive manner? Why do they ‘hallucinate’ by delivering untrue replies—even when the facts and names in their training sets are correct? How are the models affected by increased numbers of sets and parameters, and by training with humans? How can prompts be formulated to achieve the intended results?

‘The PLLuM model will stimulate science development in Poland not only in AI, but also in explainable artificial intelligence. That aspect should be our focus, considering that critical analysis is as essential as the ever-growing capabilities of AI. Poland has a chance to emerge as a global leader in the field,’ says Dr Inez Okulska, head of the Department of Language Engineering and Text Analysis at NASK.

A considerably larger share of text that is originally written in Polish and contains information on Polish science, art, history, law and economics will promote our language and culture, which are marginalised in the existing models.

PLLuM is designed to address the needs not only of scientists and enterprises, but also Polish citizens. The model aims to deliver innovative solutions that benefit Polish society directly. One of them is a Polish-language smart assistant that will facilitate access to public services, both electronically and in person. By enabling the formulation of queries in natural language, the assistant caters to the needs of those considered digitally excluded. And that is merely the beginning. The collaborative efforts of the Polish research, business and public sectors have the potential to unlock endless possibilities.