We harness mathematics and statistics to extract knowledge from text. An interview with Marek Kozłowski, PhD, the head of the Natural Language Processing Laboratory at the National Information Processing Institute, conducted by Michał Rolecki.
Marek Kozłowski holds a PhD in information technology and is the head of the Natural Language Processing Laboratory at the National Information Processing Institute, where he leads a team of 40 researchers and programmers who develop software, such as the Uniform Anti-plagiarism System and semantic search engines, using intelligent methods of data processing
His passions include natural language processing, data exploration and machine learning. He is the author of over 30 scientific publications on semantic text processing and machine learning.
Michał Rolecki:A layman would say that engineering is mainly about constructing roads and bridges.What is linguistic engineering?
Marek Kozłowski: Linguistic engineering is a field of study focused on automation of processing of natural language data or, put simply, text data.
I admit the name may be a bit confusing for people that are not familiar with that area of study. You can say it is the closest and most succinct translation of the English names given to that field, which was created and first developed by Anglo-Saxons. In English, two names are used: natural language processing or computational linguistics. We wanted the Polish name to be short and to reflect a programming and engineering approach. Hence “linguistic engineering”.
In that way, we can dissociate ourselves from a linguistically-oriented current in that field, i.e. formal language analysis (in-depth morphological and syntactic analysis of speech), and concentrate all our efforts on using mathematics or, to make things less complicated, statistics to extract knowledge from texts.
What exactly do you do?
Our main task is to perform searches in text repositories, to group texts in order to build cohesive thematic representations, to ascribe texts to pre-determined categories (e.g. to classify e-mails as spam), to extract information from texts (e.g. to detect and identify proper names, to identify specific facts and arguments), and to translate texts with MT tools and prepare summaries and abstracts.
You must remember that, in a broader sense, there are two main research currents in computational linguistics. On the one hand, you have formal description of a language and its rules and on the other hand, there are statistical methods ignoring the problem of understanding a text to the advantage of the analysis of frequency of specific correlations.We have decided to adopt the latter approach and that is why the term “engineering” has been used in the Polish name.
Summing up, we operate in an interdisciplinary field dealing with statistical or rule-based natural language modeling from a computational perspective and with analysis of relevant computational approaches to a broad category of extracting knowledge and information from texts.
And what are you doing at the moment?
Our laboratory has become a highly-interdisciplinary entity.
We have a research unit that is responsible for strictly scientific matters. They include smart methods for detection of plagiarisms in university theses, chatbots, prediction of election polling numbers based on text messages on public forums, methods for detection of unauthorized practices based on traffic logs or automatic assessment methods (e.g. of shoes based on pictures only).
We also have operational units that handle development and maintenance issues. They comprise programming teams responsible for development and maintenance of existing systems, i.a. JSA (Uniform Anti-plagiarism System), PBN (Polish Scholarly Bibliography), SEDN (System for Evaluation of Scientific Achievements) and ELA (Polish Graduate Tracking System). On top of that, we have a team of testers, a team of analysts, and a team of administrators for application support and laboratory services (devOps).
Do you employ programmers only?
Programmers constitute a majority of our employees, but there are also testers, analysts and admins. The most important thing is that, apart from strictly operational activities, we also have time to conduct research in such areas as text processing or image processing, which we have recently become interested in.
What does one have to do to work for you?
Follow the recruitment process. We have a few vacancies, including posts for back-end developer (mainly Java-oriented Stack), front-end developer (mainly with the use of Angular or Vue), administrator in the devOps team.
What can you do with the analysis a language?
Surprisingly, you can do a lot of things.
First, you can detect plagiarisms. The Uniform Anti-plagiarism System designed by our specialists was launched at the beginning of 2019. It helps to determine to what extent a thesis is an independent work of its author. Our algorithm is immune to changes in the word order in sentences, letter case, and punctuation. It cleans the analyzed text to use it to build unambiguous sets of words, single tokens, and to work on them. Details on how it works in practice have been presented in an interview for portal sztucznainteligencja.org.pl published by the National Information Processing Institute.
With machine language analysis methods you can predict people’s behavioral profiles, which means you can treat a language as a sort of a fingerprint (which I explained in the article “What is gonna betray you online”). Throughout all our life each of us develops their own characteristic style of writing. People can tell whether two texts are authored by the same or different persons basing on their intuition. But the machine can compare certain characteristics of a text in time, provided it obtains some input data. Being aware of such characteristics, it creates certain vectors, which may also be referred to as organized sets of features. They describe a person’s profile, and subsequent texts are compared with that profile. We call it stylometric behavioral profiling on the basis of which a profile of a specific person can be created. To be more precise, we can identify how a person uses their written language. Of course there is one condition: we need a sufficient number of texts written by that person.
Together with Antoni Sobkowicz we have worked in our laboratory for several years on predicative methods capable of estimating final political preferences based on emotional language used in online political debates. Our algorithms proved more precise in predicting the outcome of the parliamentary elections than the last pre-election public opinion poll.
How is that possible?
Mostly with a large number of data that the algorithms had at their disposal. Pre-election polls are based on the answers of thousands of people surveyed or even tens of thousands if we talk about the polls published on the election day. The methods we have used analyze several millions of comments left below political articles on internet portals. The comments express the views of several hundred thousand to even one million users. Neither PAPI, nor CAWI, nor CATI traditional standard surveys can reach that number of people.
Today, a considerable part of comments on political issues are made on public online forums. We used the comments available on popular online news websites containing a large number of political articles. At first, we needed a set of strong emotional expressions supporting one of two dominant political wings. We applied machine learning to precisely ascribe the comments to specific political images. The last thing we had to do was to calculate how many people out of all commenting users were in favor of a political wing, similarly to how it is done in the case of the polls. The sample size is measured in millions, which compensates for its potential inferior quality.
How long did you work on that?
That research method was tested for several years. In the months directly preceding the election day (i.e. since July 2019) we did that regularly every ten days. The predicative results of our algorithms were compared with the public polling results based on standard surveying methods and published by different public opinion polling institutes. Both were very similar. However, our algorithm proved to be more precise in predicting the results of the last year’s elections than traditional standard polling methods.
But that is no news. Political sentiments inferred from emotions in publicly available comments have been analyzed around the world for many years. Yet, in my opinion, the National Information Processing Institute is the only Polish research center that conducts regular research on the analysis of Poles’ political preferences on broadly understood internet portals with the use of artificial intelligence methods.
You operate on a large scale.
Yes, we do research in various areas. The language analysis allows to create word sense induction based search engines and statistical machine translation focused on a certain field, to classify documents within the scope of a certain field, or to discover arguments in law-related texts. It also allows for sentiment analysis, i.e. emotions hidden in a text.
Finally, the tools for statistical machine language analysis help to create recommendation systems, from music to fields of study to universities, as well as tools to summarize texts. All those areas is what we deal with in our laboratory.
Do you think that one day it will be impossible to tell the difference between what is spoken or written by a machine and by a human?
We have already come to that stage. It is said there are chatbots that are very close to passing the Turing test (proposed to differentiate a machine from a human). Some even claim that they have passed if we allow for very specific conditions. I think that in about five years most people will not be able to tell if they are talking to a human or to a chatbot, provided they will talk about specific subjects such as placing orders, handling tax issues or doing shopping.