Categorization of Multilingual Scientific Documents by a Compound Classification System

Jarosław Protasiewicz, Marcin Mirończuk, Sławomir Dadas

2017 W: Artificial Intelligence and Soft Computing 2017. Part II / Leszek Rutkowski, Marcin Korytkowski, Rafał Scherer, Ryszard Tadeusiewicz, Lotfi A Zadeh, Jacek M. Żurada; Cham: Springer, s. 563-573

The 16th International Conference on Artificial Intelligence and Soft Computing. Zakopane, 2017-06-11 - 2017-06-15

The aim of this study was to propose a classification method for documents that include simultaneously text parts in various languages. For this purpose, we constructed a three-leveled classification system. On its first level, a data processing module prepares a suitable vector space model. Next, in the middle tier, a set of monolingual or multilingual classifiers assigns the probabilities of belonging each document or its parts to all possible categories. The models are trained by using Multinomial Naïve Bayes and Long Short-Term Memory algorithms. Finally, in the last component, a multilingual decision module assigns a target class to each document. The module is built on a logistic regression classifier, which as the inputs receives probabilities produced by the classifiers. The system has been verified experimentally. According to the reported results, it can be assumed that the proposed system can deal with textual documents which content is composed of many languages at the same time. Therefore, the system can be useful in the automatic organizing of multilingual publications or other documents.