Numbers Normalisation in the Inflected Languages: a Case Study of Polish

Rafał Poświata, Michał Perełkiewicz

2019 W: Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing, pages 23–28, Florence, Italy, 2 August 2019 / Tomaž Erjavec, Michał Marcinczuk, Preslav Nakov, Jakub Piskorski, Lidia Pivovarova, Jan Šnajder, Josef Steinberger, Roman Yangarber; Florence; Association for Computational Linguistics, s. 23-28

7th Workshop on Balto-Slavic Natural Language Processing. Florence, 2019-08-02 - 2019-08-02

Text normalisation in Text-to-Speech systems is a process of converting written expressions to their spoken forms. This task is complicated because in many cases the normalised form depends on the context. Furthermore, when we analysed languages like Croatian, Lithuanian, Polish, Russian or Slovak there is additional difficulty related to their inflected nature. In
this paper we want to show how to deal with this problem for one of these languages: Polish, without having a large dedicated data set and using solutions prepared for other NLP tasks. We limited our study to only numbers
expressions, which are the most common nonstandard words to normalise. The proposed solution is a combination of morphological tagger and transducer supported by a dictionary of numbers in their spoken forms. The data set
used for evaluation is based on the part of 1-million word subset of the National Corpus of Polish. The accuracy of the described approach is presented with a comparison to a simple baseline and two commercial systems: Google
Cloud Text-to-Speech and Amazon Polly