High-Quality LLM Pre-Training Texts from Dictionary Data.

Medveď,  Marek; Sabol,  Radoslav; Sotolář,  Ondřej; Horák,  Aleš

High-Quality LLM Pre-Training Texts from Dictionary Data.

Varování

Publikace nespadá pod Fakultu sociálních studií, ale pod Fakultu informatiky. Oficiální stránka publikace je na webu muni.cz.

Autoři	MEDVEĎ Marek SABOL Radoslav SOTOLÁŘ Ondřej HORÁK Aleš
Rok publikování	2025
Druh	Článek ve sborníku
Konference	Recent Advances in Slavonic Natural Language Processing, RASLAN 2025
Fakulta / Pracoviště MU	Fakulta informatiky
Citace
www	Proceedings of the Nineteenth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2025.
Klíčová slova	large language models; LLMs; pre-training; high-quality data; dictionaries; dictionary entries; Slama models; Czech
Přiložené soubory	High-Quality_LLM_Pre-Training_Texts_from_Dictionary_Data.pdf
Popis	The quality of the pre-training texts is an important aspect in the development of a Large Language Model (LLM). High-quality data, such as collections of textbooks, academic papers, and educational forums, has been shown to improve model performance, generalization, and reduce biases. However, obtaining such data at scale can be challenging, especially for non-mainstream languages like Czech. In this paper, we introduce a method for generating high-quality Czech pre-training data from structured dictionary resources. By employing retrieval-augmented prompting and open-source LLMs, we transform XML-encoded lexicographic dictionary entries into fluent, semantically rich text. The resulting dataset demonstrates that dictionary-grounded generation can effectively enhance data quality. We present the results of experiments with several LLMs and the process of creating a new Czech pre-training dataset, SlamaHQTrain. This dataset was obtained by processing eight Czech dictionaries containing more than 500,000 entries and 18 million words.
Související projekty:	Na všechno sami: příležitosti a rizika individualizace společnosti (PRINS)