Technical Report: Development and Training of the First Armenian Large Language Model HyGPT
Date: May 9, 2025
Authors: Gen2B, with contributions from Artak Hovsepian (NCCAIT)
Document Version: 1.0
Download: HuggingFace
Test: https://t.me/HyGPT_Gen2B_Bot
Abstract
This technical report details the development and training process of the first Armenian Large Language Model, HyGPT. The report covers the stages of base model selection, data preparation and processing for pre-training, key architectural modifications, the pre-training process, collection and preparation of the instruction dataset, as well as the fine-tuning process and evaluation of the resulting models. The HyGPT-10b-it model shows significant improvements in understanding and generating Armenian text, as well as in instruction following, compared to existing models.
Table of Contents
1.1. Context and Motivation
1.2. Project Objectives
2.1. Base Model Selection
2.2. Data Preparation for Pre-training
2.3. Architectural Modification
2.4. Pre-training Process
2.5. Data Preparation for Instruct Fine-tuning
2.6. Fine-tuning Process
1. Introduction
1.1. Context and Motivation
Large Language Models (LLMs) based on the Transformer architecture, such as Llama, Mistral, and Gemma, have demonstrated outstanding performance, primarily in English. However, their effectiveness for languages with fewer available data, including Armenian, is often limited. The experience of the Gen2B team with training the Kazakh model Irbis-7B (detailed in the article "Irbis-7B или как мы учили ЛЛМку казахскому языку" in Russian language) showed that one of the key problems is inefficient tokenization, leading to increased token sequence length, reduced generation speed, and rapid filling of the context window.
While some modern models, like Gemma, have more adapted tokenizers for multilingualism, their baseline performance on languages underrepresented in the training data remains low. At the start of the project, despite progress in the field of LLMs, the Armenian language remained largely "uncovered": existing open-source models (up to 14 billion parameters, and sometimes more) demonstrated extremely low or zero ability to generate meaningful text in Armenian. This creates a significant gap in the availability of modern AI technologies for the Armenian-speaking community.
1.2. Project Objectives
The primary objective of this project was to develop and train the first high-quality large language model for Eastern Armenian, named HyGPT. The project aimed to achieve the following tasks:
2. Methodology
2.1. Base Model Selection
A key factor in model selection was the efficiency of its tokenizer for the Armenian language, which tokenized Armenian words into an average of 2-3 tokens, a good metric for morphologically rich languages. This favorably distinguished Gemma 2 from other similarly sized models whose tokenizers either resulted in significantly longer token sequences for Armenian text or were not adapted for the language at all, hindering meaningful generation. Although the base Gemma2-9B generated very low-quality Armenian text, the presence of an efficient tokenizer made it a promising starting point and finally, this model was chosen as the foundation for HyGPT.
Unlike the Irbis project, where expanding the vocabulary and retraining the Mistral tokenizer was necessary, the decision here was made to use the existing Gemma2 tokenizer without modifications. This allowed for a focus on other aspects of model improvement.
2.2. Data Preparation for Pre-training
For the pre-training phase, we have collected a large dataset of Eastern Armenian texts with about 10 billion tokens (approximately 100 GB of raw text). Data collection was significantly assisted by Artak Hovsepian (NCCAIT).
The corpus included:
The collected raw text corpus underwent a thorough cleaning and deduplication stage using the datatrove library from Hugging Face. The cleaning process included:
The cleaning process resulted in the removal of less than 0.2 billion tokens, representing less than 2% of the total volume. This indicates the high initial quality of the collected corpus.
2.3. Architectural Modification
The Gemma 2 model uses weight tying between the embedding layer (embed) and the output layer (lm_head). With this approach, the embedding layer's weight matrix is reused (typically in transposed form) to project the model's hidden states into logits for each token in the vocabulary. The main benefits of this approach are:
Despite the advantages of layer tying, the decision was made to explore a variant where embed and lm_head are separated for the HyGPT model. This decision was motivated by an analysis of theoretical and practical considerations, supported by several studies and community discussions:
All this theoretical data on layer separation should have been supported by a series of short training sessions, lasting approximately 3 days each, followed by evaluation on three benchmarks in Armenian, assessing the language model's ability to understand the text read:
The following configurations were compared:
The evaluation results (accuracy metric) are presented in Table 1:
The results undoubtedly demonstrated that training the embed and lm_head layers separately (train emb / train lm) yielded the best average accuracy. This modification increased the model's parameter count from 9 billion to 10 billion, leading us to name our model HyGPT-10b.
2.4. Pre-training Process
The final pre-training of the model (with separate embed and lm_head layers) was performed on two Nvidia H100 GPUs. Computational resources were provided through the Nvidia Inception Program. The training itself was conducted using the following libraries: transformers, peft, trl, and datasets. For more efficient GPU memory use, deepspeed, bitsandbytes, and flash-attention were utilized. The pre-training lasted for two weeks, totaling approximately 646 GPU hours (with all experiments totaling around 1000 GPU hours), during which the model was trained using the AdamW optimizer, a learning rate of 2e-4, and a cosine scheduler. The training progress is illustrated in Figure 1:
Figure 1: Pre-training progress metrics from Tensorboard
During training, the evaluation loss (eval/loss) was monitored on a dedicated validation subset of the Armenian corpus. The loss consistently decreased from an initial value of 0.88 to a final value of 0.68, indicating successful adaptation of the model to the Armenian language material.
2.5. Data Preparation for Instruct Fine-tuning
In parallel with pre-training, a dataset was being prepared for instruction fine-tuning. The goal of this stage was to teach the model to follow instructions and engage in meaningful dialogue. The dataset was structured in a chat format consisting of 2-3 turns (user-assistant). The main focus was on the following tasks:
The primary challenge in creating the instruction dataset was the near-total absence of ready-made, high-quality data in Eastern Armenian for the tasks listed above. To address this problem, the following approach was used:
In the end, an instruction dataset containing approximately 40,000 samples was compiled.
2.6. Fine-tuning Process
Fine-tuning of the pre-trained HyGPT-10b model on a collected instruction dataset was also performed on two Nvidia H100 GPUs and took slightly over 3 days (approximately 150 GPU-hours). The effective_llm_alignment library was chosen as the primary toolkit, which, in addition to the "Supervised Fine-Tuning (SFT)" alignment method used, offers more specific methods such as DPO, ORPO, SimPO, SMPO, etc. The loss function changes on the training (train/loss) and validation (eval/loss) datasets during fine-tuning are shown in Figure 2:
Figure 2: Loss function dynamics during instruction fine-tuning from WanDB
The presented plots show that the training loss (train/loss) steadily decreased throughout the entire fine-tuning process (3 epochs). In contrast, the evaluation loss (eval/loss) exhibited a different dynamic: after an initial decrease, it demonstrated a tendency to increase, particularly noticeable at the beginning of each new training epoch. This behavior suggests that the model began to over-adapt to the specifics of the training dataset, potentially reducing its ability to generalize to previously unseen examples from the validation set.
Nevertheless, it is noteworthy that the final model checkpoint, corresponding to the completion of the 3 epochs of fine-tuning, showed the best performance on the target benchmarks (see below), as well as in subjective assessments of the quality of generated responses. This phenomenon is common in the practice of training large language models and may have the following explanations:
Therefore, despite the formal signs of overfitting, captured by the slight increase in eval/loss, the selection of the final checkpoint was dictated by its superiority on more complex and practically significant quality evaluation metrics. The resulting model was named HyGPT-10b-it (instruct-tuned).
3. Evaluation and Results
HyGPT-10b-it's quality was evaluated using a set of benchmarks, pre-translated into Armenian using Gemini 2.0 Flash:
HyGPT-10b-it's results were compared with several other state-of-the-art open-source models. The data is presented in Table 2:
The results presented in Table 2 demonstrate the strong competitiveness of the HyGPT-10b-it model.
Qualitative analysis, including evaluations of HyGPT-10b-it's generations by native Armenian speakers, corroborated the quantitative results. HyGPT-10b-it demonstrated significantly improved abilities in understanding and executing complex instructions in Armenian, maintaining context in multi-turn dialogues, generating more natural and coherent responses in Armenian, and performing specific tasks such as translation, summarization, and question answering using provided context.
4. Conclusion
Within the scope of this project, HyGPT – the first large language model for Eastern Armenian – has been successfully developed and trained. Key milestones included the meticulous collection and preparation of an extensive text corpus, adaptation of the Gemma2 architecture, including modifications with separate and independently trained embedding and output layers, and a two-stage training process: pre-training (HyGPT-10b) and instruction fine-tuning (HyGPT-10b-it).
The resulting HyGPT-10b-it model demonstrates high performance on a range of Armenian language benchmarks, surpassing many existing models, including some larger ones. The model shows significant potential for use in conversational applications, educational tools, multilingual support systems, and other domains requiring high-quality Armenian language processing.
The developed models have been released to the public domain and are available for download and use on the Hugging Face platform at the following link.
Future work will focus on expanding the instruction data set, exploring more advanced model alignment techniques, and developing specialized versions of the model for specific application tasks in the Armenian-speaking space.
Acknowledgements
Gen2B team expresses its gratitude to: