HyGPT 1.0 Tehcnical Report

Technical Report: Development and Training of the First Armenian Large Language Model HyGPT

Date: May 9, 2025

Organizations: Gen2B, NCCAIT

Authors: Gen2B, with contributions from Artak Hovsepian (NCCAIT)

Document Version: 1.0

DownloadHuggingFace

Test: https://t.me/HyGPT_Gen2B_Bot

Abstract

This technical report details the development and training process of the first Armenian Large Language Model, HyGPT. The report covers the stages of base model selection, data preparation and processing for pre-training, key architectural modifications, the pre-training process, collection and preparation of the instruction dataset, as well as the fine-tuning process and evaluation of the resulting models. The HyGPT-10b-it model shows significant improvements in understanding and generating Armenian text, as well as in instruction following, compared to existing models.

Table of Contents

  1. Introduction

1.1. Context and Motivation
1.2. Project Objectives

  1. Methodology

2.1. Base Model Selection
2.2. Data Preparation for Pre-training
2.3. Architectural Modification
2.4. Pre-training Process
2.5. Data Preparation for Instruct Fine-tuning
2.6. Fine-tuning Process

  1. Evaluation and Results
  2. Conclusion

1. Introduction

1.1. Context and Motivation 

Large Language Models (LLMs) based on the Transformer architecture, such as Llama, Mistral, and Gemma, have demonstrated outstanding performance, primarily in English. However, their effectiveness for languages with fewer available data, including Armenian, is often limited. The experience of the Gen2B team with training the Kazakh model Irbis-7B (detailed in the article "Irbis-7B или как мы учили ЛЛМку казахскому языку" in Russian language) showed that one of the key problems is inefficient tokenization, leading to increased token sequence length, reduced generation speed, and rapid filling of the context window.

While some modern models, like Gemma, have more adapted tokenizers for multilingualism, their baseline performance on languages underrepresented in the training data remains low. At the start of the project, despite progress in the field of LLMs, the Armenian language remained largely "uncovered": existing open-source models (up to 14 billion parameters, and sometimes more) demonstrated extremely low or zero ability to generate meaningful text in Armenian. This creates a significant gap in the availability of modern AI technologies for the Armenian-speaking community.

1.2. Project Objectives 

The primary objective of this project was to develop and train the first high-quality large language model for Eastern Armenian, named HyGPT. The project aimed to achieve the following tasks:

  • Selecting a suitable base model.
  • Collecting and preparing a representative corpus of texts in Eastern Armenian.
  • Pre-training to adapt the model to the target language.
  • Investigating and implementing architectural modifications to improve model quality.
  • Fine-tuning to enhance the model's ability to follow instructions and engage in dialogue.
  • Comprehensive evaluation of the resulting models on relevant benchmarks.

2. Methodology

2.1. Base Model Selection 

A key factor in model selection was the efficiency of its tokenizer for the Armenian language, which tokenized Armenian words into an average of 2-3 tokens, a good metric for morphologically rich languages. This favorably distinguished Gemma 2 from other similarly sized models whose tokenizers either resulted in significantly longer token sequences for Armenian text or were not adapted for the language at all, hindering meaningful generation. Although the base Gemma2-9B generated very low-quality Armenian text, the presence of an efficient tokenizer made it a promising starting point and finally, this model was chosen as the foundation for HyGPT.

Unlike the Irbis project, where expanding the vocabulary and retraining the Mistral tokenizer was necessary, the decision here was made to use the existing Gemma2 tokenizer without modifications. This allowed for a focus on other aspects of model improvement.

2.2. Data Preparation for Pre-training

For the pre-training phase, we have collected a large dataset of Eastern Armenian texts with about 10 billion tokens (approximately 100 GB of raw text). Data collection was significantly assisted by Artak Hovsepian (NCCAIT).
The corpus included:

  • Armenian news articles.
  • Web content (cleaned web pages).
  • Texts from the Armenian Wikipedia.
  • Literary works.
  • Scientific publications.
  • Other publicly available Armenian text sources.

The collected raw text corpus underwent a thorough cleaning and deduplication stage using the datatrove library from Hugging Face. The cleaning process included:

  • Removal of texts in languages other than Armenian.
  • Removal of personal and contact information (phone numbers, website URLs, email addresses).
  • Document-level and paragraph-level deduplication to reduce data redundancy.

The cleaning process resulted in the removal of less than 0.2 billion tokens, representing less than 2% of the total volume. This indicates the high initial quality of the collected corpus.

2.3. Architectural Modification

The Gemma 2 model uses weight tying between the embedding layer (embed) and the output layer (lm_head).  With this approach, the embedding layer's weight matrix is reused (typically in transposed form) to project the model's hidden states into logits for each token in the vocabulary. The main benefits of this approach are:

  • Parameter Efficiency: A significant reduction in the overall number of model parameters, as a separate weight matrix for the lm_head (of size hidden_size * vocab_size) is not required.
  • Regularization: Weight tying can act as a form of regularization, potentially improving the model's generalization ability.

Despite the advantages of layer tying, the decision was made to explore a variant where embed and lm_head are separated for the HyGPT model. This decision was motivated by an analysis of theoretical and practical considerations, supported by several studies and community discussions:

  1. Separation of Semantic and Predictive Spaces: Independent layers allow embeddings to specialize in encoding syntactic and semantic relationships between tokens, while the output layer can better model conditional probabilities of tokens within a given context. This can improve accuracy on tasks requiring fine-grained distinctions of context-dependent meanings (according to Masked Language Models).
  2. Multilingualism: When dozens of language corpora share a common vocabulary, rare tokens compete for a single matrix (embed), which degrades the quality of low-resource languages. The DEPT paper shows that due to model capacity limitations, adding multilingual data can worsen its perplexity.
  3. Eliminating Optimization Conflicts: Weight tying introduces a conflict: embeddings strive to bring semantically similar tokens closer together, while the output layer aims to differentiate them for accurate prediction. Separating the layers eliminates this conflict, which can lead to more stable training (according to By Tying Embeddings You Are Assuming the Distributional Hypothesis).

All this theoretical data on layer separation should have been supported by a series of short training sessions, lasting approximately 3 days each, followed by evaluation on three benchmarks in Armenian, assessing the language model's ability to understand the text read:

  • The Belebele Benchmark - This benchmark contains general knowledge questions, context for them, and multiple answer options (the hye_Armn dataset was used).
  • SynDARin - This benchmark emphasizes the model's understanding of the Armenian language and also contains context and multiple answer options (the Armenian QA dataset was used).
  • INCLUDE - This benchmark evaluates the understanding of specific information related to cultural norms and practices relevant to the local environment (in our case, Armenia). The dataset does not contain contexts, being limited to questions and answers for them (the Armenian dataset was used).

The following configurations were compared:

  • train emb / train lm: Training separated layers (input and output independently).
  • train sync(emb/lm): Training with tied weights (standard Gemma2 approach).
  • train emb / freeze lm: Training only the input layer.
  • freeze emb / train lm: Training only the output layer.
  • freeze emb / freeze lm: Both layers are frozen (everything is trained except the input and output layers).

The evaluation results (accuracy metric) are presented in Table 1:

The results undoubtedly demonstrated that training the embed and lm_head layers separately (train emb / train lm) yielded the best average accuracy. This modification increased the model's parameter count from 9 billion to 10 billion, leading us to name our model HyGPT-10b.

2.4. Pre-training Process 

The final pre-training of the model (with separate embed and lm_head layers) was performed on two Nvidia H100 GPUs. Computational resources were provided through the Nvidia Inception Program. The training itself was conducted using the following libraries: transformers, peft, trl, and datasets. For more efficient GPU memory use, deepspeed, bitsandbytes, and flash-attention were utilized. The pre-training lasted for two weeks, totaling approximately 646 GPU hours (with all experiments totaling around 1000 GPU hours), during which the model was trained using the AdamW optimizer, a learning rate of 2e-4, and a cosine scheduler. The training progress is illustrated in Figure 1:

Figure 1: Pre-training progress metrics from Tensorboard

During training, the evaluation loss (eval/loss) was monitored on a dedicated validation subset of the Armenian corpus. The loss consistently decreased from an initial value of 0.88 to a final value of 0.68, indicating successful adaptation of the model to the Armenian language material.

2.5. Data Preparation for Instruct Fine-tuning

In parallel with pre-training, a dataset was being prepared for instruction fine-tuning. The goal of this stage was to teach the model to follow instructions and engage in meaningful dialogue. The dataset was structured in a chat format consisting of 2-3 turns (user-assistant). The main focus was on the following tasks:

  • Multi-turn conversations in Armenian.
  • Text paraphrasing.
  • Question answering in Armenian.
  • Text summarization.
  • Translation between Armenian, Russian, and English.
  • Solving school-level math problems.
  • Answering general knowledge questions.
  • Assistance with educational content.

The primary challenge in creating the instruction dataset was the near-total absence of ready-made, high-quality data in Eastern Armenian for the tasks listed above. To address this problem, the following approach was used:

  • Collection of existing, publicly available English and Russian instruction datasets.
  • Translation of these datasets into Eastern Armenian. After testing various translation tools (including proprietary LLMs), the Gemini 2.0 Flash model demonstrated the best translation quality.
  • Manual creation of a portion of the samples (~50% of the total) using various texts in Eastern Armenian as context. For these contexts, question-answer pairs and example dialogues were generated using Gemini 2.0 Flash.

In the end, an instruction dataset containing approximately 40,000 samples was compiled.

2.6. Fine-tuning Process

Fine-tuning of the pre-trained HyGPT-10b model on a collected instruction dataset was also performed on two Nvidia H100 GPUs and took slightly over 3 days (approximately 150 GPU-hours). The effective_llm_alignment library was chosen as the primary toolkit, which, in addition to the "Supervised Fine-Tuning (SFT)" alignment method used, offers more specific methods such as DPO, ORPO, SimPO, SMPO, etc. The loss function changes on the training (train/loss) and validation (eval/loss) datasets during fine-tuning are shown in Figure 2:

Figure 2: Loss function dynamics during instruction fine-tuning from WanDB

The presented plots show that the training loss (train/loss) steadily decreased throughout the entire fine-tuning process (3 epochs). In contrast, the evaluation loss (eval/loss) exhibited a different dynamic: after an initial decrease, it demonstrated a tendency to increase, particularly noticeable at the beginning of each new training epoch. This behavior suggests that the model began to over-adapt to the specifics of the training dataset, potentially reducing its ability to generalize to previously unseen examples from the validation set.

Nevertheless, it is noteworthy that the final model checkpoint, corresponding to the completion of the 3 epochs of fine-tuning, showed the best performance on the target benchmarks (see below), as well as in subjective assessments of the quality of generated responses. This phenomenon is common in the practice of training large language models and may have the following explanations:

  • Mismatch between proxy metric (loss) and target high-level metrics: The loss function used (cross-entropy) optimizes the accuracy of predicting the next token. While this is a useful and widely used proxy metric for training, it does not always directly correlate with complex aspects of generation quality, such as coherence, consistency, factual accuracy, and adherence to complex instructions.
  • Acquisition of specific instruction patterns: During more extended training, even with increasing eval/loss, the model may have too deeply assimilated the structure, style, and nuances of instruction execution present in the training dataset. The model might have better learned the requirements for the format and style of responses embedded in the instruction-based dataset than necessary.
  • Quality of subjective evaluation: The subjective evaluation, confirming the high quality of the final checkpoint (including answers to questions not present in the dataset), further indicates that the model acquired useful generalization abilities that are not fully reflected by the eval/loss metric.

Therefore, despite the formal signs of overfitting, captured by the slight increase in eval/loss, the selection of the final checkpoint was dictated by its superiority on more complex and practically significant quality evaluation metrics. The resulting model was named HyGPT-10b-it (instruct-tuned).

3. Evaluation and Results

HyGPT-10b-it's quality was evaluated using a set of benchmarks, pre-translated into Armenian using Gemini 2.0 Flash:

  • Flores-200 (hy-ru-en subset): Evaluates the model's ability to translate between Armenian, Russian, and English. The BLEU metric was used. The table shows the average score across all 6 directions (hy->ru, ru->hy, hy->en, en->hy, ru->en, en->ru). Link to the original dataset.
  • ARC (AI2 Reasoning Challenge, Challenge set): A multiple-choice question benchmark evaluating reasoning ability. Metric – accuracy. Link to the original dataset.
  • TruthfulQA (Multiple-choice): A multiple-choice benchmark evaluating the model's ability to provide truthful answers and avoid generating misinformation. Metric – % of correct answers. Link to the original dataset.
  • GSM8K (Grade School Math 8K): Evaluates the model's ability in mathematical reasoning on grade school-level problems. Metric – accuracy. Link to the original dataset.

HyGPT-10b-it's results were compared with several other state-of-the-art open-source models. The data is presented in Table 2:

The results presented in Table 2 demonstrate the strong competitiveness of the HyGPT-10b-it model.

  • Average Performance: HyGPT-10b-it exhibits the best average performance among all compared models, even outperforming larger models on some tasks.
  • TruthfulQA and GSM8K: HyGPT-10b-it demonstrates particularly outstanding results on the TruthfulQA and GSM8K benchmarks, significantly surpassing other models. This indicates an improved ability to follow instructions requiring factual accuracy and mathematical reasoning in Armenian.
  • Flores and ARC: On translation and general reasoning tasks, HyGPT-10b-it shows results comparable to or slightly inferior to larger models, such as gemma-3-12b-it (12B) and Mistral-Small (24B). This is expected, as these benchmarks may require a greater breadth of general knowledge, which larger models inherently possess.
  • Comparison with the Gemma2 Instruct Model: HyGPT-10b-it significantly outperforms gemma-2-9b-it, confirming the effectiveness of the additional pre-training on the Armenian corpus, architectural modifications, and instruction fine-tuning on the target data.

Qualitative analysis, including evaluations of HyGPT-10b-it's generations by native Armenian speakers, corroborated the quantitative results. HyGPT-10b-it demonstrated significantly improved abilities in understanding and executing complex instructions in Armenian, maintaining context in multi-turn dialogues, generating more natural and coherent responses in Armenian, and performing specific tasks such as translation, summarization, and question answering using provided context.

4. Conclusion 

Within the scope of this project, HyGPT – the first large language model for Eastern Armenian – has been successfully developed and trained. Key milestones included the meticulous collection and preparation of an extensive text corpus, adaptation of the Gemma2 architecture, including modifications with separate and independently trained embedding and output layers, and a two-stage training process: pre-training (HyGPT-10b) and instruction fine-tuning (HyGPT-10b-it).

The resulting HyGPT-10b-it model demonstrates high performance on a range of Armenian language benchmarks, surpassing many existing models, including some larger ones. The model shows significant potential for use in conversational applications, educational tools, multilingual support systems, and other domains requiring high-quality Armenian language processing.

The developed models have been released to the public domain and are available for download and use on the Hugging Face platform at the following link.

Future work will focus on expanding the instruction data set, exploring more advanced model alignment techniques, and developing specialized versions of the model for specific application tasks in the Armenian-speaking space.

Acknowledgements

Gen2B team expresses its gratitude to:

  • NCCAIT and personally to Artak Hovsepian for their invaluable assistance in collecting data for pre-training.
  • Nvidia for providing the computational resources that made training the models possible.

Discover Generative AI with Gen2B

Enable anyone to implement artificial intelligence

This site uses cookies to offer you a better browsing experience. Find out more on
how we use cookies