HYBRID DISTANCE-STATISTICAL-BASED PHRASE ALIGNMENT FOR ANALYZING PARALLEL TEXTS IN STANDARD MALAY AND MALAY DIALECTS

Main Article Content

Jasmina Khaw Yen Min
Tien Ping Tan
Bali Ranaivo-Malancon

Abstract

Parallel texts corpora are essential resources in linguistics and natural language processing, especially in translation  and multilingual information retrieval. The publicly available parallel text corpora are limited to certain genres, types  and domains. Furthermore, the parallel dialect text is scarce, even though they are important in the analysis and study  of a dialect. Collecting parallel dialect text is challenging because dialects typically appear in the form of speech and  very limited dialectic texts exist. Moreover, there is no standard orthography in most dialects. The contributions of  this paper are threefold. First, the paper describes a methodology in acquiring a parallel text corpus of Standard Malay and Malay dialects, particularly Kelantan Malay and Sarawak Malay. Second, we propose a hybrid of distance based and statistical-based alignment algorithm to align words and phrases the parallel text. The results show that  the precision and recall values of the proposed alignment algorithm are more than 95% and better than the state-of the-art GIZA++. Third, the alignment obtained were compared to find out the lexical similarities and differences between Standard Malay and the two studied Malay dialects, contributing valuable insights into the linguistic  variations within the Malay language family.

Downloads

Download data is not yet available.

Article Details

How to Cite
Yen Min, J. K., Tan, T. P., & Ranaivo-Malancon, B. (2024). HYBRID DISTANCE-STATISTICAL-BASED PHRASE ALIGNMENT FOR ANALYZING PARALLEL TEXTS IN STANDARD MALAY AND MALAY DIALECTS. Malaysian Journal of Computer Science, 37(1), 1–25. https://doi.org/10.22452/mjcs.vol37no1.5
Section
Articles