Bibliografi
Pengarang
Erica Harlin;
Barcode
Cat. Karya
No. Induk
Pembimbing
Ika Alfina
Kata Kunci
bahasa informal, lematisasi, noisy text, normalisasi teks, POS tagging
Pembimbing 3
Pembimbing 2
Tahun buku
2023
Barcode RFID baru
11725671
Tahun Angkatan
2019
Progam Studi
Ilmu Komputer
Lokasi
FASILKOM-UI;
Tanggal Datang
30/11/2023
Abstrak Indonesia
ABSTRAK

ABSTRAK Nama : Erica Harlin Program Studi : Computer Science Judul : Handling Noisy Text to Improve Lemmatization and POS Tagging Accuracy for Informal Indonesian Text Pembimbing : Dr. Ika Alfina, S.Kom., M.Kom. Arlisa Yuliawati, S.Kom., M.Kom. Aksara adalah sebuah NLP tool yang menuruti Universal Dependencies (UD) v2. Penelitian terakhir terkait pemrosesan bahasa informal pada Aksara adalah v1.2 (Aksara lama) yang berfokus pada kemampuan Aksara untuk memproses kata-kata dasar informal dan kata-kata dengan afiksasi informal. Penelitian ini bertujuan untuk mengembangkan kemampuan Aksara dalam memproses noisy text. Dalam penelitian ini, terdapat 5 metode yang dipertimbangkan untuk menormalisasikan noisy text, yaitu: (1) Levenshtein distance, (2) DamerauLevenshtein distance, (3) perbandingan subsequence, (4) Longest Common Subsequence (LCS), dan (5) SymSpell. Untuk menentukan metode mana yang paling cocok, kami membangun dataset sintetis berukuran 20.000 kata, lalu membandingkan performa metode yang satu dengan yang lain dalam menormalisasikan dataset sintetis tersebut. Pasangan (metode; akurasi) yang didapatkan adalah sebagai berikut: (Levenshtein distance; 61.21), (Damerau-Levenshtein distance; 61.15), (perbandingan subsequence; 40.17), (LCS; 67.35), dan (SymSpell; 68.5). Metode yang akhirnya dipilih adalah SymSpell karena metode ini yang menghasilkan akurasi yang paling tinggi. Versi Aksara yang dihasilkan oleh penelitian ini adalah Aksara v1.4 (Aksara baru). Untuk mengevaluasi Aksara baru, dipakai gold standard yang terdiri dari 152 kalimat dan 1786 token. Hasil evaluasi menunjukkan lemmatizer Aksara baru memiliki akurasi senilai 90.99% dan 91.66% untuk kasus case-sensitive dan case-insensitive, dengan peningkatan 5.67% dan 5.60% berturut-turut dibandingkan Aksara lama. Untuk POS tagger, Aksara baru memiliki akurasi senilai 83%, recall senilai 83%, dan F1 score senilai 83%, dengan peningkatan sebesar 7%, 7%, and 2% berturut-turut dibandingkan Aksara lama. Kata kunci: bahasa informal, lematisasi, noisy text, normalisasi teks, POS tagging

Daftar Isi
TABLE OF CONTENTS TITLE PAGE i APPROVAL PAGE ii STATEMENT OF ORIGINALITY iii CERTIFICATION OF APPROVAL iv ACKNOWLEDGMENTS v STATEMENT OF CONSENT OF ACADEMIC PUBLICATION v ABSTRACT vii TABLE OF CONTENTS ix List of Figures xiii List of Tables xv List of Code xvii 1 INTRODUCTION 1 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2.2 Problem Scope . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.4 Research Position . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.5 Research Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.6 Document Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2 LITERATURE REVIEW 9 2.1 Noisy Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Types of Indonesian Noisy Text . . . . . . . . . . . . . . . . . . . . . 10 2.2.1 Elongation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 ix x 2.2.2 Vocal Removal . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.3 Vocal Change . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.4 Clipping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.5 Consonant Change . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2.6 Consonant Addition . . . . . . . . . . . . . . . . . . . . . . . 14 2.2.7 Contraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2.8 Monophthongization . . . . . . . . . . . . . . . . . . . . . . 14 2.2.9 Repetition Removal . . . . . . . . . . . . . . . . . . . . . . . 15 2.3 Indonesian Noisy Text Dataset . . . . . . . . . . . . . . . . . . . . . . 15 2.4 String Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.4.1 Minimum Edit Distance . . . . . . . . . . . . . . . . . . . . . 16 2.4.1.1 Levenshtein Distance . . . . . . . . . . . . . . . . 16 2.4.1.2 Damerau-Levenshtein Distance . . . . . . . . . . 16 2.4.2 Subsequence . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.4.2.1 Longest Common Subsequence . . . . . . . . . . 18 2.4.2.2 Is Subsequence Problem . . . . . . . . . . . . . . . 18 2.4.3 SymSpell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.5 Context Dictionary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.6 NLP Stages and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 20 2.6.1 Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.6.2 Lemmatization . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.6.3 Part-of-Speech Tagging . . . . . . . . . . . . . . . . . . . . . 22 2.6.4 Morphological Analyzer . . . . . . . . . . . . . . . . . . . . . 22 2.6.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.6.5.1 Confusion Matrix . . . . . . . . . . . . . . . . . . . 24 2.6.5.2 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . 25 2.6.5.3 Precision . . . . . . . . . . . . . . . . . . . . . . . . 25 2.6.5.4 Recall . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.6.5.5 F1 Score . . . . . . . . . . . . . . . . . . . . . . . . 25 2.7 Universal Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.7.1 CoNNL-U Format . . . . . . . . . . . . . . . . . . . . . . . . 26 2.7.2 UPOS Tagset . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.8 Aksara . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3 METHODOLOGY AND DESIGN 29 3.1 Cases and Variations Identification . . . . . . . . . . . . . . . . . . . 29 3.2 Synthetic Data Generation . . . . . . . . . . . . . . . . . . . . . . . . 29 3.2.1 Case Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 Universitas Indonesia xi 3.2.2 Obtaining the List of Most Common Words . . . . . . . . . 30 3.2.3 Corruption Process to Obtain Synthetic Data . . . . . . . . 31 3.3 Normalization Using the 5 Methods . . . . . . . . . . . . . . . . . . 33 3.4 5 Methods’ Normalization Evaluation . . . . . . . . . . . . . . . . . 35 3.5 Integrating Noisy Text Normalization to Aksara . . . . . . . . . . . . 35 3.5.1 Context Dictionary . . . . . . . . . . . . . . . . . . . . . . . . 36 3.5.2 Contextual Normalization . . . . . . . . . . . . . . . . . . . 36 3.6 Aksara Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.6.1 Non-Contextual Text Normalizer without Lexicon Addition 38 3.6.2 Contextual Text Normalizer without Lexicon Addition . . . 39 3.6.3 Contextual Text Normalizer with Lexicon Addition . . . . . 39 3.6.4 Aksara’s Overall Performance Evaluation . . . . . . . . . . . 39 4 IMPLEMENTATION 41 4.1 Cases and Variations Identification . . . . . . . . . . . . . . . . . . . 41 4.2 Synthetic Data Generation . . . . . . . . . . . . . . . . . . . . . . . . 46 4.2.1 Colloquial Indonesian Lexicon Case Relabeling and Ranking 46 4.2.2 Most Common Words . . . . . . . . . . . . . . . . . . . . . . 47 4.2.3 Synthetic Word Corrupting Process . . . . . . . . . . . . . . 49 4.2.3.1 Elongation . . . . . . . . . . . . . . . . . . . . . . . 50 4.2.3.2 Vocal Removal . . . . . . . . . . . . . . . . . . . . 51 4.2.3.3 Vocal Change . . . . . . . . . . . . . . . . . . . . . 52 4.2.3.4 Clipping . . . . . . . . . . . . . . . . . . . . . . . . 53 4.2.3.5 Consonant Change . . . . . . . . . . . . . . . . . . 54 4.2.3.6 Consonant Addition . . . . . . . . . . . . . . . . . 55 4.2.3.7 Contraction . . . . . . . . . . . . . . . . . . . . . . 55 4.2.3.8 Monophthongization . . . . . . . . . . . . . . . . 60 4.2.3.9 Repetition Removal . . . . . . . . . . . . . . . . . 61 4.3 Normalization Using the 5 Methods . . . . . . . . . . . . . . . . . . 61 4.3.1 Levenshtein Distance . . . . . . . . . . . . . . . . . . . . . . 62 4.3.2 Damerau-Levenshtein Distance . . . . . . . . . . . . . . . . 63 4.3.3 Simple Subsequence Comparison . . . . . . . . . . . . . . . 63 4.3.4 Longest Common Subsequence (LCS) . . . . . . . . . . . . 65 4.3.5 SymSpell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.4 Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 5 EVALUATION AND ANALYSIS 70 5.1 Synthetic Data Generated . . . . . . . . . . . . . . . . . . . . . . . . 70 Universitas Indonesia xii 5.2 5 Methods’ Normalization Evaluation . . . . . . . . . . . . . . . . . 71 5.2.1 Insertion Needed Cases Analysis . . . . . . . . . . . . . . . . 73 5.2.2 Deletion Needed Case Analysis . . . . . . . . . . . . . . . . . 75 5.2.3 Substitution Needed Cases Analysis . . . . . . . . . . . . . . 75 5.2.4 Substitution and Deletion Needed Case Analysis . . . . . . 76 5.3 Aksara’s Overall Performance Evaluation . . . . . . . . . . . . . . . . 77 5.3.1 Lemmatizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.3.2 Part-of-Speech Tagger . . . . . . . . . . . . . . . . . . . . . . 78 6 CONCLUSION AND FUTURE RESEARCH 82 6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 6.2 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 Bibliography 85 APPENDIX 1 Appendix 1 2 Appendix 2 4 Appendix 3 11
Cat. Umum
Judul
Handling Noisy Text to Improve Lemmatization and POS Tagging Accuracy For Informal Indonesian Text
Asal
Korporasi
NPM
1906351013
Abstrak English
ABSTRAK

ABSTRACT Name : Erica Harlin Study Program : Computer Science Title : Handling Noisy Text to Improve Lemmatization and POS Tagging Accuracy for Informal Indonesian Text Supervisor : Dr. Ika Alfina, S.Kom., M.Kom. Arlisa Yuliawati, S.Kom., M.Kom. Aksara is an Indonesian NLP tool that conforms to Universal Dependencies (UD) v2. The latest work on Aksara that pertains to its informal language processing ability is Aksara v1.2, which is focused on Aksara’s ability to process informal root words and words with informal affixation. This work aims to enable Aksara to process noisy texts. In this research, there are 5 methods considered for normalizing noisy texts: (1) Levenshtein distance, (2) Damerau-Levenshtein distance, (3) Subsequence comparison, (4) Longest Common Subsequence (LCS), and (5) SymSpell. To determine which method is best suited for this purpose, we built a synthetic dataset of 20,000 words, then measure as well as compare each method’s performance in normalizing the synthetic data. The (method; accuracy) pairs obtained are as follows: (Levenshtein distance; 61.21), (DamerauLevenshtein distance; 61.15), (Subsequence comparison; 40.17), (LCS; 67.35), and (SymSpell; 68.5). The chosen method is SymSpell as it yields the highest accuracy. This chosen method, along with a context dictionary will be integrated into Aksara as a text normalizer. The version of Aksara produced by this research is Aksara v1.4 (new Aksara). To evaluate new Aksara’s performance, a gold standard consisting of 152 sentences and 1786 tokens is used. The evaluation result shows that the new Aksara’s lemmatizer has an accuracy of 90.99% and 91.61% for case-sensitive and case-insensitive cases, with a margin of 5.67% and 5.60% respectively over the previous Aksara. For POS tagger, the new Aksara has an accuracy of 83%, a recall of 83%, and an F1 score of 83%, with margins of 7%, 7%, and 2% respectively over the previous Aksara. Key words: informal language, lemmatization, noisy text, POS tagging, text normalization

Pengarang 2
Subjek
Penguji 2
Siti Aminah
Penguji 3
Pembimbing 1
Arlisa Yuliawati
Fisik
xvii, 88hlm, 30Cm
Bahasa
eng
Lulus Semester
Ganjil 2023
Penerbitan
Depok: Fakultas Ilmu Komputer UI, 2023
No. Panggil
SK-2242 (Softcopy SK-1724)
Penguji 1
Alfan Farizki Wicaksono
Lulus semester SI