Diacritical and Orthographic Modifications in Written Punjabi as Adversarial Attacks on NLP Systems: Challenges and Implications
Keywords:
Adversarial Attacks, Punjabi Language Processing, NLP Vulnerabilities, Script Normalization, Diacritical Manipulation, Dual script Punjabi (Gurmukhi and Shahmukhi)Abstract
Adversarial attacks in Natural Language Processing (NLP) have predominantly targeted high resource languages, leaving vulnerabilities in lesser resourced languages relatively unexplored. This research specifically investigates adversarial vulnerabilities in Punjabi, a digraphic language written in two distinct scripts: Gurmukhi and Shahmukhi. The study examines how intentional diacritical insertions, deletions, and orthographic modifications within Punjabi texts can significantly impair NLP systems, impacting crucial tasks such as part of speech (POS) tagging, text classification, and machine translation. A structured methodology is employed to systematically generate adversarial examples for both scripts, assessing their impacts on NLP model accuracy and key performance metrics such as F1 scores. The results demonstrate substantial performance degradation caused by minimal linguistic manipulations, highlighting critical vulnerabilities inherent in current NLP systems. Consequently, this study underscores the importance of robust defensive measures, including script normalization, targeted adversarial training, and language specific resilience strategies. Beyond its contribution to Punjabi NLP, this research offers valuable insights applicable to broader adversarial machine learning, particularly for languages characterized by complex orthographic and diacritical features.