Abstract: |
Growing number of online discourses enables the development of emotion mining models using natural language processing techniques. However, language diversity and cultural disparity alters the sentiment orientation of words depending on the community and context. Therefore, this study investigates the impacts of linguistic features, namely lexical and syntactic, in predicting the presence two emotions among Malaysian YouTube users, anger and anticipation. Term Frequency–Inverse Document Frequency (TF-IDF), Unigrams, Bigrams and Parts-of-Speech Tags were used as features to observe the classification performance. The dataset used in this study contains 2500 YouTube comments by Malaysian users on 46 Covid-19 related videos. Comments were extracted from three prominent Malaysian-centric English news channels: Channel News Asia (CNA), The Star News, and New Strait Times, ranging from 16 March 2020 – 30 April 2020 (i.e., first lockdown phase). Random Forest, Support Vector Machine, Logistic Regression, Decision Tree, K-Nearest Neighbour and Multinomial Naïve Bayes were the six classification algorithms tested, with results indicating Support Vector Machine with TF-IDF provided the best performance, achieving accuracy of 76% and 73% for anger and anticipation, respectively. |