Natural Language Processing and Representation Learning Lab

Our Member

Faculty Member

  • Assoc. Prof. Dr. Sarana Nutanong (รศ.ดร.สรณะ นุชอนงค์)

Secretary

  • Onarin Lapvetee (อรอลิน ลาภเวที)

Financial and Budgeting

  • Kwanchat Potcharawongsakul (ขวัญฉัตร พชรวงศ์สกุล)

Postdoctoral

  • Dr. Raheem Sarwar
  • Dr. Nat Dilokthanakul (ดร. ณัฏฐ์ ดิลกธนากุล)
  • Dr. Naravut Suvannang

Software Engineer

  • Benjapol Worakan (เบญจพล วรกัลป์)
  • Kasidit Phoncharoen (กษิดิศ ผลเจริญ)
  • Treephop Saeteng (นายตรีภพ แซ่เต็ง)

Data Engineer

  • Rattasat Laotaew (รัฐศาสตร์ เหลาแตว)

Research Assistants

  • Jilamika Wongpithayadisai (จิลมิกา วงศ์พิทยาดิศัย)
  • Songpon Srisawai (นายทรงพล ศรีไสว)

PhD students

  • Bundit Boonyarit (นายบัณฑิต บุญยฤทธิ์)
  • Chaniakarn Nikunram (นางสาวชนิกานต์ นิกูลรัมย์)
  • Kanatip Chitavisutthivong (นายคณาธิป จิตตวิสุทธิวงศ์)
  • Krissanee Kamthawee (นางสาวกฤษณี คำทวี)
  • Norawit Urailertprasert (นายนรวิชญ์ อุไรเลิศประเสริฐ)
  • Pattaramanee Arsomngern (นางสาวภัทรมณี อาศรมเงิน)
  • Peerat Limkonchotiwat (นายพีรัชต์ ลิ้มกรโชติวัฒน์)
  • Norawit Urailertprasert  (นาย นรวิชญ์ อุไรเลิศประเสริฐ)
  • Nattapol Trijakwanich (นาย ณัฐพล ไตรจักร์วนิช)
  • Sukrit Sriratanawilai (นาย สุกฤษฏิ์ ศรีรัตนะวิไล)

Research topics

Data Delivery in Unmanned Aerial Vehicles
To develop a stochastic optimization model of collecting and transmitting aerial data to an infrastructure for instant analysis, with planning drones and networks to obtain optimal costs, e.g. operating costs, energy consumption, and time.

Data-driven Drug Discovery
This project aims to develop a drug discovery platform to find the lead compounds, especially on GPCRs and kinases drug targets. We are performing the solutions by machine learning and structural bioinformatics approaches with biomolecular structures data. These proposed solutions can increase the drug screening accuracy, which can in turn reduce time and cost in pre-clinical and clinical process in drug discovery.

Sentiment Analysis
Sentiment analysis in social media is challenging due to the noisy terms in form of slang, abbreviations, acronym, emotions and spelling errors coupled with the availability of the data. However, there are several studies which have reported good accuracy for resource-rich languages. This is a more challenging task to perform for low-resource languages such as Thai, due to the lack of reliable natural language processing tools.

Machine Translation
Sentence matching is widely used in various natural language tasks such as natural language inference, paraphrase identification, and question answering. For these tasks, understanding logical and semantic relationship between two sentences is required but it is yet challenging. Estimating distances between text sentences is in the core of these tasks. Recently, a novel distance metric for text data was proposed which is known as WMD. The WMD measure is directly derived from the optimal transport (OT) theory  and is, in fact, an implementation of the Wasserstein distance (also known as Earth Mover’s distance) for textual data.
For WMD, a source and a target text span are expressed by high-dimensional probability densities through the bag-of-words representation. Given the two densities, OT aims to find the map (or transport plan) that minimizes the total transportation cost given a ground metric for transferring the first density to the second. The ground metric for text data can be estimated using word embeddings.

Digital Text Forensics
Digital text forensics aims at examining the originality and credibility of information in electronic documents and, in this regard, to extract and analyze information about the authors of these documents. Authorship identification is one of the important problems in the area of digital text forensics, and can be defined as follows. “Given an anonymous text x, a set of candidate authors Y , and their writing samples X, identify the most likely author of x in Y by analyzing the writing samples in X and comparing them with x.” Authorship identification problems have two main types, namely (i) closed-set authorship identification; and (ii) open-set authorship identification. The closed-set authorship identification problem assumes that the original author of an anonymous document is also included in the candidate authors set.
On the other hand, a more realistic case is open-set authorship identification. The open-set authorship identification problem considers the possibility that none of the candidate authors is the true author of the anonymous document. In such a case when the true author of anonymous document is not included in the candidate author set, an accurate solution should not attribute the anonymous document to any of the candidate author. The authorship identification solutions can naturally serve in different subareas of digital text forensics such as criminal law, identifying the writers of harassing letters or ransom notes; intelligence agencies work, e.g., linking intercepted messages to known terrorists or enemies; and civil law, e.g., solving estate disputes or copyright issues.
Another important variation of authorship identification is known as cross-lingual authorship identification, where the language of anonymous documents is different than the writing samples of the candidate authors. Nowadays, users may participate in several platforms regardless of the language. For example, an Italian user may have a blog in Italian, primarily post in English on Facebook, and publish articles in both languages. Another aspect is that people are becoming increasingly proficient in more than one language. It has been shown that more than half of the world population is bilingual. Consequently, there is a substantial need for cross-lingual authorship identification solutions.

Text-to-SQL or Natural Language-to-SQL (NL2SQL)
Learning query languages such as SQL from natural language utterances, Text-to-SQL task, is an important sub-task of semantic parsing in natural language processing (NLP). It translates natural language questions into the corresponding SQL queries and results in helping users who lack of understanding of SQL easily query as well as analyze data from a database.
In recent years, the most existing approaches have already obtained high accuracy on simple text-to-SQL benchmarks such as WikiSQL. However, complex and cross-domain text-to-SQL tasks are still challenging and hard problem to be solved, especially in a low-resource language, Thai language.
This work has the intention of developing the model that can synthesize correct and complex SQL queries from Thai questions which include many unique complex SQL queries with multiple clauses or sub-queries  along with generalize to unseen databases.

Similarity Search
Set representation learning for document retrieval task
Set is a type of data that forms a collection of things. Like a set of words appeared in the documents which are used in the document retrieval task. However, to find a similarity between each raw documents’ set of words using high computational costs. Instead, we try to achieve this problem by using a deep learning approach. Unfortunately, to apply deep learning with the set is a big problem because the set has a property called permutation invariant which doesn’t work with a traditional neural network approach that treats each sequence or position of vector differently (i.e., permutation equivariant). For this reason, We try to design an End-to-end neural network that can represent the set correctly without using a representation-based objective function for maximizing accuracy for retrieval tasks.

Speech Emotion Recognition
Develop machine learning models for recognizing emotions using the subtle nuances in the speech utterance such as tone, pitch, etc. Speech emotion recognition has a variety of applications such as human-computer interaction and emotional monitoring.


Retrieving real video from DeepFake
DeepFake can be used in harmful ways for example: delivery misleading political speech, cyberbullying, etc. What is DeepFake? Basically it is a video generated by an “AI”(Face2Face, FaceSwap, DeepFake).  In order to generate a new DeepFake video it requires an original video as source for an AI. By retrieving an original video from a DeepFake we can show that the video is being manipulated by DeepFake and help combating DeepFake videos.  

Tuneable Earth Mover’s Distance
Earth Mover’s Distance(EMD), also known as Wasserstein metric in Statistic, is an efficient similarity metric between two distributions. EMD perceive the difference between two distributions in the same way as humans do. On the other hand, Its constraints prescribe a perfect matching between the distributions which probably include some outliers. Thus, we propose a new distance which its properties lie between EMD and Hausdorff distance. In addition we can custom the ratio between EMD and HD properties. We can guarantee that our distance performance is not worse than both distance in either normal or noisy environment.

Approximate Kernel Density Estimation using Locality Sensitive Hashing Technique
Kernel density estimation (KDE) is the cornerstone of many machine learning algorithms. However, in Large-scale setting, the computational complexity of exact calculation of the KDE is very expensive. To deal with this challenge, the approximation approaches using Locality Sensitive Hashing technique (LSH) is used for solving the KDE in sub-linear time. In this work, we propose the novel method for selecting the LSH parameters which guarantee the approximation error.

Publications

Ngoenriang, N., Sawadsitang, S., Leangsuksun, C., Niyato, D., & Tan, P. S., “Joint Vehicle Routing and Loading in Delivery Planning: A Stochastic Programming Approach.” In 2019 IEEE 89th Vehicular Technology Conference (VTC2019-Spring) (pp. 1-5). IEEE.

Ngoenriang, N., Nutanong, S., & Niyato, D., “Joint Task Allocation and Data Delivery Framework for Unmanned Aerial Vehicles in Aerial Plant Inspection.” In 2019 IEEE 90th Vehicular Technology Conference (VTC2019-Fall) (pp. 1-5). IEEE.

Sarwar, Raheem, Thanasarn Porthaveepong, Attapol Rutherford, Thanawin Rakthanmanon, and Sarana Nutanong. “StyloThai: A Scalable Framework for Stylometric Authorship Identification of Thai Documents.” ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP) 19, no. 3 (2020): 1-15.

Sarwar, Raheem, Norawit Urailertprasert, Nattapol Vannaboot, Chenyun Yu, Thanawin Rakthanmanon, Ekapol Chuangsuwanich, and Sarana Nutanong“CAG: Stylometric Authorship Attribution of Multi-Author Documents Using a Co-Authorship Graph.” IEEE Access 8 (2020): 18374-18393.

Sarwar, Raheem, Attapol Rutherford, Saeed-Ul Hassan, Thanawin Rakthanmanon, and Sarana Nutanong. “Native Language Identification of Fluent and Advanced Non-native Writers.” ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP) [Accepted]

Sarwar, Raheem, Chenyun Yu, Ninad Tungare, Kanatip Chitavisutthivong, Sukrit Sriratanawilai, Yaohai Xu, Dickson Chow, Thanawin Rakthanmanon, and Sarana Nutanong. “An effective and scalable framework for authorship attribution query processing.” IEEE Access 6 (2018): 50030-50048.