Part of speech tagger pytorch pretrained

8/7/2023

In short we are proud to contribute to Indonesian researcher as we also proud to represent Indonesia in presenting the research paper: IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding the AACL-IJCNLP 2020 What I learned Moreover, the documentation of this research has been accepted in the AACL-IJCNLP 2020 to be published as the one and only Indonesian research paper in that top conference. The resources provided, the models and the datasets has also inspired others to build better models and assemble more Bahasa Indonesia dataset. This IndoNLU benchmark has helped and will continue to help a lot of Indonesian researcher to do research on NLP in Bahasa Indonesia. Second, in terms of the benchmark tasks and pre-training corpus, we had issues with collecting tasks and pre-trained corpus for Bahasa Indonesia, because the data is scattered and some sources are hard to access. First, in terms of model, we lack of computational resources for building large pre-trained models, and we managed to solve it through collaboration with many parties. We found a lot of challenges in the process of making this project. For large pre-trained model, we train BERT and ALBERT models with the official code and convert the weight into PyTorch model format and host the model in the HuggingFace platform. For pre-training dataset, we collect dataset from 15 sources that is publicly available. We collect 12 tasks for the benchmark from multiple published sources. We build the framework from scratch by using PyTorch and HuggingFace. We built IndoNLU framework along with benchmark, large-scale pre-training dataset, and large pre-trained models. We release baseline models for all twelve tasks, as well as the framework for benchmark evaluation, thus enabling everyone to benchmark their system performances.

We also provide a set of Indonesian pre-trained models (IndoBERT) trained from a large and clean Indonesian dataset (Indo4B) collected from publicly available sources such as social media texts, blogs, news, and websites. The datasets for the tasks lie in different domains and styles to ensure task diversity.

IndoNLU includes twelve tasks, ranging from single sentence classification to pair-sentences sequence labeling with different levels of complexity. In the IndoNLU project, we introduce the first-ever vast resource for training, evaluation, and benchmarking on Indonesian natural language understanding (IndoNLU) tasks. Furthermore, we envision that our work can enable future collaboration between Indonesian NLP researchers and resonate even further by inviting more and more people to collaborate in the advancement of Indonesian NLP research. By doing so, we believe that it will bring Indonesian NLP research to the next level. Our goal in this project is to enable Indonesian NLP researchers and enthusiasts to access latest trend of deep learning technology in NLP with large pre-training corpus and large pre-trained model.

0 Comments

Part of speech tagger pytorch pretrained

Leave a Reply.

Author

Archives

Categories