LazarusNLP

LazarusNLP/NusaBERT-base

No description available.

Model Documentation

NusaBERT Base

NusaBERT Base is a multilingual encoder-based language model based on the BERT architecture. We conducted continued pre-training on open-source corpora of sabilmakbar/indo_wiki, acul3/KoPI-NLLB, and uonlp/CulturaX. On a held-out subset of the corpus, our model achieved:

eval_accuracy: 0.6866

eval_loss: 1.4876

perplexity: 4.4266

This model was trained using the 🤗Transformers PyTorch framework. All training was done on an NVIDIA H100 GPU. LazarusNLP/NusaBERT-base is released under Apache 2.0 license.

Model Detail

Developed by: LazarusNLP

Finetuned from: IndoBERT base p1

Model type: Encoder-based BERT language model

Language(s): Indonesian, Acehnese, Balinese, Banjarese, Buginese, Gorontalo, Javanese, Banyumasan, Minangkabau, Malay, Nias, Sundanese, Tetum

License: Apache 2.0

Contact: LazarusNLP

Use in 🤗Transformers

python
from transformers import AutoTokenizer, AutoModelForMaskedLM

model_checkpoint = "LazarusNLP/NusaBERT-base"

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)

Training Datasets

Around 16B tokens from the following corpora were used during pre-training.

Indonesian Wikipedia Data Repository

KoPI-NLLB (Korpus Perayapan Indonesia)

Cleaned, Enormous, and Public: The Multilingual Fuel to Democratize Large Language Models for 167 Languages

Training Hyperparameters

The following hyperparameters were used during training:

learning_rate: 0.0003

train_batch_size: 256

eval_batch_size: 256

seed: 42

optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08

lr_scheduler_type: linear

lr_scheduler_warmup_steps: 24000

training_steps: 500000

Framework versions

Transformers 4.37.2

Pytorch 2.2.0+cu118

Datasets 2.17.1

Tokenizers 0.15.1

Credits

NusaBERT Base is developed with love by:

Citation

bib
@misc{wongso2024nusabert,
  title={NusaBERT: Teaching IndoBERT to be Multilingual and Multicultural}, 
  author={Wilson Wongso and David Samuel Setiawan and Steven Limcorn and Ananto Joyoadikusumo},
  year={2024},
  eprint={2403.01817},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}

Files & Weights

Filename	Size	Action
model.safetensors	0.41 GB
training_args.bin	0.00 GB