LazarusNLP
LazarusNLP/NusaBERT-base
No description available.
Model Documentation
NusaBERT Base
NusaBERT Base is a multilingual encoder-based language model based on the BERT architecture. We conducted continued pre-training on open-source corpora of sabilmakbar/indo_wiki, acul3/KoPI-NLLB, and uonlp/CulturaX. On a held-out subset of the corpus, our model achieved:
eval_accuracy: 0.6866eval_loss: 1.4876perplexity: 4.4266This model was trained using the 🤗Transformers PyTorch framework. All training was done on an NVIDIA H100 GPU. LazarusNLP/NusaBERT-base is released under Apache 2.0 license.
Model Detail
Use in 🤗Transformers
python
from transformers import AutoTokenizer, AutoModelForMaskedLM
model_checkpoint = "LazarusNLP/NusaBERT-base"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)
Training Datasets
Around 16B tokens from the following corpora were used during pre-training.
Training Hyperparameters
The following hyperparameters were used during training:
learning_rate: 0.0003train_batch_size: 256eval_batch_size: 256seed: 42optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08lr_scheduler_type: linearlr_scheduler_warmup_steps: 24000training_steps: 500000Framework versions
Credits
NusaBERT Base is developed with love by:
Citation
bib
@misc{wongso2024nusabert,
title={NusaBERT: Teaching IndoBERT to be Multilingual and Multicultural},
author={Wilson Wongso and David Samuel Setiawan and Steven Limcorn and Ananto Joyoadikusumo},
year={2024},
eprint={2403.01817},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Files & Weights
| Filename | Size | Action |
|---|---|---|
| model.safetensors | 0.41 GB | |
| training_args.bin | 0.00 GB |