LazarusNLP

LazarusNLP/NusaBERT-base

No description available.

Model Documentation

NusaBERT Base



NusaBERT Base is a multilingual encoder-based language model based on the BERT architecture. We conducted continued pre-training on open-source corpora of sabilmakbar/indo_wiki, acul3/KoPI-NLLB, and uonlp/CulturaX. On a held-out subset of the corpus, our model achieved:

  • eval_accuracy: 0.6866
  • eval_loss: 1.4876
  • perplexity: 4.4266


  • This model was trained using the 🤗Transformers PyTorch framework. All training was done on an NVIDIA H100 GPU. LazarusNLP/NusaBERT-base is released under Apache 2.0 license.

    Model Detail



  • Developed by: LazarusNLP
  • Finetuned from: IndoBERT base p1
  • Model type: Encoder-based BERT language model
  • Language(s): Indonesian, Acehnese, Balinese, Banjarese, Buginese, Gorontalo, Javanese, Banyumasan, Minangkabau, Malay, Nias, Sundanese, Tetum
  • License: Apache 2.0
  • Contact: LazarusNLP


  • Use in 🤗Transformers



    python
    from transformers import AutoTokenizer, AutoModelForMaskedLM

    model_checkpoint = "LazarusNLP/NusaBERT-base"

    tokenizer = AutoTokenizer.from_pretrained(model_checkpoint) model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)


    Training Datasets



    Around 16B tokens from the following corpora were used during pre-training.

  • Indonesian Wikipedia Data Repository
  • KoPI-NLLB (Korpus Perayapan Indonesia)
  • Cleaned, Enormous, and Public: The Multilingual Fuel to Democratize Large Language Models for 167 Languages


  • Training Hyperparameters



    The following hyperparameters were used during training:

  • learning_rate: 0.0003
  • train_batch_size: 256
  • eval_batch_size: 256
  • seed: 42
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_steps: 24000
  • training_steps: 500000


  • Framework versions



  • Transformers 4.37.2
  • Pytorch 2.2.0+cu118
  • Datasets 2.17.1
  • Tokenizers 0.15.1


  • Credits



    NusaBERT Base is developed with love by:



    Citation



    bib
    @misc{wongso2024nusabert,
      title={NusaBERT: Teaching IndoBERT to be Multilingual and Multicultural}, 
      author={Wilson Wongso and David Samuel Setiawan and Steven Limcorn and Ananto Joyoadikusumo},
      year={2024},
      eprint={2403.01817},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
    }
    

    Files & Weights

    FilenameSizeAction
    model.safetensors 0.41 GB
    training_args.bin 0.00 GB