cmarkea

cmarkea/distilcamembert-base-nli

No description available.

Model Documentation

DistilCamemBERT-NLI ===================

We present DistilCamemBERT-NLI, which is DistilCamemBERT fine-tuned for the Natural Language Inference (NLI) task for the french language, also known as recognizing textual entailment (RTE). This model is constructed on the XNLI dataset, which determines whether a premise entails, contradicts or neither entails or contradicts a hypothesis.

This modelization is close to BaptisteDoyen/camembert-base-xnli based on CamemBERT model. The problem of the modelizations based on CamemBERT is at the scaling moment, for the production phase, for example. Indeed, inference cost can be a technological issue especially in the context of cross-encoding like this task. To counteract this effect, we propose this modelization which divides the inference time by 2 with the same consumption power, thanks to DistilCamemBERT.

Dataset -------

The dataset XNLI from FLUE comprises 392,702 premises with their hypothesis for the train and 5,010 couples for the test. The goal is to predict textual entailment (does sentence A imply/contradict/neither sentence B?) and is a classification task (given two sentences, predict one of three labels). Sentence A is called *premise*, and sentence B is called *hypothesis*, then the goal of modelization is determined as follows: $$P(premise=c\in\{contradiction, entailment, neutral\}\vert hypothesis)$$

Evaluation results ------------------

| class | precision (%) | f1-score (%) | support | | :----------------: | :---------------: | :--------------: | :---------: | | global | 77.70 | 77.45 | 5,010 | | contradiction | 78.00 | 79.54 | 1,670 | | entailment | 82.90 | 78.87 | 1,670 | | neutral | 72.18 | 74.04 | 1,670 |

Benchmark ---------

We compare the DistilCamemBERT model to 2 other modelizations working on the french language. The first one BaptisteDoyen/camembert-base-xnli is based on well named CamemBERT, the french RoBERTa model and the second one MoritzLaurer/mDeBERTa-v3-base-mnli-xnli based on mDeBERTav3 a multilingual model. To compare the performances, the metrics of accuracy and MCC (Matthews Correlation Coefficient) were used. We used an AMD Ryzen 5 4500U @ 2.3GHz with 6 cores for mean inference time measure.

| model | time (ms) | accuracy (%) | MCC (x100) | | :--------------: | :-----------: | :--------------: | :------------: | | cmarkea/distilcamembert-base-nli | 51.35 | 77.45 | 66.24 | | BaptisteDoyen/camembert-base-xnli | 105.0 | 81.72 | 72.67 | | MoritzLaurer/mDeBERTa-v3-base-mnli-xnli | 299.18 | 83.43 | 75.15 |

Zero-shot classification ------------------------

The main advantage of such modelization is to create a zero-shot classifier allowing text classification without training. This task can be summarized by: $$P(hypothesis=i\in\mathcal{C}|premise)=\frac{e^{P(premise=entailment\vert hypothesis=i)}}{\sum_{j\in\mathcal{C}}e^{P(premise=entailment\vert hypothesis=j)}}$$

For this part, we use two datasets, the first one: allocine used to train the sentiment analysis models. The dataset comprises two classes: "positif" and "négatif" appreciation of movie reviews. Here we use "Ce commentaire est {}." as the hypothesis template and "positif" and "négatif" as candidate labels.

| model | time (ms) | accuracy (%) | MCC (x100) | | :--------------: | :-----------: | :--------------: | :------------: | | cmarkea/distilcamembert-base-nli | 195.54 | 80.59 | 63.71 | | BaptisteDoyen/camembert-base-xnli | 378.39 | 86.37 | 73.74 | | MoritzLaurer/mDeBERTa-v3-base-mnli-xnli | 520.58 | 84.97 | 70.05 |

The second one: mlsum used to train the summarization models. In this aim, we aggregate sub-topics and select a few of them. We use the articles summary part to predict their topics. In this case, the hypothesis template used is "C'est un article traitant de {}." and the candidate labels are: "économie", "politique", "sport" and "science".

| model | time (ms) | accuracy (%) | MCC (x100) | | :--------------: | :-----------: | :--------------: | :------------: | | cmarkea/distilcamembert-base-nli | 217.77 | 79.30 | 70.55 | | BaptisteDoyen/camembert-base-xnli | 448.27 | 70.7 | 64.10 | | MoritzLaurer/mDeBERTa-v3-base-mnli-xnli | 591.34 | 64.45 | 58.67 |

How to use DistilCamemBERT-NLI ------------------------------
python
from transformers import pipeline

classifier = pipeline( task='zero-shot-classification', model="cmarkea/distilcamembert-base-nli", tokenizer="cmarkea/distilcamembert-base-nli" ) result = classifier ( sequences="Le style très cinéphile de Quentin Tarantino " "se reconnaît entre autres par sa narration postmoderne " "et non linéaire, ses dialogues travaillés souvent " "émaillés de références à la culture populaire, et ses " "scènes hautement esthétiques mais d'une violence " "extrême, inspirées de films d'exploitation, d'arts " "martiaux ou de western spaghetti.", candidate_labels="cinéma, technologie, littérature, politique", hypothesis_template="Ce texte parle de {}." )

result {"labels": ["cinéma", "littérature", "technologie", "politique"], "scores": [0.7164115309715271, 0.12878799438476562, 0.1092301607131958, 0.0455702543258667]}


Optimum + ONNX



python
from optimum.onnxruntime import ORTModelForSequenceClassification
from transformers import AutoTokenizer, pipeline

HUB_MODEL = "cmarkea/distilcamembert-base-nli"

tokenizer = AutoTokenizer.from_pretrained(HUB_MODEL) model = ORTModelForSequenceClassification.from_pretrained(HUB_MODEL) onnx_qa = pipeline("zero-shot-classification", model=model, tokenizer=tokenizer)

Quantized onnx model

quantized_model = ORTModelForSequenceClassification.from_pretrained( HUB_MODEL, file_name="model_quantized.onnx" )


Citation --------
bibtex
@inproceedings{delestre:hal-03674695,
  TITLE = {{DistilCamemBERT : une distillation du mod{\`e}le fran{\c c}ais CamemBERT}},
  AUTHOR = {Delestre, Cyrile and Amar, Abibatou},
  URL = {https://hal.archives-ouvertes.fr/hal-03674695},
  BOOKTITLE = {{CAp (Conf{\'e}rence sur l'Apprentissage automatique)}},
  ADDRESS = {Vannes, France},
  YEAR = {2022},
  MONTH = Jul,
  KEYWORDS = {NLP ; Transformers ; CamemBERT ; Distillation},
  PDF = {https://hal.archives-ouvertes.fr/hal-03674695/file/cap2022.pdf},
  HAL_ID = {hal-03674695},
  HAL_VERSION = {v1},
}

Files & Weights

FilenameSizeAction
model.onnx 0.25 GB
model.safetensors 0.25 GB
model_quantized.onnx 0.13 GB
pytorch_model.bin 0.25 GB