cmarkea
cmarkea/distilcamembert-base-nli
No description available.
Model Documentation
DistilCamemBERT-NLI
===================
We present DistilCamemBERT-NLI, which is DistilCamemBERT fine-tuned for the Natural Language Inference (NLI) task for the french language, also known as recognizing textual entailment (RTE). This model is constructed on the XNLI dataset, which determines whether a premise entails, contradicts or neither entails or contradicts a hypothesis.
This modelization is close to BaptisteDoyen/camembert-base-xnli based on CamemBERT model. The problem of the modelizations based on CamemBERT is at the scaling moment, for the production phase, for example. Indeed, inference cost can be a technological issue especially in the context of cross-encoding like this task. To counteract this effect, we propose this modelization which divides the inference time by 2 with the same consumption power, thanks to DistilCamemBERT.
Dataset -------
The dataset XNLI from FLUE comprises 392,702 premises with their hypothesis for the train and 5,010 couples for the test. The goal is to predict textual entailment (does sentence A imply/contradict/neither sentence B?) and is a classification task (given two sentences, predict one of three labels). Sentence A is called *premise*, and sentence B is called *hypothesis*, then the goal of modelization is determined as follows: $$P(premise=c\in\{contradiction, entailment, neutral\}\vert hypothesis)$$
Evaluation results ------------------
| class | precision (%) | f1-score (%) | support | | :----------------: | :---------------: | :--------------: | :---------: | | global | 77.70 | 77.45 | 5,010 | | contradiction | 78.00 | 79.54 | 1,670 | | entailment | 82.90 | 78.87 | 1,670 | | neutral | 72.18 | 74.04 | 1,670 |
Benchmark ---------
We compare the DistilCamemBERT model to 2 other modelizations working on the french language. The first one BaptisteDoyen/camembert-base-xnli is based on well named CamemBERT, the french RoBERTa model and the second one MoritzLaurer/mDeBERTa-v3-base-mnli-xnli based on mDeBERTav3 a multilingual model. To compare the performances, the metrics of accuracy and MCC (Matthews Correlation Coefficient) were used. We used an AMD Ryzen 5 4500U @ 2.3GHz with 6 cores for mean inference time measure.
| model | time (ms) | accuracy (%) | MCC (x100) | | :--------------: | :-----------: | :--------------: | :------------: | | cmarkea/distilcamembert-base-nli | 51.35 | 77.45 | 66.24 | | BaptisteDoyen/camembert-base-xnli | 105.0 | 81.72 | 72.67 | | MoritzLaurer/mDeBERTa-v3-base-mnli-xnli | 299.18 | 83.43 | 75.15 |
Zero-shot classification ------------------------
The main advantage of such modelization is to create a zero-shot classifier allowing text classification without training. This task can be summarized by: $$P(hypothesis=i\in\mathcal{C}|premise)=\frac{e^{P(premise=entailment\vert hypothesis=i)}}{\sum_{j\in\mathcal{C}}e^{P(premise=entailment\vert hypothesis=j)}}$$
For this part, we use two datasets, the first one: allocine used to train the sentiment analysis models. The dataset comprises two classes: "positif" and "négatif" appreciation of movie reviews. Here we use "Ce commentaire est {}." as the hypothesis template and "positif" and "négatif" as candidate labels.
| model | time (ms) | accuracy (%) | MCC (x100) | | :--------------: | :-----------: | :--------------: | :------------: | | cmarkea/distilcamembert-base-nli | 195.54 | 80.59 | 63.71 | | BaptisteDoyen/camembert-base-xnli | 378.39 | 86.37 | 73.74 | | MoritzLaurer/mDeBERTa-v3-base-mnli-xnli | 520.58 | 84.97 | 70.05 |
The second one: mlsum used to train the summarization models. In this aim, we aggregate sub-topics and select a few of them. We use the articles summary part to predict their topics. In this case, the hypothesis template used is "C'est un article traitant de {}." and the candidate labels are: "économie", "politique", "sport" and "science".
| model | time (ms) | accuracy (%) | MCC (x100) | | :--------------: | :-----------: | :--------------: | :------------: | | cmarkea/distilcamembert-base-nli | 217.77 | 79.30 | 70.55 | | BaptisteDoyen/camembert-base-xnli | 448.27 | 70.7 | 64.10 | | MoritzLaurer/mDeBERTa-v3-base-mnli-xnli | 591.34 | 64.45 | 58.67 |
How to use DistilCamemBERT-NLI ------------------------------
Citation --------
We present DistilCamemBERT-NLI, which is DistilCamemBERT fine-tuned for the Natural Language Inference (NLI) task for the french language, also known as recognizing textual entailment (RTE). This model is constructed on the XNLI dataset, which determines whether a premise entails, contradicts or neither entails or contradicts a hypothesis.
This modelization is close to BaptisteDoyen/camembert-base-xnli based on CamemBERT model. The problem of the modelizations based on CamemBERT is at the scaling moment, for the production phase, for example. Indeed, inference cost can be a technological issue especially in the context of cross-encoding like this task. To counteract this effect, we propose this modelization which divides the inference time by 2 with the same consumption power, thanks to DistilCamemBERT.
Dataset -------
The dataset XNLI from FLUE comprises 392,702 premises with their hypothesis for the train and 5,010 couples for the test. The goal is to predict textual entailment (does sentence A imply/contradict/neither sentence B?) and is a classification task (given two sentences, predict one of three labels). Sentence A is called *premise*, and sentence B is called *hypothesis*, then the goal of modelization is determined as follows: $$P(premise=c\in\{contradiction, entailment, neutral\}\vert hypothesis)$$
Evaluation results ------------------
| class | precision (%) | f1-score (%) | support | | :----------------: | :---------------: | :--------------: | :---------: | | global | 77.70 | 77.45 | 5,010 | | contradiction | 78.00 | 79.54 | 1,670 | | entailment | 82.90 | 78.87 | 1,670 | | neutral | 72.18 | 74.04 | 1,670 |
Benchmark ---------
We compare the DistilCamemBERT model to 2 other modelizations working on the french language. The first one BaptisteDoyen/camembert-base-xnli is based on well named CamemBERT, the french RoBERTa model and the second one MoritzLaurer/mDeBERTa-v3-base-mnli-xnli based on mDeBERTav3 a multilingual model. To compare the performances, the metrics of accuracy and MCC (Matthews Correlation Coefficient) were used. We used an AMD Ryzen 5 4500U @ 2.3GHz with 6 cores for mean inference time measure.
| model | time (ms) | accuracy (%) | MCC (x100) | | :--------------: | :-----------: | :--------------: | :------------: | | cmarkea/distilcamembert-base-nli | 51.35 | 77.45 | 66.24 | | BaptisteDoyen/camembert-base-xnli | 105.0 | 81.72 | 72.67 | | MoritzLaurer/mDeBERTa-v3-base-mnli-xnli | 299.18 | 83.43 | 75.15 |
Zero-shot classification ------------------------
The main advantage of such modelization is to create a zero-shot classifier allowing text classification without training. This task can be summarized by: $$P(hypothesis=i\in\mathcal{C}|premise)=\frac{e^{P(premise=entailment\vert hypothesis=i)}}{\sum_{j\in\mathcal{C}}e^{P(premise=entailment\vert hypothesis=j)}}$$
For this part, we use two datasets, the first one: allocine used to train the sentiment analysis models. The dataset comprises two classes: "positif" and "négatif" appreciation of movie reviews. Here we use "Ce commentaire est {}." as the hypothesis template and "positif" and "négatif" as candidate labels.
| model | time (ms) | accuracy (%) | MCC (x100) | | :--------------: | :-----------: | :--------------: | :------------: | | cmarkea/distilcamembert-base-nli | 195.54 | 80.59 | 63.71 | | BaptisteDoyen/camembert-base-xnli | 378.39 | 86.37 | 73.74 | | MoritzLaurer/mDeBERTa-v3-base-mnli-xnli | 520.58 | 84.97 | 70.05 |
The second one: mlsum used to train the summarization models. In this aim, we aggregate sub-topics and select a few of them. We use the articles summary part to predict their topics. In this case, the hypothesis template used is "C'est un article traitant de {}." and the candidate labels are: "économie", "politique", "sport" and "science".
| model | time (ms) | accuracy (%) | MCC (x100) | | :--------------: | :-----------: | :--------------: | :------------: | | cmarkea/distilcamembert-base-nli | 217.77 | 79.30 | 70.55 | | BaptisteDoyen/camembert-base-xnli | 448.27 | 70.7 | 64.10 | | MoritzLaurer/mDeBERTa-v3-base-mnli-xnli | 591.34 | 64.45 | 58.67 |
How to use DistilCamemBERT-NLI ------------------------------
python
from transformers import pipeline
classifier = pipeline(
task='zero-shot-classification',
model="cmarkea/distilcamembert-base-nli",
tokenizer="cmarkea/distilcamembert-base-nli"
)
result = classifier (
sequences="Le style très cinéphile de Quentin Tarantino "
"se reconnaît entre autres par sa narration postmoderne "
"et non linéaire, ses dialogues travaillés souvent "
"émaillés de références à la culture populaire, et ses "
"scènes hautement esthétiques mais d'une violence "
"extrême, inspirées de films d'exploitation, d'arts "
"martiaux ou de western spaghetti.",
candidate_labels="cinéma, technologie, littérature, politique",
hypothesis_template="Ce texte parle de {}."
)
result
{"labels": ["cinéma",
"littérature",
"technologie",
"politique"],
"scores": [0.7164115309715271,
0.12878799438476562,
0.1092301607131958,
0.0455702543258667]}
Optimum + ONNX
python
from optimum.onnxruntime import ORTModelForSequenceClassification
from transformers import AutoTokenizer, pipeline
HUB_MODEL = "cmarkea/distilcamembert-base-nli"
tokenizer = AutoTokenizer.from_pretrained(HUB_MODEL)
model = ORTModelForSequenceClassification.from_pretrained(HUB_MODEL)
onnx_qa = pipeline("zero-shot-classification", model=model, tokenizer=tokenizer)
Quantized onnx model
quantized_model = ORTModelForSequenceClassification.from_pretrained(
HUB_MODEL, file_name="model_quantized.onnx"
)
Citation --------
bibtex
@inproceedings{delestre:hal-03674695,
TITLE = {{DistilCamemBERT : une distillation du mod{\`e}le fran{\c c}ais CamemBERT}},
AUTHOR = {Delestre, Cyrile and Amar, Abibatou},
URL = {https://hal.archives-ouvertes.fr/hal-03674695},
BOOKTITLE = {{CAp (Conf{\'e}rence sur l'Apprentissage automatique)}},
ADDRESS = {Vannes, France},
YEAR = {2022},
MONTH = Jul,
KEYWORDS = {NLP ; Transformers ; CamemBERT ; Distillation},
PDF = {https://hal.archives-ouvertes.fr/hal-03674695/file/cap2022.pdf},
HAL_ID = {hal-03674695},
HAL_VERSION = {v1},
}
Files & Weights
| Filename | Size | Action |
|---|---|---|
| model.onnx | 0.25 GB | |
| model.safetensors | 0.25 GB | |
| model_quantized.onnx | 0.13 GB | |
| pytorch_model.bin | 0.25 GB |