Results for "image-to-text"

100 matches found.

Salesforce

Salesforce/blip-image-captioning-base

No description available.

🖼️ image-to-text 3,111,099
zai-org

zai-org/GLM-OCR

No description available.

🖼️ image-to-text 2,515,619
Salesforce

Salesforce/blip-image-captioning-large

No description available.

🖼️ image-to-text 906,541
microsoft

microsoft/trocr-large-printed

The TrOCR model is an encoder-decoder model, consisting of an image Transformer as encoder, and a text Transformer as decoder. The image enc...

🖼️ image-to-text 762,173
PaddlePaddle

PaddlePaddle/UVDoc

No description available.

🖼️ image-to-text 639,443
microsoft

microsoft/trocr-base-printed

The TrOCR model is an encoder-decoder model, consisting of an image Transformer as encoder, and a text Transformer as decoder. The image enc...

🖼️ image-to-text 609,043
PaddlePaddle

PaddlePaddle/PP-LCNet_x1_0_doc_ori

No description available.

🖼️ image-to-text 514,816
xtuner

xtuner/llava-llama-3-8b-v1_1-gguf

No description available.

🖼️ image-to-text 469,006
naver-clova-ix

naver-clova-ix/donut-base

Donut consists of a vision encoder (Swin Transformer) and a text decoder (BART). Given an image, the encoder first encodes the image into a ...

🖼️ image-to-text 464,616
PaddlePaddle

PaddlePaddle/PP-OCRv5_server_det

No description available.

🖼️ image-to-text 452,998
Salesforce

Salesforce/blip2-opt-2.7b-coco

BLIP-2 consists of 3 models: a CLIP-like image encoder, a Querying Transformer (Q-Former) and a large language model. The authors initialize...

🖼️ image-to-text 409,659
nlpconnect

nlpconnect/vit-gpt2-image-captioning

No description available.

🖼️ image-to-text 326,819
PaddlePaddle

PaddlePaddle/en_PP-OCRv5_mobile_rec

No description available.

🖼️ image-to-text 297,397
kha-white

kha-white/manga-ocr-base

No description available.

🖼️ image-to-text 252,329
microsoft

microsoft/trocr-base-handwritten

The TrOCR model is an encoder-decoder model, consisting of an image Transformer as encoder, and a text Transformer as decoder. The image enc...

🖼️ image-to-text 227,354
PaddlePaddle

PaddlePaddle/PP-LCNet_x1_0_textline_ori

No description available.

🖼️ image-to-text 222,836
facebook

facebook/nougat-base

Nougat is a Donut model trained to transcribe scientific PDFs into an easy-to-use markdown format. The model consists of a Swin Transformer ...

🖼️ image-to-text 220,358
microsoft

microsoft/kosmos-2-patch14-224

No description available.

🖼️ image-to-text 177,502
microsoft

microsoft/trocr-large-handwritten

The TrOCR model is an encoder-decoder model, consisting of an image Transformer as encoder, and a text Transformer as decoder. The image enc...

🖼️ image-to-text 165,839
lightonai

lightonai/LightOnOCR-1B-1025

LightOnOCR combines a Vision Transformer encoder(Pixtral-based) with a lightweight text decoder(Qwen3-based) distilled from high-quality ope...

🖼️ image-to-text 154,269
breezedeus

breezedeus/pix2text-mfr

This MFR model utilizes the TrOCR architecture developed by Microsoft, starting with its initial values and retrained using a dataset of mat...

🖼️ image-to-text 124,565
alibaba-damo

alibaba-damo/mgp-str-base

MGP-STR is pure vision STR model, consisting of ViT and specially designed A^3 modules. The ViT module was initialized from the weights of D...

🖼️ image-to-text 108,960
rtr46

rtr46/meiki.txt.recognition.v0

No description available.

🖼️ image-to-text 90,746
rtr46

rtr46/meiki.text.detect.v0

No description available.

🖼️ image-to-text 89,364
numind

numind/NuMarkdown-8B-Thinking

No description available.

🖼️ image-to-text 79,290
PaddlePaddle

PaddlePaddle/PP-OCRv5_server_rec

No description available.

🖼️ image-to-text 69,351
PaddlePaddle

PaddlePaddle/latin_PP-OCRv5_mobile_rec

No description available.

🖼️ image-to-text 61,402
naver-clova-ix

naver-clova-ix/donut-base-finetuned-cord-v2

Donut consists of a vision encoder (Swin Transformer) and a text decoder (BART). Given an image, the encoder first encodes the image into a ...

🖼️ image-to-text 59,081
ibm-granite

ibm-granite/granite-vision-3.3-2b

No description available.

🖼️ image-to-text 56,468
PaddlePaddle

PaddlePaddle/PP-OCRv5_mobile_det

No description available.

🖼️ image-to-text 53,940
OleehyO

OleehyO/TexTeller

No description available.

🖼️ image-to-text 49,158
hezarai

hezarai/crnn-base-fa-v2

A CRNN model for Persian OCR. This model is based on a simple CNN + LSTM architecture inspired by this paper....

🖼️ image-to-text 32,961
optimum-intel-internal-testing

optimum-intel-internal-testing/pix2struct-tiny-random

No description available.

🖼️ image-to-text 31,611
microsoft

microsoft/trocr-small-printed

The TrOCR model is an encoder-decoder model, consisting of an image Transformer as encoder, and a text Transformer as decoder. The image enc...

🖼️ image-to-text 29,667
laion

laion/mscoco_finetuned_CoCa-ViT-L-14-laion2B-s13B-b90k

No description available.

🖼️ image-to-text 22,089
Riksarkivet

Riksarkivet/trocr-base-handwritten-hist-swe-2

No description available.

🖼️ image-to-text 17,063
microsoft

microsoft/trocr-base-stage1

The TrOCR model is an encoder-decoder model, consisting of an image Transformer as encoder, and a text Transformer as decoder. The image enc...

🖼️ image-to-text 15,995
PaddlePaddle

PaddlePaddle/RT-DETR-L_wired_table_cell_det

No description available.

🖼️ image-to-text 14,820
microsoft

microsoft/trocr-small-handwritten

The TrOCR model is an encoder-decoder model, consisting of an image Transformer as encoder, and a text Transformer as decoder. The image enc...

🖼️ image-to-text 14,054
PaddlePaddle

PaddlePaddle/PP-DocLayout_plus-L

No description available.

🖼️ image-to-text 13,910
PaddlePaddle

PaddlePaddle/RT-DETR-L_wireless_table_cell_det

No description available.

🖼️ image-to-text 13,543
optimum-intel-internal-testing

optimum-intel-internal-testing/trocr-small-handwritten

The TrOCR model is an encoder-decoder model, consisting of an image Transformer as encoder, and a text Transformer as decoder. The image enc...

🖼️ image-to-text 12,264
zhiyuanyou

zhiyuanyou/DeQA-Score-Mix3

No description available.

🖼️ image-to-text 12,105
PaddlePaddle

PaddlePaddle/PP-OCRv5_mobile_rec

No description available.

🖼️ image-to-text 11,546
microsoft

microsoft/git-base

GIT is a Transformer decoder conditioned on both CLIP image tokens and text tokens. The model is trained using "teacher forcing" on a lot of...

🖼️ image-to-text 10,910
PaddlePaddle

PaddlePaddle/PP-LCNet_x1_0_table_cls

No description available.

🖼️ image-to-text 10,193
microsoft

microsoft/trocr-small-stage1

The TrOCR model is an encoder-decoder model, consisting of an image Transformer as encoder, and a text Transformer as decoder. The image enc...

🖼️ image-to-text 10,045
PaddlePaddle

PaddlePaddle/PP-DocBlockLayout

No description available.

🖼️ image-to-text 9,442
google

google/pix2struct-textcaps-base

No description available.

🖼️ image-to-text 8,410
IAMJB

IAMJB/chexpert-mimic-cxr-findings-baseline

Evaluation on chexpert-plus...

🖼️ image-to-text 8,227
PaddlePaddle

PaddlePaddle/SLANeXt_wired

No description available.

🖼️ image-to-text 8,012
IAMJB

IAMJB/chexpert-mimic-cxr-impression-baseline

Evaluation on chexpert-plus...

🖼️ image-to-text 7,765
PaddlePaddle

PaddlePaddle/SLANet_plus

No description available.

🖼️ image-to-text 7,550
PaddlePaddle

PaddlePaddle/PP-FormulaNet_plus-L

No description available.

🖼️ image-to-text 7,362
fxmarty

fxmarty/pix2struct-tiny-random

No description available.

🖼️ image-to-text 7,030
PaddlePaddle

PaddlePaddle/PP-Chart2Table

No description available.

🖼️ image-to-text 6,143
Xenova

Xenova/vit-gpt2-image-captioning

No description available.

🖼️ image-to-text 5,849
microsoft

microsoft/git-large-coco

GIT is a Transformer decoder conditioned on both CLIP image tokens and text tokens. The model is trained using "teacher forcing" on a lot of...

🖼️ image-to-text 5,835
mradermacher

mradermacher/Qwen2.5-VL-7B-Abliterated-Caption-it-GGUF

No description available.

🖼️ image-to-text 5,249
PaddlePaddle

PaddlePaddle/korean_PP-OCRv5_mobile_rec

No description available.

🖼️ image-to-text 5,199
nyu-visionx

nyu-visionx/Cambrian-S-7B

Architecture: Qwen2.5-7B-Instruct + SigLIP2-SO400M vision encoder + 2-layer MLP adapter - Parameters: 7B - Vision Encoder: SigLIP-384 (SiGLI...

🖼️ image-to-text 5,157
noctrex

noctrex/LightOnOCR-2-1B-GGUF

This are the quantizations of the model LightOnOCR-2-1B...

🖼️ image-to-text 5,043
PaddlePaddle

PaddlePaddle/PP-OCRv4_mobile_det

No description available.

🖼️ image-to-text 4,956
mradermacher

mradermacher/Hulu-Med-30A3-i1-GGUF

No description available.

🖼️ image-to-text 4,588
mrrtmob

mrrtmob/kiri-ocr

No description available.

🖼️ image-to-text 4,527
PaddlePaddle

PaddlePaddle/eslav_PP-OCRv5_mobile_rec

No description available.

🖼️ image-to-text 4,488
facebook

facebook/nougat-small

Nougat is a Donut model trained to transcribe scientific PDFs into an easy-to-use markdown format. The model consists of a Swin Transformer ...

🖼️ image-to-text 4,394
ydshieh

ydshieh/vit-gpt2-coco-en

No description available.

🖼️ image-to-text 4,304
FireRedTeam

FireRedTeam/FireRed-OCR

No description available.

🖼️ image-to-text 4,178
naver-clova-ix

naver-clova-ix/donut-base-finetuned-rvlcdip

Donut consists of a vision encoder (Swin Transformer) and a text decoder (BART). Given an image, the encoder first encodes the image into a ...

🖼️ image-to-text 4,101
PaddlePaddle

PaddlePaddle/PP-LCNet_x0_25_textline_ori

No description available.

🖼️ image-to-text 3,993
mlx-community

mlx-community/GLM-OCR-bf16

No description available.

🖼️ image-to-text 3,947
microsoft

microsoft/git-base-coco

GIT is a Transformer decoder conditioned on both CLIP image tokens and text tokens. The model is trained using "teacher forcing" on a lot of...

🖼️ image-to-text 3,752
PaddlePaddle

PaddlePaddle/PP-OCRv3_mobile_det

No description available.

🖼️ image-to-text 3,678
PaddlePaddle

PaddlePaddle/en_PP-OCRv4_mobile_rec

No description available.

🖼️ image-to-text 3,490
breezedeus

breezedeus/pix2text-mfd

No description available.

🖼️ image-to-text 3,453
noctrex

noctrex/Chandra-OCR-GGUF

These are quantizations of the model Chandra-OCR...

🖼️ image-to-text 3,392
xtuner

xtuner/llava-phi-3-mini-gguf

No description available.

🖼️ image-to-text 3,325
mradermacher

mradermacher/Qwen3-VL-8B-Abliterated-Caption-it-i1-GGUF

No description available.

🖼️ image-to-text 3,039
google

google/pix2struct-base

No description available.

🖼️ image-to-text 2,992
Norm

Norm/nougat-latex-base

No description available.

🖼️ image-to-text 2,983
noctrex

noctrex/PaddleOCR-VL-1.5-GGUF

These are quantizations of the model PaddleOCR-VL-1.5...

🖼️ image-to-text 2,552
sbintuitions

sbintuitions/sarashina2.2-vision-3b

No description available.

🖼️ image-to-text 2,451
unsloth

unsloth/GLM-OCR

No description available.

🖼️ image-to-text 2,412
kazars24

kazars24/trocr-base-handwritten-ru

No description available.

🖼️ image-to-text 2,362
PaddlePaddle

PaddlePaddle/PP-OCRv4_mobile_rec

No description available.

🖼️ image-to-text 2,326
PaddlePaddle

PaddlePaddle/arabic_PP-OCRv5_mobile_rec

No description available.

🖼️ image-to-text 2,316
PaddlePaddle

PaddlePaddle/en_PP-OCRv3_mobile_rec

No description available.

🖼️ image-to-text 2,164
openthaigpt

openthaigpt/thai-trocr

No description available.

🖼️ image-to-text 2,122
thwri

thwri/CogFlorence-2.2-Large

No description available.

🖼️ image-to-text 2,085
to-be

to-be/donut-base-finetuned-invoices

Donut consists of a vision encoder (Swin Transformer) and a text decoder (BART). Given an image, the encoder first encodes the image into a ...

🖼️ image-to-text 1,932
PaddlePaddle

PaddlePaddle/PP-DocLayout-L

No description available.

🖼️ image-to-text 1,862
unography

unography/blip-large-long-cap

No description available.

🖼️ image-to-text 1,859
PaddlePaddle

PaddlePaddle/ta_PP-OCRv5_mobile_rec

No description available.

🖼️ image-to-text 1,809
noctrex

noctrex/ZwZ-8B-GGUF

These are quantizations of the model ZwZ-8B, using a imatrix created from text\en\medium...

🖼️ image-to-text 1,769
logasanjeev

logasanjeev/indian-id-validator

Below is a detailed breakdown of each model, including the classes they detect and their evaluation metrics on a custom Indian ID dataset. |...

🖼️ image-to-text 1,746
zhangzicheng

zhangzicheng/q-sit-mini

No description available.

🖼️ image-to-text 1,556
mradermacher

mradermacher/Qwen3-VL-8B-Abliterated-Caption-it-GGUF

No description available.

🖼️ image-to-text 1,556
mradermacher

mradermacher/QwenStoryteller-i1-GGUF

No description available.

🖼️ image-to-text 1,552
breezedeus

breezedeus/pix2text-mfr-1.5

This MFR model utilizes the TrOCR architecture developed by Microsoft, starting with its initial values and retrained using a dataset of mat...

🖼️ image-to-text 1,523