gte-large-zh
General Text Embeddings (GTE) model. Towards General Text Embeddings with Multi-stage Contrastive Learning
The GTE models are trained by Alibaba DAMO Academy. They are mainly based on the BERT framework and currently offer different sizes of models for both Chinese and English Languages. The GTE models are trained on a large-scale corpus of relevance text pairs, covering a wide range of domains and scenarios. This enables the GTE models to be applied to various downstream tasks of text embeddings, including information retrieval, semantic textual similarity, text reranking, etc.
Model List
Models | Language | Max Sequence Length | Dimension | Model Size |
---|---|---|---|---|
GTE-large-zh | Chinese | 512 | 1024 | 0.67GB |
GTE-base-zh | Chinese | 512 | 512 | 0.21GB |
GTE-small-zh | Chinese | 512 | 512 | 0.10GB |
GTE-large | English | 512 | 1024 | 0.67GB |
GTE-base | English | 512 | 512 | 0.21GB |
GTE-small | English | 512 | 384 | 0.10GB |
Metrics
We compared the performance of the GTE models with other popular text embedding models on the MTEB (CMTEB for Chinese language) benchmark. For more detailed comparison results, please refer to the MTEB leaderboard.
- Evaluation results on CMTEB
Model | Model Size (GB) | Embedding Dimensions | Sequence Length | Average (35 datasets) | Classification (9 datasets) | Clustering (4 datasets) | Pair Classification (2 datasets) | Reranking (4 datasets) | Retrieval (8 datasets) | STS (8 datasets) |
---|---|---|---|---|---|---|---|---|---|---|
gte-large-zh | 0.65 | 1024 | 512 | 66.72 | 71.34 | 53.07 | 81.14 | 67.42 | 72.49 | 57.82 |
gte-base-zh | 0.20 | 768 | 512 | 65.92 | 71.26 | 53.86 | 80.44 | 67.00 | 71.71 | 55.96 |
stella-large-zh-v2 | 0.65 | 1024 | 1024 | 65.13 | 69.05 | 49.16 | 82.68 | 66.41 | 70.14 | 58.66 |
stella-large-zh | 0.65 | 1024 | 1024 | 64.54 | 67.62 | 48.65 | 78.72 | 65.98 | 71.02 | 58.3 |
bge-large-zh-v1.5 | 1.3 | 1024 | 512 | 64.53 | 69.13 | 48.99 | 81.6 | 65.84 | 70.46 | 56.25 |
stella-base-zh-v2 | 0.21 | 768 | 1024 | 64.36 | 68.29 | 49.4 | 79.96 | 66.1 | 70.08 | 56.92 |
stella-base-zh | 0.21 | 768 | 1024 | 64.16 | 67.77 | 48.7 | 76.09 | 66.95 | 71.07 | 56.54 |
piccolo-large-zh | 0.65 | 1024 | 512 | 64.11 | 67.03 | 47.04 | 78.38 | 65.98 | 70.93 | 58.02 |
piccolo-base-zh | 0.2 | 768 | 512 | 63.66 | 66.98 | 47.12 | 76.61 | 66.68 | 71.2 | 55.9 |
gte-small-zh | 0.1 | 512 | 512 | 60.04 | 64.35 | 48.95 | 69.99 | 66.21 | 65.50 | 49.72 |
bge-small-zh-v1.5 | 0.1 | 512 | 512 | 57.82 | 63.96 | 44.18 | 70.4 | 60.92 | 61.77 | 49.1 |
m3e-base | 0.41 | 768 | 512 | 57.79 | 67.52 | 47.68 | 63.99 | 59.54 | 56.91 | 50.47 |
text-embedding-ada-002(openai) | - | 1536 | 8192 | 53.02 | 64.31 | 45.68 | 69.56 | 54.28 | 52.0 | 43.35 |
Usage
Code example
import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel
input_texts = [
"中国的首都是哪里",
"你喜欢去哪里旅游",
"北京",
"今天中午吃什么"
]
tokenizer = AutoTokenizer.from_pretrained("thenlper/gte-large-zh")
model = AutoModel.from_pretrained("thenlper/gte-large-zh")
# Tokenize the input texts
batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')
outputs = model(**batch_dict)
embeddings = outputs.last_hidden_state[:, 0]
# (Optionally) normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:1] @ embeddings[1:].T) * 100
print(scores.tolist())
Use with sentence-transformers:
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim
sentences = ['That is a happy person', 'That is a very happy person']
model = SentenceTransformer('thenlper/gte-large-zh')
embeddings = model.encode(sentences)
print(cos_sim(embeddings[0], embeddings[1]))
Limitation
This model exclusively caters to Chinese texts, and any lengthy texts will be truncated to a maximum of 512 tokens.
Citation
If you find our paper or models helpful, please consider citing them as follows:
@article{li2023towards,
title={Towards general text embeddings with multi-stage contrastive learning},
author={Li, Zehan and Zhang, Xin and Zhang, Yanzhao and Long, Dingkun and Xie, Pengjun and Zhang, Meishan},
journal={arXiv preprint arXiv:2308.03281},
year={2023}
}
- Downloads last month
- 30,780
Spaces using thenlper/gte-large-zh 2
Evaluation results
- cos_sim_pearson on MTEB AFQMCvalidation set self-reported48.941
- cos_sim_spearman on MTEB AFQMCvalidation set self-reported54.583
- euclidean_pearson on MTEB AFQMCvalidation set self-reported52.739
- euclidean_spearman on MTEB AFQMCvalidation set self-reported54.583
- manhattan_pearson on MTEB AFQMCvalidation set self-reported52.731
- manhattan_spearman on MTEB AFQMCvalidation set self-reported54.573
- cos_sim_pearson on MTEB ATECtest set self-reported47.293
- cos_sim_spearman on MTEB ATECtest set self-reported54.601
- euclidean_pearson on MTEB ATECtest set self-reported54.614
- euclidean_spearman on MTEB ATECtest set self-reported54.601
- manhattan_pearson on MTEB ATECtest set self-reported54.594
- manhattan_spearman on MTEB ATECtest set self-reported54.601
- accuracy on MTEB AmazonReviewsClassification (zh)test set self-reported47.234
- f1 on MTEB AmazonReviewsClassification (zh)test set self-reported45.690
- cos_sim_pearson on MTEB BQtest set self-reported62.550
- cos_sim_spearman on MTEB BQtest set self-reported64.406
- euclidean_pearson on MTEB BQtest set self-reported62.935
- euclidean_spearman on MTEB BQtest set self-reported64.406
- manhattan_pearson on MTEB BQtest set self-reported62.840
- manhattan_spearman on MTEB BQtest set self-reported64.308
- v_measure on MTEB CLSClusteringP2Ptest set self-reported42.098
- v_measure on MTEB CLSClusteringS2Stest set self-reported38.907
- map on MTEB CMedQAv1test set self-reported86.092
- mrr on MTEB CMedQAv1test set self-reported88.675
- map on MTEB CMedQAv2test set self-reported86.458
- mrr on MTEB CMedQAv2test set self-reported89.016
- map_at_1 on MTEB CmedqaRetrievalself-reported24.215
- map_at_10 on MTEB CmedqaRetrievalself-reported36.498
- map_at_100 on MTEB CmedqaRetrievalself-reported38.409
- map_at_1000 on MTEB CmedqaRetrievalself-reported38.524
- map_at_3 on MTEB CmedqaRetrievalself-reported32.428
- map_at_5 on MTEB CmedqaRetrievalself-reported34.664
- mrr_at_1 on MTEB CmedqaRetrievalself-reported36.834
- mrr_at_10 on MTEB CmedqaRetrievalself-reported45.196
- mrr_at_100 on MTEB CmedqaRetrievalself-reported46.214
- mrr_at_1000 on MTEB CmedqaRetrievalself-reported46.259
- mrr_at_3 on MTEB CmedqaRetrievalself-reported42.631
- mrr_at_5 on MTEB CmedqaRetrievalself-reported44.044
- ndcg_at_1 on MTEB CmedqaRetrievalself-reported36.834
- ndcg_at_10 on MTEB CmedqaRetrievalself-reported43.146
- ndcg_at_100 on MTEB CmedqaRetrievalself-reported50.633
- ndcg_at_1000 on MTEB CmedqaRetrievalself-reported52.609
- ndcg_at_3 on MTEB CmedqaRetrievalself-reported37.851
- ndcg_at_5 on MTEB CmedqaRetrievalself-reported40.005
- precision_at_1 on MTEB CmedqaRetrievalself-reported36.834
- precision_at_10 on MTEB CmedqaRetrievalself-reported9.647
- precision_at_100 on MTEB CmedqaRetrievalself-reported1.574
- precision_at_1000 on MTEB CmedqaRetrievalself-reported0.183
- precision_at_3 on MTEB CmedqaRetrievalself-reported21.480
- precision_at_5 on MTEB CmedqaRetrievalself-reported15.649
- recall_at_1 on MTEB CmedqaRetrievalself-reported24.215
- recall_at_10 on MTEB CmedqaRetrievalself-reported54.079
- recall_at_100 on MTEB CmedqaRetrievalself-reported84.943
- recall_at_1000 on MTEB CmedqaRetrievalself-reported98.098
- recall_at_3 on MTEB CmedqaRetrievalself-reported38.117
- recall_at_5 on MTEB CmedqaRetrievalself-reported44.776
- cos_sim_accuracy on MTEB Cmnlivalidation set self-reported82.514
- cos_sim_ap on MTEB Cmnlivalidation set self-reported89.499
- cos_sim_f1 on MTEB Cmnlivalidation set self-reported83.893
- cos_sim_precision on MTEB Cmnlivalidation set self-reported78.198
- cos_sim_recall on MTEB Cmnlivalidation set self-reported90.484
- dot_accuracy on MTEB Cmnlivalidation set self-reported82.514
- dot_ap on MTEB Cmnlivalidation set self-reported89.491
- dot_f1 on MTEB Cmnlivalidation set self-reported83.893
- dot_precision on MTEB Cmnlivalidation set self-reported78.198
- dot_recall on MTEB Cmnlivalidation set self-reported90.484
- euclidean_accuracy on MTEB Cmnlivalidation set self-reported82.514
- euclidean_ap on MTEB Cmnlivalidation set self-reported89.499
- euclidean_f1 on MTEB Cmnlivalidation set self-reported83.893
- euclidean_precision on MTEB Cmnlivalidation set self-reported78.198
- euclidean_recall on MTEB Cmnlivalidation set self-reported90.484
- manhattan_accuracy on MTEB Cmnlivalidation set self-reported82.489
- manhattan_ap on MTEB Cmnlivalidation set self-reported89.492
- manhattan_f1 on MTEB Cmnlivalidation set self-reported83.847
- manhattan_precision on MTEB Cmnlivalidation set self-reported77.283
- manhattan_recall on MTEB Cmnlivalidation set self-reported91.630
- max_accuracy on MTEB Cmnlivalidation set self-reported82.514
- max_ap on MTEB Cmnlivalidation set self-reported89.499
- max_f1 on MTEB Cmnlivalidation set self-reported83.893