SentenceTransformer based on nomic-ai/nomic-embed-text-v1
This is a sentence-transformers model finetuned from nomic-ai/nomic-embed-text-v1. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more. In particular, this model is trained on various documents which descibe frameworks for building ethical AI systems. As such it performs well on matching questions to context in RAG applications.
Model Details
Model Description
- Model Type: Sentence Transformer
- Base model: nomic-ai/nomic-embed-text-v1
- Maximum Sequence Length: 8192 tokens
- Output Dimensionality: 768 tokens
- Similarity Function: Cosine Similarity
Model Sources
- Documentation: Sentence Transformers Documentation
- Repository: Sentence Transformers on GitHub
- Hugging Face: Sentence Transformers on Hugging Face
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: NomicBertModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Normalize()
)
Usage
Direct Usage (Sentence Transformers)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("deman539/nomic-embed-text-v1")
# Run inference
sentences = [
'What mental health issues are associated with the increased use of technologies in schools and workplaces?',
'technologies has increased in schools and workplaces, and, when coupled with consequential management and \nevaluation decisions, it is leading to mental health harms such as lowered self-confidence, anxiety, depression, and \na reduced ability to use analytical reasoning.61 Documented patterns show that personal data is being aggregated by \ndata brokers to profile communities in harmful ways.62 The impact of all this data harvesting is corrosive,',
'but this approach may still produce harmful recommendations in response to other less-explicit, novel \nprompts (also relevant to CBRN Information or Capabilities, Data Privacy, Information Security, and \nObscene, Degrading and/or Abusive Content). Crafting such prompts deliberately is known as \n“jailbreaking,” or, manipulating prompts to circumvent output controls. Limitations of GAI systems can be',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
Evaluation
Metrics
Information Retrieval
- Evaluated with
InformationRetrievalEvaluator
Metric | Value |
---|---|
cosine_accuracy@1 | 0.8584 |
cosine_accuracy@3 | 0.9838 |
cosine_accuracy@5 | 0.9951 |
cosine_accuracy@10 | 0.9992 |
cosine_precision@1 | 0.8584 |
cosine_precision@3 | 0.3279 |
cosine_precision@5 | 0.199 |
cosine_precision@10 | 0.0999 |
cosine_recall@1 | 0.8584 |
cosine_recall@3 | 0.9838 |
cosine_recall@5 | 0.9951 |
cosine_recall@10 | 0.9992 |
cosine_ndcg@10 | 0.9418 |
cosine_mrr@10 | 0.922 |
cosine_map@100 | 0.9221 |
dot_accuracy@1 | 0.8584 |
dot_accuracy@3 | 0.9838 |
dot_accuracy@5 | 0.9951 |
dot_accuracy@10 | 0.9992 |
dot_precision@1 | 0.8584 |
dot_precision@3 | 0.3279 |
dot_precision@5 | 0.199 |
dot_precision@10 | 0.0999 |
dot_recall@1 | 0.8584 |
dot_recall@3 | 0.9838 |
dot_recall@5 | 0.9951 |
dot_recall@10 | 0.9992 |
dot_ndcg@10 | 0.9418 |
dot_mrr@10 | 0.922 |
dot_map@100 | 0.9221 |
Training Details
Training Dataset
Unnamed Dataset
- Size: 2,459 training samples
- Columns:
sentence_0
andsentence_1
- Approximate statistics based on the first 1000 samples:
sentence_0 sentence_1 type string string details - min: 2 tokens
- mean: 18.7 tokens
- max: 35 tokens
- min: 22 tokens
- mean: 93.19 tokens
- max: 337 tokens
- Samples:
sentence_0 sentence_1 What should organizations include in contracts to evaluate third-party GAI processes and standards?
services acquisition and value chain risk management; and legal compliance.
Data Privacy; Information
Integrity; Information Security;
Intellectual Property; Value Chain
and Component Integration
GV-6.1-006 Include clauses in contracts which allow an organization to evaluate third-party
GAI processes and standards.
Information Integrity
GV-6.1-007 Inventory all third-party entities with access to organizational content and
establish approved GAI technology and service provider lists.What steps should be taken to manage third-party entities with access to organizational content?
services acquisition and value chain risk management; and legal compliance.
Data Privacy; Information
Integrity; Information Security;
Intellectual Property; Value Chain
and Component Integration
GV-6.1-006 Include clauses in contracts which allow an organization to evaluate third-party
GAI processes and standards.
Information Integrity
GV-6.1-007 Inventory all third-party entities with access to organizational content and
establish approved GAI technology and service provider lists.What should entities responsible for automated systems establish before deploying the system?
Clear organizational oversight. Entities responsible for the development or use of automated systems
should lay out clear governance structures and procedures. This includes clearly-stated governance proce
dures before deploying the system, as well as responsibility of specific individuals or entities to oversee ongoing
assessment and mitigation. Organizational stakeholders including those with oversight of the business process - Loss:
MatryoshkaLoss
with these parameters:{ "loss": "MultipleNegativesRankingLoss", "matryoshka_dims": [ 768, 512, 256, 128, 64 ], "matryoshka_weights": [ 1, 1, 1, 1, 1 ], "n_dims_per_step": -1 }
Training Hyperparameters
Non-Default Hyperparameters
eval_strategy
: stepsper_device_train_batch_size
: 32per_device_eval_batch_size
: 32num_train_epochs
: 20multi_dataset_batch_sampler
: round_robin
All Hyperparameters
Click to expand
overwrite_output_dir
: Falsedo_predict
: Falseeval_strategy
: stepsprediction_loss_only
: Trueper_device_train_batch_size
: 32per_device_eval_batch_size
: 32per_gpu_train_batch_size
: Noneper_gpu_eval_batch_size
: Nonegradient_accumulation_steps
: 1eval_accumulation_steps
: Nonetorch_empty_cache_steps
: Nonelearning_rate
: 5e-05weight_decay
: 0.0adam_beta1
: 0.9adam_beta2
: 0.999adam_epsilon
: 1e-08max_grad_norm
: 1num_train_epochs
: 20max_steps
: -1lr_scheduler_type
: linearlr_scheduler_kwargs
: {}warmup_ratio
: 0.0warmup_steps
: 0log_level
: passivelog_level_replica
: warninglog_on_each_node
: Truelogging_nan_inf_filter
: Truesave_safetensors
: Truesave_on_each_node
: Falsesave_only_model
: Falserestore_callback_states_from_checkpoint
: Falseno_cuda
: Falseuse_cpu
: Falseuse_mps_device
: Falseseed
: 42data_seed
: Nonejit_mode_eval
: Falseuse_ipex
: Falsebf16
: Falsefp16
: Falsefp16_opt_level
: O1half_precision_backend
: autobf16_full_eval
: Falsefp16_full_eval
: Falsetf32
: Nonelocal_rank
: 0ddp_backend
: Nonetpu_num_cores
: Nonetpu_metrics_debug
: Falsedebug
: []dataloader_drop_last
: Falsedataloader_num_workers
: 0dataloader_prefetch_factor
: Nonepast_index
: -1disable_tqdm
: Falseremove_unused_columns
: Truelabel_names
: Noneload_best_model_at_end
: Falseignore_data_skip
: Falsefsdp
: []fsdp_min_num_params
: 0fsdp_config
: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}fsdp_transformer_layer_cls_to_wrap
: Noneaccelerator_config
: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}deepspeed
: Nonelabel_smoothing_factor
: 0.0optim
: adamw_torchoptim_args
: Noneadafactor
: Falsegroup_by_length
: Falselength_column_name
: lengthddp_find_unused_parameters
: Noneddp_bucket_cap_mb
: Noneddp_broadcast_buffers
: Falsedataloader_pin_memory
: Truedataloader_persistent_workers
: Falseskip_memory_metrics
: Trueuse_legacy_prediction_loop
: Falsepush_to_hub
: Falseresume_from_checkpoint
: Nonehub_model_id
: Nonehub_strategy
: every_savehub_private_repo
: Falsehub_always_push
: Falsegradient_checkpointing
: Falsegradient_checkpointing_kwargs
: Noneinclude_inputs_for_metrics
: Falseeval_do_concat_batches
: Truefp16_backend
: autopush_to_hub_model_id
: Nonepush_to_hub_organization
: Nonemp_parameters
:auto_find_batch_size
: Falsefull_determinism
: Falsetorchdynamo
: Noneray_scope
: lastddp_timeout
: 1800torch_compile
: Falsetorch_compile_backend
: Nonetorch_compile_mode
: Nonedispatch_batches
: Nonesplit_batches
: Noneinclude_tokens_per_second
: Falseinclude_num_input_tokens_seen
: Falseneftune_noise_alpha
: Noneoptim_target_modules
: Nonebatch_eval_metrics
: Falseeval_on_start
: Falseeval_use_gather_object
: Falsebatch_sampler
: batch_samplermulti_dataset_batch_sampler
: round_robin
Training Logs
Epoch | Step | Training Loss | cosine_map@100 |
---|---|---|---|
0.6494 | 50 | - | 0.8493 |
1.0 | 77 | - | 0.8737 |
1.2987 | 100 | - | 0.8677 |
1.9481 | 150 | - | 0.8859 |
2.0 | 154 | - | 0.8886 |
2.5974 | 200 | - | 0.8913 |
3.0 | 231 | - | 0.9058 |
3.2468 | 250 | - | 0.8993 |
3.8961 | 300 | - | 0.9077 |
4.0 | 308 | - | 0.9097 |
4.5455 | 350 | - | 0.9086 |
5.0 | 385 | - | 0.9165 |
5.1948 | 400 | - | 0.9141 |
5.8442 | 450 | - | 0.9132 |
6.0 | 462 | - | 0.9138 |
6.4935 | 500 | 0.3094 | 0.9137 |
7.0 | 539 | - | 0.9166 |
7.1429 | 550 | - | 0.9172 |
7.7922 | 600 | - | 0.9160 |
8.0 | 616 | - | 0.9169 |
8.4416 | 650 | - | 0.9177 |
9.0 | 693 | - | 0.9169 |
9.0909 | 700 | - | 0.9177 |
9.7403 | 750 | - | 0.9178 |
10.0 | 770 | - | 0.9178 |
10.3896 | 800 | - | 0.9189 |
11.0 | 847 | - | 0.9180 |
11.0390 | 850 | - | 0.9180 |
11.6883 | 900 | - | 0.9188 |
12.0 | 924 | - | 0.9192 |
12.3377 | 950 | - | 0.9204 |
12.9870 | 1000 | 0.0571 | 0.9202 |
13.0 | 1001 | - | 0.9201 |
13.6364 | 1050 | - | 0.9212 |
14.0 | 1078 | - | 0.9203 |
14.2857 | 1100 | - | 0.9219 |
14.9351 | 1150 | - | 0.9207 |
15.0 | 1155 | - | 0.9207 |
15.5844 | 1200 | - | 0.9210 |
16.0 | 1232 | - | 0.9208 |
16.2338 | 1250 | - | 0.9216 |
16.8831 | 1300 | - | 0.9209 |
17.0 | 1309 | - | 0.9209 |
17.5325 | 1350 | - | 0.9216 |
18.0 | 1386 | - | 0.9213 |
18.1818 | 1400 | - | 0.9221 |
18.8312 | 1450 | - | 0.9217 |
19.0 | 1463 | - | 0.9217 |
19.4805 | 1500 | 0.0574 | 0.9225 |
20.0 | 1540 | - | 0.9221 |
Framework Versions
- Python: 3.10.12
- Sentence Transformers: 3.1.1
- Transformers: 4.44.2
- PyTorch: 2.4.1+cu121
- Accelerate: 0.34.2
- Datasets: 3.0.0
- Tokenizers: 0.19.1
Citation
BibTeX
Sentence Transformers
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
MatryoshkaLoss
@misc{kusupati2024matryoshka,
title={Matryoshka Representation Learning},
author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
year={2024},
eprint={2205.13147},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
MultipleNegativesRankingLoss
@misc{henderson2017efficient,
title={Efficient Natural Language Response Suggestion for Smart Reply},
author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
year={2017},
eprint={1705.00652},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
- Downloads last month
- 13
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.
Model tree for deman539/nomic-embed-text-v1
Base model
nomic-ai/nomic-embed-text-v1Space using deman539/nomic-embed-text-v1 1
Evaluation results
- Cosine Accuracy@1 on Unknownself-reported0.858
- Cosine Accuracy@3 on Unknownself-reported0.984
- Cosine Accuracy@5 on Unknownself-reported0.995
- Cosine Accuracy@10 on Unknownself-reported0.999
- Cosine Precision@1 on Unknownself-reported0.858
- Cosine Precision@3 on Unknownself-reported0.328
- Cosine Precision@5 on Unknownself-reported0.199
- Cosine Precision@10 on Unknownself-reported0.100
- Cosine Recall@1 on Unknownself-reported0.858
- Cosine Recall@3 on Unknownself-reported0.984