Edit model card

SentenceTransformer based on nomic-ai/nomic-embed-text-v1

This is a sentence-transformers model finetuned from nomic-ai/nomic-embed-text-v1. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more. In particular, this model is trained on various documents which descibe frameworks for building ethical AI systems. As such it performs well on matching questions to context in RAG applications.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: nomic-ai/nomic-embed-text-v1
  • Maximum Sequence Length: 8192 tokens
  • Output Dimensionality: 768 tokens
  • Similarity Function: Cosine Similarity

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: NomicBertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("deman539/nomic-embed-text-v1")
# Run inference
sentences = [
    'What mental health issues are associated with the increased use of technologies in schools and workplaces?',
    'technologies has increased in schools and workplaces, and, when coupled with consequential management and \nevaluation decisions, it is leading to mental health harms such as lowered self-confidence, anxiety, depression, and \na reduced ability to use analytical reasoning.61 Documented patterns show that personal data is being aggregated by \ndata brokers to profile communities in harmful ways.62 The impact of all this data harvesting is corrosive,',
    'but this approach may still produce harmful recommendations in response to other less-explicit, novel \nprompts (also relevant to CBRN Information or Capabilities, Data Privacy, Information Security, and \nObscene, Degrading and/or Abusive Content). Crafting such prompts deliberately is known as \n“jailbreaking,” or, manipulating prompts to circumvent output controls. Limitations of GAI systems can be',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Information Retrieval

Metric Value
cosine_accuracy@1 0.8584
cosine_accuracy@3 0.9838
cosine_accuracy@5 0.9951
cosine_accuracy@10 0.9992
cosine_precision@1 0.8584
cosine_precision@3 0.3279
cosine_precision@5 0.199
cosine_precision@10 0.0999
cosine_recall@1 0.8584
cosine_recall@3 0.9838
cosine_recall@5 0.9951
cosine_recall@10 0.9992
cosine_ndcg@10 0.9418
cosine_mrr@10 0.922
cosine_map@100 0.9221
dot_accuracy@1 0.8584
dot_accuracy@3 0.9838
dot_accuracy@5 0.9951
dot_accuracy@10 0.9992
dot_precision@1 0.8584
dot_precision@3 0.3279
dot_precision@5 0.199
dot_precision@10 0.0999
dot_recall@1 0.8584
dot_recall@3 0.9838
dot_recall@5 0.9951
dot_recall@10 0.9992
dot_ndcg@10 0.9418
dot_mrr@10 0.922
dot_map@100 0.9221

Training Details

Training Dataset

Unnamed Dataset

  • Size: 2,459 training samples
  • Columns: sentence_0 and sentence_1
  • Approximate statistics based on the first 1000 samples:
    sentence_0 sentence_1
    type string string
    details
    • min: 2 tokens
    • mean: 18.7 tokens
    • max: 35 tokens
    • min: 22 tokens
    • mean: 93.19 tokens
    • max: 337 tokens
  • Samples:
    sentence_0 sentence_1
    What should organizations include in contracts to evaluate third-party GAI processes and standards? services acquisition and value chain risk management; and legal compliance.
    Data Privacy; Information
    Integrity; Information Security;
    Intellectual Property; Value Chain
    and Component Integration
    GV-6.1-006 Include clauses in contracts which allow an organization to evaluate third-party
    GAI processes and standards.
    Information Integrity
    GV-6.1-007 Inventory all third-party entities with access to organizational content and
    establish approved GAI technology and service provider lists.
    What steps should be taken to manage third-party entities with access to organizational content? services acquisition and value chain risk management; and legal compliance.
    Data Privacy; Information
    Integrity; Information Security;
    Intellectual Property; Value Chain
    and Component Integration
    GV-6.1-006 Include clauses in contracts which allow an organization to evaluate third-party
    GAI processes and standards.
    Information Integrity
    GV-6.1-007 Inventory all third-party entities with access to organizational content and
    establish approved GAI technology and service provider lists.
    What should entities responsible for automated systems establish before deploying the system? Clear organizational oversight. Entities responsible for the development or use of automated systems
    should lay out clear governance structures and procedures. This includes clearly-stated governance proce­
    dures before deploying the system, as well as responsibility of specific individuals or entities to oversee ongoing
    assessment and mitigation. Organizational stakeholders including those with oversight of the business process
  • Loss: MatryoshkaLoss with these parameters:
    {
        "loss": "MultipleNegativesRankingLoss",
        "matryoshka_dims": [
            768,
            512,
            256,
            128,
            64
        ],
        "matryoshka_weights": [
            1,
            1,
            1,
            1,
            1
        ],
        "n_dims_per_step": -1
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 32
  • per_device_eval_batch_size: 32
  • num_train_epochs: 20
  • multi_dataset_batch_sampler: round_robin

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 32
  • per_device_eval_batch_size: 32
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1
  • num_train_epochs: 20
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.0
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: False
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • eval_use_gather_object: False
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: round_robin

Training Logs

Epoch Step Training Loss cosine_map@100
0.6494 50 - 0.8493
1.0 77 - 0.8737
1.2987 100 - 0.8677
1.9481 150 - 0.8859
2.0 154 - 0.8886
2.5974 200 - 0.8913
3.0 231 - 0.9058
3.2468 250 - 0.8993
3.8961 300 - 0.9077
4.0 308 - 0.9097
4.5455 350 - 0.9086
5.0 385 - 0.9165
5.1948 400 - 0.9141
5.8442 450 - 0.9132
6.0 462 - 0.9138
6.4935 500 0.3094 0.9137
7.0 539 - 0.9166
7.1429 550 - 0.9172
7.7922 600 - 0.9160
8.0 616 - 0.9169
8.4416 650 - 0.9177
9.0 693 - 0.9169
9.0909 700 - 0.9177
9.7403 750 - 0.9178
10.0 770 - 0.9178
10.3896 800 - 0.9189
11.0 847 - 0.9180
11.0390 850 - 0.9180
11.6883 900 - 0.9188
12.0 924 - 0.9192
12.3377 950 - 0.9204
12.9870 1000 0.0571 0.9202
13.0 1001 - 0.9201
13.6364 1050 - 0.9212
14.0 1078 - 0.9203
14.2857 1100 - 0.9219
14.9351 1150 - 0.9207
15.0 1155 - 0.9207
15.5844 1200 - 0.9210
16.0 1232 - 0.9208
16.2338 1250 - 0.9216
16.8831 1300 - 0.9209
17.0 1309 - 0.9209
17.5325 1350 - 0.9216
18.0 1386 - 0.9213
18.1818 1400 - 0.9221
18.8312 1450 - 0.9217
19.0 1463 - 0.9217
19.4805 1500 0.0574 0.9225
20.0 1540 - 0.9221

Framework Versions

  • Python: 3.10.12
  • Sentence Transformers: 3.1.1
  • Transformers: 4.44.2
  • PyTorch: 2.4.1+cu121
  • Accelerate: 0.34.2
  • Datasets: 3.0.0
  • Tokenizers: 0.19.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MatryoshkaLoss

@misc{kusupati2024matryoshka,
    title={Matryoshka Representation Learning},
    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
    year={2024},
    eprint={2205.13147},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
Downloads last month
13
Safetensors
Model size
137M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for deman539/nomic-embed-text-v1

Finetuned
(6)
this model

Space using deman539/nomic-embed-text-v1 1

Evaluation results