A fine-tune of CLIP-L. Original model: openai/clip-vit-large-patch14

  • ❀️ this CLIP? Help feed it if you can. Besides data, CLIP eats time & expensive electricity of DE. TY! πŸ€—
  • Want to feed it yourself? All code for fine-tuning and much more is on my GitHub.

Update 23/SEP/2024:

  • Huggingface Transformers / Diffusers pipeline now implemented.
  • See here for an example script: Integrating my CLIP-L with Flux.1
  • Otherwise, use as normal / any HF model:
from transformers import CLIPModel, CLIPProcessor, CLIPConfig
model_id = "zer0int/CLIP-GmP-ViT-L-14"
config = CLIPConfig.from_pretrained(model_id)

Update 03/SEP/2024 / edit 05/AUG:

πŸ‘‹ Looking for a Text Encoder for Flux.1 (or SD3, SDXL, SD, ...) to replace CLIP-L? πŸ‘€

You'll generally want the "TE-only" .safetensors:

  • πŸ‘‰ The "TEXT" model has superior prompt following, especially for text, but also for other details. DOWNLOAD
  • πŸ‘‰ The "SMOOTH" model can sometimes** have better details (when there's no text in the image). DOWNLOAD
  • The "GmP" initial fine-tune is deprecated / inferior to the above models. Still, you can DOWNLOAD it.

**: The "TEXT" model is the best for text. Full stop. But whether the "SMOOTH" model is better for your (text-free) scenario than the "TEXT" model really depends on the specific prompt. It might also be the case that the "TEXT" model leads to images that you prefer over "SMOOTH"; the only way to know is to experiment with both.

image/png

πŸ€“πŸ‘¨β€πŸ’» In general (because we're not limited to text-to-image generative AI), I provide four versions / downloads:

  • Text encoder only .safetensors.
  • Full model .safetensors.
  • State_dict pickle.
  • Full model pickle (can be used as-is with "import clip" -> clip.load() after bypassing SHA checksum verification).

The TEXT model has a modality gap of 0.80 (OpenAI pre-trained: 0.82).

  • Trained with high temperature of 0.1 + tinkering.
  • ImageNet/ObjectNet accuracy ~0.91 for both "SMOOTH" and "TEXT" models (pre-trained: ~0.84).
  • The models (this plot = "TEXT" model on MSCOCO) are also golden retrievers: πŸ₯°πŸ•

image/png


Update 11/AUG/2024:

New Best-Performing CLIP ViT-L/14 'GmP-smooth' model added (simply download the files named BEST!):

image/png

Or just create a fine-tune yourself: https://github.com/zer0int/CLIP-fine-tune

How?

  • Geometric Parametrization (GmP) (same as before)
  • Activation Value manipulation for 'adverb neuron' (same as before)
  • NEW: Custom loss function with label smoothing!
  • For in-depth details, see my GitHub. πŸ€—

A fine-tune of OpenAI / CLIP ViT-L/14 that has an unprecedented ImageNet/ObjectNet accuracy of ~0.90 (original pre-trained model / OpenAI's CLIP: ~0.85)**.

Made possible with Geometric Parametrization (GmP):


"Normal" CLIP MLP (multi-layer perceptron):

(mlp): Sequential(
  |-(c_fc): Linear(in_features=1024, out_features=4096, bias=True)
  | (gelu): QuickGELU()
|-}-(c_proj): Linear(in_features=4096, out_features=1024, bias=True)
| | 
| |-- visual.transformer.resblocks.0.mlp.c_fc.weight
| |-- visual.transformer.resblocks.0.mlp.c_fc.bias
|
|---- visual.transformer.resblocks.0.mlp.c_proj.weight
|---- visual.transformer.resblocks.0.mlp.c_proj.bias


GmP CLIP MLP:

Weight decomposition into:
- radial component 'r' as norm of pre-trained weights
- angular component 'theta' as normalized direction
-> preserves weight vectors' directionality and magnitude

(mlp): Sequential(
  |-(c_fc): GeometricLinear()
  | (gelu): QuickGELU()
|-}-(c_proj): GeometricLinear()
| | 
| |-- visual.transformer.resblocks.0.mlp.c_fc.r
| |-- visual.transformer.resblocks.0.mlp.c_fc.theta
| |-- visual.transformer.resblocks.0.mlp.c_fc.bias
|
|---- visual.transformer.resblocks.0.mlp.c_proj.r
|---- visual.transformer.resblocks.0.mlp.c_proj.theta
|---- visual.transformer.resblocks.0.mlp.c_proj.bias

(Same thing for [text] transformer.resblocks)

image/png

βœ… The model / state_dict I am sharing was converted back to .weight after fine-tuning - alas, it can be used in the same manner as any state_dict, e.g. for use with ComfyUI as the SDXL / SD3 Text Encoder! πŸ€—

  • ** For details on training and those numbers / the eval, please see https://github.com/zer0int/CLIP-fine-tune
  • -> You can use "exp-acts-ft-finetune-OpenAI-CLIP-ViT-L-14-GmP-manipulate-neurons.py" to replicate my exact model fine-tune.

Pre-trained CLIP model by OpenAI, License: MIT License

Downloads last month
20,253
Safetensors
Model size
428M params
Tensor type
F32
Β·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for zer0int/CLIP-GmP-ViT-L-14

Finetuned
(51)
this model

Dataset used to train zer0int/CLIP-GmP-ViT-L-14

Spaces using zer0int/CLIP-GmP-ViT-L-14 3