TeLVE: Turkish efficient Language Vision Engine 🧿

First Turkish VLM ever!

TeLVE is the first Visual Language Model specifically designed for Turkish language understanding and image description generation. Built on Vision Transformer (ViT) and BERT pre-trained encoder architectures, it bridges the gap in Turkish visual-linguistic processing. No module named 'imagine'

Model Description

TeLVE combines:

🖼️ Vision Transformer (ViT-base-patch16-224)
📝 Turkish BERT (dbmdz/bert-base-turkish-cased)
🔄 Cross-attention mechanism for vision-language fusion

Version Logs

TeLVE v1.0: Trained on Unsplash Lite dataset
TeLVE v1.0dep: Dataset enhanced with selective images from Pexels images, the encoder problem with letter "ü" was fixed. (Deprecated, performance was decreased because of dataset addressing problem. Not recommended to use.)

Usage

The model can be used in two ways:

Inference (imagine.py)

# Generate captions for images
python imagine.py

This script:

Loads a trained TeLVE model
Takes images from images directory
Generates Turkish captions for each image
Outputs the results to console

Training (main.py)

Users can train their own models with ViT and BERT encoders.

# Train a new model
python main.py

This script:

Loads and preprocesses image-caption pairs
Initializes ViT and BERT encoders
Trains the combined model
Saves the model and tokenizer

Performance

Performance scores will be evaluated.

Citation

@software{telve2024,
    author = {Öğüt Su Karagün},
    title = {TeLVE: Turkish efficient Language Vision Engine},
    year = {2024},
    url = {https://huggingface.co./outsu/TeLVE}
}

outsu
/

TeLVE