C4AI-Community (C4AI Community)

posted an update 2 days ago

Post

1898

Exciting News in AI: JinaAI Releases JINA-CLIP-v2!

The team at Jina AI has just released a groundbreaking multilingual multimodal embedding model that's pushing the boundaries of text-image understanding. Here's why this is a big deal:

🚀 Technical Highlights:
- Dual encoder architecture combining a 561M parameter Jina XLM-RoBERTa text encoder and a 304M parameter EVA02-L14 vision encoder
- Supports 89 languages with 8,192 token context length
- Processes images up to 512×512 pixels with 14×14 patch size
- Implements FlashAttention2 for text and xFormers for vision processing
- Uses Matryoshka Representation Learning for efficient vector storage

⚡️ Under The Hood:
- Multi-stage training process with progressive resolution scaling (224→384→512)
- Contrastive learning using InfoNCE loss in both directions
- Trained on massive multilingual dataset including 400M English and 400M multilingual image-caption pairs
- Incorporates specialized datasets for document understanding, scientific graphs, and infographics
- Uses hard negative mining with 7 negatives per positive sample

📊 Performance:
- Outperforms previous models on visual document retrieval (52.65% nDCG@5)
- Achieves 89.73% image-to-text and 79.09% text-to-image retrieval on CLIP benchmark
- Strong multilingual performance across 30 languages
- Maintains performance even with 75% dimension reduction (256D vs 1024D)

🎯 Key Innovation:
The model solves the long-standing challenge of unifying text-only and multi-modal retrieval systems while adding robust multilingual support. Perfect for building cross-lingual visual search systems!

Kudos to the research team at Jina AI for this impressive advancement in multimodal AI!

singhsidhukuldeep

posted an update 3 days ago

Post

1167

Fascinating insights from @Pinterest 's latest research on improving feature interactions in recommendation systems!

Pinterest's engineering team has tackled a critical challenge in their Homefeed ranking system that serves 500M+ monthly active users. Here's what makes their approach remarkable:

>> Technical Deep Dive

Architecture Overview
• The ranking model combines dense features, sparse features, and embedding features to represent users, Pins, and context
• Sparse features are processed using learnable embeddings with size based on feature cardinality
• User sequence embeddings are generated using a transformer architecture processing past engagements

Feature Processing Pipeline
• Dense features undergo normalization for numerical stability
• Sparse and embedding features receive L2 normalization
• All features are concatenated into a single feature embedding

Key Innovations
• Implemented parallel MaskNet layers with 3 blocks
• Used projection ratio of 2.0 and output dimension of 512
• Stacked 4 DCNv2 layers on top for higher-order interactions

Performance Improvements
• Achieved +1.42% increase in Homefeed Save Volume
• Boosted Overall Time Spent by +0.39%
• Maintained memory consumption increase to just 5%

>> Industry Constraints Addressed

Memory Management
• Optimized for 60% GPU memory utilization
• Prevented OOM errors while maintaining batch size efficiency

Latency Optimization
• Removed input-output concatenation before MLP
• Reduced hidden layer sizes in MLP
• Achieved zero latency increase while improving performance

System Stability
• Ensured reproducible results across retraining
• Maintained model stability across different data distributions
• Successfully deployed in production environment

This work brilliantly demonstrates how to balance academic innovations with real-world industrial constraints. Kudos to the Pinterest team!

prithivMLmods

posted an update 4 days ago

Post

4917

Sketchify 😉🎨

+ strangerzonehf/Flux-Sketch-Smudge-LoRA
+ strangerzonehf/Flux-Sketch-Sized-LoRA
+ strangerzonehf/Sketch-Paint

- strangerzonehf/sketch-fav-675ba869c7ceaec7e652ee1c

singhsidhukuldeep

posted an update 5 days ago

Post

3532

Exciting breakthrough in AI: @Meta 's new Byte Latent Transformer (BLT) revolutionizes language models by eliminating tokenization!

The BLT architecture introduces a groundbreaking approach that processes raw bytes instead of tokens, achieving state-of-the-art performance while being more efficient and robust. Here's what makes it special:

>> Key Innovations
Dynamic Patching: BLT groups bytes into variable-sized patches based on entropy, allocating more compute power where the data is more complex. This results in up to 50% fewer FLOPs during inference compared to traditional token-based models.

Three-Component Architecture:
• Lightweight Local Encoder that converts bytes to patch representations
• Powerful Global Latent Transformer that processes patches
• Local Decoder that converts patches back to bytes

>> Technical Advantages
• Matches performance of Llama 3 at 8B parameters while being more efficient
• Superior handling of non-English languages and rare character sequences
• Remarkable 99.9% accuracy on spelling tasks
• Better scaling properties than token-based models

>> Under the Hood
The system uses an entropy model to determine patch boundaries, cross-attention mechanisms for information flow, and hash n-gram embeddings for improved representation. The architecture allows simultaneous scaling of both patch and model size while maintaining fixed inference costs.

This is a game-changer for multilingual AI and could reshape how we build future language models. Excited to see how this technology evolves!

2 replies

·

jjzha

authored a paper 6 days ago

SnakModel: Lessons Learned from Training an Open Danish Large Language Model

Paper • 2412.12956 • Published 9 days ago • 1

prithivMLmods

posted an update 7 days ago

Post

2085

Qwen2VL Models: Vision and Language Processing 🍉

📍FT; [ Latex OCR, Math Parsing, Text Analogy OCRTest ]

❄️Demo : prithivMLmods/Qwen2-VL-2B . The demo includes the Qwen2VL 2B Base Model.

🎯The space handles documenting content from the input image along with standardized plain text. It includes adjustment tools with over 30 font styles, file formatting support for PDF and DOCX, textual alignments, font size adjustments, and line spacing modifications.

📄PDFs are rendered using the ReportLab software library toolkit.

🧵Models :
+ prithivMLmods/Qwen2-VL-OCR-2B-Instruct
+ prithivMLmods/Qwen2-VL-Ocrtest-2B-Instruct
+ prithivMLmods/Qwen2-VL-Math-Prase-2B-Instruct

🚀Sample Document :
+ https://drive.google.com/file/d/1Hfqqzq4Xc-3eTjbz-jcQY84V5E1YM71E/view?usp=sharing

📦Collection :
+ prithivMLmods/vision-language-models-67639f790e806e1f9799979f

.
.
.
@prithivMLmods 🤗

1 reply

·

takarajordan

posted an update 7 days ago

Post

1054

I made an RSS feed for HuggingFace Daily Papers!! 🤗

Just Subscribe here: https://papers.takara.ai/api/feed

It updates every 24 hours, completely written as a serverless go script with a Redis cache (to avoid hitting HF all the time).

I'm open sourcing the code, you can check out my repo and deploy it on Vercel extremely easily!
https://github.com/404missinglink/HF-Daily-Papers-Feeds

thanks to @John6666 @p3nGu1nZz for your early support

prithivMLmods

posted an update 8 days ago

Post

3172

🎄 Here Before - Xmas🎅✨

🧑🏻‍🎄Models
+ [ Xmas 2D Illustration ] : strangerzonehf/Flux-Xmas-Illustration-LoRA
+ [ Xmas 3D Art ] : strangerzonehf/Flux-Xmas-3D-LoRA
+ [ Xmas Chocolate ] : strangerzonehf/Flux-Xmas-Chocolate-LoRA
+ [ Xmas Isometric Kit ] : strangerzonehf/Flux-Xmas-Isometric-Kit-LoRA
+ [ Xmas Realpix ] : strangerzonehf/Flux-Xmas-Realpix-LoRA
+ [ Xmas Anime ] : strangerzonehf/Flux-Anime-Xmas-LoRA

❄️Collections
+ [ Xmas Art ] : strangerzonehf/christmas-pack-6758b199487adafaddb68f82
+ [ Stranger Zone Collection ] : prithivMLmods/stranger-zone-collections-org-6737118adcf2cb40d66d0c7e

🥶Page
+ [ Stranger Zone ] : https://huggingface.co./strangerzonehf

.
.
.
@prithivMLmods 🤗

peaceAsh

authored a paper 9 days ago

Maya: An Instruction Finetuned Multilingual Multimodal Model

Paper • 2412.07112 • Published 16 days ago • 25

asusevski

authored a paper 9 days ago

Maya: An Instruction Finetuned Multilingual Multimodal Model

Paper • 2412.07112 • Published 16 days ago • 25

jjzha

authored a paper 10 days ago

Leveraging Large Language Models for Actionable Course Evaluation Student Feedback to Lecturers

Paper • 2407.01274 • Published Jul 1 • 1

prithivMLmods

posted an update 12 days ago

Post

2659

strangerzonehf/Flux-Sketch-Flat-LoRA

singhsidhukuldeep

posted an update 12 days ago

Post

1228

Groundbreaking Research Alert: The 'H' in HNSW Stands for "Hubs", Not "Hierarchy"!

Fascinating new research reveals that the hierarchical structure in the popular HNSW (Hierarchical Navigable Small World) algorithm - widely used for vector similarity search - may be unnecessary for high-dimensional data.

🔬 Key Technical Findings:

• The hierarchical layers in HNSW can be completely removed for vectors with dimensionality > 32, with no performance loss

• Memory savings of up to 38% achieved by removing the hierarchy

• Performance remains identical in both median and tail latency cases across 13 benchmark datasets

🛠️ Under The Hood:
The researchers discovered that "hub highways" naturally form in high-dimensional spaces. These hubs are well-connected nodes that are frequently traversed during searches, effectively replacing the need for explicit hierarchical layers.

The hub structure works because:
• A small subset of nodes appear disproportionately in nearest neighbor lists
• These hub nodes form highly connected subgraphs
• Queries naturally traverse through these hubs early in the search process
• The hubs efficiently connect distant regions of the graph

💡 Industry Impact:
This finding has major implications for vector databases and similarity search systems. Companies can significantly reduce memory usage while maintaining performance by implementing flat navigable small world graphs instead of hierarchical ones.

🚀 What's Next:
The researchers have released FlatNav, an open-source implementation of their flat navigable small world approach, enabling immediate practical applications of these findings.

shayekh

authored 2 papers 13 days ago

bbOCR: An Open-source Multi-domain OCR Pipeline for Bengali Documents

Paper • 2308.10647 • Published Aug 21, 2023

Maya: An Instruction Finetuned Multilingual Multimodal Model

Paper • 2412.07112 • Published 16 days ago • 25

mridul3301

authored a paper 14 days ago

Development of Pre-Trained Transformer-based Models for the Nepali Language

Paper • 2411.15734 • Published Nov 24

singhsidhukuldeep

posted an update 14 days ago

Post

455

Fascinating new research alert! Just read a groundbreaking paper on understanding Retrieval-Augmented Generation (RAG) systems and their performance factors.

Key insights from this comprehensive study:

>> Architecture Deep Dive
The researchers analyzed RAG systems across 6 datasets (3 code-related, 3 QA-focused) using multiple LLMs. Their investigation revealed critical insights into four key design factors:

Document Types Impact:
• Oracle documents (ground truth) aren't always optimal
• Distracting documents significantly degrade performance
• Surprisingly, irrelevant documents boost code generation by up to 15.6%

Retrieval Precision:
• Performance varies dramatically by task
• QA tasks need 20-100% retrieval recall
• Perfect retrieval still fails up to 12% of the time on previously correct instances

Document Selection:
• More documents ≠ better results
• Adding documents can cause errors on previously correct samples
• Performance degradation increases ~1% per 5 additional documents in code tasks

Prompt Engineering:
• Most advanced prompting techniques underperform simple zero-shot prompts
• Technique effectiveness varies significantly across models and tasks
• Complex prompts excel at difficult problems but struggle with simple ones

>> Technical Implementation
The study utilized:
• Multiple retrievers including BM25, dense retrievers, and specialized models
• Comprehensive corpus of 70,956 unique API documents
• Over 200,000 API calls and 1,000+ GPU hours of computation
• Sophisticated evaluation metrics tracking both correctness and system confidence

💡 Key takeaway: RAG system optimization requires careful balancing of multiple factors - there's no one-size-fits-all solution.

1 reply

·

takarajordan

posted an update 15 days ago

Post

2201

I'm super excited to release my first open-source text dataset:

WorldScenario 20K is a novel dataset of 20,000 synthetically generated multi-stakeholder scenarios designed to simulate real-world decision-making processes. Each scenario explores a unique environmental, societal, or economic issue.

I used the brand new meta-llama/Llama-3.3-70B-Instruct model to generate this dataset and I put the dataset through some post processing to clean and evaluate the dataset for diversity.

I'd appreciate some feedback and thoughts on my new release! Thanks!

takarajordan/WorldScenario_20K