Prithiv Sakthi's picture

Prithiv Sakthi

prithivMLmods

AI & ML interests

computer vision, nlp, multimodality, adapters @starngerzonehf @strangerguardhf

Recent Activity

updated a Space 11 minutes ago
prithivMLmods/FLUX-LoRA-DLC
updated a Space about 6 hours ago
prithivMLmods/Callisto-OCR3-2B
updated a Space about 6 hours ago
prithivMLmods/FLUX-LoRA-DLC2
View all activity

Organizations

Stanford AI's profile picture DataScienceEngineering's profile picture AI FILMS's profile picture Samsung Electronics's profile picture MISATO-dataset's profile picture Masakhane NLP's profile picture GEM benchmark's profile picture OpenGVLab's profile picture MusicAI's profile picture BigScience Biomedical Datasets's profile picture OpenVINO Toolkit's profile picture LLMs's profile picture ONNXConfig for all's profile picture Gradio-Themes-Party's profile picture Georgia Tech (Georgia Institute of Technology)'s profile picture scikit-learn's profile picture lora concepts library's profile picture DeepGHS's profile picture Open-Source AI Meetup's profile picture Literally Me FRFR Research Society's profile picture East China Normal University's profile picture Kornia AI's profile picture Universitรฉ Dauphine-PSL's profile picture Platzi Community's profile picture Tune a video concepts library's profile picture Keras Dreambooth Event's profile picture Stable Diffusion Dreambooth Concepts Library's profile picture The Waifu Research Department's profile picture Musika's profile picture Blog-explorers's profile picture OpenSky's profile picture AI Tamil Nadu's profile picture OpenLLM France's profile picture huggingPartyParis's profile picture Team Tonic's profile picture Johns Hopkins University's profile picture That Time I got Reincarnated as a Hugging Face Organization's profile picture LocalLLaMA's profile picture Major TOM's profile picture MLX Community's profile picture Cohere Labs Community's profile picture M4-ai's profile picture Chinese LLMs on Hugging Face's profile picture ONNX Community's profile picture Dataset Tools's profile picture Nerdy Face's profile picture Stranger Zone's profile picture open/ acc's profile picture Data Is Better Together Contributor's profile picture None yet's profile picture Taiwan Llama's profile picture Doge Face's profile picture Stranger Guard's profile picture Text Analysis, Understanding, and Reasoning Development's profile picture Twinkle AI's profile picture

prithivMLmods's activity

posted an update about 13 hours ago
view post
Post
236
Bringing out style-intermixing adapters for Flux.Dev, including Aura Glow, Fallen Ink Art, Cardboard Paper Arts, Black & White Expressions, and Glitter Gem Touch. For more details, visit the model card of the LoRA. ๐Ÿฅณ

โ•ฐโ”ˆโžค Adapters :
+ Aura Glow : strangerzonehf/2DAura-Flux
+ Fallen Ink Art : strangerzonehf/FallenArt-Flux
+ Black & White Expressions : strangerzonehf/BnW-Expressions-Flux
+ Glitter Gem Touch : strangerzonehf/Gem-Touch-LoRA-Flux
+ Cardboard Paper Arts v1 : strangerzonehf/Flux-Cardboard-Art-LoRA
+ Cardboard Paper Arts v2 : strangerzonehf/Cardboard-v2-Flux

โ•ฐโ”ˆโžค Pages :
- Repository Page : strangerzonehf
- Collection : strangerzonehf/mixer-adp-042025-68095c365d9d1072c8d860be
- Flux Ultimate LoRA Collection : strangerzonehf/Flux-Ultimate-LoRA-Collection
- Demo : prithivMLmods/FLUX-LoRA-DLC2 & prithivMLmods/FLUX-LoRA-DLC
- By prithivMLmods : @prithivMLmods

The best dimensions and inference settings for optimal results are as follows: A resolution of 1280 x 832 with a 3:2 aspect ratio is recommended for the best quality, while 1024 x 1024 with a 1:1 aspect ratio serves as the default option. For inference, the recommended number of steps ranges between 30 and 35 to achieve optimal output.
reacted to merve's post with ๐Ÿ”ฅ about 17 hours ago
view post
Post
744
Don't sleep on new AI at Meta Vision-Language release! ๐Ÿ”ฅ

facebook/perception-encoder-67f977c9a65ca5895a7f6ba1
facebook/perception-lm-67f9783f171948c383ee7498

Meta dropped swiss army knives for vision with A2.0 license ๐Ÿ‘
> image/video encoders for vision language modelling and spatial understanding (object detection etc) ๐Ÿ‘
> The vision LM outperforms InternVL3 and Qwen2.5VL ๐Ÿ‘
> They also release gigantic video and image datasets

The authors attempt to come up with single versatile vision encoder to align on diverse set of tasks.

They trained Perception Encoder (PE) Core: a new state-of-the-art family of vision encoders that can be aligned for both vision-language and spatial tasks. For zero-shot image tasks, it outperforms latest sota SigLIP2 ๐Ÿ‘



> Among fine-tuned ones, first one is PE-Spatial. It's a model to detect bounding boxes, segmentation, depth estimation and it outperforms all other models ๐Ÿ˜ฎ



> Second one is PLM, Perception Language Model, where they combine PE-Core with Qwen2.5 LM 7B. it outperforms all other models (including InternVL3 which was trained with Qwen2.5LM too!)

The authors release the following checkpoints in sizes base, large and giant:

> 3 PE-Core checkpoints (224, 336, 448)
> 2 PE-Lang checkpoints (L, G)
> One PE-Spatial (G, 448)
> 3 PLM (1B, 3B, 8B)
> Datasets



Authors release following datasets ๐Ÿ“‘
> PE Video: Gigantic video datasete of 1M videos with 120k expert annotations โฏ๏ธ
> PLM-Video and PLM-Image: Human and auto-annotated image and video datasets on region-based tasks
> PLM-VideoBench: New video benchmark on MCQA
  • 1 reply
ยท
posted an update 2 days ago
view post
Post
1013
Dropping the domain-specific downstream image classification content moderation models, including the anime image type classification, GeoSceneNet, indoor-outdoor scene classification, and black-and-white vs. colored image classification models, along with the datasets. ๐Ÿ”ฅ

โ•ฐโ”ˆโžคModels :
+ GeoSceneNet : prithivMLmods/Multilabel-GeoSceneNet
+ IndoorOutdoorNet : prithivMLmods/IndoorOutdoorNet
+ B&W vs Colored : prithivMLmods/BnW-vs-Colored-Detection
+ Anime Image Type : prithivMLmods/Anime-Classification-v1.0
+ Multilabel Portrait : prithivMLmods/Multilabel-Portrait-SigLIP2

โ•ฐโ”ˆโžคDatasets :
- GeoSceneNet : prithivMLmods/Multilabel-GeoSceneNet-16K
- IndoorOutdoorNet : prithivMLmods/IndoorOutdoorNet-20K
- BnW vs Colored : prithivMLmods/BnW-vs-Colored-10K
- Multilabel Portrait : prithivMLmods/Multilabel-Portrait-18K

โ•ฐโ”ˆโžคCollections :
> Multilabel Image Classification Datasets : prithivMLmods/multilabel-image-classification-datasets-6809aa64637f45d4c47fa6ca
> Model Collection : prithivMLmods/siglip2-content-filters-models-v2-68053a958c42ef17a3a3f4d1

Note: The anime scene type dataset is not mentioned in the list because it is private and only accessible to members of the DeepGHS organization.

For raw ZIP files or more information about the datasets, visit: https://www.kaggle.com/prithivsakthiur/datasets
  • 1 reply
ยท
reacted to fdaudens's post with ๐Ÿ”ฅ 2 days ago
reacted to linoyts's post with ๐Ÿ‘ 4 days ago
reacted to davidberenstein1957's post with ๐Ÿง ๐Ÿง  8 days ago
reacted to philschmid's post with ๐Ÿ”ฅ 8 days ago
view post
Post
2136
Gemini 2.5 Flash is here! We excited launch our first hybrid reasoning Gemini model. In Flash 2.5 developer can turn thinking off.

**TL;DR:**
- ๐Ÿง ย Controllable "Thinking" with thinking budget with up to 24k token
- ๐ŸŒŒย 1 Million multimodal inputย context for text, image, video, audio, and pdf
- ๐Ÿ› ๏ธย Function calling, structured output, google search & code execution.
- ๐Ÿฆย $0.15 1M input tokens; $0.6 or $3.5 (thinking on) per million output tokens (thinking tokens are billed as output tokens)
- ๐Ÿ’กย Knowledge cut ofย January 2025
- ๐Ÿš€ย Rate limits - Free 10 RPM 500 req/day
- ๐Ÿ…Outperforms 2.0 Flash on every benchmark

Try it โฌ‡๏ธ
https://aistudio.google.com/prompts/new_chat?model=gemini-2.5-flash-preview-04-17
  • 1 reply
ยท
posted an update 8 days ago
view post
Post
2788
Dropping an entire collection of Style Intermixing Adapters on StrangerZone HF โ€” including Realism, Anime, Sketch, Texture-Rich 3D Experimentals, Automotive Concept Images, and LoRA models based on Flux.1, SD 3.5 Turbo/Large, Stable Diffusion XL ๐ŸŽจ

โ•ฐโ”ˆโžคCollection :
โžœ sketch : strangerzonehf/sketch-fav-675ba869c7ceaec7e652ee1c
โžœ sketch2 : strangerzonehf/q-series-sketch-678e3503bf3a661758429717
โžœ automotive : strangerzonehf/automotive-3d-675bb31a491d8c264d45d843
โžœ texture 3d : strangerzonehf/flux-3dxl-engine-674833c14a001d5b1fdb5139
โžœ super 3d : strangerzonehf/super-3d-engine-6743231d69f496df97addd2b
โžœ style mix : strangerzonehf/mixer-engine-673582c9c5939d8aa5bf9533
โžœ realism : strangerzonehf/realism-engine-67343495b6daf0fbdb904cc1

โ•ฐโ”ˆโžคThe Entire Collection :
โžœ flux.1 : prithivMLmods/flux-lora-collections-66dd5908be2206cfaa8519be
โžœ flux-ultimate-lora-collection : strangerzonehf/Flux-Ultimate-LoRA-Collection
โžœ sd 3.5 large / turbo : prithivMLmods/sd-35-large-lora-671b39d7bc2e7f71a446b163
โžœ sdxl : prithivMLmods/sdxl-dev-models-667803a6d5ac75b59110e527

โ•ฐโ”ˆโžคPages :
โžœ page 1: strangerzonehf
โžœ page 2: @prithivMLmods
โžœ demo : prithivMLmods/FLUX-LoRA-DLC

.๐Ÿค—
posted an update 10 days ago
view post
Post
2521
Try out the demo for Multimodal OCR featuring the implementation of models including RolmOCR and Qwen2VL OCR. The use case showcases image-text-to-text conversion and video understanding support for the RolmOCR model ! ๐Ÿš€

๐Ÿค—Multimodal OCR Space : prithivMLmods/Multimodal-OCR

๐Ÿ“ฆThe models implemented in this Space are:
+ Qwen2VL OCR : prithivMLmods/Qwen2-VL-OCR-2B-Instruct [ or ]
+ Qwen2VL OCR2 : prithivMLmods/Qwen2-VL-OCR2-2B-Instruct
+ RolmOCR : reducto/RolmOCR

Qwen2VL OCR supports only image-text-to-text in the space.
reacted to danielhanchen's post with ๐Ÿ”ฅ 18 days ago
reacted to onekq's post with ๐Ÿš€ 19 days ago
posted an update 19 days ago
view post
Post
3308
Loaded some domain-specific downstream image classification content moderation models, which is essentially the practice of monitoring and filtering user-generated content on platforms, based on SigLIP-2 Base Patch16 with newly initialized trainable parameters. ๐Ÿฅ 

+ Age-Classification-SigLIP2 : prithivMLmods/Age-Classification-SigLIP2
[ Age range classification from 0 to 65+ years ]
+ Facial-Emotion-Detection-SigLIP2 : prithivMLmods/Facial-Emotion-Detection-SigLIP2
[ Designed to classify different facial emotions ]
+ Hand-Gesture-2-Robot : prithivMLmods/Hand-Gesture-2-Robot
[ Human Hand Gesture Classification for Robot Control ]
+ Mature-Content-Detection : prithivMLmods/Mature-Content-Detection
[ Mature [adult] or neutral content categories ]
+ Vit-Mature-Content-Detection : prithivMLmods/Vit-Mature-Content-Detection
[ Mature [adult] or neutral content categories ft. ViT]
+ Human-Action-Recognition : prithivMLmods/Human-Action-Recognition
[ Human actions including clapping, sitting, running, and more ]
+ Mirage-Photo-Classifier : prithivMLmods/Mirage-Photo-Classifier
[ Whether an image is real or AI-generated (fake) ]
+ Food-101-93M : prithivMLmods/Food-101-93M
[ Classify food images into one of 101 popular dishes ]
+ Hand-Gesture-19 : prithivMLmods/Hand-Gesture-19
[ Classify hand gesture images into different categories ]
+ Trash-Net : prithivMLmods/Trash-Net
[ Classification of trash into six distinct categories ]
+ Gender-Classifier-Mini : prithivMLmods/Gender-Classifier-Mini
[ Classify images based on gender [Male / Female] ]

๐ŸŽกCollections :

+ SigLIP2 Content Filters : https://huggingface.co./collections/prithivMLmods/siglip2-content-filters-models-67f001055ec2bed56ca41f6d
reacted to hesamation's post with โค๏ธ 20 days ago
view post
Post
2848
The best researchers from Yale, Stanford, Google DeepMind, and Microsoft laid out all we know about Agents in a 264-page paper [book],

Here are some of their key findings:

They build a mapping of different agent components, such as perception, memory, and world modelling, to different regions of the human brain and compare them:

- brain is much more energy-efficient
- no genuine experience in agents
- brain learns continuously, agent is static

An agent is broken down to:
- Perception: the agent's input mechanism. can be improved with multi-modality, feedback mechanisms (e.g., human corrections), etc.
- Cognition: learning, reasoning, planning, memory. LLMs are key in this part.
- Action: agent's output and tool use.

Agentic memory is represented as:
- Sensory memory or short-term holding of inputs which is not emphasized much in agents.
- Short-term memory which is the LLM context window
- Long-term memory which is the external storage such as RAG or knowledge graphs.

The memory in agents can be improved and researched in terms of:
- increasing the amount of stored information
- how to retrieve the most relevant info
- combining context-window memory with external memory
- deciding what to forget or update in memory

The agent must simulate or predict the future states of the environment for planning and decision-making.

ai world models are much simpler than the humans' with their causal reasoning (cause-and-effect) or physical intuition.

LLM world models are mostly implicit and embedded.

EMOTIONS are a deep aspect of humans, helping them with social interactions, decision-making, or learning.

Agents must understand emotions to better interact with us.

But rather than encoding the feeling of emotions, they have a surface-level modelling of emotions.

Perception is the process by which an agent receives and interprets raw data from its surroundings.

READ PAPER: Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems (2504.01990)
posted an update 20 days ago
view post
Post
2143
ChatGPT-4oโ€™s image generation goes wild for a weekโ€”featuring everything from Studio Ghibli-style art and image colorization to style intermixing. Here are some examples showcasing the generation of highly detailed images from freestyle design templates. Want to know more? Check out the blog ๐Ÿš€

๐Ÿ”—Blog : https://huggingface.co./blog/prithivMLmods/chatgpt-4o-image-gen
replied to their post 23 days ago
view reply

There is nothing intended for commercial use or profit; this is purely for experimental purposes with models based on voice essences. I have adhered strictly to the base model I used, specifically the 0th version. Even the 'Orpheus' models, which are licensed under Apache-2.0, follow their own policies and alignment. I will ensure compliance with the model I have post-trained and its licenses, specifically Llama 3.2. I am not claiming ownership of the modelโ€”everything in it is calibrated within the framework of Llama 3.2

So, I will continue following the work I have done. The point is, if someone intends to use the model, they must also adhere to the license of the original material that I have. @JLouisBiz

reacted to hesamation's post with โค๏ธ 24 days ago
view post
Post
2703
What, How, Where, and How Well? This paper reviews test-time scaling methods and all you need to know about them:
> parallel, sequential, hybrid, internal scaling
> how to scale (SFT, RL, search, verification)
> metrics and evals of test-time scaling

๐Ÿ”—paper: What, How, Where, and How Well? A Survey on Test-Time Scaling in Large Language Models (2503.24235)

If you want to learn what inference-time compute scaling is @rasbt has a great blog post on that:
https://magazine.sebastianraschka.com/p/state-of-llm-reasoning-and-inference-scaling
replied to their post 27 days ago
view reply

@JLouisBiz

But the model is licensed under Llama 3.2, on which the base model is also built. The License Rights and Redistribution section states that the grant of rights allows the use of the content for derivative works and modifications to the Llama materials, provided that 'Built with Llama' is properly mentioned and the Llama is displayed wherever it is used. I believe I have properly mentioned that and have not overruled anything from the license.

Provided a copy of the license. Include 'Llama' at the beginning of the modelโ€™s name. In the 'About' section of the model, mention that it is built based on Llama.

" If you use the Llama Materials or any outputs or results of the Llama Materials to ๐—ฐ๐—ฟ๐—ฒ๐—ฎ๐˜๐—ฒ, ๐˜๐—ฟ๐—ฎ๐—ถ๐—ป, ๐—ณ๐—ถ๐—ป๐—ฒ ๐˜๐˜‚๐—ป๐—ฒ, ๐—ผ๐—ฟ
๐—ผ๐˜๐—ต๐—ฒ๐—ฟ๐˜„๐—ถ๐˜€๐—ฒ ๐—ถ๐—บ๐—ฝ๐—ฟ๐—ผ๐˜ƒ๐—ฒ ๐—ฎ๐—ป ๐—”๐—œ ๐—บ๐—ผ๐—ฑ๐—ฒ๐—น, ๐˜„๐—ต๐—ถ๐—ฐ๐—ต ๐—ถ๐˜€ ๐—ฑ๐—ถ๐˜€๐˜๐—ฟ๐—ถ๐—ฏ๐˜‚๐˜๐—ฒ๐—ฑ ๐—ผ๐—ฟ ๐—บ๐—ฎ๐—ฑ๐—ฒ ๐—ฎ๐˜ƒ๐—ฎ๐—ถ๐—น๐—ฎ๐—ฏ๐—น๐—ฒ, ๐˜†๐—ผ๐˜‚ ๐˜€๐—ต๐—ฎ๐—น๐—น ๐—ฎ๐—น๐˜€๐—ผ ๐—ถ๐—ป๐—ฐ๐—น๐˜‚๐—ฑ๐—ฒ โ€œ๐—Ÿ๐—น๐—ฎ๐—บ๐—ฎโ€
at the beginning of any such AI model name. "

Please refer to the Llama 3.2 License [ https://huggingface.co./meta-llama/Llama-3.2-1B/blob/main/LICENSE.txt ], specifically the License Rights and Redistribution section, clauses (a) and (b).

posted an update 27 days ago
view post
Post
1890
Luna, the single-speaker text-to-speech model, features a Radio & Atcosim-style sound with a female voice. It offers authentic radio podcast noise and empathetic speech generation, fine-tuned based on Orpheus's Llama-based speech generation state-of-the-art model. ๐ŸŽ™๏ธ

+ Model : prithivMLmods/Llama-3B-Mono-Luna
+ Collection : prithivMLmods/clean-radio-mono-voice-67e76fe1b3a87cc3bccef803
+ Reference ft : https://github.com/canopyai/Orpheus-TTS
+ Base Model : canopylabs/orpheus-3b-0.1-ft

I also tried some other clean-voice single-speaker models based on Orpheus. If you're interested, check out the collection.

๐Ÿ”‰Try the Mono Luna demo here: http://colab.research.google.com/drive/1K0AAIOKDE5XE0znxXaiiUJvPSpFveteK
ยท
reacted to AdinaY's post with ๐Ÿ”ฅ about 1 month ago
view post
Post
1694
A new OPEN Omni model just dropped by @Alibaba_Qwen on the hub๐Ÿ”ฅ๐Ÿคฏ

Qwen2.5-Omni: a 7B end-to-end multimodal model
Qwen/Qwen2.5-Omni-7B

โœจ Thinker-Talker architecture
โœจ Real-time voice & video chat
โœจ Natural speech generation
โœจ Handles text, image, audio & video
  • 1 reply
ยท