Lighthouz AI

company

https://lighthouz.ai/

Activity Feed

AI & ML interests

None defined yet.

Recent Activity

clefourrier authored a paper about 1 month ago

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

clefourrier authored a paper 3 months ago

Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation

rohankaran updated a Space 7 months ago

lighthouzai/guardrails-arena

View all activity

lighthouzai's activity

clefourrier

posted an update about 2 hours ago

Post

164

Gemma3 family is out! Reading the tech report, and this section was really interesting to me from a methods/scientific fairness pov.

Instead of doing over-hyped comparisons, they clearly state that **results are reported in a setup which is advantageous to their models**.
(Which everybody does, but people usually don't say)

For a tech report, it makes a lot of sense to report model performance when used optimally!
On leaderboards on the other hand, comparison will be apples to apples, but in a potentially unoptimal way for a given model family (like some user interact sub-optimally with models)

Also contains a cool section (6) on training data memorization rate too! Important to see if your model will output the training data it has seen as such: always an issue for privacy/copyright/... but also very much for evaluation!

Because if your model knows its evals by heart, you're not testing for generalization.

clefourrier

authored a paper about 1 month ago

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Paper • 2502.02737 • Published Feb 4 • 203

clefourrier

authored a paper 3 months ago

Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation

Paper • 2412.03304 • Published Dec 4, 2024 • 18

rohankaran

updated a Space 7 months ago

Guardrails Arena

⚔

Jailbreak the LLM and privacy guardrails

clefourrier

authored 2 papers 9 months ago

The Hallucinations Leaderboard -- An Open Effort to Measure Hallucinations in Large Language Models

Paper • 2404.05904 • Published Apr 8, 2024 • 9

GAIA: a benchmark for General AI Assistants

Paper • 2311.12983 • Published Nov 21, 2023 • 192

clefourrier

posted an update 11 months ago

Post

6116

In a basic chatbots, errors are annoyances. In medical LLMs, errors can have life-threatening consequences 🩸

It's therefore vital to benchmark/follow advances in medical LLMs before even thinking about deployment.

This is why a small research team introduced a medical LLM leaderboard, to get reproducible and comparable results between LLMs, and allow everyone to follow advances in the field.

openlifescienceai/open_medical_llm_leaderboard

Congrats to @aaditya and @pminervini !
Learn more in the blog: https://huggingface.co./blog/leaderboard-medicalllm

clefourrier

posted an update 11 months ago

Post

4762

Contamination free code evaluations with LiveCodeBench! 🖥️

LiveCodeBench is a new leaderboard, which contains:
- complete code evaluations (on code generation, self repair, code execution, tests)
- my favorite feature: problem selection by publication date 📅

This feature means that you can get model scores averaged only on new problems out of the training data. This means... contamination free code evals! 🚀

Check it out!

Blog: https://huggingface.co./blog/leaderboard-livecodebench
Leaderboard: livecodebench/leaderboard

Congrats to @StringChaos @minimario @xu3kev @kingh0730 and @FanjiaYan for the super cool leaderboard!

clefourrier

posted an update 11 months ago

Post

2236

🆕 Evaluate your RL agents - who's best at Atari?🏆

The new RL leaderboard evaluates agents in 87 possible environments (from Atari 🎮 to motion control simulations🚶and more)!

When you submit your model, it's run and evaluated in real time - and the leaderboard displays small videos of the best model's run, which is super fun to watch! ✨

Kudos to @qgallouedec for creating and maintaining the leaderboard!
Let's find out which agent is the best at games! 🚀

open-rl-leaderboard/leaderboard

clefourrier

posted an update 11 months ago

Post

2237

Fun fact about evaluation, part 2!

How much do scores change depending on prompt format choice?

Using different prompts (all present in the literature, from Prompt question? to Question: prompt question?\nChoices: enumeration of all choices\nAnswer: ), we get a score range of...

10 points for a single model!
Keep in mind that we only changed the prompt, not the evaluation subsets, etc.
Again, this confirms that evaluation results reported without their details are basically bullshit.

Prompt format on the x axis, all these evals look at the logprob of either "choice A/choice B..." or "A/B...".

Incidentally, it also changes model rankings - so a "best" model might only be best on one type of prompt...

srijankedia

authored 10 papers 11 months ago

The Role of the Crowd in Countering Misinformation: A Case Study of the COVID-19 Infodemic

Paper • 2011.05773 • Published Nov 11, 2020 • 1

Racism is a Virus: Anti-Asian Hate and Counterspeech in Social Media during the COVID-19 Crisis

Paper • 2005.12423 • Published May 25, 2020 • 1

Graph Vulnerability and Robustness: A Survey

Paper • 2105.00419 • Published May 2, 2021 • 1

Cross-Modal Attribute Insertions for Assessing the Robustness of Vision-and-Language Learning

Paper • 2306.11065 • Published Jun 19, 2023 • 1

Findings of Factify 2: Multimodal Fake News Detection

Paper • 2307.10475 • Published Jul 19, 2023 • 1

MM-Soc: Benchmarking Multimodal Large Language Models in Social Media Platforms

Paper • 2402.14154 • Published Feb 21, 2024 • 2

Mysterious Projections: Multimodal LLMs Gain Domain-Specific Visual Capabilities Without Richer Cross-Modal Projections

Paper • 2402.16832 • Published Feb 26, 2024 • 1

Deception Detection in Group Video Conversations using Dynamic Interaction Networks

Paper • 2106.06163 • Published Jun 11, 2021

Rank List Sensitivity of Recommender Systems to Interaction Perturbations

Paper • 2201.12686 • Published Jan 29, 2022

Characterizing, Detecting, and Predicting Online Ban Evasion

Paper • 2202.05257 • Published Feb 10, 2022

AI & ML interests

Recent Activity

Team members 4

lighthouzai's activity

Guardrails Arena