de Rodrigo's picture

de Rodrigo PRO

de-Rodrigo

AI & ML interests

Synthetic Datasets, Multimodal LLMs, Computer Vision

Recent Activity

updated a dataset about 11 hours ago
de-Rodrigo/merit
updated a dataset about 12 hours ago
de-Rodrigo/dummy
updated a dataset about 12 hours ago
de-Rodrigo/dummy
View all activity

Organizations

The Hidden Gallery's profile picture CICLAB Comillas ICAI's profile picture

Posts 2

view post
Post
374
MERIT Dataset ๐ŸŽ’๐Ÿ“ƒ๐Ÿ† Updates: The Token Classification Version is Now Live on the Hub!

This new version extends the previous dataset by providing richer labels that include word bounding boxes alongside the already available images. ๐Ÿš€

We can't wait to see how you use this update! Give it a try, and let us know your thoughts, questions, or any cool projects you build with it. ๐Ÿ’ก

Resources:

- Dataset: de-Rodrigo/merit
- Code and generation pipeline: https://github.com/nachoDRT/MERIT-Dataset
- Paper: The MERIT Dataset: Modelling and Efficiently Rendering Interpretable Transcripts (2409.00447)
view post
Post
1013
A few weeks ago, we uploaded the MERIT Dataset ๐ŸŽ’๐Ÿ“ƒ๐Ÿ† into Hugging Face ๐Ÿค—!

Now, we are excited to share the Merit Dataset paper via arXiv! ๐Ÿ“ƒ๐Ÿ’ซ
The MERIT Dataset: Modelling and Efficiently Rendering Interpretable Transcripts (2409.00447)

The MERIT Dataset is a fully synthetic, labeled dataset created for training and benchmarking LLMs on Visually Rich Document Understanding tasks. It is also designed to help detect biases and improve interpretability in LLMs, where we are actively working. ๐Ÿ”ง๐Ÿ”จ

MERIT contains synthetically rendered students' transcripts of records from different schools in English and Spanish. We plan to expand the dataset into different contexts (synth medical/insurance documents, synth IDS, etc.) Want to collaborate? Do you have any feedback? ๐Ÿง

Resources:

- Dataset: de-Rodrigo/merit
- Code and generation pipeline: https://github.com/nachoDRT/MERIT-Dataset

PD: We are grateful to Hugging Face ๐Ÿค— for providing the fantastic tools and resources we find in the platform and, more specifically, to @nielsr for sharing the fine-tuning/inference scripts we have used in our benchmark.