Spaces:
Running
Open source CC-BY dataset and classifier?
Hey ML6
Great work on this blog post. Is the dataset and/or classifier available? It would be great to use it on community projects.
Hi,
Yes the dataset is available here: https://huggingface.co./datasets/fondant-ai/fondant-cc-25m
@nielsr Is the algorithm somewhere that can be shared? This blog post sort of describes it (https://blog.ml6.eu/ai-image-generation-without-copyright-infringement-a9901b64541c) but seeing an implementation could be helpful.
The code is here: https://github.com/ml6team/fondant-usecase-filter-creative-commons
@nielsr Unless I am missing it, that's just the download code, not the processing code that was used to identify copyrighted material.
@BramVanroy is @burtenshaw part of your team? can you send me your email addresses through linkedin, we might be able to deliver you the code. (my name is Robin Van Craenenbroek)
@RobinVC I think @burtenshaw works at Argilla!
@burtenshaw and @BramVanroy we will refactor our code and make the source code publicly available on the ML6team github page. This will take a bit of time(probably a 1-2 weeks max) but I will keep you posted if we release it.
Awesome, thanks!
@BramVanroy , @burtenshaw We published a branch containing the dataset extraction logic on our github page: https://github.com/ml6team/fondant-usecase-filter-creative-commons/tree/add-fondant-usecase-cc-image-extraction The code has not been fully cleaned or documented yet but you can already take this as an inspiration. This branch will be merged to main once it's presentable enough. Hope this helps! You can find the dataset extraction logic in the image_extraction folder if interested.