Why does it generates arafed so much ?
I have the same problem... it also has incorrect grammar... An image of chunks of ice in water in the sea got:
"arafed ice floess are floating in the water near the shore"
I have no clue what 'arafed' or 'floess' are..
googling 'arafed' led me to this dataset: https://huggingface.co./datasets/multimodalart/facesyntheticsspigacaptioned
Which I am assuming they might have used to train on for some weird reason?
Yeah, I believe it's primarily because of that dataset used which has a lot of "arafed" in it for whatever reason. A just made a little function to remove that word as normally it could be removed and the caption was still grammatically correct since "Arafed" is basically just prepended to the caption. You could also finetune it as well which would probably help
The dataset linked uses BLIP-generated captions, so I doubt that it was used for blib