I think you forgot to add the actual audio processor ~
Right now its incomplete bro ! basically it is a llava model ..
for it to be the anymodal you need to add the input processor for the audio as well:
Your on the right track : begining with Input methods first :
Get over this first hurdle : Add the Audio processor : Here there are two methods : Speech input (basic) and Audio( Stable audio ) ( identfys sounds ) ...
Then we can design the correct outputs :
First we will need to train the model on these inputs : Just to be able to return text is fine as we are dealing with a llm model firstly and need to get this input processes embedded into the model space :
SO
given a Image and text input
or
given a speech and image input
Or
given a text and sound input
All output to text only !! ( very important stage ) wwe need to get the model super fit on this task before moving to generation :
For the opposite modal we use the exact same training set but this time we provide the media output as well as the text :
embedding both tasks : Even creating clones and merging these models to begin the process again on the merged model : this provides a mass scattering of tensors and thier activations : so that you can fdo fine tuning training instead of full model training ! finen tuning each task hence the requirement to over fit the task ( its only going ot over fit onn those select parameters , ie : 13,7878908 params ) so onn the next pass a new set of parameters will be randomly chosen hence seeds !
you have a small journey ahead . you will find this model very heavy ! so keep you model as small as possible