Decoding GPT-4'o': In-Depth Exploration of Its Mechanisms and Creating Similar AI.
OpenAI has launched the groundbreaking AI GPT-4'o', a model that is a mixture of many models. In this blog post, we will discuss how GPT-4'o' works and how to create this kind of model.
0. GPT 4'o' Capabilities
- Video Chat. (First time introduced feature)
- Faster and Human Like Voice Chat. (It even shows emotions and change tones.)
- Text Generation, Image Generation, Image QnA, Document QnA, Video QnA ,Sequential Image Generation, Image to 3d and best thing is All these things are Packed in 1 Modal.
- Supports 50+ languages.
1. How GPT 4'o' works.
Firstly GPT 4o working is mainly Divided into 3 parts.
1. SuperChat
As, GPT 4 already achieved Sequential image generation and image QnA. They have to just add doc QnA ,Video QnA and 3d generation. For, tech Giant like OpenAI it is just a piece of cake for them. This can be possible with methods we discuss at end.
2. Voice Chat
OpenAI has integrated TTS (Text-to-Speech) and STT (Speech-to-Text) into a single module, removing the text generation component they previously used. This means that when you speak, the AI analyzes your tone and words to create response in audio in real-time, similar to how streaming is used in text generation. In my opinion, OpenAi made this model comparatively less powerful because it is primarily designed for human interaction, and thus, the AI is trained accordingly.
3. Video Chat
Video chat is not actually a live video interaction. The AI captures an image at the start of the conversation and takes additional images as needed or instructed. It then employs Zero Shot Image Classification to respond to user queries. This module utilizes a more powerful model than voice chat because the AI can address a wider range of requests when it has visual information. For example, it can identify people, places, solve complex mathematical problems, detect coding errors, and much more which means it can do many things as compared to simple voice chat.
Image depicting what people thinks of how OpenGPT-4 works vs Reality.
2. Creating AI Like GPT 4o
We, also make 3 models like OpenAI but before these There are two methods for creating every model. First, it's important to understand them.
1. MultiModalification or Mixture of Modal Method
This method combines 2 or more modals according to their functionality to create a new, powerful, multifunctional model, It aso requires further training.
2. Duct Tape Method
In this method You just need to use different types of Modals or API for doing Different task without ANY TRAINING.
Making of SuperChat Model
MultiModalification or Mixture of Modal Method To create SuperChat model we need to combine Text Generation, Image Generation, Image Classification, Document Classification, Video Classification models. Use the same process used in Idefics 2. A model that combines zero-shot image classification and text generation modal, Idefics 2 can chat with you and answer questions based on images.
Duct Tape Method Method without API - It include One base Modal which PROMPTED to identify which type of task is that and then send users prompt to that specific type of modal then send output to user. Optional: Use text gen modal at end to add some words, to make answer more realistic. Method with API - One base model prompted to use API on specific type of query. This method is utilized by Copilot. For instance, when it's requested to create images, compose songs, conduct web searches, or answer questions from images, it uses an API of that task to accomplish that task.
Recommended models from which you can create SuperChat Modal as powerful as GPT 4o
- Base Modal - Llama 3 70B
- Image Generation: Pixart Sigma or RealVisXL
- Zero Shot Image Classification: Sigslip
- Zero Shot Video Classification: Xclip
- Sequential Image Gen - Control SDxl
- Zero Shot Doc Classification - idf
- 3d gen - Instant Mesh
- Other Models - Animate Diff lightning
Making of VoiceChat Model
MultiModalification or Mixture of Modal Method To develop a human-like speaking AI that also exhibits emotions, high-quality training data is essential. Additionally, an emotion identification model is necessary to recognize users' emotions and Text gen model who understands users emotion.
Duct Tape Method It include One stt Modal to encode users prompt with emotion to text gen modal with emotion encoded in answer and utilizing a TTS such as Parler TTS Expresso can further infuse emotion into the output.
Suggested Models
- Speech to Text - Whisper
- ChatModal - Llama3 8b
- Text to Speech - Parler tts Expresso
- Emotion identifier - Speech Emotion Recognition
Making of VideoChat Model
As previously mentioned, it only captures images. Thus, a zero-shot image classification model is necessary, while the rest remains the same as the voice chat model. However, it also requires a highly intelligent model, due to the increased use case with vision.
Suggested Models
- ZeroShot Image Classification : Sigslip
- Speech to Text - Whisper
- ChatModal - Llama3 8b
- Text to Speech - Parler tts Expresso
- Optional - Speech Emotion Recognition
Alternatively
- Image QnA Model - Idefics 2
- VoiceChat Model
Making of Similar AI
Covered in Next Blog: https://huggingface.co./blog/KingNish/opengpt-4o-working