Lost in Translation: The Current Reality of ChatGPT-4 o's Multimodal Capabilities

Noby Fujioka
26TH MAY, 2024 | Updated 26TH MAY, 2024

Introduction

In OpenAI's spring demo, ChatGPT-4o was showcased as a groundbreaking multimodal AI, capable of understanding and responding to inputs with expressive nuances. This presentation suggested that the future is now where AI could seamlessly integrate and interpret multiple forms of data, including text, images, and speech, providing more human-like interactions.

A Closer Look at the Practical Use

However, in practical use, particularly through the Android mobile app I tested, these capabilities are not fully realised. For instance, if you ask the app to read a story with exaggerated emotions or to read it slowly like in the demo, it struggles to capture and convey the intended expressions. The reason is that the app converts spoken words to text, losing the nuances of tone, rhythm, and emotion in the process. As a result, the output lacks the expressive qualities that were promised.

Challenges with Languages and Names

Another example is the handling of names in different languages. When I asked the app to speak with you in Japanese and I provided my name, Fujioka Nobuyuki, it converts this into kanji characters 藤岡信之 based on assumptions, but my name is written 藤岡伸行. Since names and words in Japanese are often based on meanings, not just phonetics, they can be written in different kanji characters. The same kanji characters can also have various pronunciations. In this case, the app not only made a wrong text conversion with an incorrect assumption, but also when I asked it to pronounce my name, it pronounced it wrong ("Fujioka Shinko"). This demonstrated a significant limitation. This happens because the app simply stores names as text rather than capturing the sound and context, leading to potential misinterpretations.

Understanding Emotions from Text

ChatGPT-4o attempts to infer emotions from conversational context, relying solely on textual data. Without access to vocal tones, facial expressions, or other non-verbal cues, its ability to accurately interpret and respond to emotional nuances is limited. This is a significant departure from how the human brain functions, as humans remember concepts and the associated context, including emotional tones and meanings.

The Role of Conceptual Memory in Humans

In the human brain, memories are stored based on concepts rather than just raw data. This allows for a richer recall of information, including context, emotion, and nuances. When we remember a conversation, we recall not just the words spoken but also the tone, the emotions involved, and the broader multimodal context. ChatGPT, on the other hand, stores interactions as text, lacking the depth of conceptual memory that humans or even animals possess. This difference highlights the gap between human-like understanding and the current capabilities of AI.

"Friendship needs no translation": Future Improvements and OpenAI's Efforts

It may be that OpenAI is likely working on enhancing these capabilities. Future updates might address these limitations, allowing for better handling of multimodal inputs and more accurate interpretation and expression of emotions and nuances as suggested by the demo. If OpenAI or others can achieve or have achieved true multimodal capabilities, where the AI understands and retains information in a more human-like manner—capturing not just the text but the concepts, the full context, tone, and emotional content of interactions—it would mark a significant milestone. Achieving this could be a step closer to Artificial General Intelligence (AGI) or emulating the human brain's conceptual memory and understanding. Not sure what happens then...