TLDR;

This summary was generated by Claude.

</aside>

The past two years have seen revolutionary changes in how computers can interact with humans through multiple modalities like voice, text, and images. This transformation has been driven by combining traditional recognition technologies with Large Language Models (LLMs), creating systems that can understand and generate content across different formats naturally.

Native Multimodality: Latest LLMs can directly process multiple input types (like images and voice) without converting them first, with GPT-4o being the first to achieve "omni" modality capabilities

Advanced Voice Capabilities: Models can now engage in natural voice conversations with features like real-time interruption, accent switching, and multilingual communication

Speech Recognition Evolution: Modern systems powered by LLMs can handle natural conversations, multiple languages, and context-aware transcription far better than traditional speech-to-text systems

Voice Generation: Latest synthetic voices are nearly indistinguishable from human speech, with ability to convey emotion and speak naturally across multiple languages

Voice Cloning: New technology can create perfect copies of anyone's voice from just minutes of sample audio, raising both opportunities and ethical concerns

Impact on Academia: These advances are making content more accessible but also challenging traditional notions of authorship and academic practice

Future Implications: While raising concerns about fraud and verification challenges, the technology also enables positive innovations like enhanced language learning, accessible healthcare communication, and assistive technologies. Similar to previous technological shifts, success will depend on developing frameworks that maximize benefits while protecting against misuse

Infographic summary

Note: Note all graphics in this text we generated by Claude using the Artifacts feature.

</aside>

Advances in multimodal interaction

The possibility of interacting with computers via voice and having computers capable of reacting to visual inputs has been an ever elusive goal of computer science since the very beginning. In the span of two years, the landscape of what is possible has completely changed.

Natural interaction with computers using speech is now technically possible and while not perfectly solved, the quality of interpreting images is extremely high.

The biggest advance has been combining voice and image recognition and generation with the semantic potential of Large Language Models. This was done in two different ways:

Building voice and image processing models enhanced with the semantic techniques used in Large Language Models.
Developing Large Language Models that are natively (or end-to-end) multimodal.

Both approaches have produced great results and have their place in the development of multimodality, but it is the second approach - native multimodality - which is the subject of the most excitement in the field.

What is native (end-to-end) multimodality

Native multimodality refers to Large Language Models that can directly take other modalities such as image, voice or video as input without first having to convert them. End-to-end multimodality refers to models that can both take multimodal input and also output multiple modalities directly.

At present, there are no fully end-to-end multimodal models available that can input and output all modalities but all frontier models can take static images as direct input and two can take audio. Only one model, GPT-4o can output audio directly. OpenAI have also demonstrated GPT-4o generating images natively as part of the initial announcement but this functionality has not been released.

Here's a graphic illustrating natively-multimodal capabilities of the current frontier models.

Note: Products built on language models that are not multimodal may use other models to provide multimodal capabilities. For instance, both Gemini and ChatGPT (the products built on models with similar names) can call external image generation models and Claude mobile apps use separate speech models to understand spoken prompts.

Giving language models vision

One of the largest improvements in the capability of Large Language Models last year, was the introduction of multimodality. However, a year ago that only meant vision. Vision models can take images as input and produce a description of both what is in the image and any text contained. This was first announced in GPT-4 in the spring of 2023 and all three major frontier models from OpenAI (GPT series), Anthropic (Claude series) and Google (Gemini series) are now fully vision capable. There are now also capable open source vision-enabled models.