Dr. Yellow: Exploring the Multimodal Future

ドクターイエロー：マルチモダールな将来を探る

Jan 07, 2025

I recently spent a weekend visiting my family and my friend in Nagoya. The city was full of energy and I could feel the vibrant economic recovery with Christmas and Year-End shoppers everywhere. As I was waiting for my high-speed bullet train, Shinkansen, at the crowded platform, checking my messages on my phone, I heard the train masters announcing loudly, “Please do not use the flash on your camera! Please do not use selfie sticks over the guard fence! Please do not put your children over the guard fence!” I could feel the tension in their voice, and something in my brain told me to look up for what was about to happen. So I looked ahead on the rail and there it was—THE famous Shinkansen, Dr. Yellow.

Dr. Yellow is a bullet train to check and test the rail systems and it will be retired in a couple years. Dr. Yellow started as Japan started its bullet train system in 1965. With Japanese precision and vision for perfection, Dr. Yellow has been servicing Japan’s amazing bullet train infrastructure for over half a century. The train embodies the Japanese commitment to excellence, running tirelessly behind the scenes to ensure the safety and precision that Japanese railways are famous for. There is no official schedule available so the legend says that if you see it running, it is a symbol of good luck, and there I was. I really felt very lucky to bump into this crazy scene, and found myself feeling nostalgic, knowing I was witnessing one of these legendary trains during its final years of service.

12月の週末、家族と友人を訪ね名古屋に行きました。名古屋の街は活気に満ち、クリスマスと年末を迎え、まさに経済回復が至るところで感じられました。新幹線を待つ混雑したプラットホームで携帯を見ながらメッセージを確認していると、ホームの駅員さんが大きな声で「カメラのフラッシュは使わないでください！セルフィースティックをホームドアの外に出さないでください！お子様をフェンスの外に出さないでください！」と叫んでいます。かなり切迫したその声を聞いて、何かが起きているのではと線路の方に目を向けると、何と、まさにドクターイエローがホームに入るところではありませんか！

ドクターイエローは、新幹線の路線や架線の状況を点検・検査するための専用車両で、あと2年ほどで引退することが決まっています。ドクターイエローは日本の新幹線が開業した1965年に始まりました。日本の精密さと完璧さを求めたドクターイエローは、新幹線のインフラを半世紀に渡り守ってきました。ドクターイエローはまさに、日本の鉄道が誇る安全性と正確さを実現するために、裏舞台でたゆまず努力する日本の美学の象徴と言えるでしょう。このドクターイエローには公開された時刻表はなく、遭遇することは幸運なことと言われています。そして、まさにそこに私がいたのです。この驚きの場面に出くわし私は幸運に感じながらも、あと数年でリタイヤするというこの伝説の列車を目にし少々ノスタルジックな気分になりました。

Our brains are interesting, though, because I wasn’t really paying attention to the train masters’ announcements but I felt the sense of unusual atmosphere. Through the sense of the voice, the people, and the tension, I somehow knew something significant was happening. We are all equipped with various senses and they are all connected through our brain.

This interplay of senses and intuition—how our brain processes multiple signals simultaneously to create meaning—made me think about how AI’s multimodal systems attempt to replicate such human capabilities.

私たちの脳はとても不思議なもので、私自身は駅員さんのアナウンスに特に注意を払っていたわけではないのに、普通ではない空気を察っしました。声やその場の人々、緊迫感といったものを直感的に感じ取り、私は何か重要なことが起こりつつあることを察したのです。私たちはたくさんの感覚を持ち合わせ、脳の中を通してつなぎ合わせています。

このような感覚と直感の相互作用—私たちの脳が複数の合図を同時に処理し意味を見出す作用—は、AIのマルチモーダルシステムが、こうした人間の能力を模倣しようとする試みについて考える機会を与えてくれました。

AI Multimodal
AI マルチモーダル

Humans are highly multimodal, not only with our five basic senses (sight, smell, taste, touch, and hearing) but also with our brain’s ability to connect the present with our past and beyond. We sense our surroundings through various channels, whether it is in the tone of the voice, in the interplay of smells and tastes, in the changes in temperature, or in the warmth of the touch of someone next to you. Just as I sensed enthusiasm at Nagoya Station, we are equipped with processing multiple inputs as signals.

Just as humans process multiple inputs simultaneously, AI has evolved to handle various types of data. I explored Large Language Models (LLMs)—AI systems trained on vast amounts of text to understand and generate human-like language—in my previous post, “Reading Daniel Kahneman (No.3): Navigating the Unknowns.” In that piece, I examined how both humans and AI navigate uncertainty—humans through judgment, and AI through probability patterns in vast text data. These LLMs represent a significant advancement in AI’s ability to process and generate human-like text, much like how we process and produce language naturally.

人間は、基本となる五感（視覚、聴覚、嗅覚、味覚、触覚）を持つだけでなく、現在を過去や未来と結びつけて考える能力を備えた、非常にマルチモーダル（複数の形式や手段を組み合わせること）な存在です。私たちは、声の調子、匂いと味の組み合わせ、温度の変化、あるいは隣りの人の触れ合いから感じる温もりなど、さまざまな経路を通じ周囲のことを感じ取ります。私が名古屋駅で熱狂的な雰囲気を感じ取ったときのように、私たちは複数の信号を同時に処理する能力を持っています。

人間が複数の信号を同時に処理するように、AIもさまざまなデータ形式を処理できるよう進化してきました。前回の記事「ダニエル・カーネマンを読む (No.3) 不確実性下における判断」では、膨大なテキストデータから学習し、人間の言語のような出力を可能にするAIシステム、大規模言語モデル（Large Language Model: LLM）を紹介しました。この中で、人間とAIがどのように不確実な状況下で判断するかを考察しました。人間は直感や経験に基づき判断を下し、AIは膨大なテキストデータから学習したパターンをもとに判断します。これらのLLMは、私たちが自然な形で言語を処理し生成するように、AIが人間のように文章を処理・生成する能力における重要な進化を示しています。

Beyond text processing, AI has developed remarkable capabilities in visual understanding. Diffusion models, trained on large image datasets, can now generate sophisticated images by understanding and recreating visual patterns. These developments, combined with innovative architectural approaches, have enabled AI systems to process multiple types of data—text, images, and videos—through unified frameworks. These multimodal models represent the latest frontier in AI, offering the ability to understand and generate content across different forms of media.

Yet despite these advances, AI multimodal systems, while excelling at synthesizing data from various sources, may lack the human ability to intuitively interpret cultural nuances, emotional undertones, and the ‘unspoken’ signals that often shape our decisions and creativity. The vast efforts taking place in multimodal AI development demonstrate how complex it is to replicate human-like thinking. Although there are impressive developments in AI models that can answer questions, generate images from text, or predict weather patterns, the uniquely human aspects of multimodal processing remain distinct and valuable.

テキスト処理以外にも、AIは視覚的な理解においても大きな進歩を遂げています。現在、拡散モデルは大量の画像データを学習し、視覚的なパターンを理解・再現することで、洗練された画像を生成することができるようになりました。このようなAIの進展は、革新的なアーキテクチャ設計と組み合わせることで、統合されたフレームワークを通じて、AIがテキスト、画像、動画といった多様なデータ形式を処理できるようになりました。これらのマルチモーダルモデルは、AIの最先端を率い、異なる形式のメディアをまたいでコンテンツを理解し生成する能力を提供しています。

しかし、このような進化にも関わらず、AIマルチモーダルシステムは多岐にわたる情報源からのデータを取りまとめることに卓越していますが、私たちが決断したり創造する際に使う文化的なニュアンス、感情的な響き、あるいは言葉では表現できないサインを直感的に汲み取るという人間特有の能力は備えていません。AIのマルチモーダル開発における膨大な努力は、人間のように考えることを再現することの複雑さを示しています。質問に答えたり、テキストから画像を生成したり、天気を予測したりするAIモデルの開発は目覚ましいものがありますが、人間特有のマルチモーダルな処理能力は依然として特別でかけがいのない価値を持っています。

The Power of Human Multimodality
人間のマルチモダール能力

While AI systems continue to advance and evolve in processing and analyzing data through multimodal models, human beings possess a unique advantage—our ability to integrate sensory experiences, emotions, and consciousness into meaningful insights. Our strength lies not just in processing information, but in our capacity to weave together diverse experiences and personal insights to create something entirely new. Our cultural heritage, personal experience, and the ability to own those experiences in our consciousness result in new creations, even when they are complex and ambiguous at times. These qualities are unique to each of us to form our own multimodal model.

To cultivate our multimodal capabilities, we must actively engage in diverse forms of learning and experience. This means immersing ourselves in rich and various contexts. From engaging with arts and literature to developing our emotional intelligence through interactions, and sharpening our ability to perceive subtle environmental signals, each experience contributes to our multimodal development.

AIシステムはマルチモーダルモデルを通じてデータを処理・解析し、さらに進化を続けていますが、人間には感覚、感情、そして意識を組み合わせ意味のある洞察を創り出すという独自の力を持っています。私たち人間の素晴らしさは、単に情報を処理するだけでなく、個々それぞれの豊かな経験と洞察を織り交ぜて、全く新しいものを創造する力にあります。複雑で曖昧な状況下でも、私たちの受け継いだ文化や、個々の経験、そしてそれらを意識の中で保ち続ける能力によって、新たな創造を生み出すことができます。これらの特質は人間ならではのものであり、私たち一人ひとりが独自のマルチモーダルモデルを形成しているのです。

自身のマルチモダール能力をさらに向上させるためには、積極的に様々な形で学び、経験を重ねていく必要があります。これは、豊かで多彩な環境に身を置くということを意味します。美術や文学への触れ合い、交流を通じて感情知性を育むこと、周囲のちょっとした雰囲気を感じ取る能力を磨くことなど、どの経験も私たちのマルチモダール能力を育むために貢献します。

As we integrate AI into our lives and work, we must consciously develop these uniquely human capabilities. Rather than competing with AI’s processing power, we should focus on strengthening our ability to think deeply, feel genuinely, and create meaningfully. This means deliberately slowing down to process our experiences through multiple channels—emotional, intellectual, and sensory—to gain richer insights and generate novel ideas.

AIが私たちの生活や仕事に組み込まれていく中で、人間特有の能力を意識的に伸ばしていく必要があります。AIの処理能力に対抗するのではなく、深く考え、誠実に感じ取り、意味のある創造を行う能力を意識して強化することに集中すべきです。つまり、感情、知性、そして感覚といった多様な感性を通じて経験を処理するために、意図的にペースを落として考えることで、より深い洞察や独創的なアイデアを生み出すこと可能になるのです。

Dr. Yellow has provided a tool to maintain Japan’s Shinkansen excellence through constant, precise monitoring, its striking yellow color making it both functional and memorable. This iconic train serves as a powerful reminder that in an age of rapid technological advancement, our ability to think holistically, create innovatively, and connect deeply with our experiences becomes increasingly valuable—just as the train combines utility with visual creativity.

The future belongs not to those who simply process information faster, but to those who can integrate diverse inputs into meaningful insights and creative solutions. By embracing and developing our multimodal nature—through learning, experiencing, thinking, and creating—we ensure that technology enhances rather than diminishes our uniquely human capabilities. Each of us can create our own special Dr. Yellow.

ドクターイエローは鮮やかな黄色い姿で、機能性と印象深さを兼ね備えながら、常に徹底した監視を行い、日本が誇る新幹線を支え続けてきました。この象徴的な列車が実用性と視覚的な創造性を兼ね備えているように、急速な技術革新の時代に、私たちの総合的に考える力、革新的に創造する力、そして経験と深く結びつける力がますます重要になっていくことを再認識させてくれます。

私たちの未来は、単に情報を速く処理する人たちではなく、様々な情報をまとめ有意義な知見と独創的な解決策を生み出すことができる人たちに委ねられています。学習や経験、思考、創造を通じ私たちのマルチモーダルな能力を受け入れ、育むことで、技術が人間特有の本質を損なうことなく、さらなる進化を遂げることができるのです。こうした取り組みによって、私たち一人ひとりが、自分だけの特別なドクターイエローを創り出せるのではないでしょうか。

Notes on Transformation

Discussion about this post

Ready for more?