Logo
FrontierNews.ai

Three Ways Multimodal AI Is Breaking Into Business Websites, Wearables, and AR Glasses

Multimodal AI, which processes text, audio, and video simultaneously, is shifting from research labs into real-world business applications. Rather than building separate tools for different types of communication, companies are now embedding AI agents that seamlessly switch between chat, voice calls, and video conversations. Three major announcements this week show how this technology is expanding beyond consumer chatbots into websites, mobile devices, and augmented reality glasses.

What Is Multimodal AI, and Why Does It Matter Now?

Multimodal AI systems process multiple types of input, such as text, speech, and images, within a single conversation. Unlike traditional chatbots that only read text, or voice assistants that only listen, multimodal systems can see what a user is looking at, hear what they are saying, and respond with text, voice, or video. This matters because it mirrors how humans naturally communicate, making interactions feel less robotic and more useful for complex tasks like troubleshooting, training, or sales conversations.

The technology has been advancing rapidly in research settings, but deploying it at scale has been expensive and technically complex. Recent launches suggest that barrier is lowering, opening the door for smaller businesses and new use cases.

How Are Businesses Using Multimodal AI on Their Websites?

SalesCloser, a Vancouver-based AI software company, launched a self-serve multimodal website agent on June 16, 2026, designed to help any business engage visitors without requiring a dedicated sales team. The agent operates in two modes that hand off to each other within a single conversation: text-based chat and live audio-visual conversation with a digital avatar, voice, and shared-screen capability.

A visitor can start with a simple chat question and move to a face-to-face video conversation, or begin with video and drop back to text, with the conversation context preserved across the transition. The agent can present its own screen to walk a visitor through a product tour or documentation, and a visitor can share their screen or camera to receive hands-on help. The agent operates around the clock, across desktop and mobile, in more than 30 languages.

"Most websites lose the large majority of their visitors without those visitors completing a desired action, such as making a purchase or submitting a form, and we believe a key reason is the lack of a scalable way to greet and engage everyone who views a website," said Ali Tajskandar, Chief Executive Officer of SalesCloser.

Ali Tajskandar, Chief Executive Officer of SalesCloser

What sets this product apart is its pricing model. Unlike SalesCloser's existing products, which require a sales call to set up, the website agent is free to start with a monthly allotment of free usage credit. After that, pricing is usage-based, meaning businesses pay only for the conversations their agent actually has. This approach is designed to lower the barrier to entry for smaller companies that cannot afford enterprise software.

The conversational AI market is projected to reach approximately $41.39 billion by 2030, growing at a compound annual growth rate of approximately 23.7% from 2025 to 2030, according to Grand View Research. SalesCloser's website agent is positioned to capture a portion of this growth by targeting the substantially larger universe of businesses that operate a website but lack dedicated sales functions.

What Multimodal Features Are Coming to Android and Pixel Devices?

Google released Android 17 and Wear OS 7 on June 16, 2026, bringing new multimodal AI capabilities to its Pixel devices. The latest Pixel Drop update includes support for Gemini Omni, a multimodal AI model that can now edit videos in a conversation, and AudioLM, a speech-to-translation tool for the Pixel 10a that enables better speech-to-speech translation.

Beyond video editing, Android 17 introduces several features that leverage multimodal AI:

  • Lyria 3 Music Generation: Users can create music tracks with text prompts and images in the Gemini app, combining language and visual input to generate audio output.
  • Personalized Audio Messages: Users can record personalized outgoing audio messages for callers when they cannot answer, adding a voice-based touch to call screening.
  • Screen Reaction Videos: A new feature lets users record themselves with the selfie camera and phone screen simultaneously for screen reaction videos that can be shared on social media platforms like TikTok, YouTube, and Instagram.
  • Gemini Intelligence on Wearables: Smartwatches will receive live updates from phone apps that mirror to the Pixel Watch, and Wear OS will introduce tools for making personalized widgets by describing them in natural language.

These features underscore Google's strategy of using its Android and Pixel devices to showcase its latest AI technology. While Apple is focused on catching up in AI with September's public launch of AI upgrades to Siri and iOS 27, Google's Android 17 is focused on Gemini's role in creation, communication, and other device experiences.

How Are Developers Building Multimodal AI for AR Glasses and XR Devices?

NVIDIA released NVIDIA XR AI in public beta on June 16, 2026, an open-source library that helps developers build intelligent agents for augmented reality glasses, extended reality headsets, and similar wearable devices. The platform addresses a critical infrastructure gap: while AR and XR hardware is ready, creating AI experiences that understand what users see and hear in real time has been technically complex and expensive.

NVIDIA XR AI connects live camera and microphone streams from XR devices to GPU-accelerated AI services running in the cloud, data center, workstation, or edge. The architecture is modular, separating components like media transport, model services, enterprise connectivity, and agent orchestration. This design allows developers to swap clients, models, and deployment environments without rebuilding the entire agent.

Steps to Build an Intelligent XR Agent with NVIDIA XR AI

  • Clone the Repository: Developers start by cloning the public beta repository from GitHub and accessing sample agents, model-server launchers, and web clients that demonstrate how the system works.
  • Start AI Services: Launch shared AI services independently, including speech-to-text using NVIDIA Parakeet, vision-language reasoning with NVIDIA Cosmos, and language responses with NVIDIA Nemotron models for fast, latency-sensitive interactions.
  • Connect Enterprise Data: Integrate enterprise tools and data sources through Model Context Protocol servers, allowing agents to access company information and call business tools in real time.
  • Add Agent Orchestration: Use frameworks such as NVIDIA NeMo Agent Toolkit to orchestrate workflows across models and tools, enabling agents to reason about user intent and take appropriate actions.
  • Deploy Spatial Experiences: Optionally incorporate NVIDIA CloudXR for rendered spatial content when applications need rich 3D interaction within the XR session.

The core AI services powering visual understanding, speech recognition, language reasoning, and voice responses are already in place. Developers can quickly prototype XR agents by running sample agents, integrating enterprise data, and adding agent orchestration. The modular design also supports multi-user and multi-agent scenarios, where multiple clients can connect to the same hub and multiple agents can observe the same streams.

Real-world applications are already emerging. Researchers at Stanford School of Medicine and Princeton University are exploring XR and AI workflows for stem cell therapy research, helping researchers access contextual information and interact with laboratory systems while remaining focused on complex procedures. In manufacturing, Siemens is exploring how NVIDIA XR AI can help factory engineers find maintenance information, troubleshoot issues, verify work, and capture what happened on the shop floor.

What Does This Mean for Businesses and Developers?

The convergence of these three announcements signals a shift in how multimodal AI is being deployed. Rather than waiting for perfect, general-purpose AI systems, companies are building specialized agents for specific environments: websites for customer engagement, mobile devices for personal productivity, and AR glasses for hands-busy work environments like manufacturing and healthcare.

For businesses, this means more affordable options for deploying AI agents without upfront capital investment or technical expertise. SalesCloser's free-to-start model removes friction for small companies testing the technology. For developers, NVIDIA's open-source library and modular architecture reduce the complexity of building XR agents from scratch. For consumers, Google's integration of multimodal AI into Android and Wear OS brings these capabilities to billions of devices.

The common thread is that multimodal AI is becoming infrastructure, not a novelty. When a business can deploy an AI agent on its website in minutes, when a developer can build an XR agent without months of engineering, and when a smartphone can edit videos and translate speech in real time, the technology moves from research into everyday use. The announcements this week suggest that transition is accelerating.