The Data Factory Race: How India Is Becoming the Training Ground for Physical AI Robots
India is emerging as a critical infrastructure hub for Physical AI development, with a new startup launching the country's first dedicated robotics data factory to collect and provide large-scale human action datasets for training embodied AI systems. Founded by entrepreneur Abhinav Kukreja, Neocambrian AI announced its launch with a focus on building the foundational datasets that robotics companies need to train next-generation robots capable of performing real-world tasks.
Why Does Physical AI Need So Much Human Action Data?
The challenge facing robotics today mirrors the problem that large language models (LLMs) solved a few years ago. Just as LLMs required massive text datasets from the internet to learn language patterns, Physical AI systems need enormous collections of human movement and action data to learn how to manipulate objects, navigate spaces, and interact with their environment. Kukreja described this gap in a public note, explaining that robotics currently lacks "internet-scale datasets" comparable to the text datasets that enabled breakthroughs in AI chatbots and language understanding.
Kukreja
Neocambrian AI is building what it calls a high-fidelity, pre-training scale database using specialized equipment including egocentric video capture systems, motion tracking hardware, stereo capture rigs, and upgraded UMI devices designed specifically for robotics training. The startup plans to provide thousands of hours of collected data free of cost to Indian researchers working on vision-language-action (VLA) models and world models, which are AI systems that learn to understand and predict how the physical world works.
How Is India Positioned to Become a Physical AI Data Hub?
Kukreja framed India as a potential global hub for Physical AI datasets, citing three key advantages. The country has a large workforce available for data collection work, diverse real-world environments that provide varied training scenarios, and established operational experience in running distributed services at scale. This combination of factors makes India an attractive location for collecting the diverse, high-quality human action data that robotics companies need.
The timing of Neocambrian AI's launch reflects broader momentum in the Physical AI sector. The startup's announcement comes days after other companies began experimenting with similar data collection initiatives. Another startup called Snabbit confirmed to industry sources that it had been approached by US-based startup Human Archive for similar Physical AI data collection work, though Snabbit ultimately decided not to proceed.
Steps to Understanding the Physical AI Data Collection Landscape
- Data Collection Methods: Companies are using egocentric video systems that capture footage from a human's perspective, motion tracking hardware that records precise body movements, and stereo capture rigs that create 3D records of human actions and object interactions.
- Geographic Strategy: India's large workforce, diverse environments, and experience with distributed operations make it attractive for collecting varied training data at scale, compared to smaller or more homogeneous markets.
- Open Access Model: Neocambrian AI plans to provide collected data free to Indian researchers, which could accelerate local robotics development and create a competitive advantage for India in the Physical AI ecosystem.
- Industry Momentum: Multiple startups are entering the Physical AI data collection space, signaling that this infrastructure layer is becoming essential to the robotics industry's growth.
Meanwhile, the Physical AI sector is seeing parallel developments in deployment and commercialization. IntBot, a San Jose-based robotics company, announced a strategic partnership with Certis Group, a Singapore-based intelligent operations company, to develop and deploy socially intelligent humanoid robots in real-world enterprise environments.
"With multimodal models maturing, the decisive bottleneck for embodied AI shifts from task manipulation to human interaction," said Lei Yang, CEO and co-founder of IntBot. "A robot's success in public spaces is increasingly measured by its ability to engage people, and Singapore's smart-infrastructure leadership makes it the ideal launchpad for Physical AI."
Lei Yang, CEO and co-founder of IntBot
The IntBot-Certis partnership signals that Physical AI is transitioning from research and pilot projects toward operationally viable deployments in high-traffic public environments. The companies plan to develop humanoid concierge and service-assistance applications for use cases including wayfinding, visitor assistance, multilingual engagement, customer service support, and frontline operational support across transit, hospitality, healthcare, retail, and public venues.
Certis brings operational expertise in designing and running complex, mission-critical environments. The company was recently named by Singapore's Infocomm Media Development Authority (IMDA) as a design partner for Singapore's first large-scale, multi-operator robotics testbed in a live mixed-use public environment at Punggol Digital District.
"The next phase of enterprise robotics will not be defined by autonomy alone, but by how well robots can work with people in real operations," said Raahul Kumar, Chief Executive, International and Robotics and Chief Strategy Officer at Certis. "By combining IntBot's social intelligence technology with our experience in operational design and deployment, we can create humanoid robot applications that support frontline teams in demanding roles and make everyday public interactions simpler and more intuitive."
Raahul Kumar, Chief Executive, International and Robotics and Chief Strategy Officer, Certis
The emergence of data infrastructure companies like Neocambrian AI and deployment partnerships like IntBot-Certis reflects a broader maturation of the Physical AI industry. Rather than focusing solely on building better robot hardware or algorithms, companies are now addressing the foundational infrastructure challenges that will determine which regions and organizations can scale robotics deployment most effectively.
Industry observers note that the future of Physical AI depends on solving two interconnected problems: building the datasets needed to train capable robots, and creating operational frameworks that allow those robots to work safely and effectively alongside humans in real-world environments. India's focus on data infrastructure and Singapore's focus on operational deployment represent complementary approaches to scaling Physical AI globally.
The privacy and ethical concerns surrounding human action data collection remain important considerations as the industry grows. The emergence of startups focused on this work has intensified conversations around worker consent, data protection, and responsible data collection practices, even as companies race to build the datasets that Physical AI systems require.