Logo
FrontierNews.ai

How AI Researchers Are Solving the Privacy Problem in Collaborative Machine Learning

A new research framework called U-SplitDoRA allows multiple organizations to collaboratively fine-tune large language models while keeping their private data completely hidden from servers and other participants. The approach, developed by researchers at Vellore Institute of Technology and SASTRA Deemed University in India, combines three privacy-focused techniques to solve a growing challenge in AI development: how to harness the power of collaborative training without exposing confidential information.

Why Does Privacy Matter in AI Training?

Large language models, or LLMs, are becoming increasingly powerful and useful across industries like healthcare, software development, and computer vision. However, training these massive AI systems requires enormous amounts of data. The problem is that most organizations with valuable data, such as hospitals with patient records or companies with proprietary business information, cannot share that data openly due to privacy regulations and competitive concerns.

Federated learning emerged as a solution, allowing multiple organizations to train a shared AI model without sending raw data to a central server. Instead, each organization trains the model locally on its own data, then sends only the model updates (called gradients) to a central server for aggregation. However, this approach still faces a major bottleneck: large language models contain billions of parameters, making them computationally expensive to train on individual devices, especially those with limited computing resources.

How Does U-SplitDoRA Improve on Existing Methods?

U-SplitDoRA addresses these challenges through a three-part strategy. First, it uses split learning, which partitions the AI model into three sections: a head and tail that remain on the client's local device, and a body that runs on the central server. This design ensures that raw data and labels never leave the client side, providing strong privacy guarantees.

Second, the framework replaces a standard fine-tuning technique called LoRA with an improved version called DoRA, which stands for weight-decomposed low-rank adaptation. DoRA updates both the magnitude and direction of the model's weights, rather than just one dimension. This dual approach makes the model more expressive and reduces the gap between efficient fine-tuning and full parameter fine-tuning to a minimal margin.

Third, U-SplitDoRA harnesses the parallelization power of federated learning, allowing multiple organizations to train simultaneously while maintaining privacy. The U-shaped architecture is the key innovation here, ensuring that neither the server nor other participants ever see sensitive information.

What Results Did the Researchers Achieve?

The research team tested U-SplitDoRA using GPT-2-S and GPT-2-M models trained on the E2E benchmark dataset, a standard evaluation tool in the AI research community. The simulation results confirmed that U-SplitDoRA attained better accuracy scores and faster convergence speed than other state-of-the-art LLM fine-tuning frameworks.

This is significant because it demonstrates that privacy-preserving training does not require sacrificing model quality. Previous frameworks like SplitLoRA and HSpliLoRA enabled collaborative fine-tuning through model partitioning, but they left open questions about privacy preservation and adaptation quality. U-SplitDoRA closes those gaps.

How to Implement Privacy-Preserving AI Training

  • Model Partitioning: Split the AI model into three sections, keeping sensitive portions on client devices while offloading computational workload to the server, reducing privacy exposure and computational burden simultaneously.
  • Weight Decomposition: Use DoRA instead of standard LoRA techniques to update both the magnitude and direction of model weights, improving the model's expressiveness and reducing performance gaps compared to full fine-tuning.
  • Federated Aggregation: Implement federated learning protocols where multiple organizations train locally and send only model updates to a central server, ensuring raw data never leaves organizational boundaries.

What Does This Mean for AI Development?

The implications of U-SplitDoRA extend beyond academic research. As data privacy regulations like GDPR and HIPAA become stricter, and as organizations increasingly recognize the competitive value of their data, privacy-preserving training methods become essential infrastructure for AI development. Healthcare providers, financial institutions, and government agencies all have incentives to collaborate on AI models without exposing confidential information.

The framework also addresses the data scarcity problem facing the AI industry. High-quality training data is becoming harder to find, and many organizations have limited datasets individually. By enabling collaborative training, U-SplitDoRA allows smaller organizations to pool their data resources without actually sharing the data itself, democratizing access to advanced AI capabilities.

The research demonstrates that the future of AI development may not be centralized, with a few large companies training massive models on proprietary datasets. Instead, distributed, privacy-preserving training frameworks could enable a more collaborative ecosystem where organizations of all sizes contribute to and benefit from shared AI advancement while maintaining strict control over their sensitive information.