Data-Centric AI with Foundation Models: A Practical Guide

JD Prater

August 8, 2023

Table of Contents

Imagine a world where your AI model's performance isn't just determined by the sophistication of your algorithms and data science expertise, but by the quality of your data. Welcome to the era of data-centric AI, a paradigm shift that is fueling the rapid evolution in the field of Artificial Intelligence (AI) and Machine Learning (ML).

Gone are the days when the majority of resources and focus in AI development were directed towards refining model architectures, perfecting algorithm designs, and the intricate art of feature engineering. 

Today, the tide is turning. Visionaries in the field are recognizing that data, the lifeblood powering these AI models, deserves equal if not greater attention. 

What is data-centric AI?
“Data-centric AI is the discipline of systematically engineering the data needed to successfully build an AI system.” Andrew Ng, Adjunct Professor at Stanford University’s Computer Science Department, IEEE Spectrum

This shift towards data-centric AI promotes iterative improvements in data quality and quantity, breeding a more dynamic and agile model development process.

So, how does this shift impact today's AI practitioners and businesses? How does it redefine the way we approach AI and ML development?

In this guide, I'll explore the impact of this shift to data-centric AI powered by foundation models. I'll unpack its core principles, use cases, and tangible benefits like faster deployment, improved accuracy, standardized workflow, and lower costs.

Let's dive in to understand how focusing on your data can accelerate your Modern AI success and catalyze your digital transformation.

The Evolution from Model-Centric to Data-Centric AI

In the early days of machine learning, much of the development process was model-centric. This approach viewed the dataset as something "outside" or that came "before" the actual AI development process. 

Data scientists typically viewed their training datasets as collections of ground-truth labels, with the machine learning model adjusted to fit that labeled training data. They might download benchmark datasets, such as ImageNet, as a static CSV file, treating the training data as an unchanging resource.

The majority of innovation would then occur within the model, with changes resulting from alterations in feature engineering, algorithm design, or custom architecture design.

However, this model-centric approach often overlooked a crucial factor: the data. As AI and machine learning models grew more complex and opaque, they required much larger volumes of training data. This need for more data sparked a realization within the AI community - that the data, not just the model, should be the focus of iteration and improvement.

model-centric vs data-centric

Enter data-centric AI, an approach that shifts the emphasis from the model to the data. Instead of treating data as a static artifact, this approach views data as a dynamic component of the AI system, ripe for iteration and improvement. It involves spending more time on labeling, managing, slicing, augmenting, and curating the data efficiently, with the model itself relatively more fixed.

The transition from a model-centric to a data-centric approach represents a significant shift in the ML community, but it is not a binary choice. Successful AI still requires well-conceived models. But by placing greater emphasis on data quality and iterability, data-centric AI brings new opportunities and challenges in AI development, opening the door to the power and potential of foundation models.

model-centric ai vs data-centric ai

In the next section, we’ll explore the core principles of data-centric AI that transformed the process of building reliable, effective AI systems - and the role of foundation models in this paradigm.

Core Principles of Data-Centric AI 

When considering a data-centric AI approach, three main principles emerge:

  1. Emphasis on data over models: As AI models become more user-friendly and commoditized, the progress of AI development increasingly hinges on the quality and quantity of training data, rather than around feature engineering, model architecture, or algorithm design.
  2. Programmatic approach: With the voluminous training data that today’s deep-learning models require, a programmatic process for labeling and iterating the data is vital. Manually labeling millions of data points is simply not practical.
  3. Inclusion of Subject Matter Experts (SMEs): Subject Matter Experts (SMEs), with their in-depth understanding of how to label and curate data, play a crucial role in a data-centric approach. Their expertise contributes significantly to injecting domain-specific knowledge directly into the model, paving the way for programmatic supervision.

The Growing Impact of Foundation Models

With the growing realization that building advanced machine learning models from scratch is not feasible for most organizations due to high costs and computational challenges, the role of foundation models has become even more pronounced. Training these models requires massive resources, running into millions of dollars, extensive memory, and even hundreds of GPUs working for months.

They’re shifting the paradigm by being capable of applying themselves "out-of-the-box" to a wide range of tasks, without needing task-specific training. This shift not only reduces the emphasis on curating a task-specific training dataset but also increases the focus on experimenting, envisioning, and rapidly iterating ideas. Their accessibility through natural language interfaces opens doors for domain experts to employ them quickly and effectively across varied settings.

Because of these complexities, many companies now opt to fine-tune existing foundation models for their specific needs. This approach emphasizes the crucial importance of data quality in a data-centric AI approach:

  • Efficiency and Cost-Effectiveness: Leveraging prebuilt foundation models saves time and money. The efficiency of these models depends on high-quality data, ensuring adaptability and performance without extra costs.
  • Scalability: The precision and relevance of data are paramount with the significant computational requirements of foundation models. Low-quality data can waste resources and hinder scaling.
  • Adaptability: Fine-tuning foundation models requires accurate and relevant data. Poor data quality limits their applicability and effectiveness.
  • Integration: Aligning data quality with the specific requirements of the model is essential for integration, avoiding challenges and reduced performance.

Thus, as the trend shifts towards utilizing existing foundation models, the quality of data takes center stage. Ensuring data quality is vital for achieving desired outcomes and optimizing costs, scalability, and adaptability. It's not a mere option but a fundamental necessity in the era of foundation models and data-centric AI.

We now focus more on understanding the emergent properties of the data and the real-world problems they can address, rather than the models themselves. This shift invites us to explore new opportunities, innovate, and drive tangible results through AI, emphasizing the need for meticulous attention to data quality.

Improving Performance and Accuracy at Scale with Active Learning

Another effective strategy in this iterative process is active learning, a semi-supervised machine learning approach where a model learns iteratively. It starts with a small set of labeled data to train an initial model. Then, it intelligently selects the most informative samples from the unlabeled pool of data to be labeled for the next iteration of training.

active learning in graft
Active Learning in Graft's Modern AI Platform

Here's how active learning fortifies a data-centric AI approach:

  1. Efficient Use of Resources: Active learning ensures that resources for labeling are allocated to the most informative examples, thereby reducing the overall amount of labeling required.
  2. Improving Data Quality: By concentrating on uncertain or challenging instances, active learning can help detect and correct labeling errors or biases in the initial dataset, thereby augmenting data quality.
  3. Model Performance: Active learning can boost model performance with less data compared to traditional methods, thanks to its iterative learning from the most informative data.
  4. Adaptability: As the model updates with each iteration, active learning enables the system to adapt to changes in the data distribution over time.

In the next section, we'll discuss the benefits of adopting a data-centric AI approach and how foundation models can amplify these benefits.

What are the Benefits of a Data-Centric Approach?

Adopting a data-centric AI approach powered by foundation models yields benefits to organizations across industries.

Here are the key ways this approach creates value:

  1. Speed and Efficiency: By emphasizing data and utilizing foundation models as a robust starting point, you can expedite the development process. Fine-tuning these models with your specific dataset often yields results faster than creating a custom model from scratch.
  2. Improved Accuracy: A data-centric approach's emphasis on data quality can greatly enhance AI model accuracy. Foundation models, pre-trained on vast datasets, provide a strong base for fine-tuning, further boosting accuracy levels.
  3. Cost-effectiveness: Shifting the focus from creating bespoke models to curating data can lead to substantial cost savings. Foundation models, pre-trained on extensive data, eliminate the high upfront costs associated with model development.
  4. Agility: This approach is particularly suited for dynamic environments, where business objectives and data frequently evolve. By focusing on iterating your data, you can swiftly adapt your AI systems to changing needs.
  5. Data Privacy: For industries dealing with sensitive data, a data-centric approach offers enhanced privacy. By focusing on internal data and iterating on it, you can minimize the risk of data privacy breaches that could occur when outsourcing data handling or using external datasets for training.
  6. Democratization of AI: Foundation models reduce the technical skill requirements for AI development, allowing a wider range of organizations to adopt AI. They provide a versatile starting point, adaptable across a wide range of applications.

These benefits underscore the growing shift towards data-centric AI strategies across sectors. This approach amplifies these advantages, transforming the landscape of AI development and deployment. In the next section, we'll look into practical use-cases.

Practical Use Cases Across Industries

Let's highlight some practical applications of data-centric AI across a few different industries.

Marketplaces: In bustling online marketplaces, consumer behavior can pivot rapidly. Now you can recommend products based on a user's past purchases or browsing history.

Real Estate: In the real estate sector, making sense of vast and varied data can be a daunting task. Foundation models can aid in this data interpretation, providing insights into market trends and price fluctuations. Whether it's predicting property prices or analyzing neighborhood features, data-centric AI can provide a competitive edge to real estate firms.

Retail: In the retail industry, customer behavior patterns can change rapidly. Data-centric AI, with its focus on data iteration, allows businesses to adapt quickly to these changes. Now you can analyze customer purchase histories and browsing behavior to help prevent churn. Then you can tailor promotions and offers to individual customers to encourage repeat purchases and loyalty.

Logistics: The logistics industry thrives on precision and predictability. With foundation models, fine-tuned on your company’s logistics data, can quickly help predict delivery times, optimize routes, and manage inventory more efficiently. This approach can lead to reduced operational costs and improved customer satisfaction.

Legal and Compliance: with a Q&A app you can process lengthy legal documents, such as contracts and case law, enabling lawyers to quickly identify essential information and better prepare for negotiations or court proceedings.

Gaming: you can detect and prevent cheating, monitor in-game chat for inappropriate behavior, and prevent the sharing of illegal content by using modern AI for content moderation.

Travel and Hospitality: By grouping customers with similar travel preferences or booking behavior, you can offer personalized travel packages, promotions, or destination recommendations, enhancing the booking experience and increasing customer satisfaction by using similarity search.

These real-world applications underscore the adaptability and versatility of data-centric AI powered by foundation models. The approach's success across diverse sectors testifies to its efficacy, strengthening the case for its adoption as a vital component of your organization's AI strategy.

In the next section, we will provide some practical steps to embracing a data-centric AI strategy leveraging foundation models in your business.

Embrace a Data-Centric Approach Leveraging Foundation Models 

Whether you're embarking on your first AI project or already have several in progress, transitioning to a data-centric approach can be achieved with the right strategies in place. Outlined below are some comprehensive steps to assist your journey:

1) Define Your Business Objectives

Kick-start your AI project by pinpointing the business challenges you aim to solve with AI. Are you looking to forecast customer behavior, streamline your supply chain, or elevate your product's customer experience? These defined objectives should be the compass for your AI project's course.

2) Know Your Data Landscape

Carry out an assessment of the data at your disposal. It's not just about accessing structured and unstructured data but also unlocking its potential to train your foundation models. Moreover, understanding the nature of your data - be it text, images, numerical, or time-series - is crucial in guiding the selection of the suitable foundation model.

3) Select Your Foundation Model

Your data type and business needs together should inform the choice of foundation model. For instance, if your project involves natural language processing, OpenAI’s GPT-4 or Cohere's model are suitable choices, whereas image-related tasks could benefit from the ResNet18 model.

4) Customize the Foundation Model to Your Data

Fine-tune the chosen foundation model on your data to align it with your specific task. In the data-centric AI paradigm, the emphasis is on iterative enhancements to your data, enhancing model performance over time - a stage where active learning can prove pivotal.

5) Engage Subject Matter Experts

Involve experts in the data labeling and curation process. Their domain-specific knowledge can dramatically improve the model's performance by providing invaluable insights during data annotation.

6) Test, Deploy, and Monitor

Once fine-tuning and training are complete, evaluate your model for performance and accuracy before deploying it in a real-world environment. Continued monitoring and periodic updates are crucial to ensure sustained model efficacy.

7) Evaluate Success

Define key performance indicators (KPIs) to measure the success of your AI initiatives. These could be business metrics like cost savings, improved precision, or time efficiency.

Embracing this modern, data-centric approach to AI can significantly boost the effectiveness of your AI projects, propelling superior business outcomes. By centering your strategies around data and harnessing the power of foundation models, you're poised for sustainable and high-impact AI deployments.

The Simplest Path to Data-Centric AI Implementation

Venturing into the era of data-centric AI is no longer a daunting task. It's time to reap the benefits of AI, and our Modern AI platform that can make it happen. 

“Modern AI is a data-centric approach that leverages foundation models to achieve better business results.”

Many organizations feel challenged by the lack of Modern AI infrastructure and the prospect of investing heavily in a specialized ML engineering team.

But, Graft offers a full-production AI system for such organizations, providing everything you need to deploy AI-powered solutions just like the industry leaders, but without needing ML expertise.

graft's full production AI system architecture

Whether you have existing AI infrastructure or not, Graft is designed for you. It seamlessly integrates with your current systems, shifting your resources to where they truly matter - your data and use cases, which directly drive business outcomes. By eliminating the need for substantial resource allocation towards building and maintaining ML infrastructure, Graft enables you to focus on creating distinguishing business value with the power of AI.

For those without an existing AI infrastructure or resources to build one, Graft's no-code AI platform opens a new world of possibilities. Now, business users can build intelligent solutions without needing data scientists or MLEs. By enabling faster, higher-quality business insights, you can save hundreds of hours that would otherwise be spent on manual analysis and tinkering.

Discover the power of extracting valuable information from unstructured data sources including images, videos, audio, and text. With Graft, creating AI-powered solutions is now within everyone's reach.

The Graft Intelligence Layer integrates your company knowledge and expertise to streamline your enterprise operations.

Book Demo
checkmark icon
All Your Use Cases - Advanced AI models for search, predictive, and generative.
checkmark icon
Use All Your Data - Every data source, every modality, always current.
checkmark icon
Customizable and Extensible - Leverage Graft's API to build custom AI-powered applications and workflows on top of the intelligence layer.
The AI of the 1%,
Built for the 99%
Get Access

Last Updated

February 1, 2024

Further reading

JD Prater

Head of Marketing

JD writes about his experience using and building AI solutions. Outside of work, you'll find him spending time with his family, cycling the backroads of the Santa Cruz mountains, and surfing the local sandbars. Say hi on LinkedIn.

Unify Knowledge

Centralized information and expertise for quick access and discovery.

grid icon
Quick Setup

No code; no AI expertise; and no infrastructure setup required.

cubes icon
Tailor to Your Needs

We partner closely with your team to ensure your success.

Equip your teams with intelligence

checkmark icon
Immediate productivity gains
checkmark icon
Save 2-3 hours/week/employee
checkmark icon
Reduce costs