The AI Wake-Up Call for Data Engineers: Why LLMs + MCP Matter Now

AI Wake-Up Call for Data Engineers | Image By ChatGPT

I wrote a blog post a few months ago —  How to Build an AI Agent for Data Analytics Without Writing SQL. One comment I received was, “This is mostly a toy example; my production jobs are way more complex than this; it won’t be applicable to my job.” — but it misses a bigger point: AI is evolving fast, and it’s reshaping how we build pipelines, write SQL, and even debug. In many companies, AI skills are no longer optional for data engineers. We cannot overlook the fact that AI proficiency is now mandatory.

Let us look at how modern AI, particularly LLMs, RAG, and MCP, is becoming not only relevant, but essential for production-grade data engineering.

Can AI resolve the intricate SQL and data pipeline issues that data engineers face? Please put your skepticism of the potential of AI away, and I can show you its potential and how AI can enhance data engineering.

Let’s examine why there is widespread skepticism about AI’s capacity to handle complex data engineering logic and what has changed.

LLMs: Power and Pitfalls

We equip LLMs with exceptional language comprehension skills. We provide a high-level overview of the necessary code structure and include a sample code, enabling LLMs to handle program writing smoothly. Tools like Cursor and Cline demonstrate how straightforward code generation can be with LLMs.

The developer serves as both a product manager and a QA specialist. For the product manager, you would write the appropriate prompt as a problem statement that clearly describes the AI requirements. For QA, you would understand how to run the code, test the desired outcome, and provide feedback to AI for debugging.

Everything seems simple. However, reality isn’t straightforward. 

LLMs are trained by past data; they perform impressively on the data the LLM model has seen but make hallucinations on the data it never saw before. 

If the foundation model is not properly post-trained, LLM may make an incorrect guess. ChatGPT may generate confident-sounding code that refers to nonexistent libraries, such as attempting to use a fictional Spark T-digest implementation, which can mislead engineers unfamiliar with the underlying tools.

If LLM can understand a wide range of languages, all we need to do is provide it with the most recent data.

 

Introducing RAG: Smarter AI with Context

Think of an LLM as a smart but humble student who absorbs any new information that is presented to it. This is the fundamental concept of RAG.

If some library we used is out of date, we give a link to LLM on the updated library documentation to let it learn.

If some facts are out of date, we give the LLM the Wikipedia link to let it learn.

This is the fundamental idea behind RAG — Retrieval-Augmented Generation, a technique that improves the accuracy and relevance of Large Language Models (LLMs) by combining them with external data sources.

RAG is a technique in which a retrieval system searches data from external data sources to identify pertinent information in response to a user’s query. The LLM subsequently employs the retrieved information to produce a more precise and current response.

A good example for a data engineer would be to write an SQL query to retrieve the best-selling product over the last 10 days.

Because LLM does not know the table schema or column names, asking LLM directly may result in an unusable SQL query. 

To improve this, we can implement RAG to fetch the schema and columns so LLM can more accurately generate a query.

AI Powered SQL Query | Image By Author

For RAG, you are basically developing your own customized function to call various tools. If RAG still feels too hacky for you, there are newer options. 

The Rise of MCP: Standardized Tool Access

While RAG enables LLMs to respond with accurate context, MCP addresses a different issue: how to standardize tool integration.

MCP (Model Context Protocol) is an open standard for connecting AI to databases and tools—like an API layer purposely built for LLMs. Tool providers define a shared interface instead of having every team build one-off integrations with DuckDB, Snowflake, or Spark. LLMs then plug into it seamlessly. 

The tool provider who provides the tool will offer the interface to LLM. Now we can subscribe to those tools and have LLMs plug them in for us to use. Since the tool provider created the tools in MCP, they are officially available for use, and everyone adheres to the same standard.

LLM + MCP | Image By Author

Real-World Example: DuckDB + MCP + LLM

We will use DuckDB from the Cline MCP server in VS Code. 

Cline MCP with DuckDB | Image By Author

We will use DeepSeek V3 as the LLM model here. To configure DuckDB, you can follow instructions on MCP for DuckDB.

For local testing, all you need to do is make the db path as memory

				
					{
  "mcpServers": {
    "mcp-server-motherduck": {
      "command": "uvx",
      "args": [
        "mcp-server-motherduck",
        "--db-path",
        ":memory:"
      ]
    }
  }
}
				
			

We will build another example by loading a dataset from Kaggle — Netflix Movies and TV Shows dataset (CC0: Public Domain)

In Cline, you can test a prompt

load file to DuckDB "~/Downloads/netflix_titles.csv", 
then Check the netflix table in DuckDB,
get total number of shows by each director

Cline will automatically call DuckDB MCP and load the proper function for our goal, then it will load data into DuckDB and then generate a query to fetch the result. 

MCP DuckDB Example | Image By Author

Let’s try something more complex; we will just use Kaggle notebook for this one, https://www.kaggle.com/code/shivamb/netflix-shows-and-movies-exploratory-analysis, which has done great EPA on top of this dataset.

I picked a random query — show me the Distribution of Movie Duration

Cline will first check the schema of the DB

DuckDB MCP Check Schema | Image By Author

It generates the following query. However, the initial attempt failed due to the need for additional data transformations. So it took the error as feedback and tried twice until the issues were resolved.

				
					SELECT   
CASE WHEN duration_minutes BETWEEN 0 AND 60 THEN '0-60 min'    
 WHEN duration_minutes BETWEEN 61 AND 90 THEN '61-90 min'    
 WHEN duration_minutes BETWEEN 91 AND 120 THEN '91-120 min'    
 WHEN duration_minutes BETWEEN 121 AND 150 THEN '121-150 min'    
 ELSE '150+ min'  END AS duration_range,  
 COUNT(*) AS movie_count,  
 ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER(), 2) AS percentage
 FROM (  
    SELECT     TRY_CAST(TRIM(REPLACE(duration, 'min', '')) AS INTEGER) AS duration_minutes  
    FROM netflix  WHERE type = 'Movie'     AND duration IS NOT NULL    AND duration LIKE '%min%'
 )
WHERE duration_minutes IS NOT NULLGROUP BY duration_rangeORDER BY duration_range
				
			

Output

DuckDB MCP Complex Example | Image By Author

Final Thoughts

We have seen with MCP + LLM that it unlocks the potential for tool integration and simplifies the code development cycle. We have observed that AI can perform data transformation in DuckDB, and I am confident that it can integrate with DBT, Snowflake, and Spark to enhance the workload of data engineers. 

AI isn’t optional for day-to-day anymore; it has become a required and essential tool. You can build more tools and productionize your ideas much easier with AI.

AI is becoming a must-have for data engineers, no longer a nice-to-have. Start small. Build an AI-powered tool. You’ll be surprised how much friction you can remove.

About Me

I hope my stories are helpful to you. 

For data engineering post, you can also subscribe to my new articles or becomes a referred Medium member that also gets full access to stories on Medium.

In case of questions/comments, do not hesitate to write in the comments of this story or reach me directly through Linkedin or Twitter.

More Articles

Second Iteration: Interactivity with User Click | Image By Author

How to Engage with Users By Storytelling: Show Data Analytics in R and Shiny

Using R and Shiny, we can build an app where the end users can interact with the data analysis we have done. I will show ...
Photo by Michał Turkiewicz on Unsplash

How I Built a Tool to Visualize Expense In Sankey Diagram

My main goal is to enable people without programming experience to use the powerful Sankey Diagram by simply uploading the transaction CVS file from the popular ...
Photo by Matt Hudson on Unsplash

5 Lessons I Learned From a Totaled Car Accident

Experiencing a totaled car accident that results in the total loss of your vehicle is a difficult situation to deal with. I want to share ...
0 0 votes
Article Rating
Subscribe
Notify of
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Scroll to Top
0
Would love your thoughts, please comment.x
()
x