Institute for Artificial Intelligence & Data Science
Back to all posts

There's Gold in Them There CSV Files

What is the future of AI? More importantly, where—and on what—will the software engineers and data scientists we mint today actually be working over the next two to four years?

It won't be training the next massive large language model in pursuit of artificial general intelligence. Hardly.

Instead, the next several years will bring the opposite trend: a proliferation of smaller, targeted projects focused on giving LLMs real predictive power. Let me explain how my Fall 2025 Data Intensive Computing course offered a glimpse of that future. My students built exactly what it will look like—LLMs enhanced through a technology called Model Context Protocol (MCP) to deliver custom, domain-specific insights.

The Background

I try to stay close to my industry colleagues. Academia and industry often view AI technologies through very different lenses, and those differences matter. A good friend and industry practitioner, Andrew Siradas, recently pointed me toward two ideas that reframed my thinking: agentic AI and Model Context Protocol.

A quick aside for fellow programmers: if you haven't downloaded Claude Code and experimented with it yet, you should. It requires a fundamentally different mindset than browser-based LLM tools, and that difference is precisely the point.

So what is MCP? At its core, it's remarkably simple—and incredibly powerful. MCP provides a standardized way to connect external data sources and services to an LLM. I often describe it as USB-C for AI: a universal connector that allows AI systems to interact with tools, datasets, and models without bespoke integrations. For those interested in a deeper dive, see the references at the end of this post.

How the Students Used It

The semester-long course project unfolded in three phases.

In the first phase, students sourced and cleaned real-world datasets—handling missing values, normalizing schemas, and engineering features from raw CSV files and APIs. In the second phase, they built and validated machine-learning pipelines using Apache Spark MLlib, training classifiers and regressors, tuning hyperparameters, and evaluating performance through cross-validation.

The final phase was the integration point. Students wrapped their trained models inside MCP servers, exposing prediction endpoints that Claude Desktop could invoke directly. The result was striking: domain expertise became conversational.

Here is a cross-section of what they built:

Project Description User Question Asked to LLM Dataset
Flight Delay Prediction
Processes 5M+ U.S. domestic flights with weather integration. Gradient Boosted Trees classifier achieves 87% accuracy and 0.91 AUC.
"I'm at Midway—what are the chances my flight to Las Vegas is delayed?" Bureau of Transportation Statistics
Hospital Readmission Risk
Predicts 30-day readmission for diabetic patients using 68,629 patient records. Random Forest model achieves 0.94 AUC and 95% accuracy.
"I'm preparing to discharge this patient. What's the likelihood of readmission within two weeks?" UCI Diabetes 130-US Hospitals
EV Energy Optimization
Analyzes 118K driving records with OpenStreetMap road network data. Random Forest and Linear Regression predict energy consumption and speeding risk.
"I have this much charge—what route will conserve the most battery?" Vehicle Energy Dataset (VED)
Workplace Sentiment Analysis
Integrates 85K+ Glassdoor reviews with ESG sustainability data. Logistic Regression and ensemble methods classify sentiment and predict workplace recommendations.
"What complaints come up most often for this company?" Glassdoor Job Reviews (Kaggle)
Network Intrusion Detection
Processes IoT-23 network traffic dataset for malware classification. Multi-stage pipeline with ML/DL models detects anomalous behavior across IP addresses.
"This subnet just triggered an anomaly alert—can you explain what activity looks suspicious and how severe it is?" IoT-23 Dataset

What's happening here is subtle but profound. In every case, a human first decided, "This is the kind of insight I need." They then trained, tested, and validated a model designed for that specific purpose. The LLM never ingested the data itself. Instead, it called these domain-specific models through MCP.

The Paradigm Shift

Here's what this architecture looks like in practice. A user asks a question in natural language. The LLM interprets the request, recognizes the need for a specialized prediction, and invokes the appropriate model through MCP. That model returns its result, and the LLM contextualizes the answer for the user.

The key insight? The LLM never retrains. New capabilities scale through composition, not retraining. Models simply plug in.

MCP Architecture Diagram showing User question flowing to LLM, which invokes MCP layer to connect with custom-trained models
The user interacts with the LLM, which invokes domain-specific models through the MCP layer—no retraining required.

The Future Is Micro

Now zoom out.

Think about the sheer volume of organizations sitting on nothing more than CSV files. Admissions histories at small schools. Sensor readings from hospital controllers. Sales records at an HVAC company. Purchase histories at a local restaurant. Each dataset represents a specific human need and a concrete question waiting to be answered.

That's a lot of unmet demand.

Once the LLM receives structured output from these custom models, it can do what it does best: contextualize, synthesize, and connect insights across domains. It can even orchestrate multiple predictive services at once.

The future is micro.
The future is custom.

Foundation models will continue to improve, and they should. But the real growth over the next decade will come from humans attaching intelligence to the data they already possess—plugging domain-specific prediction into a general-purpose intelligence layer.

There's gold in them there CSVs. And my students just showed me how to mine it.

References

  1. Model Context Protocol Documentation — Anthropic's official MCP guide
    https://docs.anthropic.com/en/docs/agents-and-tools/mcp
  2. Introducing the Model Context Protocol — Anthropic's announcement post
    https://www.anthropic.com/news/model-context-protocol
  3. MCP GitHub Repository — Open-source protocol and SDKs
    https://github.com/modelcontextprotocol
  4. Introduction to Model Context Protocol — Free course from Anthropic
    https://anthropic.skilljar.com/introduction-to-model-context-protocol
  5. Getting Started with MCP on Claude Desktop — Setup guide
    https://support.anthropic.com/en/articles/10949351-getting-started-with-model-context-protocol-mcp-on-claude-for-desktop
  6. Download Claude Desktop
    https://claude.com/download
  7. AI Agents with Claude — Overview of agentic AI capabilities
    https://claude.com/solutions/agents
  8. Building Agents with the Claude Agent SDK — Engineering deep-dive
    https://www.anthropic.com/engineering/building-agents-with-the-claude-agent-sdk