How to Deconstruct Your Moodle Soup

25 Jul

How to Deconstruct Your Moodle Soup

A client once described their data situation to me as a "Data Soup"—and it was the perfect description. It's a jumble of valuable feedback from course participants, scattered across hundreds of individual files, mixed with inconsistent questions, and buried so deep in the system that it's almost impossible to use. It’s a common problem for organisations that rely on Moodle’s powerful course-delivery features but find its data-export capabilities lacking.

In this article, I'll walk you through how we deconstructed that soup. We'll cover how we extracted the good stuff, cleaned it up, and turned it into a practical AI analytics tool that can answer complex questions in plain English.

The Core Challenge: A Soup of Valuable, Inaccessible Data

The client knew the insights were there, but they were facing a series of practical, frustrating roadblocks that made analysis a non-starter. These weren't just minor inconveniences; they represented a significant opportunity cost.

Crippling Data Fragmentation: The primary issue was access. Each course was accessed individually, and so to get all the feedback someone would have had to manually navigate to each of their ~150 courses, find the right module, and download a separate CSV file. This was a non-starter.
Hard System Limitations: The only alternative was to use the Moodle reporting tool, however it often time out or crash, and then they started hitting a hard limit of 135,000 rows. This made it impossible to ask fundamental questions like, "What are the most common challenges our participants faced across all courses last year?" They were stuck looking at individual trees, with no possible way to see the forest.
The Chaos of Unstructured Data: The data itself was specific to each course. Questions and Questionnaires weren’t standardised , so "Participant Strengths" in one course might be "Areas of Confidence" in another. This inconsistency made simple comparisons impossible. Worse, the most valuable, nuanced feedback was locked away in free-text fields. Their attempts to make sense of this by manually copying comments into ChatGPT were clunky, time-consuming, and produced generic summaries that lacked the specific context of their business.
An Inability to Demonstrate ROI: Perhaps the biggest business challenge was the inability to connect the dots. They couldn't easily track feedback across all their courses. It was impossible for them to get insights into what participants were struggling with before they took a course and what they were able to do after the course.

They were data-rich but insight-poor. The "soup" wasn't just messy; it was holding the business back. My challenge was to design a system that could not only clean the data but also make it actively useful for people who weren't data scientists.

The Solution: A Step-by-Step Breakdown

The goal was to create a single, reliable system that could handle the entire data lifecycle: from messy extraction to a clean, conversational interface. We broke the solution into three foundational layers.

Step 1: Building a Coherent Data Foundation

Any culinary delight has to start with great ingredients. Before any advanced analysis or AI could be effective, we had to scrub the data so it’s spotless. This meant building an automated, repeatable data pipeline.

Automated Extraction: The first step was to eliminate the soul-crushing manual work. We developed a script that securely connects to the Moodle system and programmatically downloads all the individual feedback files. This single step turned a month-long manual task into a 3-hour automated job.
Intelligent Standardisation: This was the most crucial part of the cleanup. To handle inconsistent question phrasing, we used fuzzy matching algorithms. Think of it like a sophisticated spell-checker for concepts. The system reads a new question, compares it to a master list of "canonical" questions, and if it's similar enough (e.g., "What is your main challenge?" vs. "Describe your biggest professional hurdle?"), it maps it to the standard version. This ensures that we're always comparing apples to apples.
Enrichment and Structuring: Once standardised, the data underwent several transformations. We converted vague formats like "Yes/No" answers into clean boolean (true/false) values. For multi-select questions, we used one-hot encoding to turn a single column with many comma-separated values into multiple, easy-to-analyze columns. For the free-text fields, an LLM was used to intelligently classify and tag each response into consistent themes like "Time Management" or "Team Delegation," turning qualitative noise into quantitative signals.
Efficient Consolidation: Finally, instead of dumping everything into a giant, slow CSV file or a costly traditional database, we stored the consolidated data in Parquet files. This is a columnar storage format, which is incredibly efficient for analytical queries. Think of it this way: when you want to calculate the average of a single column in a massive spreadsheet, a normal system has to read the entire sheet. A columnar format lets you read only that single column, making queries much faster. This choice was key to ensuring the final application would be fast and cost-effective.

Step 2: Designing the Analysis Engine

With a clean, fast, and reliable data foundation, we could build the "brain." The problem with traditional dashboards is that they are static. They answer the questions they were designed to answer, but they can't answer the follow-up questions. We needed something more dynamic. The solution was a hybrid AI system.

The Quantitative Engine (Text-to-SQL): To answer questions about "how many," "what is the average," or "show me the count," we used an LLM as a "translator." It takes a user's plain-English question and converts it into a precise SQL query. To make this reliable, we used a library called instructor, which acts as a set of guardrails for the LLM, forcing it to produce its output in a strict, predictable format. This helps prevent the AI from "hallucinating" or going off-script. The generated SQL query is then executed on DuckDB, the in-process database reading our Parquet files. The result is the ability to perform complex aggregations on the fly, in seconds.
The Qualitative Engine (RAG): Numbers tell you what happened across large sections of the database - think of it as like summarising a book or a chapter. But what if the question was specific to an individual’s response on a particular question? SQL doesn’t really help and to unlock this, we used a technique called Retrieval-Augmented Generation (RAG). All the free-text responses were indexed in a vector database, which understands semantic meaning, not just keywords. When a user asks a question like, "What are some concerns about leadership?" the quantitative engine might find that the 'Leadership' skill rating is low. Simultaneously, the RAG engine searches the vector database and retrieves the most relevant comments, like "I struggle to delegate effectively to my team" or "I need to learn how to give constructive feedback."

An LLM agent orchestrates the the generation of these responses. It receives the user's question, sends it to the both engines, and then a final "synthesiser" LLM weaves the quantitative data and the qualitative examples into a single, comprehensive, and easy-to-understand answer.

Step 3: Creating a Unified, Conversational Interface

The final piece was to make this power accessible. An executive doesn't have time to learn a complex Business Intelligence tool. I initially considered building a separate chatbot and a dashboard, but this felt disjointed and clunky.

The better design was a single, conversational canvas. We built a clean, minimalist web application where a user can simply ask questions. The system is designed to understand the intent of the question and respond with the most appropriate format. A request for a list gets a table and a question about distribution gets a bar chart. We used modern web technologies (Next.js and D3.js) to make this interface fast, interactive, and intuitive.

The Result: Answering Complex Business Questions in Seconds

The "before and after" workflow for the client is night and day.

Before: An executive asks, "How many people in the new sales cohort are worried about time management?" This triggers a multi-day process. Someone wades through Moodle, downloads dozens of CSVs, tries to consolidate them in Excel, manually counts rows, and eventually pastes a number and a poorly formatted chart into an email.
After: The same executive types the question into the web app during a meeting. Twenty seconds later, the answer appears: a clear text summary, a bar chart showing the breakdown by department, and extracts from individual responses.

They can now explore their data with a natural curiosity, asking questions like:

"Show me the trend of 'optimising time' challenges year on year from pre-programme questions."
"What is the average 'Impact on self' rating, and how does it differ by department?"
"How do the pre-course challenges of participants from 'Company A' compare to everyone else in course 158?"
"List the top 3 challenges cited in course 'XYZ123' and give me a count for each."
"What are some of the actual comments people made regarding 'leadership' in our end-of-program surveys?"
"Show me the average rating for 'Gaining a deeper understanding of my strengths' by course, for all courses run after 2022."

For every answer, the system provides the data and, for full transparency, the exact SQL query it generated. This builds trust and allows for verification.

The Broader Impact: From a Specific Fix to a Reusable Framework

While this project solved a specific problem with Moodle, the approach is a blueprint for how small and medium-sized businesses can handle their data challenges. The practical benefits are clear:

Drastic Time Savings: It eliminates error-prone, manual data work, freeing up staff to focus on higher-value activities like program design and client engagement.
Empowering Non-Technical Teams: You no longer need a data specialist as a gatekeeper to information. This allows leadership, HR, or marketing teams to find their own answers, when they need them.
Cost-Effective Power: The entire architecture is built on efficient, often open-source, technologies. It provides enterprise-level analytical power without the associated high costs of per-seat software licenses or large server infrastructure.
Keeping Data Secure: The process is designed to be deployed on a secure Google Could platform.

Is Your Data Working For You?

So that's how you deconstruct a Moodle Soup. The process revealed a fundamental truth: most companies are collecting more data than ever before, but few are getting its full value. If this story of fragmented files, manual processes, and missed insights sounds familiar, it might be time for a new approach.

By combining a smart data pipeline with a modern, conversational AI front-end, it's possible to turn your most challenging data sets into a practical, valuable, and actively used tool for your business.

If you'd like to discuss how this approach could be applied to your own data challenges, we’d love to hear from you setup a call