An open approach to data & insights
Testing the hypothesis of open-source data infrastructure for MATs and school groups from the OEAI community.
In every organisation I’ve worked with, an early aspiration when beginning conversations about data strategy has been to tame an unruly tangle of different systems, each with large areas of overlapping functionality, and most with a specific use that the other systems cannot quite cover. When I ask what’s driving that aspiration, the first response is almost always “I just want to see everything in one place…”
Part of this challenge can indeed be addressed by pruning systems that aren’t earning their keep, but the trade-offs of consolidation tend to show up sooner than expected. Even with the most capable tools, the dream of a single system that does it all is, in reality, confined to the realm of the ambitious sales representative.
This is where data infrastructure projects are born. What if we could automatically pull data from our key systems, combine it all, and then present it in a suite of reports? This is something data teams have been doing for decades but, in the context of multi-academy trusts, there’s a specific constraint that we run into time and time again: we simply can’t get the data out of the source system.
Anecdotally, I think there are two central problems that make this so difficult.
Schools and trusts lack the budget and resources to take on large, complex data engineering programmes.
EdTech companies have focused on integration within their own products, or with a small number of partner products.
These reinforce one another. The more that EdTech systems focus on integrations within the same family of products, the greater the difficulty of the data engineering task for trusts. Similarly, why would EdTech companies focus on API and data connectivity for customers who lack the internal resource to get any value from them?
In the last ten years, I think the calculation has shifted. Many trusts are now larger and more mature in their approach to data and systems and, at the same time, the costs and complexity of data infrastructure using modern cloud offerings has dropped significantly.
This has flipped the market pressures for EdTech companies. Not all of them can see it yet, but I think the importance of APIs, programmatic data access, and integrations is well understood by a significant percentage of trusts, and this is likely to factor in procurement decisions.
This now presents a third problem: each multi-academy trust still has to adopt the data infrastructure, build the pipelines to pull the data, prepare it, and then build the reporting on top. That’s a lot of different trusts, potentially all building the same thing…
It’s this third problem that the charitable not-for-profit Open Education AI is trying to solve. The website is worth a read for the background, but they’re building a common approach to data infrastructure for trusts and school groups, with the code published on GitHub.
This sounds wonderful in theory, but does it work? Well, of course – there are over 50 trusts now using the infrastructure. Open Education AI is aiming a little higher than that though. The OEAI public repo is open source, published under the Apache 2.0 licence, so the principle is that any trust should be able to clone the repository and adopt the OEAI model. This sounds very much like a testable hypothesis…
Deploying the OEAI Bromcom module
This post isn’t an implementation guide, but this next section does get a little technical. If that’s not your jam, skip to the end!
Infrastructure
OEAI uses PySpark notebooks, split into three tiers. The bronze layer pulls the raw data from the source system and drops it into a data lake. A silver layer then transforms that data into a standard structure, and the gold layer adds calculated measures on top, ready for use in reporting and analysis. To spin up the infrastructure in Microsoft Fabric requires a few things:
Fabric workspace (a trial one is fine)
An Azure subscription with an Azure Key Vault
An empty lakehouse
The repo contains a cloud-infrastructure/OEASetup.ps1 script that looks to configure the infrastructure needed in Azure and Fabric, but the last update was a little while ago, so I opted to use it as a guide and provision things manually – setting up a new Azure subscription, resource group, key vault and Fabric workspace is a quick enough task.
Next, there’s a bit of uploading to do1. I’m specifically testing the bromcom module, so I need everything from the /modules/bromcom/ directory, and oeai.py from the root of the repo. I also uploaded oeai_logger.py, more on that in a moment.
Inside the bromcom module, there’s a oeai_mod_bromcom_env_var.py which needs the details of the Azure Key Vault containing the secrets for Bromcom’s OData connection, and a variable set to tell the bromcom module to use an OData connection rather than the default SQL connection.
There’s one section in this notebook that needs to be commented out to get the notebooks to run - the OEAI team operate a logging service for the OEAI installs that they maintain, but as I’m setting this up in a test environment, we don’t want that. Disabling it is as simple as commenting out a few lines of code.
The lakehouse also needs a few CSV files uploading. These can be found in reference/reference_common.7z, and should be uploaded to Files/reference in the lakehouse.
Less than 60 minutes from downloading the repo, I’m ready to run the bronze notebook and see how far I’ve got. On the first run, I hit a few errors that turn out to be incorrect paths set in the key vault, and on the second attempt the bronze notebook runs through to completion. The silver and gold notebooks are even simpler, running first time, once the bronze notebook has completed.
Reporting
The OEAI repo contains a “MVP” report and semantic model. I upload both to the workspace and point them at the lakehouse, but this part turns out to be a little trickier. It looks like the model is expecting a couple of tables that don’t exist in the lakehouse and, from the code, this looks to be a difference between Bromcom, Arbor and Wonde modules. Rather than unpicking this, I created a new semantic model and report, using the ones in the repo as guides.
I created a couple of demo reports based on the model, and set up a schedule to run the notebooks and refresh the semantic model nightly.
Closing thoughts
As it stands, it’s easily possible to use the OEAI repo in a test environment, and I’ve found nothing in doing so that would prevent production use purely from the public repo. In practice, whether you should or not depends on your internal data capabilities, your appetite for owning the risk and security responsibilities (open-source code does not come with an inherent warranty or support model), and your judgement on the value of managing the data infrastructure in-house. For most trusts, I’d recommend using OEAI’s delivery partner and then focusing your internal data resources on the reporting and insight layer that leads to impact.
For me, the importance of the open-source nature of the project is difficult to overstate. Having the data infrastructure deployed in your own systems, and the freedom to use the code and the assets yourself should you choose is a welcome departure from what we’ve been accustomed to. The OEAI effort is still relatively young and the future development of this project is definitely one to watch. If your trust is scoping a data project, I’d definitely say it’s worth a look.
https://www.openeducationai.org/
Git integrations and deployment pipelines would be a much better option if this were to be used in production.

