Analyze, Query, and View Data using Apache Airflow, Flask, and React: a Proof of Concept to Help You Get Started

Robert Gutierrez
6 min readJan 12, 2024
Photo by ThisisEngineering RAEng on Unsplash

A common use case found in nearly any industry is the need to set up some sort of automated data processing pipeline and a web app to query and display the data to a user in the form of a report or dashboard. Often, that data is internal, hosted on the company’s data warehouse or other database solution, but sometimes that data needs to be retrieved externally, either from a client or partner’s database or even from a hosted data exchange like data.world.

Normally, the data processing solution and your web app are two separate pieces of your technology ecosystem. They certainly use very different libraries to function and have different hosting needs. A popular hosting solution for Airflow on Google Cloud is Cloud Composer, a fully managed orchestrated solution of Airflow deployed on Kubernetes. It works well for a production environment that needs to be “always on”, regularly running workflows. Because of the beefy resource usage and constant uptime, it can be expensive. But it is possible to deploy Airflow on a Google Compute Engine instance as well. As far as React and Flask, numerous hosting solutions exist for these; for React, it can vary from lightweight static hosting in something like Google Cloud Storage or AWS S3 to a React/Node implementation in Google App Engine and more beyond this. Likewise for Flask, you can host simple scripts as Google Cloud Functions or AWS Lambdas or go further with App Engine and more.

Hosting, resource usage, and your specific use case are all things to keep in mind when figuring out how to deploy and integrate your data processing pipeline with your web apps. However, it would be useful to have a proof of concept to see what a possible solution could look like, something you can demo to your team leads or managers to get buy-in. If the proof of concept is accepted, you can build out each piece into a production-ready solution and keep utilizing the proof of concept to test new ideas.

Fortunately, I have a proof of concept ready for you to use! It’s a Docker Compose setup I put together for a related use case: utilize Airflow to run computations on data hosted externally and show a full-stack application that can query/display the resulting data, as well as interact with Airflow directly. It is very much a proof of concept: designed to run locally, utilizing simple credentials, and the React and Flask apps are missing large chunks of components needed for a real implementation (like authentication, models, etc.), but it’s a starting point, and something you can demo to folks considering a similar solution.

I will be walking you through the Github repo I created. You can find that repo here: rgutierrez1014/sensor-data-airflow-react-flask.

Overview

Utilizing data hosted externally seemed like a good use case for this. Instead of creating some bogus data and, say, uploading it to Google Cloud Storage or Google Bigquery, I decided to use a public dataset of sensor data from analyzing air pollutant levels. I came across data.world which seemed like to perfect candidate for this project. I took a file from this dataset, did some reformatting and simplifying, and re-uploaded it to a project I made public. The included DAG in Airflow queries this dataset but you can modify the DAG to query another dataset.

I am using Docker Compose for easy local container orchestration. All the Apache Airflow setup comes from their own documentation, which I recommend reading through as well; they go into more detail on the specifics of using Airflow in Docker. For the React piece, I’m using a Node 18 image along with an entrypoint script to install dependencies and start the React app. For Flask, I’m using a Python 3.10 image along with an entrypoint script to install dependencies and start the Flask app. The image version can be tweaked to your liking, but some adjustment of dependencies may be needed.

Regarding project structure, I’m using a structure I’ve adopted for multiple projects which has suited me well. All application code and files go inside the apps folder, with each app having its own folder. All Docker image stuff goes in the images folder, with each image having its own folder. And any helper scripts go into the scripts folder.

Further information about each piece of the project can be found in the repo’s readme under “Overview”.

Setup

For the full, detailed setup, refer to the repo’s readme under “Setup”. I will provide an abridged version here so you have an idea of the setup steps before proceeding.

Docker + Docker Compose

You’ll need Docker and Docker Compose installed on your machine. I also recommend using Docker Desktop as it provides an easy-to-use GUI for interacting with Docker stuff. Your Docker Compose version should be v2.14.0 or newer, and you should allocate at least 4 GB of memory to Docker for Airflow to run.

data.world

To use the data.world API, you’ll need to create an account and create an API token for yourself. After creating your account, go to the integrations page, search for the Python integration and enable it.

The Python integration page on data.world
The Python integration page on data.world

Then, click your avatar at the top right and select Settings. Click Advanced, and copy the API token for the “Read/Write” scope. Feel free to copy one with a larger scope if you are adding additional functionality.

The API Token page for your profile on data.world
The API Token page for your profile on data.world

Create necessary files

Create the logs and plugins folder for Airflow, under apps/airflow. These are necessary for Airflow.

Create a .env file in the project base directory with the string “AIRFLOW_UID=5000”.

Create a file called config within images/airflow with the following content:

[DEFAULT]
auth_token = <your token here>

[r]
auth_token = <your token here>

Replace “<your token here>” with the data.world API token you retrieved above.

Initialize Airflow

Next, on all operating systems, you need to run database migrations and create the first user account. To do this, run:

docker compose up airflow-init

Wait for this to complete. When it’s done, your Airflow username will be airflow and your password will be airflow.

Initialize all services

Now we can run Airflow and initialize the rest of the services.

docker compose up

Add -d to run in detached mode.

You can view the status of all containers by pulling up Docker Desktop or by running docker ps.

The Airflow webserver may not be available immediately after all services have returned a Started state. I've found it is usually available after waiting a minute or two from this point.

Test it!

Open Airflow in your browser; this should be at http://localhost:8080. Ensure the webserver is active; this may take a minute or two if you’ve just started up everything. Once it’s active, log in using the credentials listed above. Ensure there are one or more DAGs available to run. Now visit the React frontend at http://localhost:3000. Click "Trigger DAG" which will utilize the Airflow REST API to trigger a DAG run; confirm the progress of the run over in Airflow. Once the DAG has completed running, go back to the React app and click "Refresh Data". You should now see a table of data.

A data dashboard with a table showing UFP, BC, and NO2 means, medians, and standard deviations.
A data dashboard with a table showing UFP, BC, and NO2 means, medians, and standard deviations.

Success! Above is the computed mean, median, and standard deviation of UFP (Ultrafine Particles), BC (Black Carbon), and NO2 (Nitrogen Dioxide) levels. The data points are the same because it’s just recalculating from the same data. Again, proof of concept; in the real world, we would be calculating new data as it comes in and receiving different results.

Summary

We’ve demonstrated how Apache Airflow, working together with Flask and React, can be used for a modern data processing and review use case. DAGs can be set up and scheduled, and they can query internal or external data. We’ve utilized the wonderful data.world data catalog. We’ve also used the Airflow REST API to trigger a DAG run, queried our data from a Postgres database, and displayed it in our React app.

A good next step would be to explore deploying each piece to a cloud provider, or building out the Flask and React apps for a truer user experience.

--

--

Robert Gutierrez

A creative, collaborative, and empathetic software engineer. I have over nine years of professional experience in developing impactful web applications