Thoughts on Implementing an Analytics-Friendly Data Solution for a Web Application

Robert Gutierrez
7 min readMar 3, 2021
Data center; your data is stored here!
Photo by İsmail Enes Ayhan on Unsplash

Many software development companies, as well as companies who conduct business or learning online, perform some kind of analysis on the data generated by their web applications. Often, this data comes from multiple applications and other sources, internal and external, and before it is processed for analysis, it must be “normalized”, converted into some common format shared by all sources. PII (personally identifiable information) must be stripped and identities anonymized; data must be joined; dirty data must be cleaned. During my career as a software engineer, I have found that this last step, the normalization, is an afterthought, something left to think about after the application has launched and analysts are left staring at a brick wall of mismatched data. With enough resources and time, normalization of data can definitely be tackled after launch. But I’m here to advocate for an alternative: tackle this while you’re developing the app. Construct your data models and tables with your analysts in mind. If you do it right, your pre-work for performing analysis could be as simple as uploading your app’s data to your company’s data lake/data warehouse.

I’m going to address this using a particular project as an example, an app that was later rebuilt to incorporate some of my thinking on this subject (it was also rebuilt to make it better and faster on multiple other fronts).

The project — the best argument in support of my thesis

This project was, in terms of normalization, a worst nightmare. A few years ago, my company contracted a development team to construct a web application that allowed users to engage in improvement science digitally. Key concepts of improvement science, such as Driver Diagrams, Fishbone Diagrams, and PDSAs, were sketched out and fleshed out on paper. This led to lots of disorganization and lost learning, in addition to learning being just plain hard to share. Iteration on previous PDSAs meant flipping through the “PDSA” pile at the corner of your desk, scanning data gathered, and determining whether you were making progress.

Needless to say, we advocated heavily for a digital solution, and thus the app was born.

The contractor, keen to use familiar technologies and frameworks, produced a product that consisted of a WordPress backend and a WordPress/Angular 1 frontend. While the application functioned adequately, one very large problem presented itself when our analytics team began their effort to conduct a “network health analysis”, which involved the collection of measures gathered from within each network (usually a school district) on the app. These were: interactions between users and the platform, between users themselves, and the outcomes of these interactions. Each interaction was also weighted based on its potential for producing an actionable outcome.

(This is probably a gross oversimplification of network health analysis. Sorry, analytics team!)

While this sounds simple in theory (likely to end up as a dense weighted directed graph), there were numerous standing in the way between data and graph:

Overuse of the wp_posts table. Many WordPress plugins will utilize the wp_posts table for storing their data. Since the table contains lots of metadata fields, and perhaps because the table seems to be the “meat” of any WordPress site, new objects introduced into the application are often just posts with a different post_type. Listing off some of the custom posts types from the table, we have: blog_post, topic, reply, pdsa, forum, invites_log, and several others. What is a reply? Do topics and replies belong to forums? And what is an invites_log? Without doing due diligence on what each post type was for, it was nearly impossible to determine which ones were important.

Arbitrary linking of things to other things. On its own, the ability to do such a thing is powerful, when implemented correctly. Here, it was not. The application was using an outdated plugin called Posts 2 Posts, which utilized a wp_p2p table with four columns: p2p_id, p2p_to, p2p_from, and p2p_type. The first three columns here are obvious: primary key, then ids to rows in other tables. But how do we know which tables to look for p2p_to and p2p_from? Supposedly, p2p_type, which was a string denoting some definition within the WordPress code. So the analyst would need access to the codebase to then go spelunking for the definition in the code. Not ideal.

The entire Driver Diagram, and all its related objects, was stored in one giant JSON blob. But drivers were also stored in the terms tables. The enormous JSON blob made it extremely difficult to debug and enhance the Driver Diagram. But dev gripes aside, if an analyst wanted to look at specific drivers, how did they know where to look? Was the JSON more recently updated? Were there things within the JSON that were not stored in the other tables?

Photo by Sebastian Herrmann on Unsplash

I’m imagining that’s what our analysts looked like when trying to sift through all this data. WordPress works great as blogging software, but for a web application that allows networks of people to engage socially around improvement science… maybe not the best choice.

Out of the weeds — the rebuild

Fortunately, we’ve been working on a rebuild of this application and I’m happy to say that our schema is much more friendly. Many lessons were learned from this application, and I was able to start from the ground up with something that would work both for my developers and our analytics team. In the past, I’ve needed to devise a workflow for exporting app data to our analysts and ingesting that data back into the application somehow for use in generating reports and dashboards. Here are a few things I’ve implemented and am thinking about in regards to analytics-friendly data:

Store clean data

This is a bit of a no-brainer but bears mentioning. Sanitize form inputs. Make sure things like names, email addresses, and titles are stripped of unnecessary whitespace. Lowercase email addresses. Store datetimes in UTC and convert to specific timezones when needed. Watch out for special characters in strings. Use clear and concise column names. Use proper data types in columns; if your archived column is a boolean, keep it as a boolean type instead of something like TINYINT that is either 1 or 0. We can easily convert it to something else if needed later on.

Don’t shoehorn functionality

This is the infamous “fitting a square peg into a round hole” problem, one often dealt with by WordPress plugins that attempt to create new functionality around limited database tables. The better plugins will create their own tables and write their own logic, but numerous others will just utilize wp_posts, for better or worse.

Knowing the specs of something beforehand and designing a schema that is easily extendable are two key things here that will keep you (mostly) on the right track. This very concept led me to one of my first orders of business with the rebuild, which was…

Implementing “arbitrary association of things to other things” the right way

An article by Luke Plant on the disadvantages of Django’s GenericForeignKey got me thinking about the optimal way of accomplishing this. I ultimately went with alternative two from his article and implemented my own version of it as an interface in Django called ContextualModel. On the code side, this gives us a nifty context field, where the queries to retrieve the actual object have already been done and the result cached. If I have a Post object and I want the title of its context, I can just refer to post.context.title. I will also likely know what type of object the context is, based on where I am in the application, but if I don’t, I can always refer to post.context_type.

What remains to be done is to determine whether additional work is needed to resolve context into the actual context object before we hand off the data to our analysts. Also, would it be valuable if we could start with a context object, like Activity, and get all “contextual” child objects that it is a context for? If we know what we’re querying for beforehand, like Posts, we can do the queries easily, but what about a query that gets every kind of contextual child object? Right now, for Activities, it’s just Posts, but this could be useful.

Rolling up Likes, Comments, and Tags into a SocialActions table

Currently we only have Likes and Comments, but this idea came to me the other day. If we’re weighting our social interactions in the app as part of some measure for network health analysis, then calculating such a measure would involve knowing which tables to look through, adding up the relevant records, then multiplying them by the weight for each table. This is still possible, but what if we went a step further? What if we maintained a record in a SocialActions table for each Like, Comment, and Tag in a network? In addition to the common fields these objects all share, we’d have an action_type column (like, comment, or tag) and also a weight column that would be different for each type of interaction. Then, we could just query the SocialActions table by network (or perhaps by context) and just take the sum of weight for all rows.

That’s all I have for now! I will likely update this post as more implementations and/or ideas manifest themselves. I also would love to hear others’ thoughts as well, since this is something I’m constantly thinking about.

--

--

Robert Gutierrez

A creative, collaborative, and empathetic software engineer. I have over nine years of professional experience in developing impactful web applications