If you’re responsible for implementing and sustaining data-heavy projects, you already know that as your implementation grows, sustainability and reliability become harder to maintain. Here are four data management best practices to help you make better data-driven decisions.
As the volume and complexity of the datasets, ETLs, and use cases increases, you inevitably find yourself facing challenges such as:
- Data updates take too long.
- It becomes harder to track all the data dependencies, meaning you’re not certain about the impacts of your changes. You can break something without realizing it.
- Data validation becomes harder.
- You can’t monitor your data reliably.
We know that pain. We’ve been there. And after working through hundreds of these types of projects, we’ve learned that there are four components of a well-managed data ecosystem.
- Structured data integration and ETL processes
- Documentation
- Ongoing monitoring
- Buy-in and shared ownership
We’ll admit, these aren’t “simplify your life in just 4 easy steps!” types of fixes; they’re a new way of thinking about your data and the processes that surround that data. But they work, and they’ll help make your data more manageable and reliable.
Data Management Best Practice #1: Establish structured data integration and ETL processes
There are two essential practices to properly structure your data. The first is standardizing everything. Naming conventions are critical—from ensuring field names are clear and consistent across datasets to the values within fields, especially in cases where you are deriving data based on a particular format. Standardizing your ETL process is equally as important. Things like setting your data types in every transformation step can save you a lot of headaches later if data gets pushed through incorrectly!
The second practice is to differentiate between standard and customized data. There will inevitably be requests by end-users that require custom data transformations, but if you continue modifying your data in response to all of those requests, you’ll end up with datasets that don’t resemble the raw data in ways that make them impossible to map back to their source, which (going back to those initial challenges) make the validation and ongoing updates very complex. Therefore, it’s critical to differentiate between universal and specific processes.
- Universal processes are those which apply to every use-case of that data; changes to this data are equally relevant to all of its outputs. An example might be lookups or groupings that are standard across the company or a recursive process to ensure you maintain historical data.
- Specific processes are only needed for specific use-cases or sets of use-cases. Data may need to be modified to fit those needs, but those changes are only relevant to these use-cases, so there’s no need to apply them to all of the data! Rather, build offshoot datasets that can then be modified to fit the specific need. Any changes can be made in those offshoot datasets, and will not impact the upstream data. This will speed up the time for those changes to take effect and load regularly, and will also maintain the necessary simplicity of the source data.
This is often harder than it sounds because the ad hoc fixes to meet specific requests are often quicker and easier than implementing a system like this. But it’s not worth it. Really. It’s not worth the heartache you’ll experience after making ten or twenty ad hoc fixes and then realizing your dataset relationships now look like your spaghetti dinner.
Once you’ve implemented a system like this, it will end up with an architecture that looks something like this:
And once you’ve applied this across your ecosystem, you’ll end up with a structure that looks something like this:
Cleaner, simpler, and so much easier to maintain!
Data Management Best Practice #2: Document early and often
In most cases, no one can be expected to keep all the relevant information in their heads. But even if they could, we’ve seen far too many cases where clients tell us, “The person who set this up is gone now,” or, “This was created so long ago that nobody’s really clear about how it was set up or what rules it follows.” Don’t do that to yourself; establish clear documentation from the beginning.
Documentation can be a significant undertaking, but the principles that underlie it are not. Essentially, they boil down to a few key points:
- Document naming conventions and ETL standards &guidelines to ensure stability.
- Create a data dictionary. There’s a tendency to make these extremely complex, which usually means they aren’t maintained or referenced. But while they should include all the key information about data and the logic for ETL processes, they should be “client-facing” – friendly enough for a new user to review and understand their meaning. Ultimately, this should replace tribal knowledge that lives across a bunch of IT people’s heads. They must become the single source of truth for all data in the ecosystem, but straightforward enough to create a common understanding and enable shared dialogue.
Data Management Best Practice # 3: Set up monitoring and alerts
Just like with any system, your data ecosystem needs to be monitored to keep you informed, enable you to maintain data’s health, and alert you to future problems that may arise. You’ll want to keep an eye out for missing data, duplicate data, mismatching data, and a host of other issues.
But as your data grows, you’ll quickly hit the point at which you can’t monitor everything. So it’s critical to map out what to monitor and why; once you do that, you can plan how to monitor.
A quick framework to plan this out is to think through these four points:
- Determine what should be monitored and why
- Determine who needs to be alerted when there are issues
- Determine when they need to know about those issues – Is it daily? Weekly? As soon as an issue arises?
- Determine what action needs to be taken if an issue comes up
Once you have these answers, you can build your monitoring plan, which may consist of a checklist of critical areas to check on regularly, a dashboard indicating your ecosystem’s health, or some other methodology.
Beyond monitoring, you may also consider setting up alerts, which can help you keep track of far more than you’d be able to manually, because this way you can “manage by exception”, focusing only on those areas where there are issues. Here, too, it’s important to think through those four points above. You want those alerts to be timely and relevant because too few alerts mean you may miss critical issues, and too many alerts eventually disintegrate into noise for the recipient and they start ignoring them.
You can use the points to create a framework to determine what alerts to set, who should receive them, and what they should do once they do get alerted. Here is a simple example for a client who wanted to make sure marketing spend didn’t spike unexpectedly:
Here is another example. The purpose of these alerts was to ensure that campaigns and conversions followed a consistent naming pattern, which helped join data consistently across multiple sources. When those structures weren’t followed, we set up alerts to the marketing agency so they could update those names. But in this case, the client wanted a backup system to ensure these were addressed in a timely manner. So we set up a second set of alerts for the client so that if the naming issues were not addressed within five days, the clients received an email informing them to check in with the agency and address it. Mapping those out looked like this:
Data Management Best Practice #4: Create buy-in and shared ownership that’ll make your future self grateful
No matter how good a system is, it won’t be effective unless the people responsible all share an understanding of that system and what each person needs to own to maintain that system’s success. This is so often overlooked, with the assumption that an email announcement, or a lunch-and-learn, is sufficient to get everyone on board. It’s not.
It’s worth investing the time to connect directly with data owners and data consumers, usually multiple times. Don’t overwhelm them with details, but do communicate what they need to know about the data, and what their role will be in maintaining that system, whether it’s ensuring IT folks understand and comply with naming conventions, or that data consumers know why they may receive alerts and how they should respond.
Then, check in over time to ensure they’re able to use the systems, are actually implementing them correctly, and that they understand how these systems are benefiting them. As we mentioned earlier, there’s often some more effort involved in implementing these systems, but your future self will be eternally grateful to you if you start early and stay consistent.
Takeaways and next steps
There’s a lot of information in these short paragraphs. We’ve outlined a system that can save you significant time and effort in the long run, and help you maintain your data’s consistency and integrity.
If you’re just beginning to build your data ecosystem, build with these principles in mind. If you have already built your system, or—as is often the case—you’ve inherited an already established ecosystem that doesn’t follow these principles and is overwhelming you, don’t despair!
Just pick a discrete section of the data and start implementing these ideas there. As you expand into other areas of your ecosystem, you’ll see the rigor and structure that you’re implementing begin to pay off—you’ll find it easier to validate data, update data, and monitor what’s going on.
And of course, in either case, we’re here to help. We’ve helped hundreds of clients build (or re-build) better data management systems, and we’d be happy to help you get started.