10 Best Practices for Data Transformation in Domo
With the growth of user-friendly BI tools like Domo, using your data is simpler than ever! At the same time, though, it is quite challenging to manage and maintain your processes as your data increases.
In this article, we’ll look at 10 best practices that reduce long-term challenges and set you up for efficient and flexible data transformations within Domo.
As a Domo technical consultant for nearly 3 years, I’ve noticed some common challenges that arise over time with Domo transformations:
- Iteration Efficiency: It takes a lot of time to understand a dataflow that someone else created (or even a dataflow that you created a long time ago!).
- Scalability: A dataflow can start small but gradually expands to include too much logic to manage.
- Repeatability: Logic sometimes gets re-written across dataflows in order to meet multiple needs.
- Speed: A dataflow that runs rapidly will gradually slow down as the data size grows.
Thankfully, there are surprisingly simple methods for mitigating each of these challenges. By implementing them during your initial dataflow builds, you’ll save yourself and your coworkers a lot of headaches and problems down the line.
Let’s start with Iteration Efficiency. How can we ensure that dataflows we create are as easy to understand as possible and set up for rapid iterations?
Tip #1 – Self-document your work and edits as you go alongThe most useful way to get “up-to-speed” on a dataflow is by referencing documentation. But no need to panic—I’m not asking that you create a 50 page PDF describing all the details in your dataflow! Instead, it’s incredibly simple to leverage Domo’s built-in tools to do this. Have you ever noticed that you can rename the Magic ETL action boxes? Instead of using the default names, such as “Select Columns” or “Join,” change the name to something a bit more descriptive. For example, compare this: Versus the exact same dataflow, but with renamed action boxes: Similarly, with MySQL or Redshift, you can write quick comments on top of your SQL code. And after saving your data flow changes, make sure to add a quick “commit message.” It saves automatically to your dataflow version history and provides a super useful log for all changes that you or others have made within the dataflow. Having comments also helps if you ever need to revert to a previous version of the dataflow. Here’s an example of how I worked with a collaborator within a dataflow over the course of 5 months:
Tip #2 – Turn on trigger settings right away when you first create the dataflow
This might seem obvious, but I’ve seen people spend hours trying to figure out why their data output isn’t as expected, only to discover that the dataflow just hadn’t run in about a month.
It’s easy to forget about the trigger settings after spending a lot of time creating the logic. Do yourself a favor and get in the habit of turning that on right away once you create the dataflow.
Tip #3 – Give your output dataset the same name as your dataflow
This makes it easier to find the dataflow or dataset you’re looking for. Domo’s data center has two distinct areas: “Dataflows” and “Datasets,” and you’ll often find yourself switching back and forth between those areas. Trying to remember different names for the same dataflow can often make things more confusing than necessary.
If your dataflow has more than one output, we recommend using dataset names that are similar to the name of the dataflow.
Tip #4 – Avoid confusing transformation setups
There are two simple configurations that I’ve seen to cause confusion.
In Magic ETL, don’t rename your columns within a GROUP BY action box. Although that makes a lot of sense in a SQL dataflow, Magic ETL users often forget to look inside of the GROUP BY box when they’re looking for specific columns or calculations.
To keep the transformations as simple as possible, put a SELECT action box right in front of the GROUP BY, and save the GROUP BY for aggregating only.
For example, in this example, a coworker might be looking for “Sales” and won’t find it since it’s now titled “Total Revenue”.
Instead, rename “Sales” to “Total Revenue” right before you aggregate:
There’s a similar idea in MySQL or Redshift dataflows. Don’t put transformation logic into the output transform (unless your data size is extremely large and you’re trying to reduce data movement).
In this screenshot, you can see that the Output dataset has a simple “SELECT *” and the actual SQL logic happens within the “Transform” section. That’s the best practice way to set up your dataflow because it keeps the logic organized and easy to find.
Alright! Now that you know how to set up your dataflows for easy iterations, let’s look at ways to make your dataflows more scalable for the future.
Tip #5 – Pick the right tool for the job
Domo provides 3 methods for transformation: MySQL, Magic ETL, and Redshift (available upon request). Before beginning your transformation development, think carefully about which tool will be best for you in the long run.
There are a lot of considerations that matter here – we have a separate blog post coming soon specifically for this topic! One main thing to keep in mind is how easy it’ll be to manage the business logic as your data and logic continue to expand.
The latest version of Magic ETL runs very fast. It’s also great for less technical analysts since it’s drag & drop. However, a single SQL command can often take multiple steps within Magic ETL. If your Magic ETL dataflow has more than 30 action boxes, consider switching to SQL.
You’ve picked the right tool, and now it’s time to consider how to structure your dataflow(s). How can you avoid repeating logic in different places across Domo?
Tip #6 – Build for parallel consumption
When adding logic into a dataflow, always assume that you’ll want to re-use it for different reasons in the future. Make each dataflow a “micro-process” where it accomplishes a specific result rather than doing everything at once. Split up your logic into multiple dataflows when necessary. This keeps each segment of logic isolated and allows for re-using a set of prepared data, without re-writing the logic.
Tip #7 – Don’t go overboard with your aggregations
You can aggregate your data to increase run-time speed, but don’t over-aggregate it by removing too many dimensions. It’s better to leave dimensions inside of your dataset even if you don’t need it at the moment. Domo’s dashboarding capabilities are fast and can pull out specific columns as needed. Leaving extra fields in your dataset will simply save time in the future if you want to expand the slicers & dicers that are available on your dashboard.
Excellent – Your dataflows are easy to understand and are set up for expansion in the future! Now let’s make sure they’ll continue to run at a fast speed, even once a lot more data is flowing through.
Tip #8 – Append instead of joining, where possible
At Search Discovery, we call this the “spine method.” Instead of joining data horizontally (left join, inner join, etc.), stack the datasets on top of each other, using the shared dimensions as the backbone.
This method works best for situations where you’re running a calculation that combines different sets of data. In this example, we want to compare Cost vs Clicks. By having them stacked on top of each other, the Domo charts will be able to calculate it on the fly for each campaign.
Tip #9 – Use UNION ALL instead of UNION where possibleWhen using MySQL or Redshift, UNION will attempt to remove duplicates while appending. Using UNION ALL runs faster because it will execute simply the append without deduplicating.
Tip #10 – Leverage native database methods for improving speed in SQL Dataflows
Inside of Domo, you can treat your SQL dataflows as if they were standard database engines. This means you can use the SQL native methods for optimization.
In MySQL, make sure to include indexing! Domo even has the option built-in for you, but you have to “enable” it by selecting which field is the index.
For Redshift dataflows, use Redshift distribution keys and sorting keys as part of a CREATE TABLE statement. This will improve how Redshift handles the clustering across nodes as it runs your transformations.