How analytics engineers can unlock practical ML to drive business value
Analytics engineers formally emerged in 2018 and quickly established themselves as key members of the modern data team. With a broad skillset that bridges the gap between data engineers and data analysts, they can own the entire data stack, making them attractive candidates to jump-start new data teams.
Analytics engineers can help their companies begin generating value from data in record time. With their hybrid skillset, they can quickly set up pipelines for data ingestion, do modeling for analytics, and even enable some data activation use cases. However, one thing is still seemingly out of reach for the analytics engineer – machine learning.
Traditionally, ML has required dedicated tooling and expertise, but that’s rapidly changing. If you're a data leader or an analytics engineer who sees an opportunity but is struggling to take advantage of ML, the good news is it’s within reach.
In this post, we’ll explore two types of ML problems, noting which one an analytics engineer is well equipped to solve. Then, we’ll provide a clear roadmap to help analytics engineers get started implementing ML to drive business value.
Facilitating practical ML
In the past, ML was about developing predictive algorithms made of complicated mathematical models. This required data scientists with extensive training. To design algorithms, they needed a deep understanding of complex math. To make their algorithms usable, they needed training in computer science and programming. This expertise is still invaluable today, but technological advancements have shifted the focus for some ML work.
Many business use cases are now well-defined, and we’ve learned which algorithms consistently return the most reliable results. For these use cases, the focus becomes feeding these algorithms with clean, high-quality, structured data and building processes for real-world activation – the type of work that analytics engineers live and breathe. We’ll call this practical ML.
Taking advantage of ML used to require a painful tradeoff. Either use a black-box SaaS tool and lose transparency and customizability, or invest in a dedicated team and dedicated infrastructure. This tradeoff made it difficult for most companies to fully leverage ML to create better business outcomes.
With tooling advancements and the rise of broad knowledge data roles, this is no longer the case. Today, it’s possible for analytics engineers to implement practical ML solutions for many of the most critical business use cases.
Start taking advantage of ML to drive better business outcomes
Reach out to our team today to get a demo of RudderStack PredictionsThe two types of ML problems
To understand how analytics engineers can unlock ML for your business, it’s helpful to frame ML problems as solved vs. unsolved.
- Solved problems are more defined and predictable, and their solutions don’t widely vary between businesses and industries. The models for solving them can be fairly standardized. The difference between businesses for these problems comes down to the data input. Examples of these types of problems include churn, LTV, lead score, customer segmentation, and demand forecast. Solutions for these problems are rapidly becoming table stakes for companies that want to remain competitive in today’s environment.
- Unsolved problems are exploratory in nature and are unique to business/industry. You can think of them like R&D. They are highly complex and require custom solutions and tools. Their success isn’t guaranteed, but the potential upside is huge. They can generate significant competitive advantages because they’re difficult to produce and hard to copy. Examples include video and image analysis (computer vision), supply chain optimization, predictive maintenance in manufacturing, and fraud detection.
With modern tooling, analytics engineers are well suited to facilitate practical ML solutions to handle solved problems. When they do, they expand their impact on the business and free up data science teams to focus on high-potential unsolved problems.
Unsolved problems are where data scientists shine. When analytics engineers handle the solved problems, they can focus on these high-potential initiatives.
Next, we’ll look at the model development process for solved problems, giving a practical roadmap to help analytics engineers get started. I’ll highlight how the analytics engineers’ existing skillset applies to each step, call out skill gaps where they exist, and offer guidance on how they can be filled.
The four stages of practical ML model development
To implement practical ML solutions and help your company leverage ML to drive better business outcomes, you’ll follow a straightforward process. Broadly, there are four stages of model development for solved ML problems:
- Define: Define the problem, solution, and key metrics
- Prepare: Build the pipeline to prepare data for model training and scoring
- Model: Train several models and choose the best one
- Deploy/monitor: Make the model available to the business and monitor its effectiveness
The define and prepare stages involve extensions of two core analytics engineering skill sets: business knowledge and data engineering. I’ll point out some pitfalls and offer tips for these stages but won’t go into detail. The model and deploy/monitor stages involve new skills, so I’ll provide more depth here.
Define
This is the first stage and possibly the most important. Get things right here, and you’ll prevent dozens of potential problems down the road. At this step, ensure you have solid answers to these foundational questions:
- What problem are you trying to solve?
- How will your solution solve the problem?
- How will the business measure success?
- How will the business use the solution?
- How and where will you deliver the solution?
- How is the metric you want to predict defined internally?
- What is the time frame of the predictions (how far into the future are you predicting and when)?
- Do we have the data to solve this problem?
These questions will ensure your ML solution aligns with real business needs and force you to consider how to implement the end-to-end process. While these questions might seem obvious, it’s easy to skip this step and jump directly into model building only to discover a misalignment at the end of the project or realize you have no way to deploy the final model to the business. The last thing you need is an ML solution that doesn’t solve an actual business problem or can’t be used by your business.
Answering these questions will also help you make technical decisions like:
- How you’ll build training examples
- What the data prep pipeline will need to look like for production
- How you’ll deliver production predictions to the business
Finally, you want to collect hypotheses of what features drive the target metric. These will primarily come from the business. However, they can also come from research on the industry or type of problem. This will give you a starting point for defining features to train and help keep you from adding meaningless features that randomly correlate with the target. It’s not your only chance to define features, but it will give you a solid starting point.
Prepare
At this stage, you begin to take your work from the previous step and turn it into action. First, you’ll create the training data. This data needs to take a few things into account:
- The point in time for predictions (for a churn model, for example, only data that is available 30 days before renewal date)
- The features transformed to be ingested by the ML algorithms
- You also need to make sure your calculations align with the definitions you agreed to in the define stage
At this stage, you have an advantage because of your data engineering skills. If done well, you can use the pipelines you buid during this stage to also prep production data, with minimal rework, when the final model goes into production.
Once you prep your data, you’ll randomly split it into train and validate subsets. Usually, the split is 60/40 or 70/30, depending on the total amount of data you have. If your volume is low, opt for the 70/30 split to ensure you have enough training data. Once this is complete, you’re ready for modeling.
Model
Modeling was, traditionally, the main gap between data scientists and the rest of the data team. But for solved problems, that gap is now more perception than reality. That’s because, over the past decade, we answered the harder technical questions for these problems. Today your focus is on applying those answers to your dataset. This stage has two primary components: algorithms and implementation tools.
While algorithms can be intimidating, for practical ML, you don’t have to be an expert to get started. You can focus on a small set of known, high-performing models and begin driving results with a conceptual understanding of a few fundamentals:
- Algorithm type: Is it tree-based, regression, or something else?
- Algorithm high-level assumptions: Is it linear or nonlinear? What kind of features can it handle?
- Algorithm hyperparameters: Learning rate, max depth, number of estimators, etc.
You don’t need to get into the implementation details or the high-level math to start. As you continue to build models, though, you’ll better understand the nuances and idiosyncrasies of the algorithms and implementations if you take time to learn about the details. This Github repo from Alastair Rushworth provides a comprehensive list of free online resources you can use to build your ML knowledge.
There is a massive and growing ecosystem of tools for implementing ML algorithms. Ultimately, the correct choice depends on your data environment, deployment plan, and personal preference. Here are a few important considerations to help you choose the right tools for your needs:
- Look for tools that will handle much of the heavy lifting (implementing training, scoring, metric calculation and tracking, etc.) while providing transparency and giving you control to tweak and tune the algorithm's hyperparameters.
- Select tools that will fit your existing workflows and tech stack without much extra work.
When it comes to environments, you have many options. Running Python code locally on your laptop makes it easy to get started but can make it difficult to move the final model into your production environment.
For cloud development environments, there are two choices: Virtual machines that simulate your laptop in the cloud and data science specific platforms like AWS Sagemaker and Google Vertex AI. Both VMs and data science platforms will enable you to integrate into some of the cloud tools you already have set up. The data science platforms will also give you some out-of-the-box options for training and deploying models.
Finally, there’s a newer option designed specifically for folks without deep data science training: In-database ML development tools like Redshift ML and BigQuery ML. These tools train ML models directly within your warehouse, keeping modeling in the same environment as data prep and using a SQL-like syntax. The ease of training and deployment they provide makes them an attractive option for any analytics engineer looking to get started with ML.
Here are a few more tips to help as you begin to train and compare models:
- Focus on the model’s performance on unseen data (data not used for training). If you optimize for training subset accuracy, you’ll overfit the model. Validation performance will give you a better sense of how the model will generalize.
- Pay attention to internal fit metrics (ROC, AOC, AIC), but remember that ultimately the model needs to perform well on your business success metrics. For instance, in a case where the cost of customer churn is disproportionately high, it’s okay to pick a model that has a higher number of false positives and lower overall accuracy if it also captures more true positives.
- Be suspicious of the data if your results are at the extremes (very good or very bad). Very bad might mean the data did not join properly or is missing an important feature. Very good data is a sign of possible leakage, for example, the inclusion of a feature that is only populated if the customer churns.
- Understand your base numbers. If an existing or easy-to-implement process can produce 70% accuracy, an ML solution that produces 75% accuracy may not be worth the effort.
- Use common sense and keep your models straightforward. Don’t fall into the trap of including arbitrary features because it slightly increases your accuracy. Increases from features that seem arbitrary are often due to random correlation to your target and will not translate to production.
Deployment/Monitoring
Once you have your model performing well, you’re ready to deploy it. As long as you started the project with deployment in mind, this won’t be a heavy lift.
The deployment workflow (Prepare → Score → Save & Deliver) is similar to your model development workflow.
For preparation, you should be able to reuse your data prep pipelines with minimal adjustments/rework.
How you perform scoring will depend on where you developed your model. The in-warehouse tools deliver a big advantage here because they are already ready to score new data. If you didn’t develop your model in warehouse, you’ll need to set up a new scoring pipeline to pass the data to the model and generate your predictive metrics (score). You’ll also need to consider the schedule for scoring. While it's tempting to want data in real-time, be realistic. In most situations, batch scoring on a set schedule will be enough. Match the schedule to the cadence you already determined in the define stage.
In-warehouse tools also make it easy to save the model output (predictions) in your warehouse. If you’re not using an in-warehouse tool, you’ll need to add a step to your scoring pipeline that writes your predictions to your warehouse.
With the data scored and saved, you’re ready to deliver it to the business for activation. Delivery can be as simple as downloading the data to an Excel workbook and emailing it to business users or as complicated as making it available via API. Practically, most delivery will be in the middle of those two extremes. At this stage, your existing analytics engineering skillset will help you determine the best delivery method for your use case. Beware of manual data dumps, especially when under pressure to “just get something out this one time.” We all know it’s almost never one time.
All predictive models will eventually need retraining. The values of input features can drift over time, or relationships can change (remember, you are changing the environment by deploying predictive data). So, once you begin producing predictive data, it’s important to monitor how well it performs. Build an ongoing report or dashboard to stay on top of model performance. It can take some work to ensure the actual results are captured and stored in your warehouse, but it’s worth it to catch performance drops before they get to an unacceptable level for your business users. Plus, you can use this performance data for training data when you rebuild the model.
Start leveraging practical ML today
Using machine learning to drive better business outcomes has never been more important. While unsolved problems still demand the skill and expertise of data scientists, analytics engineers are perfectly positioned to begin unlocking practical ML for solved problems. The overview and the resources referenced here will help you get started. If you’re interested in tooling to support your efforts, RudderStack could be a good fit.
At RudderStack, we’re focused on enabling data teams to focus on their strengths so they can do their best work and create more business value. Our Predictions product makes it easier to deliver solutions for solved ML problems from your warehouse. It’s a great way for analytics engineers to get started working on ML projects. If you’re interested, reach out to our team today to see it in action.