Let’s build a Salesforce Deal Predictor (Part 1)
Introduction
Salesforce has been the darling of the CRM world for as long as I can remember. Businesses of all shapes and sizes use it to manage their sales processes. For some time their Einstein Opportunity Scoring product has been predicting outcomes of deals, but anecdotally the clients I have spoken to have been underwhelmed by it’s performance. In this article we explore an alternative way of generating predictions on salesforce opportunities. The point is to give you a sense of how I like to tackle this type of challenge, demonstrating that the solution is a combination of data and AI.
The Task
What we’re going to do is build an AI system that will give us a score out of a hundred for any open opportunity that we choose. The score will indicate how likely the AI thinks the opportunity will be closed and won. The higher the score the more likely it is that opportunity will be won.
The Approach
To create our AI solution we need to create a “model” which is something that I have discussed many times before. Models come in all sorts of shapes and sizes but our model today is called a Classifier (we classify whether the opportunity will be “Won” or “Lost”. Unlike the Salesforce Einstein Opportunity Scoring solution we are going to build a neural network based model - which today form the overwhelming majority of all AI systems from ChatGPT like LLM’s to Radiography image detectors. Neural networks understand numbers and nothing else, so we will need a processing pipeline that will transform our text heavy Opportunity into numbers that the model can understand and produce predictions for. AI models need to be trained, which we do by feeding them with lots of data and then marking their predictions. For us that means that we will provide our AI models with examples of past opportunities that we know the outcome of and then comparing the predictions with the true outcome of each deal.
The Naive Approach
It might be natural to think the best way of achieving this is to take an extract of all the closed deals that there are in the CRM system and use this closed deals to train our model ( since we know what the actual outcomes were this would be simple).
There are however many reasons why this isn’t such a great idea and to understand why we need to dive into the brain of our AI system and consider what task we are asking it to perform. Deals have a lifecycle to them. They are born full of hope and promise with expectation of what they might go on to achieve. They grow up as deals get more qualified.. amounts.. budgeted, needs … qualified, decision makers… onboard, timing … established. In B2B businesses there can be a lot of ebb and flow - it’s not necessarily a straight linear path through the sales process and there can be hiccups along the way, however, as with death and taxes the fate of opportunities are guaranteed as they are either “won” or “lost”. The problem is that we want to make predictions about our opportunities as they are in the prime of their life, so digging up past deals from the graveyard and using that to train our AI system feels not quite right. And you’d be right to think that way.
Lifecycle
Let’s think about a different scenario that we may be more familiar with. The Tour de France starts this weekend and one of the iconic images that I can remember from watching it is the image of the peloton riding on roads flanked with Sunflowers on either side. We like to grow them here in our garden and being an annual plant they have a lifecycle that lasts less than a year. They develop as saplings, become young plants, and come into flower for a few weeks in late summer before eventually dying off when the days become shorter and colder. Now, if I was to take a photo of a plant in the autumn could I then use it the following spring to guess how big next years seedlings would eventually become as mature plants? No. What you really need is a photo of last years plants as seedlings along with the details of how large they became. Take enough photos of different plants and then the following year you compare the seedlings to the photo of the plant from last year that best matches it, look up the details of it’s eventual size and hey presto you’ve got a prediction.
Recreating the life of a deal with Opportunity History
Enough with the horticultural references and let’s think about what this means for Opportunities in Salesforce. We need snapshots of opportunities as they were when they were young saplings and here we hit a snag because Salesforce doesn’t automatically create a nice family album that we can refer to. Records get updated and when they do the details of what they were (or it’s state) is overwritten and lost. There is however a native object called Opportunity History that we can access that preserves some of the key pieces of information like Amount, Close Date, Stage and Probability giving us the ability to track changes to these specific fields. If we can combine the information from the opportunity that doesn’t change like created date, lead source and account industry - we should be able to reconstruct lots of records of an opportunity as it looked throughout its life - our very own Opportunity photo album. Is it complete… no, but if we’re lucky then it might just give us enough information to be able to create an AI system that can tell us what fate will come of our young sapling Opportunities.
Adding more information
Can we add more information from the opportunities? Well the answer is maybe and at this point the decision about what more information we can use becomes specific to each case. We have to keep in mind that we are trying to recreate opportunities as they appeared when they were open. If your business assigns deals to one owner and it’s highly unusual to transfer deals to another rep then add in Opportunity Owner. What about Next Steps? … well no definitely not. That’s just going to give us the next steps as they were when the deal was closed so not a good choice. Same with Last Modified Date - knowing when a deal that has been closed was last modified is not going to help us make predictions for open deals.
Enriching the data
If you’re feeling a little underwhelmed because all of a sudden you can use the 500 fields that you track on your Opportunity then fear not we have another trick up our sleeve that’s going to give our data a little boost.
We can squeeze more information out of our Opportunity History and for any opportunity we can calculate things like:
- Stage Duration (by tracking how much time has elapsed between changes to the stage), 
- Slippage (by tracking increases to the close date), 
- Brings forward ( by tracking reductions in the close date), 
- Discounts ( tracking reductions in the amount) 
- Price Increases ( tracking increases in the amount) 
- Backwards moves ( tracking changes to the stage to an earlier stage in the sales process) 
Then we can look at the dates we have. At a minimum we’ll have the Opportunity Created Date and the Close Date. From here we can calculate additional date features like:
- Month 
- Quarter 
- Week 
- Is Month End 
- Is Quarter End 
- Day of Week 
We can even squeeze more out by comparing different date fields to give us:
- Estimated Final Deal Age (Close Date - Created Date) 
- Current Deal Age ( Opportunity History Created Date - Created Date) 
We can easily add an additional 20 features or more and this is going to provide our AI model with valuable signals that are going to help it make accurate predictions - this is the secret sauce that’s going to transform our model from ok to “next-level” or “game-changing” or whatever term the marketing folks like to use these days.
Imputation
One final step before we’re done with the data. In the real world data is never perfect and one problem that we have to address is missing data. Our AI model just will not work if we include missing data, so we have to fill it in with best guesses and the word that we give that process is Imputation. I’m sure that there are plenty of heuristics that could be employed to come up with ever more accurate missing values and I’ve experimented a lot over the years, but I keep coming back to the same formula.
- Median values for numerical fields like Amount (median is the most average number in a sequence) 
- Mode values for anything else like Lead Source ( mode is the most frequently occurring value in a sequence) 
This means that we calculate these values for each field on the data that we are going to use for training and store it somewhere so that we can then use those pre-calculated values when it comes to making predictions either during testing or in a live environment.
Next Time
Once we’ve completed these steps we’ll have a champion dataset that we can use to train our AI system and so next time we’ll dive into the details of how we define, train and serve a Deep Learning model that will produce scores for live opportunities.
Struggling to predict your deals and knowing which ones to pursue and which ones to dump - let us know we’d love to help


