Are you curious about the application of Machine Learning to an Ad Tech ecosystem? Do you want to learn about the application of Machine Learning to prioritize fanning-out requests to downstream partners? In this blog we describe our prototype for assessing the usefulness of an impression using Neural Networks with Entity Embeddings
As an avid consumer of news, you’ve just opened an article on a publisher’s site (say CNN), Sharethrough gets a request (called an impression) indicating that there is an opportunity to show an Ad (because we have integrations with publishers). We then fan-out this incoming request from your browser to our downstream partners called DSPs (Demand Side Platforms: entities which have Ads to show, and they maintain integration/relations with brands and agencies). Some might respond back with an Ad and price they are willing to pay for the Ad to be shown to you on CNN. We then conduct a second price auction and return the winning Ad to you.
Sharethrough is successful in showing an Ad about 40% of the time and a given partner DSP bids on a request less than 5% of the time. From  and similar discussions with DSPs, our partners would prefer not to receive requests that will eventually be discarded (ie resulting in a no-bid). This presents an opportunity to predict if a given request is worth it for each of our partner DSPs.
Due to a widely adopted recent industry trend called header bidding, DSPs get inundated with duplicate requests for the same impression from a few players similar to Sharethrough.
The volume of undesirable Bid Requests is a waste of computational resources and has a similar effect to a Distributed Denial-of-Service attack (DDoS) . DSPs monitor the utilization of their partners (like Sharethrough) and one metric of particular interest, is the percentage of requests they decide to bid on. If we send only send requests to DSPs when they are likely to bid, then we can forge better partnerships in addition to saving trees.
This is a classic Machine Learning Classification problem. For each incoming impression, Sharethrough has to determine if each of our partner DSPs are likely to bid on the bid request we send them. Addressing any Machine Learning problem involves understanding and working through three main components:
- Loss Function and Metrics
We work with the data available at the time of decision making, i.e the data should only include features that are available in the bid request we send to DSPs. Some example features are DSP name, Geocode, Timestamp and country. On average, we receive a bid about 5% of the time, with some variability by DSP. This leads to a class imbalance problem in our data which can be mitigated by some techniques. After empirically determining that we needed to address the class imbalance problem (based on results), we decided to try a technique called downsampling. By reducing the dataset with downsampling, it made our Machine Learning jobs more manageable. For context, we receive more than 1 billion impression requests per day, which generates a lot of data, and have found that this technique worked well for us.
To get baseline performance, we first used a Linear Model. After getting our baseline, we created tree-based models to better understand our data by determining feature importance. This allowed us to discard redundant features. For example, country and geocode are two similar features, but we decided to keep geocode. The reasoning was:
- Ranked higher in feature importance
- Empirically showed better results
- Logically captured more granular information
Most of our input features were categorical variables (like DSP name and geocode, both of which have a fixed number of known values). The most common representation of categorical variables is One Hot Encoding. One Hot Encoding has a few shortcomings:
- If the number of possible values (cardinality) of categorical variables is high, it’s computationally expensive
- When we have multiple categorical variables (e.g geocode and DSP name in our case), these are treated independently and thus ignoring relationships between them.
It was shown in a Kaggle competition  that by use of a technique called Entity Embeddings to represent categorical variables, the model learns intrinsic properties of categorical variables, uses less memory and speeds up the training compared to one hot encoding. They won third place in the contest with some relatively simple features and documented their findings in a paper 
Entity Embeddings essentially helps us represent a categorical variable in a higher dimension. Let’s say we have a categorical variable called DayOfWeek (Sunday through Saturday). This can be represented in a higher dimension by a matrix of size 7 x
number_dimension, 7 is the number of possible values (n) of the variable DayOfWeek and
number_dimension is the number of dimensions in which we want to represent this categorical variable. In theory the dimension size can be between 1 and n - 1 but in practice, one can run experiments to get this hyper parameter (
(n+1)/2 is a good starting point).
Note that these Entity Embeddings are learnt. They are initially randomly set (org using optimized techniques like He Weight Initialization) and learnt during the course of training.
Neural Network Setup
Our neural network setup has two hidden layers that come after the input layer, but before the output layer. We use entity embeddings to represent the categorical variables. These layers are of size 1000 and 500 respectively which were chosen empirically (also used in ). Each hidden layer has an associated ReLU (which is essentially
max(0, input)) and Dropout operations. The output layer is log_softmax which outputs the logarithm of probability (of a partner bidding on this request). Logarithm is used for numerical stability reasons as recommended by ML community.
Using Entity Embeddings
Let’s zoom into one row or one observation of data. For simplicity, assume that one row of data contains three features, two of which are straight up numbers (continuous variables) and one is a categorical variable DayOfWeek (represented by a matrix of size say
7x4). This is how input for this sample row of data is constructed:
- Continuous variables go as-is (normalization is a usual practice but it doesn’t change dimensionality)
- Let’s say for this row of data, DayOfWeek is Monday, then we would lookup second row from our
7x4. Since the matrix has four columns, we would get four numbers corresponding to Monday. We would then add these numbers to input. Hence our input for this sample row will have 2 (from continuous) + 4 (from Embedding representation of categorical variable) = 6 numbers.
And we do this for each row of input data. Using frameworks like PyTorch, this operation can be done efficiently for a number of rows (batch) at once on a GPU.
The loss function used is Negative Log Likelihood which is a common loss function for binary classification problem. Note the last layer of the Neural Network was
Log_Softmax and hence the use of
Negative Log Likelihood. If one has a
Softmax as an output layer then the loss function would have been
CrossEntropyLoss instead. While mathematically
Log_Softmax equivalent to
log(softmax(x)), doing these two operations separately is slower, and numerically unstable. A
Log_Softmax function will use an alternative formulation to compute the output and gradient correctly .
Lets understand the four scenarios of prediction before introducing our metric of choice:
- True Positive (TP).
- True Negative (TN).
- False Positive (FP).
- False Negative (FN).
These can be represented in a confusion matrix. Read this to better understand these metrics.
There are few important metrics we can derive from this:
- Precision = TP / (TP + FP)
- Recall = TP / (TP + FN)
If we were to predict perfectly, then we would send bid requests only around 5% of the time to each DSP when they do want to bid. But, we are happy to send out way more bid requests to DSPs to avoid false negatives i.e cases when we incorrectly predict that a given DSP would not bid on an impression. So, we want to optimize for reducing False Negatives (FN) which leads our metric of choice as Recall while we keep % of futile bid requests (predicted) to less than certain X% (X can be set based on our business considerations)
Implementation details and Train / Test methodology
PyTorch is used to build the model, and we use a library called Fastai which uses PyTorch to greatly enhance its ease of use. It provides convenient utilities to: read data, create neural network architecture with well researched defaults and to run a training loop. We used an AWS p2.xLarge GPU instance for training. The training was done on one week of data with a few hours of data kept aside for the validation set and tested on data sampled from the following week. We achieved a recall of 95-97% when we predicted 10-15% of requests to a DSP were not worthy. We found that our model needed to be updated at least every three days.
With the advent of user friendly libraries like PyTorch and Fastai, and with the ML community generously sharing winning solutions on Kaggle, it’s becoming increasingly accessible to build state of the art models.