There are many complex and almost instantaneous ad processes that happen when we visit a typical website. The ads that we see on websites are selected in less than 100 milliseconds through specialized auctions in which advertisers bid on ad impressions and publishers sell them.

An important part of optimal bidding strategies in these ad auctions involves calculation of bid prices based on estimated click-through-rate (CTR) for given ad impression. CTR prediction is thus one of the key methods in the RTB ecosystem.

In this article, we will build two machine learning models for predicting whether an ad will be clicked or not. We will use a data set from a Kaggle competition by Avazu – “Click-Through Rate Prediction”, available at https://www.kaggle.com/c/avazu-ctr-prediction. It contains 11 days worth of ad clicks data.

Features in our data set consist of:

– id: ad identifier

– hour: format is YYMMDDHH, so 14091123 means 23:00 on Sept. 11, 2014 UTC.

– C1 — anonymized categorical variable

– banner_pos

– site_id

– site_domain

– site_category

– app_id

– app_domain

– app_category

– device_id

– device_ip

– device_model

– device_type

– device_conn_type

– C14-C21 — anonymized categorical variables

– click, 0 for non-click, 1 for click (our target variable column)

Before using the data set for training, we shuffle its rows (using bash command “shuf -o train_shuffled.csv < train.csv”).

We start by importing the libraries:

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

import datetime

import scipy.stats as sstats

import csv

import numpy as np

import pandas as pd

from datetime import datetime, date, time

from sklearn.linear_model import SGDClassifier

from sklearn.metrics import log_loss

from sklearn.feature_extraction import FeatureHasher

from sklearn import preprocessing

from sklearn.pipeline import Pipeline

from sklearn.externals import joblib

from fastFM.mcmc import FMClassification

from sklearn.preprocessing import OneHotEncoder

To better understand our data set we will first load 100,000 rows for initial exploratory data analysis:

dataset = r'train_shuffled.csv' 

df = pd.read_csv(dataset,nrows=100000)

df = df.drop('id', axis=1)

# we do not need column id

Let us first examine the imbalance in our data set:

print("Imbalance ratio: {}".format(float(len(df[df['click']==0]))/len(df[df['click']==1])))

Imbalance ratio: 4.87

Our data set displays some imbalance with ads that were not clicked representing the majority of data instances.

Next, we calculate the CTR in our ads sample:

print("CTR is {}%".format(100.0*df['click'].sum()/len(df)))

CTR is 17.144%

Prior to training, we perform feature engineering, but restricted on datetime column:

df['hour']=df['hour'].map(lambda x: datetime.strptime(str(x),"%y%m%d%H"))

df['day_of_week']=df['hour'].map(lambda x:  x.weekday())

df['hour']=df['hour'].map(lambda x:  x.hour)

Next, we want to qualitatively examine the CTR for different features, but we will limit visualizations to categorical features that do not have too many distinct values and are not anonymized.

df.nunique()

Based on number of unique values, we visually examine CTR of the following features:

highlighted_features = ['banner_pos','device_type', 'device_conn_type','day_of_week','hour']

sns.set_context("paper",font_scale=1.3)

sns.set_style("darkgrid")

width = 0.35

fig, axes = plt.subplots(nrows=2, ncols=2,figsize=(20,15))

for index,feature in enumerate(highlighted_features[:len(highlighted_features)-1]):

    x = index // 2 

    y = index % 2      

    df1 = df.groupby([feature,'click'], sort=True).size().unstack(fill_value=0)

    df1.columns = ['not clicked','clicked']

    indices = range(len(df1.index.values))

    ctr = [float(value)/(df1['not clicked'].values[i] + value) for i,value in enumerate(df1['clicked'].values)]

    axes[x,y].bar(indices, df1['not clicked'].values, width, label='not clicked')

    axes[x,y].bar(indices, df1['clicked'].values, width,bottom=df1['not clicked'].values,label='clicked')

    axes[x,y].set_xticks(indices)

    axes[x,y].set_xticklabels(tuple(df1.index.values))

    axes[x,y].set_ylabel('Number of ads / CTRs')

    axes[x,y].set_title('CTR for '+feature)

    axes[x,y].legend() 

    for p in  axes[x,y].patches:

        if p.get_height() in df1['not clicked'].values: 

            axes[x,y].annotate('CTR='+format(100*ctr[np.where(df1['not clicked'].values == p.get_height())[0][0]], '.1f')+'%', (p.get_x() + p.get_width() / 2., p.get_height()*1.2 ), ha = 'center', va = 'center', 

                                   xytext = (0, 10), textcoords = 'offset points')

plt.show()

Figure 1: CTR for different feature values

Charts show that there is a considerable difference between click-through-rates depending on device type, device connection type and banner position. CTR is also not uniform with respect to the day of week on which the ad is clicked.

Analysing CTR for different hours of the day:

Figure 2: CTR for different hours of the day

Certain hours of the day, like 1, 7, 17 exhibit higher CTR than average, though they are not clustered together. Lower CTR than average is on the other hand registered in the hours between 17 and 19.

Next, we examine correlation between categorical features, using Cramer’s V measure for this purpose:

categorical_features = highlighted_features

def cramers_v_with_bias_correction(x, y):

    cm = pd.crosstab(x,y).values

    chi2 = sstats.chi2_contingency(cm)[0]

    cm_sum = cm.sum().sum()

    phi2 = chi2/cm_sum

    r,k = cm.shape

    phi2corr = max(0, phi2-((k-1)*(r-1))/(cm_sum-1))

    rcorr = r-((r-1)**2)/(cm_sum-1)

    kcorr = k-((k-1)**2)/(cm_sum-1)

    return np.sqrt(phi2corr/min((kcorr-1),(rcorr-1)))

cramers_v = np.zeros((len(categorical_features),len(categorical_features)))

for i,feature1 in enumerate(categorical_features):

    for j,feature2 in enumerate(categorical_features): 

        cramers_v[j,i] = cramers_v_with_bias_correction(df[feature1],df[feature2])

cramers_v = pd.DataFrame(cramers_v,index=categorical_features,columns=categorical_features)

plt.figure(figsize=(14,12))

plt.title("Cramer's V of Categorical Features", y=1.04, size=20)

sns.heatmap(cramers_v,linewidths=0.15,square=True, vmax=1.0, cmap=plt.cm.BuGn, linecolor='white', annot=True);

Figure 3: Cramer’s V of categorical features

When looking at non-anonymized features, correlations are rather low, except between device type and banner position.

2. CTR-prediction using SGD and hashing trick

After introduction to our data sets, we will next build our first model to predict whether a given ad will be clicked by the user or not. Our Avazu data set consists of around 40 million rows, the corresponding .csv file has a size of 6GB.

We will treat our problem as out-of-core learning, where we will not load entire train data set in memory. We will rather input and process data on a row by row basis, each time updating weights of our ML model.

Stochastic Gradient Descent

To achieve this online learning, we will use Stochastic Gradient Descent method with updates to the model done for a single data instance at a time. Scikit-learn has several excellent online learning algorithms available, we will be using SGDClassifier.

SGDClassifier has a complexity of O(knp), where k is the number of epochs, p is the average number of non-zero features per sample and n is the number of data instances. SGDClassifier time complexity thus increases linearly with training data size. Another advantage is fixed memory footprint, making it highly scalable.

In order to use SGDClassifier as out-of-core learner, we must use the method partial_fit instead of fit. To use partial_fit, we need to parse through the entire data set at least once to determine the classes that we want to predict. These classes should be passed to partial_fit under parameter classes. In our case, we only have two of them – 0 and 1.

Hashing trick or feature hashing

Categorical features in our data set have millions of unique values. In order to manage this high cardinality, we will use feature hashing or hashing trick approach. Feature hashing maps sparse, high-dimensional features to smaller vectors of fixed size.

It is very useful in out-of-core learning as it does not require us to parse the complete data set. It computes feature indices directly from feature values, using a hashing function. One can use different hashing functions, good hashing functions however have the following properties:

  • hash value is completely determined by the input value
  • the resulting hash values are mapped as uniformly as possible across hash value space
  • hashing function uses all the input data
  • the same input value always produces the same hash value (deterministic)

When the dimension of hashing vector is smaller than the initial feature space, the hashing function can hash different input values to the same hash value, known as “collision”. Many studies however show that impact of collisions on predictive performance of models is limited, see e.g. https://arxiv.org/pdf/0902.2206.pdf.

Although feature hashing has several advantages, it also has some drawbacks. One is that we need to explicitly set the hashing space size. Another drawback is reduced interpretability of models that use feature hashing. This is a consequence of loss of information on application of hashing function. Although the same input feature values lead to the same hash value, we cannot perform reverse lookup and determine the exact input feature value from the hash value.

This means that when we determine the most important features for predictions of our model using e.g. XAI (Explainable AI) tools like SHAP and LIME, we will have difficulty in mapping these features to those in the initial feature set. Application of feature hashing may thus lead to lower interpretability of our model.

Feature hashing is implemented in several open source machine learning tools, such as Vowpal Wabbit and Scikit-learn. In our training, we will use FeatureHasher from scikit-learn.

Let us start by defining several variables and the SGD model:

train_file = dataset 

chunk_size = 100000

train_size = 0.8

total_rows = 40428967

train_rows = int(train_size*total_rows)

test_rows = total_rows - train_rows

nr_iterations_train = round(train_rows/chunk_size)

nr_iterations_test = round(test_rows/chunk_size)

SGDmodel = SGDClassifier(loss='log',  alpha=.000001, penalty='l2', eta0=2.0, shuffle=True, n_jobs=-1,learning_rate='invscaling',  random_state=10)

partial_fit_classes = np.array([0, 1]) # to be used in partial_fit method

Defining instance of FeatureHasher:

h = FeatureHasher(n_features=2**26, input_type='string')

preprocessor = Pipeline([('feature_hashing', h)])

 

Setting up readers of data:

columns = ['id', 'click', 'hour', 'C1', 'banner_pos', 'site_id', 'site_domain', 'site_category', 'app_id', 'app_domain', 'app_category',

           'device_id', 'device_ip', 'device_model', 'device_type', 'device_conn_type', 'C14', 'C15', 'C16', 'C17', 'C18', 'C19', 'C20', 'C21']

df_train = pd.read_csv(

    train_file, sep=',', chunksize=chunk_size, names=columns, header=0, nrows=train_rows)

df_test = pd.read_csv(train_file, sep=',',

                        chunksize=chunk_size, names=columns, header=0, skiprows=train_rows, nrows=test_rows)

 

Defining function for pre-processing of data instances:

def preprocess_data(df):

    y_train = df['click'].values

    y_train = np.asarray(y_train).ravel()

    df['hour'] = df['hour'].map(

        lambda x: datetime.strptime(str(x), "%y%m%d%H"))

    df['day_of_week'] = df['hour'].map(lambda x:  x.weekday())

    df['hour'] = df['hour'].map(lambda x:  x.hour)

    X = df.drop(['id', 'click'], axis=1)

    X_train = preprocessor.fit_transform(np.asarray(X.astype(str)))

    return y_train, X_train

 

Starting the training:

print("Training...")

count = 0

logloss_sum = 0

for df in df_train:

    count += 1

    y_train, X_train = preprocess_data(df)

    SGDmodel.partial_fit(X_train, y_train, classes=partial_fit_classes)

 

Followed by computing log loss on test set:

print("Log loss on test set ...")

logloss_sum = 0

count = 0

for df in df_test:

    count += 1

    y_test, X_test = preprocess_data(df)

    y_pred = SGDmodel.predict_proba(X_test)

    log_loss_temp = log_loss(y_test, y_pred)

    logloss_sum += log_loss_temp

    if count % 10 == 0:

        print("Iteration {}/{}, test log loss: {}".format(count,nr_iterations_test,logloss_sum/count))

print("Final log loss: ", logloss_sum/count)

We obtain final log loss of 0.3995.

3. Factorization Machines

One of the most popular machine learning models for CTR prediction are Factorization Machines (FM) and its later variants, including Field Aware Factorization Machines (FFM). Factorization machines are a supervised machine learning algorithm, capable of both regression and classification, introduced by Steffen Rendle in https://www.csie.ntu.edu.tw/~b97053/paper/Rendle2010FM.pdf.

They are a generalization of a linear model and can capture higher order interactions between features. One approach to modelling higher order interactions would be to introduce mixed polynomial terms in a model known as Poly2 but that model is a O(n**2) problem which makes it prohibitively expensive both in terms of memory and time when applied on typical CTR data sets.

Rendle’s Factorization Machines reduce the polynomial calculation time to linear complexity. The interaction between features is modelled with terms of the form W_ij*x_i*x_j, where W_ij is modelled as the inner product of k-dimensional latent vectors associated with each feature. The FM approach learns these k-dimensional latent vectors for each feature, where k is a hyperparameter that must be specified by the user before the start of the training.

We replaced the learning of nXn interaction parameters with learning nXk parameters, making the problem linear in n. Or in matrix terms, instead of having a term x_transpose A x with A unconstrained (with nXn possible values), we constrain the matrix A to a form V_transpose*V with V of dimension nXk.

Factorization Machines are especially suitable for CTR predictions, where they can learn the users click rate behaviour when ads from specific categories are shown on webpages from specific categories (interaction).

There are several libraries with implementation of Factorization Machines:

Factorization Machines are usually trained by using one of the three main solvers – Stochastic Gradient Descent (SGD), Alternative Least Squares (ALS) or Markov Chain Monte Carlo (MCMC).

We trained our Factorization Machine model using fastFM tool (with Markov Chain Monte Carlo – MCMC method), on a sample of 1 million rows from Avazu data set:

rank = 9 # the number of hidden variables per feature, selected with hyperparameter tuning

n_iter = 12 # selected with hyperparameter tuning

df = pd.read_csv(dataset,nrows=1000000)

train_size = 0.8

train_size = int(train_size*len(df))

def fast_fm(trainX, testX, trainY, testY):

    encoder = OneHotEncoder(handle_unknown='ignore').fit(trainX)

    trainX = encoder.transform(trainX)

    testX = encoder.transform(testX)

    clf = FMClassification(rank=rank, n_iter=n_iter)

    predictions = clf.fit_predict_proba(trainX, trainY, testX)

    print("fast fm log loss: ", log_loss(testY, predictions))

    return True

trainX = df.iloc[:train_size, ~df.columns.isin(['click'])]

trainY = df['click'].values[:train_size]

testX = df.iloc[train_size:, ~df.columns.isin(['click'])]

testY = df['click'].values[train_size:]

fast_fm(trainX, testX, trainY, testY)

 

Log loss of resulting Factorization Machines model is 0.4022.

Conclusion

Real time bidding (RTB) are automated digital auctions that allow advertisers to bid on ad impressions and publishers to sell them.  An important part in these ad auctions involves calculation of bid prices which often rely on estimation of click-through-rates (CTR) for given ad impression. CTR prediction is thus one of the key methods in the RTB ecosystem.

In our article, we have trained several machine learning models to predict clicks on ads, using data set from Avazu, consisting of 11 days worth of click ads. We examined two ML models, first was an out-of-core, online learner based on Stochastic Gradient Descent using feature hashing. The second model was based on Factorization Machines. This class of models and its later variants such as Field Aware Factorization Machines are among the most popular models for CTR prediction.

Get in Touch

Do you need consultation or have a project in mind? We would love to hear from you!

Get in Touch