Factorization Machines from scratch

Real time bidding (RTB) has become a leading way of how publishers are selling ad impressions to the advertisers. Most of the RTB is automated, with programmatic auctions in the background determining who wins an impression on specific page.

Typical mechanism behind the ads that we see everyday on the websites follows roughly the following steps:

1) when a visitor visits a typical publisher’s site, the publisher sends a bid request with additional information (about visiting user, ad, etc.) to hundreds of thousands of advertisers than can bid on this impression using ad exchange,

2) each advertiser evaluates ad opportunity based on received information and usually using complex algorithms,

3) it then submits bid price to the auction and the ad exchange determines the selected advertiser with winner of auction usually paying the second highest bid.

All steps of RTB bidding are usually finished in less than 100 ms. The winning advertiser ad is then displayed on the publishers website, resulting in potential user response e.g. click on ad.

So-called Demand Side Platforms are most often bidding on behalf of advertisers on ad exchanges. Typical paying models for publishers are cost per impressions (CPM), cost-per-click (CPC) and cost-per-action (CPA) for conversions. To set proper bids for advertisers, an important part of process is estimating click-through-rate (CTR) of ad impressions.

To predict CTRs, RTB industry employs many different models. One of more advanced models are so-called Factorization Machines (FM) and we will explore them in more detail in this article.

Although efficient libraries are available for factorization machines such as libFM, we will start from scratch to gain a better understanding of FMs and try to implement FM by following the paper:

https://www.csie.ntu.edu.tw/~b97053/paper/Factorization%20Machines%20with%20libFM.pdf

Advantages of Factorization Machines

To better understand the advantages of the Factorization machines let us examine the model:

\( \begin{align*} \hat { y } ( \mathbf { x } ) = w _ { 0 } + \sum _ { j = 1 } ^ { p } w _ { j } x _ { j } + \sum _ { j = 1 } ^ { p } \sum _ { j = j + 1 } ^ { p } x _ { j } x _ { j ^ { \prime } } \sum _ { f = 1 } ^ { k } v _ { j , f } v _ { j ^ { \prime } , f } \end{align*} (1)\)

and consider a hypothetical cases of two publishers: Sueddeutsche, Sports Illustrated and two advertisers: Chicago Bulls and FC Bayern.

If our model had just the first two terms (i.e. linear regression):

\( \begin{align*} w _ { 0 }+w _ { Sueddeutsche }+w _ { Sports Illustrated }+w _ { Chicago Bulls}+w _ { FC Bayern } \end{align*} (2) \)

it would be unable to learn that FC Bayern fans are more likely to be among the readers of Sueddeutsche and Chicago Bulls fans likewise for Sports Illustrated. To remedy this problem we could introduce mixed polynomial terms in a model known as Poly2 but that model suffers from two problems. It is a \( O \left( n ^ { 2 } \right)\) problem which makes it prohibitively expensive both in terms of memory and time when dealing with typical CTR data sets.

And second, when applied on sparse matrix situations as are usually our CTR datasets, there are a lot of combinations for which there is no data to learn about and for those it then makes simple, mostly wrong predictions.

Let us rearrange the last part of (1) as follows:

\(\begin{align*} \frac { 1 } { 2 } \sum _ { f = 1 } ^ { k } \left[ \left( \sum _ { j = 1 } ^ { p } v _ { j f } x _ { j } \right) ^ { 2 } – \sum _ { j = 1 } ^ { p } v _ { j f } ^ { 2 } x _ { j } ^ { 2 } \right] \end{align*} (3) \)

which makes it an \( O (n*k) \) problem, linear in n. Or using matrix structure, instead of having a term \( x ^ { T}Ax\) with \(A\) unconstrained with \(n^ { 2 }\) possible values, we constrain the matrix \(A\) to form \( V^ { T}V \) with \( V \) of \( kn \) dimension.

If we now again consider the FM model and closely inspect the second part of (1) you can observe that even for pairs that may not have data available such as the Sueddeutsche – Chicago Bulls combination we can make some educated estimate about the weight of this pair from the weights of other pairs which have at least one member of the pair. This leads to improved results in comparison to simpler models.

Let us now implement FM from scratch. We will make some modifications of the algorithm from the paper to implement the hashing trick.

import numpy as np

sigma=0.002
learning_rate=0.02
iterations=2000
k=5
etha=0.02
lambda0=0.01
np.random.seed(1)

def sigmoid(x):
    return 1 / (1 + np.exp(-x))


def logloss(p,y, eps=1e-15):
    pred = np.clip(p, eps, 1 - eps)
    if y == 1:
        return -np.log(pred)
    else:
        return -np.log(1 - pred)
    
def predict(x):
        s=w_0
        s+=np.dot(w, x)
        sums = 0.0
        for f in range(k):
            term1=0
            term2=0
            for j in range(p):
                prod=v[j,f]*x[j]
                term2+=prod**2
                term1+=prod
            sums+=0.5*(term1**2 - term2)
        s += sums
        return s

The main training part is:

X=load_data()
Y=load_clicks()
p=X.shape[1]
w = np.zeros(p)    
w_0=0
v = sigma * np.array(np.random.randn(p,k))
for i in range(iterations):
    for x,y in zip(X,Y):
        y_predict=predict(x)
      #  print(y_predict)
        partial_derivative=(sigmoid(y_predict*y)-1)*y
        w_0-=etha*(partial_derivative + 2*lambda0*w_0)
        for i in range(p):
            w[i]-=etha*(partial_derivative*x[i]+2*lambda0*w[i])            
            for f in range(k):
                partial_derivative_v=x[i]*(np.dot(v[:,f],x) - x[i]*v[i,f])*partial_derivative
                v[i,f]-=etha*(partial_derivative_v +2*lambda0*v[i,f])

For training and assesing the performance of the model we recommend Avazu dataset from Kaggle challenge and using grid search:

https://www.kaggle.com/c/avazu-ctr-prediction

Improvements

If you inspect the second part of equation (1) you can observe that Sueddeutsche interacts with FC Bayern with the same vector (\(v_{Sueddeutsche}\)) as when interacting with Chicago Bulls. An improvement is to have different vectors but to keep the size of the problem under control, the vectors differ with respect to only the family features. Features are thus grouped e.g. in categories of publisher, advertiser, user and timeline. The new terms are now of the form e.g. \(v_{Sueddeutsch,Advertiser}*v_{FC Bayern, Publisher}\). This is an approach known as Field-aware Factorization Machines.

While this approach leads to improved results in accuracy over simpler methods its main problem is that it has high memory usage and increased latency making it difficult to use for real-time bidding. Interesting approach to explore is the Field-weighted Factorization Machines which have only 4% number of parameters of Field-aware FM with competitive performance and could thus be more suitable for use in real-time production system. More about the Field-weighted Factorization Machines:

https://arxiv.org/pdf/1806.03514.pdf