CTR prediction using hashing trick, logistic regression, SGD – from scratch

The leading way for publishers to sell their ads to advertisers is real-time bidding (RTB).

Most often it is the so-called Demand Side Platforms who are doing the biddings on behalf of advertisers at ad exchanges. Paying models for publishers are usually cost per impressions (CPM), cost-per-click (CPC) or cost-per-action (CPA). To set proper bid prices for advertisers, algorithms are employed with the main objective of estimating click-through-rate (CTR) of ad impressions.

In this article we will be exploring a model for predicting CTR as part of a RTB system.

When training models for this kind of prediction we usually deal with data which has either integer features or categorical features. As we want to build an online model with minimum memory footprint, one-hot encoding is not an option and the technique we will rely on is called a hashing trick or sometimes feature hashing.

One implementation is given below, where we chose a particular hash function (feel free to experiment with your own):

N=2**18
learning_rate=0.02
weights=[0]*N
    
def hash(a):
    #fnv32a implementation
    offset = 0x811c9dc5
    fnv_32_prime = 0x01000193
    size = 2 ** 32
    for b in a:
        offset = offset ^ ord(b)
        offset = (offset * fnv_32_prime) % size
    return offset 
    
def hashing(features,N):
    x=[]
    for index,f in features.items():
        h= hash(f)
        x.append(h % N)
    return x

N is a parameter denoting the size of the hashing table. 2**20 is a good size for typical problems.

Next step is to implement SGD for minimisation of log loss function and carry out training iterations:

def prediction(x,weights):
    w_x_product=0
    for i in x:
        w_x_product+=weights[i]
    return 1/(1+np.exp(-w_x_product))

def logloss(p,y, eps=1e-15):
    pred = np.clip(p, eps, 1 - eps)
    if y == 1:
        return -np.log(pred)
    else:
        return -np.log(1 - pred)

#training
train_file = csv.DictReader(open("train.csv"))
counter=0
current_progress=0
for features in train_file:
    counter+=1
    del features['id']
    y=float(features['click'])
    del features['click']      
    x=hashing(features,N)
    
    #iteration
    for i in x:
        weights[i]-=(prediction(x,weights)-y)*learning_rate
    current_progress+=logloss(prediction(x,weights),y)    
    if counter % 10000 ==0: 
        print("training, loss: ", round(current_progress/counter,4))

The dataset examined in this article is from Avazu Kaggle CTR competition (link).

After training, we checked our model against test set:

#testing    
test_file=csv.DictReader(open("test.csv"))
y_test=[]
y_test_pred=[]
logloss_sum=0
counter=0
for features in test_file:
    counter+=1
    y_test=float(features['click'])
    del features['click']
    del features['id']    
    x=hashing(features,N)
    y_test_pred.append(prediction(x,weights))
    logloss_sum+=logloss(prediction(x,weights),y_test)
    if counter % 10000 ==0:     
        print("testing, logloss: ", round(logloss_sum/counter,4))  
print("testing, final logloss: ", round(logloss_sum/counter,4))

The log loss result is 0.4129.

For comparison to respective Kaggle competition results please visit link.