# CTR prediction using hashing trick, logistic regression, SGD – from scratch

Most often it is the so-called Demand Side Platforms who are doing the biddings on behalf of advertisers at ad exchanges. Paying models for publishers are usually cost per impressions (CPM), cost-per-click (CPC) or cost-per-action (CPA). To set proper bid prices for advertisers, algorithms are employed with the main objective of estimating click-through-rate (CTR) of ad impressions.

In this article we will be exploring a model for predicting CTR as part of a RTB system.

When training models for this kind of prediction we usually deal with data which has either integer features or categorical features. As we want to build an online model with minimum memory footprint, one-hot encoding is not an option and the technique we will rely on is called a hashing trick or sometimes feature hashing.

One implementation is given below, where we chose a particular hash function (feel free to experiment with your own):

N=2**18
learning_rate=0.02
weights=[0]*N

def hash(a):
#fnv32a implementation
offset = 0x811c9dc5
fnv_32_prime = 0x01000193
size = 2 ** 32
for b in a:
offset = offset ^ ord(b)
offset = (offset * fnv_32_prime) % size
return offset

def hashing(features,N):
x=[]
for index,f in features.items():
h= hash(f)
x.append(h % N)
return x

N is a parameter denoting the size of the hashing table. 2**20 is a good size for typical problems.

Next step is to implement SGD for minimisation of log loss function and carry out training iterations:

def prediction(x,weights):
w_x_product=0
for i in x:
w_x_product+=weights[i]
return 1/(1+np.exp(-w_x_product))

def logloss(p,y, eps=1e-15):
pred = np.clip(p, eps, 1 - eps)
if y == 1:
return -np.log(pred)
else:
return -np.log(1 - pred)

#training
counter=0
current_progress=0
for features in train_file:
counter+=1
del features['id']
y=float(features['click'])
del features['click']
x=hashing(features,N)

#iteration
for i in x:
weights[i]-=(prediction(x,weights)-y)*learning_rate
current_progress+=logloss(prediction(x,weights),y)
if counter % 10000 ==0:
print("training, loss: ", round(current_progress/counter,4))

After training, we checked our model against test set:

#testing
y_test=[]
y_test_pred=[]
logloss_sum=0
counter=0
for features in test_file:
counter+=1
y_test=float(features['click'])
del features['click']
del features['id']
x=hashing(features,N)
y_test_pred.append(prediction(x,weights))
logloss_sum+=logloss(prediction(x,weights),y_test)
if counter % 10000 ==0:
print("testing, logloss: ", round(logloss_sum/counter,4))
print("testing, final logloss: ", round(logloss_sum/counter,4))

The log loss result is 0.4129.