### Clinton Boys

I am an Australian data scientist and mathematician, living in Tel Aviv.

# Repetition index

This post is based on a question I was asked in a job interview. I had trouble thinking properly in the interview, so I wrote up this code to answer the question afterwards.

Suppose we run a website (maybe an online newspaper) with a number of different articles, which we call sources. We have developed an in-house advertising system which places an ad, called the target, at the bottom of each source page. We recieve a table containing the following columns:

date | source | target | count

That is, for a given day, the number of people who browsed each source and which target they saw. For example, maybe on April 9, there were fifteen users who browsed source 1 and saw target a; this would be represented as the row

9/4 | 1 | a | 15

We are tasked with developing a “repetition index” for a given day, that is, some quantifiable and subjective measure of a user’s feeling that they are repeatedly seeing the same targets.

Let’s model user behaviour on the website as follows:

• the probability of a user visiting a given source on a site can be estimated using the count data
• a user is smart enough not to visit the same source twice on the same site on the same day

(An estimate of) the probability that a user navigates to source s which recommends target t is easily computable using the count data. So is the probability that a user then navigates to another source that is not s and sees target u. We can then look at all possible “paths” that a user could take, first visiting one site, and then a different one, and compute the probability that they see the same reccommendation on both sources.

Using Python:

def ProbOfPath(table,s,t):

## Estimates the probability of a user browsing to
## source s recommending target t, using
## the count data.

sum = 0
total = 0
for i in range(0,len(table)):
total = total + int(table['count'][i])
if table['source'][i] == s and table['target'][i] == t:
sum = sum + int(table['count'][i])

return float(sum)/float(total)

def CondProbOfPath(table,r,s,t):

## Estimates the probability of a user browsing to
## source s recommending target t, having
## already browsed source r, using the count data.

sum = 0
total = 0
for i in range(0,len(table)):
if table['source'][i] != r:
total = total + int(table['count'][i])
if table['source'][i] == s and table['target'][i] ==t:
sum = sum + int(table['count'][i])

return float(sum)/float(total)

def RepetitionIndex(table):

sources = list(set(table['source']))
targets = list(set(table['target']))

total_count = 0
for i in range(0,len(table)):
total_count = total_count + int(table['count'][i])

combinations = {}

for i in range(0,len(table)):
if int(table['count'][i]) != 0:
comb = repr([int(table['source'][i]), int(table['target'][i])])
combinations[comb] = float(int(table['count'][i]))/float(total_count)

paths = {}

for first_combination in combinations.keys():
for second_combination in combinations.keys():
if first_combination[1] != second_combination[1]:
comb = repr([first_combination, second_combination])
paths[comb] = ProbOfPath(table,int(first_combination[1]), int(first_combination[4]))*CondProbOfPath(table,int(first_combination[1]), int(second_combination[1]),int(second_combination[4]))

count_list = []

prob_of_same = 0

for path in paths.keys():
if int(path[6]) == int(path[16]):
prob_of_same = prob_of_same + paths[path]
return np.round(prob_of_same,3)