Clinton Boys bio photo

Clinton Boys

I am an Australian data scientist and mathematician, living in Tel Aviv.

Email Twitter Facebook LinkedIn Instagram Github Stackoverflow

This post is based on a question I was asked in a job interview. I had trouble thinking properly in the interview, so I wrote up this code to answer the question afterwards.

Suppose we run a website (maybe an online newspaper) with a number of different articles, which we call sources. We have developed an in-house advertising system which places an ad, called the target, at the bottom of each source page. We recieve a table containing the following columns:

date | source | target | count

That is, for a given day, the number of people who browsed each source and which target they saw. For example, maybe on April 9, there were fifteen users who browsed source 1 and saw target a; this would be represented as the row

9/4 | 1 | a | 15

We are tasked with developing a “repetition index” for a given day, that is, some quantifiable and subjective measure of a user’s feeling that they are repeatedly seeing the same targets.

Let’s model user behaviour on the website as follows:

  • the probability of a user visiting a given source on a site can be estimated using the count data
  • a user is smart enough not to visit the same source twice on the same site on the same day

(An estimate of) the probability that a user navigates to source s which recommends target t is easily computable using the count data. So is the probability that a user then navigates to another source that is not s and sees target u. We can then look at all possible “paths” that a user could take, first visiting one site, and then a different one, and compute the probability that they see the same reccommendation on both sources.

Using Python:

def ProbOfPath(table,s,t):

## Estimates the probability of a user browsing to 
## source s recommending target t, using 
## the count data. 

    sum = 0
    total = 0
    for i in range(0,len(table)):
        total = total + int(table['count'][i])
        if table['source'][i] == s and table['target'][i] == t:
            sum = sum + int(table['count'][i])

    return float(sum)/float(total)

def CondProbOfPath(table,r,s,t):

## Estimates the probability of a user browsing to
## source s recommending target t, having
## already browsed source r, using the count data. 

    sum = 0
    total = 0
    for i in range(0,len(table)):
        if table['source'][i] != r:
            total = total + int(table['count'][i])
            if table['source'][i] == s and table['target'][i] ==t:
                sum = sum + int(table['count'][i])

    return float(sum)/float(total)

def RepetitionIndex(table):

    sources = list(set(table['source']))
    targets = list(set(table['target']))

    total_count = 0
    for i in range(0,len(table)):
        total_count = total_count + int(table['count'][i])

    combinations = {}

    for i in range(0,len(table)):
        if int(table['count'][i]) != 0:
            comb = repr([int(table['source'][i]), int(table['target'][i])])
            combinations[comb] = float(int(table['count'][i]))/float(total_count)

    paths = {}

    for first_combination in combinations.keys():
        for second_combination in combinations.keys():
            if first_combination[1] != second_combination[1]:
                comb = repr([first_combination, second_combination])
                paths[comb] = ProbOfPath(table,int(first_combination[1]), int(first_combination[4]))*CondProbOfPath(table,int(first_combination[1]), int(second_combination[1]),int(second_combination[4]))

    count_list = []

    prob_of_same = 0

    for path in paths.keys():
        if int(path[6]) == int(path[16]):
            prob_of_same = prob_of_same + paths[path]
    return np.round(prob_of_same,3)

A few comments:

  • This model is maybe not the most perfect measure of repetition; in particular it is not invariant under one particular source having its traffic increased. Notice that this does not violate the assumptions of the model, since if traffic to source 1 increases, and source 1 recommends target 2, then the total number of users seeing target 2 will inevitably increase.
  • The code above is really hacky and dumb; in particular it’s going to take a very long time to compute things for real data sets. It should be redesigned in a smarter, object-oriented way.
  • This model gives an estimate of a fairly precise thing: the probability that a user sees the same target recommended twice. This measure is presumably highly correlated with a user’s sense of “repetitiveness” but is not the only possible measure.
  • A better model would compute longer “paths” through the site and look at the probability that a single target was seen an “unusually high” number of times.