Types of Data

In order to simplify the code and speed up the implementation of algorithms, choix assumes that items are identified by consecutive integers ranging from 0 to n_items - 1.

Data processed by the inference algorithms in the library consist of outcomes of comparisons between subsets of items. Specifically, four types of observations are supported.

Pairwise comparisons

In the simplest (and perhaps the most widely-used) case, the data consist of outcomes of comparisons between two items. Mathematically, we represent the event “item \(i\) wins over item \(j\)” as

\[i \succ j.\]

In Python, we simply represent this event using a list with two integers:

[i, j]

By convention, the first element of the list represents the item which wins, and the second element the item which loses.

The statistical model that choix postulates for pairwise-comparison data is usually known as the Bradley–Terry model. Given parameters \(\theta_1, \ldots, \theta_n\), and two items \(i\) and \(j\), the probability of the outcome \(i \succ j\) is

\[p(i \succ j) = \frac{e^{\theta_i}}{e^{\theta_i} + e^{\theta_j}}.\]

Top-1 lists

Another case arises when the data consist of choices of one item out of a set containing several other items. We call these top-1 lists. Compared to pairwise comparisons, this type of data is no longer restricted to comparing only two items: comparisons can involve sets of alternatives of any size between 2 and n_items. We denote the outcome “item \(i\) is chosen over items \(j, \ldots, k\)” as

\[i \succ \{j, \ldots, k\}.\]

In Python, we represent this event using a list with two elements:

[i, {j, ..., k}]

The first element of the list is an integer that represents the “winning” item, whereas the second element is a set containing the “losing” items. Note that this set does not include the winning item.

The statistical model that choix uses for these data is a straightforward extension of the Bradley–Terry model (see, e.g., Luce 1959). Given parameters \(\theta_1, \ldots, \theta_n\), winning item \(i\) and losing alternatives \(j, k, \ell, \ldots\), the probability of the corresponding outcome is

\[p(i \succ \{j, \ldots, k\}) = \frac{e^{\theta_i}}{ e^{\theta_i} + e^{\theta_j} + \cdots + e^{\theta_k}}.\]

Rankings

Instead of observing a single choice, we might have observations that consist of a ranking over a set of alternatives. This leads to a third type of data. We denote the event “item \(i\) wins over item \(j\) … wins over item \(k\)” as

\[i \succ j \succ \ldots \succ k.\]

In Python, we represent this as a list:

[i, j, ..., k]

The list contains the subset of items in decreasing order of preference. For example, the list [2, 0, 4] corresponds to a ranking where 2 is first, 0 is second, and 4 is third.

In this case, the statistical model that choix uses is usually referred to as the Plackett–Luce model. Given parameters \(\theta_1, \ldots, \theta_n\) and items \(i, j, \ldots, k\), the probability of a given ranking is

\[p(i \succ j \succ \ldots \succ k) = \frac{e^{\theta_i}}{e^{\theta_i} + e^{\theta_j} + \cdots + e^{\theta_k}} \cdot \frac{e^{\theta_j}}{e^{\theta_j} + \cdots + e^{\theta_k}} \cdots.\]

The attentive reader will notice that this probability corresponds to that of an independent sequence of top-1 lists over the remaining alternatives.

Choices in a network

The fourth type of data is slightly more involved. It enables the processing of choices on networks based on marginal observations at the nodes of the network. The easiest way to get started is to follow this notebook.

We defer to [MG17] for a thorough presentation of the observed data and of the statistical model.