Types of Data¶
In order to simplify the code and speed up the implementation of algorithms,
choix
assumes that items are identified by consecutive integers ranging
from 0
to n_items - 1
.
Data processed by the inference algorithms in the library consist of outcomes of comparisons between subsets of items. Specifically, four types of observations are supported.
Pairwise comparisons¶
In the simplest (and perhaps the most widely-used) case, the data consist of outcomes of comparisons between two items. Mathematically, we represent the event “item \(i\) wins over item \(j\)” as
In Python, we simply represent this event using a list with two integers:
[i, j]
By convention, the first element of the list represents the item which wins, and the second element the item which loses.
The statistical model that choix
postulates for pairwise-comparison
data is usually known as the Bradley–Terry model. Given parameters
\(\theta_1, \ldots, \theta_n\), and two items \(i\) and \(j\), the
probability of the outcome \(i \succ j\) is
Top-1 lists¶
Another case arises when the data consist of choices of one item out of a set
containing several other items. We call these top-1 lists. Compared to
pairwise comparisons, this type of data is no longer restricted to comparing
only two items: comparisons can involve sets of alternatives of any size
between 2 and n_items
. We denote the outcome “item \(i\) is chosen over
items \(j, \ldots, k\)” as
In Python, we represent this event using a list with two elements:
[i, {j, ..., k}]
The first element of the list is an integer that represents the “winning” item, whereas the second element is a set containing the “losing” items. Note that this set does not include the winning item.
The statistical model that choix
uses for these data is a straightforward
extension of the Bradley–Terry model (see, e.g., Luce 1959). Given parameters
\(\theta_1, \ldots, \theta_n\), winning item \(i\) and losing
alternatives \(j, k, \ell, \ldots\), the probability of the corresponding
outcome is
Rankings¶
Instead of observing a single choice, we might have observations that consist of a ranking over a set of alternatives. This leads to a third type of data. We denote the event “item \(i\) wins over item \(j\) … wins over item \(k\)” as
In Python, we represent this as a list:
[i, j, ..., k]
The list contains the subset of items in decreasing order of preference. For
example, the list [2, 0, 4]
corresponds to a ranking where 2
is first,
0
is second, and 4
is third.
In this case, the statistical model that choix
uses is usually referred to
as the Plackett–Luce model. Given parameters \(\theta_1, \ldots,
\theta_n\) and items \(i, j, \ldots, k\), the probability of a given ranking
is
The attentive reader will notice that this probability corresponds to that of an independent sequence of top-1 lists over the remaining alternatives.
Choices in a network¶
The fourth type of data is slightly more involved. It enables the processing of choices on networks based on marginal observations at the nodes of the network. The easiest way to get started is to follow this notebook.
We defer to [MG17] for a thorough presentation of the observed data and of the statistical model.