The logarithmic scoring rule was suggested by Good in the 1950s [Good, 1952]. It can be defined as follows: If there are n (mutually exclusive) possible outcomes and fi (i=1,...n) is the predicted probability of the ith outcome occurring then if the jth outcome is the one which actually occurs the score for this particular forecast-realization pair is given by
As defined above, with a negative sign, the logarithmic score cannot be negative and smaller values of the score are better. The minimum value of the score (zero) is obtained if a probability of 100% is assigned to the actual outcome. If a probability of zero is assigned to the actual outcome the logarithmic scoring rule is infinite. We will examine the meaning of this below.
The logarithmic scoring rule is strictly proper which means that if a forecaster believes the probabilities of each outcome occurring are gi (i=1,...n) then that forecaster will minimize their expected logarithmic score by issuing a forecast fi = gi. The Brier score is also strictly proper. Unlike the Brier score, however, the logarithmic score is local in that it only depends upon the probability assigned to the outcome which occurs and not to any of the probabilities assigned to the other outcomes.
The use of the term “ignorance” to describe the logarithmic score follows from an information theoretic interpretation of what the score means [Roulston and Smith, 2002]. Consider Alice who is in a closed room and Bob who is outside observing the weather. If there are n possible weather outcomes then, in the absence of any other information, Bob will have to send Alice 2n bits of information to describe which outcome he observes. There is a result from information theory, however, which says that a random symbol sequence containing ndifferent types of symbol occurring with frequencies pi (i=1,...n) can be compressed and that the maximum level of compression is given by assigning, on average, -log2 pi bits to the ith symbol. Note that this compression works by assigning fewer bits to commonly occurring symbols while assigning more bits to the rarer symbols. So if Alice and Bob both believe that the probabilities of the different outcomes are fi (i=1,...n) they can in theory agree on some sort of compression scheme which assigns, on average, -log2 fi bits to the ith outcome. If the jth outcome then occurs then Bob will have to send Alice IGN=-log2 fj bits to tell her which outcome occurred. In this scenario IGN is the extra amount of information Alice needs to know which outcome occurred given that she had the forecast fi (i=1,...n), that is it represents the information deficit – or ignorance – of Alice when she had the forecast. If the forecast assigned a 100% chance to the outcome which occurs then Alice's ignorance was zero, she didn't need Bob to tell her what happened. If the forecast assigned zero probability to the actual outcome then Alice is in trouble since an optimal data compression scheme has no way of encoding outcomes deemed impossible – that would be suboptimal.
The idea of evaluating the goodness of scientific predictions in terms of how much they allow us to compress our observations of the real world has been discussed by P.W. Davies [1991]. Davies argues that rather than concerning ourselves with the question of whether scientific theories are “true” or not we can gauge their effectiveness as data compression schemes. For example, Newton's law of gravitation allows a huge number of observations of planetary orbits to be compressed down to some initial conditions and some equations. The ignorance score is a practical way to apply this philosophy to weather predictions.
In addition to interpreting the logarithmic score as an information deficit we can also interpret the score in terms of the wealth doubling rates of a gambler betting on the forecast. Suppose that a casino is offering to multiply any money placed on the ith outcome by a factor oi if that outcome occurs. For example if oi=3 then this is equivalent to saying that the odds being offered on the ith outcome are 2:1 against (you would get $2 for a stake of $1 and you would get the stake returned). If wi is the fraction of the total amount gambled placed on the ith outcome then if the jth outcome occurs the total amount will be multiplied by a factor wjoj. The logarithm log2 wjoj can be interpreted as the wealth doubling rate – the number of bets required to double the gambler’s initial wealth. It can be shown that if a gambler believes the probabilities of the different outcomes occurring are fi (i=1,...n) and they wish to maximize their wealth doubling rate then they should distribute their stake money such that wi = fi. This strategy is known as “Kelly betting” [Kelly, 1956] and follows from the fact that the logarithmic score is a proper score. If the casino believes the probabilities of the different outcomes are gi (i=1,...n) and the casino sets fair odds (Σ1/oi=1) then the casino will set its odds such that oi=1/gi, and the wealth doubling rate of a Kelly betting gambler will be given by
For the gambler to make money this wealth doubling rate must be positive. This will only be true if the ignorance score of the casino's forecast is larger than the ignorance score of the gambler’s forecast. In this interpretation, if a probability of zero is assigned to the outcome which occurs then the gambler would place none of their money on that outcome and would lose everything.
The idea of representing our uncertainty about the Universe through gambling was suggested by Immanuel Kant in his Critique of Pure Reason in 1781. Kant equated betting with “pragmatic belief” [Menand, 2001]. The logarithmic scoring rule is, in a sense, a practical implementation of this philosophical suggestion.
References
Davies, P.C.W., 1991: Why is the physical world so comprehensible? Complexity, Entropy and the Physics of Information, W.H. Zurek, Ed., Addison-Wesley, 61-70.
Good, I.J., 1952:
Rational decisions, Journal of the Royal
Statistical Society, 14, 107-114.
Menand, L. 2001: The Metaphysical Club, Harper-Collins,
Kelly, J., 1956: A new interpretation of information rate, Bell Systems Technical Journal, 35, 916-926.
Roulston, M.S. and
Smith,