In information theory, entropy is the average amount of information contained in each message received. Here, message stands for an event, sample or character drawn from a distribution or data stream. Entropy thus characterizes our uncertainty about our source of information. (Entropy is best understood as a measure of uncertainty rather than certainty as entropy is larger for more random sources.) The source is also characterized by the probability distribution of the samples drawn from it. The idea here is that the less likely an event is, the more information it provides when it occurs. For some other reasons (explained below) it makes sense to define information as the negative of the logarithm of the probability distribution. The probability distribution of the events, coupled with the information amount of every event, forms a random variable whose average (a.k.a. expected) value is the average amount of information, a.k.a. entropy, generated by this distribution. The units of entropy are commonly referred to as bits, but entropy is also measured in shannons, nats, or hartleys, depending on the base of the logarithm used to define it. The logarithm of the probability distribution is useful as a measure of information because it is additive. For instance, flipping a coin provides 1 shannon of information whereas m tosses gather m bits. Generally, you need log2(n) bits to represent a variable that can take one of n values. Since 1 of n outcomes is possible when you apply a scale graduated with n marks, you receive log2(n) bits of information with every such measurement. The log2(n) rule holds only while all outcomes are equally probable. If one of the events occurs more often than others, observation of that event is less informative. Conversely, observing rarer events compensate by providing more information when observed. Since observation of less probable events occurs more rarely, the net effect is that the entropy (thought of as the average information) received from non-uniformly distributed data is less than log2(n). Entropy is zero when only one certain outcome is expected. Shannon entropy quantifies all these considerations exactly when a probability distribution of the source is provided. It is important to note that the meaning of the events observed (a.k.a. the meaning of messages) do not matter in the definition of entropy. Entropy only takes into account the probability of observing a specific event, so the information it encapsulates is information about the underlying probability distribution, not the meaning of the events themselves. Generally, “entropy” stands for “disorder” or uncertainty. The entropy we talk about here was introduced by Claude E. Shannon in his 1948 paper “A Mathematical Theory of Communication”. We also call it Shannon entropy to distinguish from other occurrences of the term, which appears in various parts of physics in different forms. Shannon entropy provides an absolute limit on the best possible average length of lossless encoding or compression of any communication, assuming that the communication may be represented as a sequence of independent and identically distributed random variables.