In statistics and information theory, a maximum entropy probability distribution has entropy that is at least as great as that of all other members of a specified class of probability distributions. According to the principle of maximum entropy, if nothing is known about a distribution except that it belongs to a certain class (usually defined in terms of specified properties or measures), then the distribution with the largest entropy should be chosen as the least-informative default. The motivation is twofold: first, maximizing entropy minimizes the amount of prior information built into the distribution; second, many physical systems tend to move towards maximal entropy configurations over time.

This is a special case of more general forms described in the articles Entropy (information theory), Principle of maximum entropy, and differential entropy. In connection with maximum entropy distributions, this is the only one needed, because maximizing H(X){\displaystyle H(X)} will also maximize the more general forms.

The base of the logarithm is not important as long as the same one is used consistently: change of base merely results in a rescaling of the entropy. Information theorists may prefer to use base 2 in order to express the entropy in bits; mathematicians and physicists will often prefer the natural logarithm, resulting in a unit of nats for the entropy.

The choice of the measure dx{\displaystyle dx} is however crucial in determining the entropy and the resulting maximum entropy distribution, even though the usual recourse to the Lebesgue measure is often defended as "natural"

Many statistical distributions of applicable interest are those for which the moments or other measurable quantities are constrained to be constants. The following theorem by Ludwig Boltzmann gives the form of the probability density under these constraints.

Suppose S is a closed subset of the real numbersR and we choose to specify nmeasurable functionsf1,...,fn and n numbers a1,...,an. We consider the class C of all real-valued random variables which are supported on S
(i.e. whose density function is zero outside of S) and which satisfy
the n moment conditions:

Using the Karush–Kuhn–Tucker conditions, it can be shown that the optimization program has a unique solution, because the objective function in the optimization is concave in λ{\displaystyle {\boldsymbol {\lambda }}}.

Note that if the moment conditions are equalities (instead of inequalities), that is,

Suppose S = {x1,x2,...} is a (finite or infinite) discrete subset of the reals and we choose to specify n functions f1,...,fn and n numbers a1,...,an. We consider the class C of all discrete random variables X which are supported on S and which satisfy the n moment conditions

where η0{\displaystyle \eta _{0}} and λj,j≥1{\displaystyle \lambda _{j},j\geq 1} are the Lagrange multipliers. The zeroth constraint ensures the second axiom of probability. The other constraints are that the measurements of the function are given constants up to order n{\displaystyle n}. The entropy attains an extremum when the functional derivative is equal to zero:

It is an exercise for the reader that this extremum is indeed a maximum. Therefore, the maximum entropy probability distribution in this case must be of the form (λ0:=η0−1{\displaystyle \lambda _{0}:=\eta _{0}-1})

It follows that a distribution satisfying the expectation-constraints and maximising entropy must necessarily have full support—i. e. the distribution is almost everywhere positive. It follows that the maximising distribution must be an internal point in the space of distributions satisfying the expectation-constraints, that is, it must be a local extreme. Thus it suffices to show that the local extreme is unique, in order to show both that the entropy-maximising distribution is unique (and this also shows that the local extreme is the global maximum).

where q{\displaystyle q} is similar to the distribution above, only parameterised by γ→{\displaystyle {\vec {\gamma }}}. Assuming that no non-trivial linear combination of the observables is a.e. constant, (which e. g. holds if the observables are independent and not a.e. constant), it holds that ⟨u,f→(X)⟩{\displaystyle \langle u,{\vec {f}}(X)\rangle } has non-zero variance, unless u=0{\displaystyle u=0}. By the above equation it is thus clear, that the latter must be the case. Hence λ→′−λ→=u=0{\displaystyle {\vec {\lambda }}'-{\vec {\lambda }}=u=0}, so the parameters characterising the local extrema p,p′{\displaystyle p,p'} are identical, which means that the distributions themselves are identical. Thus, the local extreme is unique and by the above discussion, the maximum is unique—provided a local extreme actually exists.

Note that not all classes of distributions contain a maximum entropy distribution. It is possible that a class contain distributions of arbitrarily large entropy (e.g. the class of all continuous distributions on R with mean 0 but arbitrary standard deviation), or that the entropies are bounded above but there is no distribution which attains the maximal entropy (e.g. the class of all continuous distributions X on R with E(X) = 0 and E(X2) = E(X3) = 1 (See Cover, Ch 12)). It is also possible that the expected value restrictions for the class C force the probability distribution to be zero in certain subsets of S. In that case our theorem doesn't apply, but one can work around this by shrinking the set S.

Every probability distribution is trivially a maximum entropy probability distribution under the constraint that the distribution have its own entropy. To see this, rewrite the density as p(x)=exp⁡(ln⁡p(x)){\displaystyle p(x)=\exp {(\ln {p(x)})}} and compare to the expression of the theorem above. By choosing ln⁡p(x)→f(x){\displaystyle \ln {p(x)}\rightarrow f(x)} to be the measurable function and

∫exp⁡(f(x))f(x)dx=−H{\displaystyle \int \exp {(f(x))}f(x)dx=-H}

to be the constant, p(x){\displaystyle p(x)} is the maximum entropy probability distribution under the constraint

∫p(x)f(x)dx=−H{\displaystyle \int p(x)f(x)dx=-H}.

Nontrivial examples are distributions that are subject to multiple constraints that are different from the assignment of the entropy. These are often found by starting with the same procedure ln⁡p(x)→f(x){\displaystyle \ln {p(x)}\rightarrow f(x)} and finding that f(x){\displaystyle f(x)} can be separated into parts.

A table of examples of maximum entropy distributions is given in Lisman (1972) [6] and Park & Bera (2009)[7]

The uniform distribution on the interval [a,b] is the maximum entropy distribution among all continuous distributions which are supported in the interval [a, b], and thus the probability density is 0 outside of the interval. This uniform density can be related to Laplace's principle of indifference, sometimes called the principle of insufficient reason. More generally, if we're given a subdivision a=a0 < a1 < ... < ak = b of the interval [a,b] and probabilities p1,...,pk which add up to one, then we can consider the class of all continuous distributions such that

The density of the maximum entropy distribution for this class is constant on each of the intervals [aj-1,aj). The uniform distribution on the finite set {x1,...,xn} (which assigns a probability of 1/n to each of these values) is the maximum entropy distribution among all discrete distributions supported on this set.

has maximum entropy among all real-valued distributions supported on (-∞,∞) with a specified varianceσ2 (a particular moment). Therefore, the assumption of normality imposes the minimal prior structural constraint beyond this moment. (See the differential entropy article for a derivation.)

In the case of distributions supported on [0,∞], the maximum entropy distribution depends on relationships between the first and second moments. In specific cases, it may be the exponential distribution, or may be another distribution, or may be undefinable.[8]

where the positive constants C and r can be determined by the requirements that the sum of all the probabilities must be 1 and the expected value must be μ.

For example, if a large number N of dice are thrown, and you are told that the sum of all the shown numbers is S. Based on this information alone, what would be a reasonable assumption for the number of dice showing 1, 2, ..., 6? This is an instance of the situation considered above, with {x1,...,x6} = {1,...,6} and μ = S/N.

Finally, among all the discrete distributions supported on the infinite set {x1,x2,...} with mean μ, the maximum entropy distribution has the shape:

There exists an upper bound on the entropy of continuous random variables on R{\displaystyle \mathbb {R} } with a specified mean, variance, and skew. However, there is no distribution which achieves this upper bound, because p(x)=cexp⁡(λ1x+λ2x2+λ3x3){\displaystyle p(x)=c\exp {(\lambda _{1}x+\lambda _{2}x^{2}+\lambda _{3}x^{3})}} is unbounded except when λ3=0{\displaystyle \lambda _{3}=0} (see Cover & Thomas (2006: chapter 12)).[clarification needed (explanation)]

However, the maximum entropy is ε-achievable: a distribution's entropy can be arbitrarily close to the upper bound. Start with a normal distribution of the specified mean and variance. To introduce a positive skew, perturb the normal distribution upward by a small amount at a value many σ larger than the mean. The skewness, being proportional to the third moment, will be affected more than the lower order moments.

The distribution with density of the form f(x)=cexp⁡(ax+b[x−μ]−2){\displaystyle f(x)=c\exp(ax+b{[x-\mu ]_{-}}^{2})} if D(x)=E[(x−μ)−2]{\displaystyle D(x)={\sqrt {E[{(x-\mu )_{-}}^{2}]}}} is the standard lower semi-deviation, where [x]−:=max{0,−x}{\displaystyle [x]_{-}:=\max\{0,-x\}}, and a,b,c are constants.[10]

In the table below, each listed distribution maximizes the entropy for a particular set of functional constraints listed in the third column, and the constraint that x be included in the support of the probability density, which is listed in the fourth column.[6][7] Several examples (Bernoulli, geometric, exponential, Laplace, Pareto) listed are trivially true because their associated constraints are equivalent to the assignment of their entropy. They are included anyway because their constraint is related to a common or easily measured quantity. For reference, Γ(x)=∫0∞e−ttx−1dt{\displaystyle \Gamma (x)=\int _{0}^{\infty }e^{-t}t^{x-1}dt} is the gamma function, ψ(x)=ddxln⁡Γ(x)=Γ′(x)Γ(x){\displaystyle \psi (x)={\frac {d}{dx}}\ln \Gamma (x)={\frac {\Gamma '(x)}{\Gamma (x)}}} is the digamma function, B(p,q)=Γ(p)Γ(q)Γ(p+q){\displaystyle B(p,q)={\frac {\Gamma (p)\Gamma (q)}{\Gamma (p+q)}}} is the beta function, and γE is the Euler-Mascheroni constant.

1.
Ludwig Boltzmann
–
Boltzmann was born in Vienna, the capital of the Austrian Empire. His father, Ludwig Georg Boltzmann, was a revenue official and his grandfather, who had moved to Vienna from Berlin, was a clock manufacturer, and Boltzmanns mother, Katharina Pauernfeind, was originally from Salzburg. He received his education from a private tutor at the home of his parents. Boltzmann attended high school in Linz, Upper Austria, when Boltzmann was 15, his father died. Boltzmann studied physics at the University of Vienna, starting in 1863, among his teachers were Josef Loschmidt, Joseph Stefan, Andreas von Ettingshausen and Jozef Petzval. Boltzmann received his PhD degree in 1866 working under the supervision of Stefan, in 1867 he became a Privatdozent. After obtaining his degree, Boltzmann worked two more years as Stefans assistant. It was Stefan who introduced Boltzmann to Maxwells work, in 1869 at age 25, thanks to a letter of recommendation written by Stefan, he was appointed full Professor of Mathematical Physics at the University of Graz in the province of Styria. In 1869 he spent several months in Heidelberg working with Robert Bunsen and Leo Königsberger and then in 1871 he was with Gustav Kirchhoff, in 1873 Boltzmann joined the University of Vienna as Professor of Mathematics and there he stayed until 1876. In 1872, long before women were admitted to Austrian universities, he met Henriette von Aigentler and she was refused permission to audit lectures unofficially. Boltzmann advised her to appeal, which she did, successfully, on July 17,1876 Ludwig Boltzmann married Henriette, they had three daughters and two sons. Boltzmann went back to Graz to take up the chair of Experimental Physics, among his students in Graz were Svante Arrhenius and Walther Nernst. He spent 14 happy years in Graz and it was there that he developed his concept of nature. Boltzmann was appointed to the Chair of Theoretical Physics at the University of Munich in Bavaria, in 1893, Boltzmann succeeded his teacher Joseph Stefan as Professor of Theoretical Physics at the University of Vienna. Boltzmann spent a deal of effort in his final years defending his theories. He did not get along with some of his colleagues in Vienna, particularly Ernst Mach and that same year Georg Helm and Wilhelm Ostwald presented their position on Energetics at a meeting in Lübeck. They saw energy, and not matter, as the component of the universe. Boltzmanns position carried the day among other physicists who supported his theories in the debate

2.
Statistics
–
Statistics is a branch of mathematics dealing with the collection, analysis, interpretation, presentation, and organization of data. In applying statistics to, e. g. a scientific, industrial, or social problem, populations can be diverse topics such as all people living in a country or every atom composing a crystal. Statistics deals with all aspects of data including the planning of data collection in terms of the design of surveys, statistician Sir Arthur Lyon Bowley defines statistics as Numerical statements of facts in any department of inquiry placed in relation to each other. When census data cannot be collected, statisticians collect data by developing specific experiment designs, representative sampling assures that inferences and conclusions can safely extend from the sample to the population as a whole. In contrast, an observational study does not involve experimental manipulation, inferences on mathematical statistics are made under the framework of probability theory, which deals with the analysis of random phenomena. A standard statistical procedure involves the test of the relationship between two data sets, or a data set and a synthetic data drawn from idealized model. A hypothesis is proposed for the relationship between the two data sets, and this is compared as an alternative to an idealized null hypothesis of no relationship between two data sets. Rejecting or disproving the hypothesis is done using statistical tests that quantify the sense in which the null can be proven false. Working from a hypothesis, two basic forms of error are recognized, Type I errors and Type II errors. Multiple problems have come to be associated with this framework, ranging from obtaining a sufficient sample size to specifying an adequate null hypothesis, measurement processes that generate statistical data are also subject to error. Many of these errors are classified as random or systematic, the presence of missing data or censoring may result in biased estimates and specific techniques have been developed to address these problems. Statistics continues to be an area of research, for example on the problem of how to analyze Big data. Statistics is a body of science that pertains to the collection, analysis, interpretation or explanation. Some consider statistics to be a mathematical science rather than a branch of mathematics. While many scientific investigations make use of data, statistics is concerned with the use of data in the context of uncertainty, mathematical techniques used for this include mathematical analysis, linear algebra, stochastic analysis, differential equations, and measure-theoretic probability theory. In applying statistics to a problem, it is practice to start with a population or process to be studied. Populations can be diverse topics such as all living in a country or every atom composing a crystal. Ideally, statisticians compile data about the entire population and this may be organized by governmental statistical institutes

3.
Logarithm
–
In mathematics, the logarithm is the inverse operation to exponentiation. That means the logarithm of a number is the exponent to which another fixed number, in simple cases the logarithm counts factors in multiplication. For example, the base 10 logarithm of 1000 is 3, the logarithm of x to base b, denoted logb, is the unique real number y such that by = x. For example, log2 =6, as 64 =26, the logarithm to base 10 is called the common logarithm and has many applications in science and engineering. The natural logarithm has the e as its base, its use is widespread in mathematics and physics. The binary logarithm uses base 2 and is used in computer science. Logarithms were introduced by John Napier in the early 17th century as a means to simplify calculations and they were rapidly adopted by navigators, scientists, engineers, and others to perform computations more easily, using slide rules and logarithm tables. The present-day notion of logarithms comes from Leonhard Euler, who connected them to the function in the 18th century. Logarithmic scales reduce wide-ranging quantities to tiny scopes, for example, the decibel is a unit quantifying signal power log-ratios and amplitude log-ratios. In chemistry, pH is a measure for the acidity of an aqueous solution. Logarithms are commonplace in scientific formulae, and in measurements of the complexity of algorithms and they describe musical intervals, appear in formulas counting prime numbers, inform some models in psychophysics, and can aid in forensic accounting. In the same way as the logarithm reverses exponentiation, the logarithm is the inverse function of the exponential function applied to complex numbers. The discrete logarithm is another variant, it has uses in public-key cryptography, the idea of logarithms is to reverse the operation of exponentiation, that is, raising a number to a power. For example, the power of 2 is 8, because 8 is the product of three factors of 2,23 =2 ×2 ×2 =8. It follows that the logarithm of 8 with respect to base 2 is 3, the third power of some number b is the product of three factors equal to b. More generally, raising b to the power, where n is a natural number, is done by multiplying n factors equal to b. The n-th power of b is written bn, so that b n = b × b × ⋯ × b ⏟ n factors, exponentiation may be extended to by, where b is a positive number and the exponent y is any real number. For example, b−1 is the reciprocal of b, that is, the logarithm of a positive real number x with respect to base b, a positive real number not equal to 1, is the exponent by which b must be raised to yield x

4.
Normal distribution
–
In probability theory, the normal distribution is a very common continuous probability distribution. Normal distributions are important in statistics and are used in the natural and social sciences to represent real-valued random variables whose distributions are not known. The normal distribution is useful because of the limit theorem. Physical quantities that are expected to be the sum of independent processes often have distributions that are nearly normal. Moreover, many results and methods can be derived analytically in explicit form when the relevant variables are normally distributed, the normal distribution is sometimes informally called the bell curve. However, many other distributions are bell-shaped, the probability density of the normal distribution is, f =12 π σ2 e −22 σ2 Where, μ is mean or expectation of the distribution. σ is standard deviation σ2 is variance A random variable with a Gaussian distribution is said to be distributed and is called a normal deviate. The simplest case of a distribution is known as the standard normal distribution. The factor 1 /2 in the exponent ensures that the distribution has unit variance and this function is symmetric around x =0, where it attains its maximum value 1 /2 π and has inflection points at x = +1 and x = −1. Authors may differ also on which normal distribution should be called the standard one, the probability density must be scaled by 1 / σ so that the integral is still 1. If Z is a normal deviate, then X = Zσ + μ will have a normal distribution with expected value μ. Conversely, if X is a normal deviate, then Z = /σ will have a standard normal distribution. Every normal distribution is the exponential of a function, f = e a x 2 + b x + c where a is negative. In this form, the mean value μ is −b/, for the standard normal distribution, a is −1/2, b is zero, and c is − ln ⁡ /2. The standard Gaussian distribution is denoted with the Greek letter ϕ. The alternative form of the Greek phi letter, φ, is used quite often. The normal distribution is often denoted by N. Thus when a random variable X is distributed normally with mean μ and variance σ2, some authors advocate using the precision τ as the parameter defining the width of the distribution, instead of the deviation σ or the variance σ2

5.
Gamma function
–
In mathematics, the gamma function is an extension of the factorial function, with its argument shifted down by 1, to real and complex numbers. That is, if n is an integer, Γ =. The gamma function is defined for all numbers except the non-positive integers. The gamma function can be seen as a solution to the interpolation problem. The simple formula for the factorial, x. =1 ×2 × … × x, a good solution to this is the gamma function. There are infinitely many continuous extensions of the factorial to non-integers, the gamma function is the most useful solution in practice, being analytic, and it can be characterized in several ways. The Bohr–Mollerup theorem proves that these properties, together with the assumption that f be logarithmically convex, uniquely determine f for positive, from there, the gamma function can be extended to all real and complex values by using the unique analytic continuation of f. Also see Eulers infinite product definition below where the properties f =1 and f = x f together with the requirement that limn→+∞. nx / f =1 uniquely define the same function. The notation Γ is due to Legendre, if the real part of the complex number z is positive, then the integral Γ = ∫0 ∞ x z −1 e − x d x converges absolutely, and is known as the Euler integral of the second kind. The identity Γ = Γ z can be used to extend the integral formulation for Γ to a meromorphic function defined for all complex numbers z. It is this version that is commonly referred to as the gamma function. When seeking to approximate z. for a number z, it turns out that it is effective to first compute n. for some large integer n. And then use the relation m. = m. backwards n times. Furthermore, this approximation is exact in the limit as n goes to infinity, specifically, for a fixed integer m, it is the case that lim n → + ∞ n. m. =1, and we can ask that the formula is obeyed when the arbitrary integer m is replaced by an arbitrary complex number z lim n → + ∞ n. z. =1. Multiplying both sides by z. gives z. = lim n → + ∞ n. z, Z = lim n → + ∞1 ⋯ n ⋯ z = ∏ n =1 + ∞. Similarly for the function, the definition as an infinite product due to Euler is valid for all complex numbers z except the non-positive integers. By this construction, the function is the unique function that simultaneously satisfies Γ =1, Γ = z Γ for all complex numbers z except the non-positive integers

6.
Information theory
–
Information theory studies the quantification, storage, and communication of information. A key measure in information theory is entropy, entropy quantifies the amount of uncertainty involved in the value of a random variable or the outcome of a random process. For example, identifying the outcome of a coin flip provides less information than specifying the outcome from a roll of a die. Some other important measures in information theory are mutual information, channel capacity, error exponents, applications of fundamental topics of information theory include lossless data compression, lossy data compression, and channel coding. The field is at the intersection of mathematics, statistics, computer science, physics, neurobiology, Information theory studies the transmission, processing, utilization, and extraction of information. Abstractly, information can be thought of as the resolution of uncertainty, Information theory is a broad and deep mathematical theory, with equally broad and deep applications, amongst which is the vital field of coding theory. These codes can be subdivided into data compression and error-correction techniques. In the latter case, it took years to find the methods Shannons work proved were possible. A third class of information theory codes are cryptographic algorithms, concepts, methods and results from coding theory and information theory are widely used in cryptography and cryptanalysis. See the article ban for a historical application, Information theory is also used in information retrieval, intelligence gathering, gambling, statistics, and even in musical composition. Prior to this paper, limited information-theoretic ideas had been developed at Bell Labs, the unit of information was therefore the decimal digit, much later renamed the hartley in his honour as a unit or scale or measure of information. Alan Turing in 1940 used similar ideas as part of the analysis of the breaking of the German second world war Enigma ciphers. Much of the mathematics behind information theory with events of different probabilities were developed for the field of thermodynamics by Ludwig Boltzmann, Information theory is based on probability theory and statistics. Information theory often concerns itself with measures of information of the associated with random variables. Important quantities of information are entropy, a measure of information in a random variable, and mutual information. The choice of base in the following formulae determines the unit of information entropy that is used. A common unit of information is the bit, based on the binary logarithm, other units include the nat, which is based on the natural logarithm, and the hartley, which is based on the common logarithm. In what follows, an expression of the form p log p is considered by convention to be equal to zero whenever p =0 and this is justified because lim p →0 + p log ⁡ p =0 for any logarithmic base

7.
Directional statistics
–
Directional statistics is the subdiscipline of statistics that deals with directions, axes or rotations in Rn. More generally, directional statistics deals with observations on compact Riemannian manifolds, other examples of data that may be regarded as directional include statistics involving temporal periods, compass directions, dihedral angles in molecules, orientations, rotations and so on. Any probability density function p on the line can be wrapped around the circumference of a circle of unit radius. That is, the pdf of the wrapped variable θ = x w = x mod 2 π ∈ ( − π, π ] is p w = ∑ k = − ∞ ∞ p, the following sections show some relevant circular distributions. The von Mises distribution is a distribution which, like any other circular distribution. In fact, the von Mises distribution is known as the circular normal distribution because of its ease of use. The pdf of the von Mises distribution is, f = e κ cos ⁡2 π I0 where I0 is the modified Bessel function of order 0, the probability density function of the circular uniform distribution is given by U =1 /. There also exist distributions on the sphere, the N-dimensional sphere or the torus. The von Mises–Fisher distribution is a distribution on the Stiefel manifold, the Bingham distribution is a distribution over axes in N dimensions, or equivalently, over points on the -dimensional sphere with the antipodes identified. For example, if N =2, the axes are undirected lines through the origin in the plane, in this case, each axis cuts the unit circle in the plane at two points that are each others antipodes. For N =4, the Bingham distribution is a distribution over the space of unit quaternions and these distributions are for example used in geology, crystallography and bioinformatics. Since the integral P is unity, and the interval is finite, it follows that the moments of any circular distribution are always finite. Sample moments are defined, m ¯ n =1 N ∑ i =1 N z i n. The population resultant vector, length, and mean angle are defined in analogy with the corresponding sample parameters, ρ = m 1 R = | m 1 | θ μ = A r g. In addition, the lengths of the moments are defined as. The lengths of the higher moments will all lie between 0 and 1, various measures of location and spread may be defined for both the population and a sample drawn from that population. The most common measure of location is the circular mean, the population circular mean is simply the first moment of the distribution while the sample mean is the first moment of the sample. The sample mean will serve as an estimator of the population mean

8.
Entropy (information theory)
–
In information theory, systems are modeled by a transmitter, channel, and receiver. The transmitter produces messages that are sent through the channel, the channel modifies the message in some way. The receiver attempts to infer which message was sent, in this context, entropy is the expected value of the information contained in each message. Messages can be modeled by any flow of information, in a more technical sense, there are reasons to define information as the negative of the logarithm of the probability distribution of possible events or messages. The amount of information of every event forms a random variable whose expected value, units of entropy are the shannon, nat, or hartley, depending on the base of the logarithm used to define it, though the shannon is commonly referred to as a bit. The logarithm of the probability distribution is useful as a measure of entropy because it is additive for independent sources, for instance, the entropy of a coin toss is 1 shannon, whereas of m tosses it is m shannons. Generally, you need log2 bits to represent a variable that can take one of n if n is a power of 2. If these values are equally probable, the entropy is equal to the number of bits, equality between number of bits and shannons holds only while all outcomes are equally probable. If one of the events is more probable than others, observation of event is less informative. Conversely, rarer events provide more information when observed, since observation of less probable events occurs more rarely, the net effect is that the entropy received from non-uniformly distributed data is less than log2. Entropy is zero when one outcome is certain, Shannon entropy quantifies all these considerations exactly when a probability distribution of the source is known. The meaning of the events observed does not matter in the definition of entropy, generally, entropy refers to disorder or uncertainty. Shannon entropy was introduced by Claude E. Shannon in his 1948 paper A Mathematical Theory of Communication, Shannon entropy provides an absolute limit on the best possible average length of lossless encoding or compression of an information source. Entropy is a measure of unpredictability of the state, or equivalently, to get an intuitive understanding of these terms, consider the example of a political poll. Usually, such polls happen because the outcome of the poll is not already known, now, consider the case that the same poll is performed a second time shortly after the first poll. Now consider the example of a coin toss, assuming the probability of heads is the same as the probability of tails, then the entropy of the coin toss is as high as it could be. Such a coin toss has one shannon of entropy since there are two possible outcomes that occur with probability, and learning the actual outcome contains one shannon of information. Contrarily, a toss with a coin that has two heads and no tails has zero entropy since the coin will always come up heads

9.
Natural logarithm
–
The natural logarithm of a number is its logarithm to the base of the mathematical constant e, where e is an irrational and transcendental number approximately equal to 2.718281828459. The natural logarithm of x is written as ln x, loge x, or sometimes, if the base e is implicit. Parentheses are sometimes added for clarity, giving ln, loge or log and this is done in particular when the argument to the logarithm is not a single symbol, to prevent ambiguity. The natural logarithm of x is the power to which e would have to be raised to equal x. The natural log of e itself, ln, is 1, because e1 = e, while the natural logarithm of 1, ln, is 0, since e0 =1. The natural logarithm can be defined for any real number a as the area under the curve y = 1/x from 1 to a. The simplicity of this definition, which is matched in many other formulas involving the natural logarithm, like all logarithms, the natural logarithm maps multiplication into addition, ln ⁡ = ln ⁡ + ln ⁡. However, logarithms in other bases differ only by a constant multiplier from the natural logarithm, for instance, the binary logarithm is the natural logarithm divided by ln, the natural logarithm of 2. Logarithms are useful for solving equations in which the unknown appears as the exponent of some other quantity, for example, logarithms are used to solve for the half-life, decay constant, or unknown time in exponential decay problems. They are important in many branches of mathematics and the sciences and are used in finance to solve problems involving compound interest, by Lindemann–Weierstrass theorem, the natural logarithm of any positive algebraic number other than 1 is a transcendental number. The concept of the natural logarithm was worked out by Gregoire de Saint-Vincent and their work involved quadrature of the hyperbola xy =1 by determination of the area of hyperbolic sectors. Their solution generated the requisite hyperbolic logarithm function having properties now associated with the natural logarithm, the notations ln x and loge x both refer unambiguously to the natural logarithm of x. log x without an explicit base may also refer to the natural logarithm. This usage is common in mathematics and some scientific contexts as well as in many programming languages, in some other contexts, however, log x can be used to denote the common logarithm. Historically, the notations l. and l were in use at least since the 1730s, finally, in the twentieth century, the notations Log and logh are attested. The graph of the logarithm function shown earlier on the right side of the page enables one to glean some of the basic characteristics that logarithms to any base have in common. Chief among them are, the logarithm of the one is zero. What makes natural logarithms unique is to be found at the point where all logarithms are zero. At that specific point the slope of the curve of the graph of the logarithm is also precisely one

10.
Lagrange multiplier
–
In mathematical optimization, the method of Lagrange multipliers is a strategy for finding the local maxima and minima of a function subject to equality constraints. For instance, consider the optimization problem maximize f subject to g = c and we need both f and g to have continuous first partial derivatives. We introduce a new variable called a Lagrange multiplier and study the Lagrange function defined by L = f − λ ⋅, if f is a maximum of f for the original constrained problem, then there exists λ0 such that is a stationary point for the Lagrange function. However, not all stationary points yield a solution of the original problem, thus, the method of Lagrange multipliers yields a necessary condition for optimality in constrained problems. Sufficient conditions for a minimum or maximum also exist, one of the most common problems in calculus is that of finding maxima or minima of a function, but it is often difficult to find a closed form for the function being extremized. Such difficulties often arise when one wishes to maximize or minimize a function subject to fixed outside equality constraints, the method of Lagrange multipliers is a powerful tool for solving this class of problems without the need to explicitly solve the conditions and use them to eliminate extra variables. Consider the two-dimensional problem introduced above maximize f subject to g =0, the method of Lagrange multipliers relies on the intuition that at a maximum, f cannot be increasing in the direction of any neighboring point where g =0. If it were, we could walk along g =0 to get higher and we can visualize contours of f given by f = d for various values of d, and the contour of g given by g =0. Suppose we walk along the line with g =0. We are interested in finding points where f does not change as we walk, there are two ways this could happen, First, we could be following a contour line of f, since by definition f does not change as we walk along its contour lines. This would mean that the lines of f and g are parallel here. The second possibility is that we have reached a part of f. Thus we want points where g =0 and ∇ x, y f = λ ∇ x, y g, the constant λ is required because although the two gradient vectors are parallel, the magnitudes of the gradient vectors are generally not equal. This constant is called the Lagrange multiplier, notice that this method also solves the second possibility, if f is level, then its gradient is zero, and setting λ =0 is a solution regardless of g. To incorporate these conditions into one equation, we introduce an auxiliary function L = f − λ ⋅ g, note that this amounts to solving three equations in three unknowns. This is the method of Lagrange multipliers, note that ∇ λ L =0 implies g =0. The constrained extrema of f are points of the Lagrangian L. One may reformulate the Lagrangian as a Hamiltonian, in case the solutions are local minima for the Hamiltonian

11.
Standard deviation
–
In statistics, the standard deviation is a measure that is used to quantify the amount of variation or dispersion of a set of data values. The standard deviation of a variable, statistical population, data set. It is algebraically simpler, though in practice less robust, than the absolute deviation. A useful property of the deviation is that, unlike the variance. There are also other measures of deviation from the norm, including mean absolute deviation, in addition to expressing the variability of a population, the standard deviation is commonly used to measure confidence in statistical conclusions. For example, the margin of error in polling data is determined by calculating the standard deviation in the results if the same poll were to be conducted multiple times. This derivation of a deviation is often called the standard error of the estimate or standard error of the mean when referring to a mean. It is computed as the deviation of all the means that would be computed from that population if an infinite number of samples were drawn. It is very important to note that the deviation of a population. The reported margin of error of a poll is computed from the error of the mean and is typically about twice the standard deviation—the half-width of a 95 percent confidence interval. The standard deviation is also important in finance, where the standard deviation on the rate of return on an investment is a measure of the volatility of the investment. For a finite set of numbers, the deviation is found by taking the square root of the average of the squared deviations of the values from their average value. For example, the marks of a class of eight students are the eight values,2,4,4,4,5,5,7,9. These eight data points have the mean of 5,2 +4 +4 +4 +5 +5 +7 +98 =5 and this formula is valid only if the eight values with which we began form the complete population. If the values instead were a sample drawn from some large parent population. In that case the result would be called the standard deviation. Dividing by n −1 rather than by n gives an estimate of the variance of the larger parent population. This is known as Bessels correction, as a slightly more complicated real-life example, the average height for adult men in the United States is about 70 inches, with a standard deviation of around 3 inches

In probability theory, the normal (or Gaussian) distribution is a very common continuous probability distribution. …

The bean machine, a device invented by Francis Galton, can be called the first generator of normal random variables. This machine consists of a vertical board with interleaved rows of pins. Small balls are dropped from the top and then bounce randomly left or right as they hit the pins. The balls are collected into bins at the bottom and settle down into a pattern resembling the Gaussian curve.

For the normal distribution, the values less than one standard deviation away from the mean account for 68.27% of the set; while two standard deviations from the mean account for 95.45%; and three standard deviations account for 99.73%.

Information theory studies the quantification, storage, and communication of information. It was originally proposed by …

A picture showing scratches on the readable surface of a CD-R. Music and data CDs are coded using error correcting codes and thus can still be read even if they have minor scratches using error detection and correction.

Information entropy is defined as the average amount of information produced by a stochastic source of data. — The …

Two bits of entropy: In the case of two fair coin tosses, the information entropy in bits is the base-2 logarithm of the number of possible outcomes; with two coins there are four possible outcomes, and two bits of entropy. Generally, information entropy is the average amount of information conveyed by an event, when considering all possible outcomes.