Modeling of sample mean by a process engineer a Victorian beer factory

The following example is taken from Dalgaard (2008, p. 97), who, in turn, got his data from Altman (1991, p. 183).

In 1986, Manocha and his colleagues studied eleven healthy female subjects (aged 22 to 30) for ten days, and took careful measurements of their daily calorie intakes, \(\varepsilon_{k,j}\), where \(j\) is the day index.

Here, \(\varepsilon\) is energy intake data. But it can be anything numeric depending on your use case. In the table, we see that Manocha condensed 110 data points (\(\varepsilon_{k,j}\)) into 11 data points (\(\varepsilon_k\)).


  1. See Peter Dalgaard (2008) Introductory statistics with R 2nd edition, Springer Science + Business Media, LLC, New York. And also Douglas G. Altman (1991) Practical statistics for medical research, Chapman & Hall/CRC, Boca Raton.

  2. S. Manocha, G. Choudhuri, B. N Tandon (1986) A study of dietary intake in pre- and post-menstrual period. Human Nutition: Applied nutrition 40A, pp. 213 - 216.
Female subject identifier, \(k\) Day 1 Day 2 \(\cdots\) Day 10 10-day average (kilojoules)
\(1\) \(\varepsilon_{1, 1}\) \(\varepsilon_{1, 2}\) \(\cdots\) \(\varepsilon_{1, 10}\) \(\varepsilon_1 = 5260\)
\(2\) \(\varepsilon_{2, 1}\) \(\varepsilon_{2, 2}\) \(\cdots\) \(\varepsilon_{2, 10}\) \(\varepsilon_2 = 5470\)
\(3\) \(\varepsilon_{3, 1}\) \(\varepsilon_{3, 2}\) \(\cdots\) \(\varepsilon_{3, 10}\) \(\varepsilon_3 = 5640\)
\(4\) \(\varepsilon_{4, 1}\) \(\varepsilon_{4, 2}\) \(\cdots\) \(\varepsilon_{4, 10}\) \(\varepsilon_4 = 6180\)
\(5\) \(\varepsilon_{5, 1}\) \(\varepsilon_{5, 2}\) \(\cdots\) \(\varepsilon_{5, 10}\) \(\varepsilon_5 = 6390\)
\(6\) \(\varepsilon_{6, 1}\) \(\varepsilon_{6, 2}\) \(\cdots\) \(\varepsilon_{6, 10}\) \(\varepsilon_6 = 6515\)
\(7\) \(\varepsilon_{7, 1}\) \(\varepsilon_{7, 2}\) \(\;\cdots\;\) \(\varepsilon_{7, 10}\) \(\varepsilon_7 = 6805\)
\(8\) \(\varepsilon_{8, 1}\) \(\varepsilon_{8, 2}\) \(\cdots\) \(\varepsilon_{8, 10}\) \(\varepsilon_8 = 7515\)
\(9\) \(\varepsilon_{9, 1}\) \(\varepsilon_{9, 2}\) \(\cdots\) \(\varepsilon_{9, 10}\) \(\varepsilon_9 = 7515\)
\(10\) \(\varepsilon_{10, 1}\) \(\varepsilon_{10, 2}\) \(\cdots\) \(\varepsilon_{10, 10}\) \(\varepsilon_{10} = 8230\)
\(11\) \(\varepsilon_{11, 1}\) \(\varepsilon_{11, 2}\) \(\cdots\) \(\varepsilon_{11, 10}\) \(\varepsilon_{11} = 8770\)

$$\underbrace{\varepsilon_{k,j}=\begin{pmatrix}\varepsilon_{1,1} & \cdots & \varepsilon_{1,10}\\ \varepsilon_{2,1} & \cdots & \varepsilon_{2,10}\\ \vdots & \ddots & \vdots\\ \varepsilon_{11,1} & \cdots & \varepsilon_{11,10} \end{pmatrix}}_{\mathbf{E}, \,\textrm{110 data points}} \underset{\textrm{condensed to}}{\longrightarrow} \underbrace{\varepsilon_k = \begin{pmatrix}\varepsilon_1 \\ \varepsilon_2 \\ \vdots\\ \varepsilon_{11}\end{pmatrix}}_{\mathbf{e}, \, \textrm{11 data points}}$$

The very act of summing the 10 daily intake numbers and dividing the sum by the total count is statistically known as the computation of the first Pearsonian moment (around zero).

The term ‘moment' was borrowed from classical mechanics and introduced into statistics by Karl Pearson in 1893. In general given some random variable \(x\), the moment of \(r\)-th order around a certain arbitrary point \(x = a\) is given by

$$\mu_r' = \frac{1}{n}\sum_{k=1}^n (x - a)^r$$

For instance, for the first female subject, the first-order moment (around zero) is calculated by:

$$\begin{align} \mu_1' &= \frac{\varepsilon_{1,1} + \varepsilon_{1,2} + \ldots + \varepsilon_{1,10} }{10}\\ &= \frac{1}{10}\sum_{j=1}^{10} (\varepsilon_{1,j} - 0)^{1} \\ &= \varepsilon_1 = 5260 \end{align} $$

This is not the end of the story, for course, for the eleven data points are still concentratable.

$$ \underbrace{\mathbf{E}}_{\textrm{110 data points}} {\longrightarrow} \underbrace{ \mathbf{e}}_{\textrm{11 data points}} {\longrightarrow} \underbrace{\bar{\varepsilon}}_{\textrm{1 data point}} $$

If you repeat the procedure of taking another first-order Pearsonian moment, we obtained \(\bar{\varepsilon} = 6754\)

$$ \begin{align} \mu_1' &= \frac{\varepsilon_{1} + \varepsilon_{2} + \ldots + \varepsilon_{11} }{11}\\ &= \frac{1}{11}\sum_{k=1}^{11} (\varepsilon_{k} - 0)^{1}\\ &= \bar{\varepsilon} = 6754 \end{align} $$

this is our educated guess for the mean intake based on the measurements.

What can we say about the energy intake of this particular group of Indian women in relation to a recommended daily intake of 7725 kilojoules?

Very crudely, with the first-order moment, we can say that on average, this group of Indian women had daily energy intake below the recommended value since 6754 is 971 kilojoules below the recommended value of 7725 kilojoules.

How confident are we with this crude deduction, you may ask

In other words: how accurate is the sample mean as an estimate of the true mean daily energy intake of this particular group of Indian women. To answer this question, we need to introduce the second-order Pearsonian moment into our analysis.

$$ \begin{align} \mu_2 &= \frac{(\varepsilon_{1} - \bar{\varepsilon})^2 + \ldots + (\varepsilon_{11}-\bar{\varepsilon})^2 }{11} \\ &= \frac{1}{11}\sum_{k=1}^{11} (\varepsilon_{k} - \bar{\varepsilon})^{2} \\ &= 1185860 \end{align} $$

The second-order moment, \(\mu_2\), is introduced to quantify the spread of the data, and it is related to a parameter known as standard deviation, \(s\).

$$s^2 = \left(\frac{n}{n-1}\right)\mu_2$$

And the variance of the distribution of the sample mean is given by \(\frac{s^2}{n}\), and the square root of this expression is named standard error.

$${\rm SE(\textrm{mean intake})} = \frac{s}{\sqrt{n}}$$
  1. Pearson was appointed to the Goldsmid Chair of Applied Mathematics and Mechanics, a position previously occupied by W. K. Clifford (b. 1845, d. 1879), in UCL in 1884 and he has to teach statics, dynamics, and mechanics to engineering students.

  2. Karl Pearson (1893) Asymmetrical frequency curves. Nature 48, pp. 615 - 616; And also Karl Pearson (1894) Contributions to the mathematical theory of evolution. Philosophical Transactions of the Royal Society A 185, pp. 71 - 110.

  3. In general, we use the greek letter mu to denote moment around the mean. When the reference point is not the mean value, \(\mu\) is diacriticalized with a prime mark, \(\mu'\). See for example, Section 4 in Chapter 1 of Charles Weatherburn (1946) A first course in mathematical statistics, Cambridge University Press, Cambridge, p. 8. In Pearson's 1893 paper, he used \(M_r\) instead of \(\mu_r\). Weatherburn suggested that sometimes, the greek letter nu can also be used to denote \(\mu_r'\), that is, \(\nu_r = \mu_r'\).



  4. Gosset took a sabbatical leave from Guinness to work in Pearson's laboratory between 1906 and 1907, where he developed his mathematical model of sample mean. In March 1908, he published his model under the pseudonym ‘Student'. Student (1908) The probable error of a mean, Biometrika 6(1), pp. 1-25.

  5. Standard error measures the variability of a sample statistic (e.g. \(\bar{\varepsilon}\)) across the repeated samples, whereas standard deviation measures the variability within a dataset (\(\varepsilon_i\)).
Female subject identifier, \(k\) \((\varepsilon_k-0)^1\) \((\varepsilon_k - \bar{\varepsilon})^2\)
\(1\) \(\varepsilon_1 = 5260\) \((\varepsilon_1 - \bar{\varepsilon})^2 = 2230949.6\)
\(2\) \(\varepsilon_2 = 5470\) \((\varepsilon_2 - \bar{\varepsilon})^2 = 1647722.3\)
\(3\) \(\varepsilon_3 = 5640\) \((\varepsilon_3 - \bar{\varepsilon})^2 = 1240186\)
\(4\) \(\varepsilon_4 = 6180\) \((\varepsilon_4 - \bar{\varepsilon})^2 = 329058.7\)
\(5\) \(\varepsilon_5 = 6390\) \((\varepsilon_5 - \bar{\varepsilon})^2 = 132231.4\)
\(6\) \(\varepsilon_6 = 6515\) \((\varepsilon_6 - \bar{\varepsilon})^2 = 2638.2\)
\(7\) \(\varepsilon_7 = 6805\) \((\varepsilon_7 - \bar{\varepsilon})^2 = 2230949.6\)
\(8\) \(\varepsilon_8 = 7515\) \((\varepsilon_8 - \bar{\varepsilon})^2 = 579674.6\)
\(9\) \(\varepsilon_9 = 7515\) \((\varepsilon_9 - \bar{\varepsilon})^2 = 579674.6\)
\(10\) \(\varepsilon_{10} = 8230\) \((\varepsilon_{10} - \bar{\varepsilon})^2 = 2179649.6\)
\(11\) \(\varepsilon_{11} = 8770\) \((\varepsilon_{11} - \bar{\varepsilon})^2 = 4065722.3\)
\(\mu_1' = 6754\) \(\mu_2 = 1185860\)

In 1908, a Guinness process engineer named W. S. Gosset (b. 1876, d. 1937) figured out geometrically that frequency curve of sample mean is proportional to \(\left(1+\frac{t^2}{n-1}\right)^{-\frac{1}{2}n}\), that is,

$$f(t) \propto \left(1 + \tfrac{t^2}{n-1}\right)^{-\frac{1}{2}n}$$

since the following asymptotic behavior is well-known (put \(\square = t^2\) )

$$e^{-\square/2} = \frac{1}{\left(e^{\square}\right)^{1/2}} \sim \underbrace{\frac{1}{\left\{\left(1+\frac{\square}{n}\right)^n\right\}^{1/2}}}_{\textrm{when $n$ is big}} \sim \left( 1 + \tfrac{\square}{n}\right)^{-\frac{1}{2}n}$$

In these expressions, the \(t\)-statistic is defined by

$$t = \frac{ \bar{\varepsilon} - \textrm{mean intake}}{\textrm{SE}(\textrm{mean intake})} = \frac{ \bar{\varepsilon} - \textrm{mean intake}}{\sqrt{\frac{s^2}{n}}}$$

That is, Gosset figured out that in general, the reference distribution associated with the \(t\)-statistic in a \(n\)-sample situation is:

$$ t = \frac{\displaystyle \sum_{k=1}^n \tfrac{\varepsilon_k}{n} - \textrm{(mean intake)}}{\displaystyle \sqrt{\frac{\displaystyle \sum_{k=1}^n \left(\tfrac{1}{n-1}\right)\left(\varepsilon_k - \sum_{k=1}^n \tfrac{\varepsilon_k}{n}\right)^2}{n}}} \sim \left(1+ \tfrac{t^2}{n-1}\right)^{-\frac{1}{2}n} $$

Equipped with the sample mean frequency curve, we can then compute a certain numerical range

$$\textrm{lower bound of $\bar{\varepsilon}$} < \textrm{mean intake} < \textrm{upper bound of $\bar{\varepsilon}$}$$

which satisfy the following condition:

$$\frac{\displaystyle \int_{\textrm{lower bound of $t(\bar{\varepsilon}, s^2, n)$}}^{\textrm{upper bound of $t(\bar{\varepsilon}, s^2, n)$}}\left(1 + \tfrac{t^2}{n-1}\right)^{-\frac{1}{2}n}\,dt}{\displaystyle \int_{-\infty}^\infty\left(1 + \tfrac{t^2}{n-1}\right)^{-\frac{1}{2}n}\,dt} = \frac{95}{100}$$

Since the frequency curve is symmetrical about the origin, we have \(b = \) upper bound of \(t(\bar{\varepsilon}, s^2, n)\), and \(-b = \) lower bound of \(t(\bar{\varepsilon}, s^2, n)\), thus the integrals can be evaluated and simplified to

$$\frac{\frac{32}{63}\sqrt{\frac{2}{5}}\frac{b(196875+52500b^2+6300b^4+360b^6+8b^8)}{(10+b^2)^{9/2}}}{\frac{256}{63}\sqrt{\frac{2}{5}}} = \frac{95}{100} $$

And in this particular case, the value of \(b\) is

$$b = 2.2281388519862747484\ldots$$

The chore of going through the integrals and the resulting equation is usually not required in practice for the value of \(b\) can be found in statistical tables (or by invoking a suitable function in statistical software packages). The 95%-confidence interval of the mean energy intake is therefore given by:

$$\bar{\varepsilon} - b \times \sqrt{\tfrac{s^2}{n}} <\textrm{mean intake}< \bar{\varepsilon} + b\times \sqrt{\tfrac{s^2}{n}}$$

or numerically

$$5986 < \textrm{mean intake} < 7521$$

This essentially means that we are pretty sure (or 95% sure) the average daily intake of Indian women is between 6000 kilojoules to 7500 kilojoules.

Also, since 7725 lies outside the 95% confidence interval

$$(5986 < \textrm{mean intake} < 7521) < 7725$$

It cannot possibly have a \(p\)-value of more than 5%.

Comments

Popular posts from this blog

「日上三竿」到底是早上多少點?

François Valentyn's Sincapoera

Urusan Seri Paduka Baginda和金牌急腳遞

《心經》裡面的「般若波羅蜜」一詞

The children of Yap Ah Loy sued their mum in court (1898 - 1904)