ELEMENTARY CONCEPTS IN
STATISTICS
INTRODUCTION
Many
phenomena in life can be measured,
either directly (height, weight, prices, population…) or through indicators (health, satisfaction,
intelligence…)
Measured
entities will be called variables
here.
Measuring
phenomena is often convenient or even necessary in order to explore reality
more accurately and reliably than through direct experience and/or in order to
take technical, political, administrative or other measures to improve life.
(Measuring the effect of tax policy, of therapeutic treatment, changing the
proportion of ingredients in a product…)
While small
sets of data can often be analyzed and used directly, larger sets of data are
difficult to process fully for the human brain due to cognitive and other
limitations.
Statistics
offers tools for the use of quantitative
data.
They are
very useful in many walks of life where the characteristics of large numbers of
people, of products, or processes, of animals, of other objects etc. need to be
studied. Such cases abound in economics, in politics, in all fields of science,
in technology, in medicine, and statistics has become a major tool in modern
life.
DESCRIPTIVE STATISTICS AND INFERENTIAL
STATISTICS
Statistics
can be broadly classified into descriptive
statistics and inferential statistics.
Descriptive statistics offers tools to extract and present
features of a set of data which have been obtained by ‘observation’ and measurement.
In
particular, it focuses on central
tendencies of variables as well as on their dispersion (how much variability is there in the set of data?).
The mean (known to you as the average) is
the most frequently used measure of central tendency, but is not necessarily
the best, as it is sensitive to outliers
in small data sets). The median is a
value which divides the population into two equal sub-populations, one with
values smaller than the median, and the other with values larger then the median.
It is less sensitive to such outliers. For variables which take on
non-quantitative values, for instance choices in the menu at a restaurant or
colors of dresses that women prefer to wear in a certain population, there is
another measure of central tendency, the mode.
The mode is defined as the value of the variable which occurs most frequently
in the population.
The most commonly used measure of dispersion
is the standard deviation. It
measures a value which is close to (but not exactly) the average absolute
difference between individual values of the relevant variables and their mean
value.
Descriptive
statistics also supplies tools to measure correlation,
the degree to which values of one variable and values of another (for instance
weight vs. height) are related. Correlation is measured by correlation coefficients (there are several popular correlation
coefficients) which takes values 0 to 1 for positive
correlation (if as values of one variable go up, values of the other go up
as well), and 0 to -1 for negative
correlation (if as values of one variable go up, values of the other
variable go down). 0 means no correlation, and 1 or -1 mean very strong
correlation, in which you can predict the value of one variable very well from
the value of the other variable.
Correlation
only says to what extent the values of two variables can be predicted from each
other. It does not say what the nature of the link between the two variables
is. In particular, it does not say anything about a potential causal relationship between them (i.e. the
idea that one causes the other, or conversely that one is the result of the
other): finding that people who live in country X tend to be richer than people
who live in country Y does not mean that living in country X is the cause of their financial status, though
living in country X may help. Science uses correlation, but seeks causal links
to construct theories with explanatory and predictive power. In other walks of
life, including technology, administration and politics, causal relations are
also sought in order to be able to produce desired outcomes through specific
actions.
Inferential statistics offers tools which help make
inferences on a population (a set of
individual people, objects, actions, phenomena etc. in which one is interested)
from a sample (a subset of the
population which is selected for the purpose of learning something about the
population) and thus differs fundamentally from descriptive statistics, which only focuses on a given set of data
and does not seek to generalize beyond it. You are familiar with inferential
statistics at least because of their central role at elections: inferences are
made on the whole population as soon as samples are available, long before the
final count has been completed.
STATISTICAL TESTS AND SIGNIFICANCE
Inferential
statistics are necessary because of variability:
in a given population (of people, of actions, of light bulbs, of apples…),
generally, one cannot count on all individuals or individual units being
identical. If they were, a sample of one would be enough to learn all there is
to learn about the population. By studying a sufficiently large sample from a
population, it is possible to draw useful inferences about it while limiting
uncertainty.
For
instance, Scandinavians are generally tall, at least when compared to Mediterraneans. However, some Scandinavians are short (and
some Mediterraneans are tall). Because of this
variability, measuring the height of one Scandinavian is not a very reliable
way of getting a good idea of how tall they tend to be (i.e. a central
tendency). Taking a sample of 100 Scandinavians (n=100) and judging on that
basis is far more reliable.
The height
of people is a quantitative variable. As mentioned earlier, variables can also take
on qualitative values – this is the case of marital status
(married/single/divorce/widow or widower), of a person’s religion, etc. As seen
earlier, variables have a distribution,
that is, a pattern in which their values are found in the population. For
instance, in a class, the pass-fail distribution for an exam could be 50% pass
and 50% fail, or 20% pass and 80% fail, or 60% pass and 40% fail, etc. The
distribution of income in a national population could be 50% of low income, 45%
of medium income and 5% of high income. The distribution of heights in the male
population of Mr. Smith’s mathematics class could be:
In
statistics, the distribution of a particular variable in a population, for
instance the distribution of height, will be taken as characterizing this population. Statisticians know that the actual
nature of the population may have little to do with this variable, but they
consider this variable as a relevant indicator for whatever phenomenon they are
studying. In other words, the population studied is abstracted into a
distribution, regardless of the fact that it is composed of people who live in
certain districts in town, who like certain types of music, who dress in a
number of ways, etc. Statistically speaking, the population will be defined by
the distribution of its heights.
Inferential
statistics is often used to compare ‘populations’ and help decide whether they
are the same or not. Operationally, it will use tests to determine whether the distributions of the relevant variable
in two or more samples from these populations are similar enough to say that
they are actually the same population, or that they are too different and
therefore make it likely that the samples are taken from different populations.
For instance, if a sample of 10 students (n=10) is taken from a class in the
school where Mr. Smith teaches and it is found that the distribution of heights
in the sample is as follows:
Student A:
Student B:
Student C:
Student D:
Student E:
Student F:
Student G:
Student H:
Student I:
Student
J:
How likely
is it that this sample is taken from the population of Mr. Smith’s mathematics
class?
In this
case, the answer is quite easy, because several students in this sample are
less than
Student A:
Student B:
Student C:
Student D:
Student E:
It is not
impossible for these students to come from Mr. Smith’s class, but note that the
mean height of the students in this sample is
Actually,
the answer will depend on the characteristics of the distribution of heights in
Mr. Smith’s class and on the distribution of heights in the sample. More
specifically, it will depend not only on the mean, but also on the dispersion
of each (generally measured by the standard
deviation). Statistical tests, based on probability theory,
which is part of mathematics, can tell us with a certain degree of confidence
whether it is likely that the sample is significantly
different from the population of students in Mr. Smith’s class, or whether
the differences observed between the mean height in the sample and the mean
height in Mr. Smith’s class are only due to random variation (a significant difference is, by
definition, a difference unlikely to occur due to random variation only).
In other
cases, experimenters draw a sample from one population and a sample from
another population, and then use inferential tests to help them determine
whether the two samples come from two populations in which the relevant
variable has the same distribution, in other words, from ‘the same population’
(remember that the ‘population’ was defined by the distribution of this
relevant variable).
In all
these cases, inferential statistics will tell us whether a difference is significant, that is, due to some kind
of reason other than random variation, for a certain risk of being wrong in
concluding it is significant. In other words, significance tests, as they are called, will not give us a certain answer. They will only tell us how likely it
is that if we draw a certain inference, we will be wrong.
Some
comparisons are made between real
populations (for instance, when measuring the life span of a population of
light bulbs made in one plant as opposed to the life span of a population of
light bulbs made in another plant). Other comparisons are made between fictitious populations, that is populations that would exist or
will exist if/when the relevant experimental condition (treating patients with
a particular medicine, teaching a subject matter with a particular method)
were/will be implemented.
For instance,
when initially testing the effect of a new drug on patients, one might give 20
patients the medical drug and 20 other patients placebo and measure and compare
the outcomes in the two samples. Obviously, there is no ‘real’ population of
patients who are given the new drug, since this is the first time it is tested
and it is only tested on samples. Neither is there a real population of
patients who receive placebo. The samples only represent a fictitious
population of patients who would receive the new drug (and who will receive it
if results of the test are satisfactory), and a fictitious population of
patients who receive placebo. If the tests say the outcome is significantly
different in patients who receive the medical drug, this can be taken to mean
that these two fictitious populations are really
different, or, in other words, that the drug really makes a difference.
Inferential
tests may indicate that there is a significant
difference between populations at a level of p < .05 or at a level of p
< .01 . This indicates respectively that if you decide on the basis of the
results of the tests that the populations are different, you have a chance of
less than 1 out of 20 (5%) or less than 1 out of 100 (1%) to be wrong. The odds
of your being wrong in this inference are therefore low. One problem, however,
is that the tests do not tell you by how
much the populations differ. If the difference is very small, albeit
significant, the difference may not have practical implications. If for example
tests show that gas consumption between two models of cars is significantly
different, but the difference is very small, it is easily offset by how the
cars are driven, by how much weight they carry, by traffic conditions etc.
By the way,
if a significance test does not yield ‘significant’ results, this does not mean
that the two populations are not different. It only means that the differences
found in the characteristics of the sample are not large enough to conclude
that they are not due to random phenomena.
Another
problem is that these values of significance have been chosen arbitrarily. A
difference not significant at .01 can be significant at .02 or .03 . In other
words, an investigator’s decision as to whether s/he can consider a difference
significant or not is to a considerable extent arbitrary. There are other
reasons for which some scientists are not happy with inferential tests as the
sole or main criterion for decisions.
SAMPLING
In order to
be able to infer something on the population from the sample, the sample must
be representative of the population.
In statistical terms, this means that in respect of the relevant variable, it
should have features similar to those of the population, and in particular
without bias.
Bias is a
systematic error. For instance, if, in order to study the economic behavior of
the population of a city, you sample only people from areas where rich people
live, you will probably have a bias. You will probably have a bias in the
opposite direction if you only sample people from poor areas. But you may also
have a bias if you sit next to a shop on a weekday in the morning and sample
clients coming into the shop, because your sample will probably have more
unemployed and retired people than working people, and their economic behavior
may well be different from that of working people.
The only
way to make sure there is no bias is to do some kind of random sampling. Random sampling refers to a procedure where each
individual in the population (or, in some cases, in a subgroup of a population)
has the same chance of being drawn in the sample.
However,
random sampling only eliminates bias, that is, systematic error. It cannot
eliminate random error, that is, some difference between the characteristics of
the sample and the characteristics of the population. Other sampling methods
are also used. For instance, if you want to make sure that particular
sub-groups of the population are included in the sample (for instance an ethnic
minority, or an occupational minority, or an economic minority), a Simple Random Sample, that is, a sample
directly drawn from a list of all individuals in a population, may not be the
best. If a city with a population of 100 000 has such a minority of
10 000 and you draw a random sample of 1000 people, this sample may not
include individuals from that minority. In that case, a better method would be
to draw 900 people at random from the population other than this group, and 100
people from this group. In this way, you are sure that you will have the same
proportion of individuals from that group in the sample as in the population
and that its characteristics are taken into account. Such a procedure is called
Stratified Sampling, in which you
divide the population into strata and
sample individuals at random from each stratum.
It should
be noted that as sample size (the number of units in the sample) increases,
the uncertainty associated with variability decreases (sampling error
decreases). However, the effect is not proportional; it is much weaker. This is
why relatively small samples, in the order of a few thousand units, can yield relatively
accurate results for a population of dozens of millions, and multiplying sample
size by 10, which will increase sampling costs very substantially, will only
yield marginal improvement in the accuracy of the inferences.