ELEMENTARY CONCEPTS IN STATISTICS

 

 

INTRODUCTION

 

Many phenomena in life can be measured, either directly (height, weight, prices, population…) or through indicators (health, satisfaction, intelligence…)

Measured entities will be called variables here.

 

Measuring phenomena is often convenient or even necessary in order to explore reality more accurately and reliably than through direct experience and/or in order to take technical, political, administrative or other measures to improve life. (Measuring the effect of tax policy, of therapeutic treatment, changing the proportion of ingredients in a product…)

 

While small sets of data can often be analyzed and used directly, larger sets of data are difficult to process fully for the human brain due to cognitive and other limitations.

 

Statistics offers tools for the use of quantitative data.

 

They are very useful in many walks of life where the characteristics of large numbers of people, of products, or processes, of animals, of other objects etc. need to be studied. Such cases abound in economics, in politics, in all fields of science, in technology, in medicine, and statistics has become a major tool in modern life.

 

DESCRIPTIVE STATISTICS AND INFERENTIAL STATISTICS

 

Statistics can be broadly classified into descriptive statistics and inferential statistics.

 

Descriptive statistics offers tools to extract and present features of a set of data which have been obtained by ‘observation’ and measurement.

In particular, it focuses on central tendencies of variables as well as on their dispersion (how much variability is there in the set of data?).

 

The mean (known to you as the average) is the most frequently used measure of central tendency, but is not necessarily the best, as it is sensitive to outliers in small data sets). The median is a value which divides the population into two equal sub-populations, one with values smaller than the median, and the other with values larger then the median. It is less sensitive to such outliers. For variables which take on non-quantitative values, for instance choices in the menu at a restaurant or colors of dresses that women prefer to wear in a certain population, there is another measure of central tendency, the mode. The mode is defined as the value of the variable which occurs most frequently in the population.

 

 The most commonly used measure of dispersion is the standard deviation. It measures a value which is close to (but not exactly) the average absolute difference between individual values of the relevant variables and their mean value.

 

Descriptive statistics also supplies tools to measure correlation, the degree to which values of one variable and values of another (for instance weight vs. height) are related. Correlation is measured by correlation coefficients (there are several popular correlation coefficients) which takes values 0 to 1 for positive correlation (if as values of one variable go up, values of the other go up as well), and 0 to -1 for negative correlation (if as values of one variable go up, values of the other variable go down). 0 means no correlation, and 1 or -1 mean very strong correlation, in which you can predict the value of one variable very well from the value of the other variable.

 

Correlation only says to what extent the values of two variables can be predicted from each other. It does not say what the nature of the link between the two variables is. In particular, it does not say anything about a potential causal relationship between them (i.e. the idea that one causes the other, or conversely that one is the result of the other): finding that people who live in country X tend to be richer than people who live in country Y does not mean that living in country X is the cause of their financial status, though living in country X may help. Science uses correlation, but seeks causal links to construct theories with explanatory and predictive power. In other walks of life, including technology, administration and politics, causal relations are also sought in order to be able to produce desired outcomes through specific actions.

 

Inferential statistics offers tools which help make inferences on a population (a set of individual people, objects, actions, phenomena etc. in which one is interested) from a sample (a subset of the population which is selected for the purpose of learning something about the population) and thus differs fundamentally from descriptive statistics, which only focuses on a given set of data and does not seek to generalize beyond it. You are familiar with inferential statistics at least because of their central role at elections: inferences are made on the whole population as soon as samples are available, long before the final count has been completed.

 

STATISTICAL TESTS AND SIGNIFICANCE

 

Inferential statistics are necessary because of variability: in a given population (of people, of actions, of light bulbs, of apples…), generally, one cannot count on all individuals or individual units being identical. If they were, a sample of one would be enough to learn all there is to learn about the population. By studying a sufficiently large sample from a population, it is possible to draw useful inferences about it while limiting uncertainty.

 

For instance, Scandinavians are generally tall, at least when compared to Mediterraneans. However, some Scandinavians are short (and some Mediterraneans are tall). Because of this variability, measuring the height of one Scandinavian is not a very reliable way of getting a good idea of how tall they tend to be (i.e. a central tendency). Taking a sample of 100 Scandinavians (n=100) and judging on that basis is far more reliable.

 

The height of people is a quantitative variable. As mentioned earlier, variables can also take on qualitative values – this is the case of marital status (married/single/divorce/widow or widower), of a person’s religion, etc. As seen earlier, variables have a distribution, that is, a pattern in which their values are found in the population. For instance, in a class, the pass-fail distribution for an exam could be 50% pass and 50% fail, or 20% pass and 80% fail, or 60% pass and 40% fail, etc. The distribution of income in a national population could be 50% of low income, 45% of medium income and 5% of high income. The distribution of heights in the male population of Mr. Smith’s mathematics class could be:

 

175 cm    1 student

176 cm    0 student

177 cm    3 student

178 cm    1 student

179 cm    2 student

180 cm    3 students

181 cm    4 students

182 cm    5 students

183 cm    2 students

184 cm    0 student

185 cm    1 student

186 cm    0 student

187 cm    2 student

 

In statistics, the distribution of a particular variable in a population, for instance the distribution of height, will be taken as characterizing this population. Statisticians know that the actual nature of the population may have little to do with this variable, but they consider this variable as a relevant indicator for whatever phenomenon they are studying. In other words, the population studied is abstracted into a distribution, regardless of the fact that it is composed of people who live in certain districts in town, who like certain types of music, who dress in a number of ways, etc. Statistically speaking, the population will be defined by the distribution of its heights.

 

Inferential statistics is often used to compare ‘populations’ and help decide whether they are the same or not. Operationally, it will use tests to determine whether the distributions of the relevant variable in two or more samples from these populations are similar enough to say that they are actually the same population, or that they are too different and therefore make it likely that the samples are taken from different populations. For instance, if a sample of 10 students (n=10) is taken from a class in the school where Mr. Smith teaches and it is found that the distribution of heights in the sample is as follows:

 

Student A:  166 cm

Student B:  160 cm

Student C:  170 cm

Student D:  176 cm

Student E:   178 cm

Student F:   170 cm

Student G:   155 cm

Student H:   160 cm

Student I:    180 cm

Student J:    165 cm

 

How likely is it that this sample is taken from the population of Mr. Smith’s mathematics class?

 

In this case, the answer is quite easy, because several students in this sample are less than 175 cm tall, and no student in Mr. Smith’s class is as short, so the sample must come from a different population. What about a sample of 5 students from the school (n=5) with the following height distribution?

 

Student A: 185 cm

Student B: 187 cm

Student C: 187 cm

Student D: 182 cm

Student E: 182 cm

 

It is not impossible for these students to come from Mr. Smith’s class, but note that the mean height of the students in this sample is 184,6 cm, whereas the mean height of the students in Mr. Smith’s class is 177,5 cm. Is this difference sufficient to draw the conclusion that the sample is not from Mr. Smith’s class?

 

Actually, the answer will depend on the characteristics of the distribution of heights in Mr. Smith’s class and on the distribution of heights in the sample. More specifically, it will depend not only on the mean, but also on the dispersion of each (generally measured by the standard deviation). Statistical tests, based on probability theory, which is part of mathematics, can tell us with a certain degree of confidence whether it is likely that the sample is significantly different from the population of students in Mr. Smith’s class, or whether the differences observed between the mean height in the sample and the mean height in Mr. Smith’s class are only due to random variation (a significant difference is, by definition, a difference unlikely to occur due to random variation only).

 

In other cases, experimenters draw a sample from one population and a sample from another population, and then use inferential tests to help them determine whether the two samples come from two populations in which the relevant variable has the same distribution, in other words, from ‘the same population’ (remember that the ‘population’ was defined by the distribution of this relevant variable).

 

In all these cases, inferential statistics will tell us whether a difference is significant, that is, due to some kind of reason other than random variation, for a certain risk of being wrong in concluding it is significant. In other words, significance tests, as they are called, will not give us a certain answer. They will only tell us how likely it is that if we draw a certain inference, we will be wrong.

 

Some comparisons are made between real populations (for instance, when measuring the life span of a population of light bulbs made in one plant as opposed to the life span of a population of light bulbs made in another plant). Other comparisons are made between fictitious populations, that is populations that would exist or will exist if/when the relevant experimental condition (treating patients with a particular medicine, teaching a subject matter with a particular method) were/will be implemented.

 

For instance, when initially testing the effect of a new drug on patients, one might give 20 patients the medical drug and 20 other patients placebo and measure and compare the outcomes in the two samples. Obviously, there is no ‘real’ population of patients who are given the new drug, since this is the first time it is tested and it is only tested on samples. Neither is there a real population of patients who receive placebo. The samples only represent a fictitious population of patients who would receive the new drug (and who will receive it if results of the test are satisfactory), and a fictitious population of patients who receive placebo. If the tests say the outcome is significantly different in patients who receive the medical drug, this can be taken to mean that these two fictitious populations are really different, or, in other words, that the drug really makes a difference.

 

Inferential tests may indicate that there is a significant difference between populations at a level of p < .05 or at a level of p < .01 . This indicates respectively that if you decide on the basis of the results of the tests that the populations are different, you have a chance of less than 1 out of 20 (5%) or less than 1 out of 100 (1%) to be wrong. The odds of your being wrong in this inference are therefore low. One problem, however, is that the tests do not tell you by how much the populations differ. If the difference is very small, albeit significant, the difference may not have practical implications. If for example tests show that gas consumption between two models of cars is significantly different, but the difference is very small, it is easily offset by how the cars are driven, by how much weight they carry, by traffic conditions etc.

 

By the way, if a significance test does not yield ‘significant’ results, this does not mean that the two populations are not different. It only means that the differences found in the characteristics of the sample are not large enough to conclude that they are not due to random phenomena.

 

Another problem is that these values of significance have been chosen arbitrarily. A difference not significant at .01 can be significant at .02 or .03 . In other words, an investigator’s decision as to whether s/he can consider a difference significant or not is to a considerable extent arbitrary. There are other reasons for which some scientists are not happy with inferential tests as the sole or main criterion for decisions.

 

SAMPLING

 

In order to be able to infer something on the population from the sample, the sample must be representative of the population. In statistical terms, this means that in respect of the relevant variable, it should have features similar to those of the population, and in particular without bias.

 

Bias is a systematic error. For instance, if, in order to study the economic behavior of the population of a city, you sample only people from areas where rich people live, you will probably have a bias. You will probably have a bias in the opposite direction if you only sample people from poor areas. But you may also have a bias if you sit next to a shop on a weekday in the morning and sample clients coming into the shop, because your sample will probably have more unemployed and retired people than working people, and their economic behavior may well be different from that of working people.

 

The only way to make sure there is no bias is to do some kind of random sampling. Random sampling refers to a procedure where each individual in the population (or, in some cases, in a subgroup of a population) has the same chance of being drawn in the sample.

 

However, random sampling only eliminates bias, that is, systematic error. It cannot eliminate random error, that is, some difference between the characteristics of the sample and the characteristics of the population. Other sampling methods are also used. For instance, if you want to make sure that particular sub-groups of the population are included in the sample (for instance an ethnic minority, or an occupational minority, or an economic minority), a Simple Random Sample, that is, a sample directly drawn from a list of all individuals in a population, may not be the best. If a city with a population of 100 000 has such a minority of 10 000 and you draw a random sample of 1000 people, this sample may not include individuals from that minority. In that case, a better method would be to draw 900 people at random from the population other than this group, and 100 people from this group. In this way, you are sure that you will have the same proportion of individuals from that group in the sample as in the population and that its characteristics are taken into account. Such a procedure is called Stratified Sampling, in which you divide the population into strata and sample individuals at random from each stratum.

 

It should be noted that as sample size (the number of units in the sample) increases, the uncertainty associated with variability decreases (sampling error decreases). However, the effect is not proportional; it is much weaker. This is why relatively small samples, in the order of a few thousand units, can yield relatively accurate results for a population of dozens of millions, and multiplying sample size by 10, which will increase sampling costs very substantially, will only yield marginal improvement in the accuracy of the inferences.