Limited confidence in tests of significance
by Richard Corlett
The idea that the basic methodology of ecological research involves testing the statistical significance of a null hypotheses is so deeply ingrained in postgraduate students that most will be surprised to learn that there is probably no living statistician who supports this as a general approach, and some who believe it is never the best thing to do. The statistical reasons for this are clearly laid out in a recent paper by Douglas Johnson in the Journal of Wildlife Management ("The insignificance of statistical significance testing" JWM 63(3), 763-772) but they have been known and widely-reported for decades. I will only consider the most ecologically-relevant points here. The most fundamental objection is that the null hypotheses tested are usually known, a priori, to be untrue. Thus, if they are "rejected", the test only confirms what is already known, while if they cannot be rejected it simply means the sample size was too small. Either way, we gain nothing from the test. Ecological null hypotheses most often state that some parameter equals zero or that two or more parameters are equal. In practice, however, no two things are ever exactly equal in biological systems and no effect worth the effort of testing is likely to be precisely zero. A parameter may be almost zero and two or more parameters may be almost equal, but large enough sample sizes will still show a "statistically significant" difference. Good examples of this misuse of significance tests in local studies are the comparison of ecological parameters (such as the abundance of an organism or the value of some physical variable) between seasons - nobody seriously believes that any parameter of ecological interest is exactly the same throughout the year - between two or more sites, where, again, large enough sample sizes will always show a difference in any parameter of ecological interest, or between two or more species which are, by definition, different.
To summarize the story so far: most null hypotheses tested in ecology - particularly, non-experimental ecology - are truly false, so P can be made as small as you want, and thus as "significant" as you want, simply by increasing the sample size. The value of P, and thus the "significance" of the test, is therefore arbitrary. The use of a standard cutoff value, typically P = 0.05, irrespective of sample size, is even less justifiable, since it does not allow you to use your common sense to decide whether the result has any biological significance.
The second major problem mentioned by Johnson is that the belief that the value of P represents the probability that the null hypothesis is true - or the probability that the results were obtained by chance - is not generally correct, at least in the situations most relevant to ecologists. That it is not precisely true will come as no surprise to ecologists, who are used to relying on the supposed "robustness" of the statistical tests they apply in situations where the underlying assumptions of the test are not completely met, but the potential magnitude of the errors is much larger than we like to believe.
No wonder Clark (1963, in Johnson 1999) stated that statistical hypothesis testing was "no longer a sound or fruitful basis for statistical investigation" and Bakan (1966, again quoted in Johnson's paper) called it "essential mindlessness in the conduct of research". Why, then, do we still do it? Johnson suggests many reasons, but believes the major one is physics envy. We envy the ability of physical scientists to say things which are precisely, universally and objectively true. Yet I find it hard to think of any non-trivial statement in ecology which I truly believe to have these properties. The ecologically-useful statements which appear to most nearly approach universality (e.g. "bulbuls are key seed dispersal agents in degraded Asian landscapes") do so because similar results have been found by different investigators, using different methods, at different sites: not because of high levels of statistical significance obtained in a single study at a single site over a limited period. Replication at different times and places, and by different methods and people, is the key to confidence in ecology.
What are the alternatives to statistical significance testing? The first step is to ask yourself what it is you really want to know. Very rarely is this "Is A different from B?" or "Is C different from zero?". Given that we usually only investigate effects which we think likely to be non-zero, a more appropriate question is "How big is the effect and how reliable is our estimate of it?". In such cases, parameter estimates with confidence limits are far more useful than tests of significance. And if, for instance, the 95% confidence limits include zero, we know that if we did do such a test, the resulting P-value would be >0.05. Looking through my own papers, and the theses of postgraduates I have supervised, it is (in retrospect) obvious that most of the hypothesis tests should have been replaced by confidence intervals.
In applied ecology - including studies related to conservation or impact assessment - the question of interest is usually "Should we do X?" (Should we build the road? Should we burn the shrubland? Should we use this method or that?). In such cases, neither hypotheses testing nor confidence limits are adequate, since they ignore the relative costs of alternative actions. Thus a "non-significant" risk to human health or an endangered species - or an estimated effect which includes zero in its confidence limits - may still be unacceptable, while a small, but "statistically significant" risk to a common species may be acceptable. More generally, Type II errors (the acceptance of a false null hypothesis) may often be far more expensive than the Type I errors (rejection of a true null hypothesis) which hypothesis testing is intended to guard against. The preferred tool in such cases is decision theory, about which I know little, but confidence limits and common sense are a good, if less formal, basis for making such decisions.
Every stats article should have a caveat, and I think I have a spare one here in my pocket, along with several unidentified seeds, a useless key, and a thousand-Forint banknote which must have survived at least two washes. Johnson was writing for the Journal of Wildlife Management, which publishes papers on wolves, moose, oryx and other animals which are not very amenable to well-designed experiments with random assignment of treatments. Significance tests are more easily justified, and more likely to be accurate, for such experiments than for observational data. However, the problems of null hypotheses which are, a priori, false still remain, as, in most cases, do the advantages of presenting parameter estimates with confidence limits.
P.20-21
Back to Contents
Back to Porcupine Homepage
Go to Departmental Homepage