Is it substantive or just significant?
Over-reliance on statistical significance often misses the point
Significance vs. Substance
Why is it that we often have such a hard time gaining meaningful insights from our data? Sometimes we have results that are presented as significant only to be unable to replicate the effect. Other times our significant relationships don’t seem to really explain what’s happening. What I intend to present here is the case that we are misinterpreting the meaning of significance and failing to make judgments about the substance of our findings.
Statistical significance tests are something that I think many misunderstand. They are only partially related to the magnitude of the findings, but address a somewhat arcane notion of the probability of producing a like result if the process in question was to run a large number of times. It’s not something that will ever really happen, but we are testing against that to get our probability level. A typical significance test would be run with a Type 1 error threshold of .05 (alpha=.05). We’re accepting that we would get a result like the one in question five times out of a hundred by chance if we repeated our process some large number of times. So we feel our finding is meaningful. I would argue the statistical significance has little to do with the meaningfulness.
Let’s look at an example. I was recently reading a validation study for a personality measure that showed the relationship between the measure in question and individual performance was r=.34. The result was statistically significant. The authors then went on to claim the great predictive power of the tool. I have a different take. Statistical significance, the probability of the effect occurring by chance, is only part of the relationship. We need to know something about the magnitude or substance of the relationship.
One way of approaching this is to determine the proportion of the variance in performance that is explainable from the personality measure. To do this, statisticians square the correlation coefficient — statisticians love to square numbers. The result is something called the coefficient of determination, which in our example is .1156. Our coefficient of determination tells us that the personality measure explains about 11.56% of the variance of performance. Now we are left with the decision of how much predictive value the personality measure really has? Forget about whether the result is statistically significant. It leaves 88.44% of the variance of performance unexplained. Does that make us comfortable making decisions affecting another human being’s life?
To further tug at our decision, let’s consider another aspect of our correlation between the personality measure and performance. Almost as much as squaring things, statisticians like to talk about error and rightfully so. Not error in the sense of making a mistake, rather error as the component of a model that we cannot explain. In the case of correlation one can calculate something called the standard error of estimate. It’s a great reminder that we need to consider the fact that we have estimates, not unequivocal facts.
The same issue of statistical significant and substance plagues us in all areas of inquiry. Let’s take a further look at differences in mean (average) scores from some survey data. The data show us a statistically significant difference between two groups on a particular survey attribute. For the moment, we’ll ignore that there is nothing random about the formation of the two groups. Group A’s mean equals 4.25; Group B’s is equals 4.3. Significance again tells us that if we repeated the same exercise a large number of times this difference would occur fewer than five times out of 100 by chance. But is the difference meaningful?
There are a number of ways to approach the issue of meaningful difference. One that we sometimes use is to consider the standard error of measurement, which is a function of the measure’s reliability and group variance. So we must first estimate the reliability of our measures. Most commonly this is done by calculating a measure called Cronbach’s Alpha for each of the components of our survey. This does assume that our measures are more than a single item. More on this in a subsequent piece on developing scales.
The standard error of measurement, like the standard error of estimate, lets us establish confidence intervals to use in interpreting the group differences. Considering one unit of the standard error of measurement (plus or minus) gets us to a 68% confidence level. To get to the confidence levels of 95% and 99%, we need to consider two and three units respectively of the standard error of measurement.
Let’s return to our example with means of 4.25 and 4.3, which tested as statistically significant. Assume we calculated our standard error of measurement to be .22. The difference in the two groups’ means is only .05. With only a 68% confidence level, one unit of standard error of measurement, our group difference is much less than the error of measurement. Using this criterion, we don’t consider the difference meaningful.
Use of the standard error of measurement, of course, assumes we have reliability estimates for our measures — something we know we should have, but oftentimes ignore. Looking at survey results only from the perspective of statistical significance is a mistake. Interpreting results without knowing anything about the survey itself, far from gaining meaningful insights, may be completely misleading.
Bill Nolen, Ph.D. is Chief Scientist at LIS, a full-service human capital assessment firm.