5 Things that Matter More than Statistical Significance
Over the last few weeks, many of us across Canada are looking at two relatively new sets of survey data. One is the American College Health Assessment (ACHA/NCHA), the other the Canadian Graduate and Professional Student Survey. In both cases, we are looking at the data and comparing our results to the last time the survey was done and to other similar institutions. Where are we improving? Where are we lagging? And when these comparisons are made, a question that keeps coming up: are the results (differences) significant?
“Significant” is a phrase used all the time in the discussion of data, and I think it’s often misunderstood. First, it’s important to note that the phrase should be statistically significant. This isn’t just semantics. When we hear the word “significant”, we think important or meaningful, but statistically significant really means two numbers are not the same. If you do a comparison of your data from 2016 and 2013 and get a p-value =.05, then this is effectively saying you have a significant difference, and that there is about a 1 in 20 chance that you would find a difference that big, just by chance. A more textbook definition might be “The p-value is the probability of seeing a result as strong as observed or greater, under the null hypothesis (which is commonly the hypothesis that there is no effect).” Gelman, 2015
Confused? I am. And that’s okay because I want to argue there are at least 5, and possibly 40, things more important than statistical significance.
1. How relevant are your questions?
The surveys and assessments we use profess to measure “Health” or “The Graduate Student Experience”. But neither of these are real things. Height, weight, age, income, distance, and time are easier to define and assess. So in measuring “health” or the “graduate student experience”, we create sets of questions that existing research, experts, and our own experience suggest are parts of that thing.
Developing these questions is not easy.
Think back to grade school when you had tests on fractions. If the test only focused on addition and subtraction of fractions, then it’s not fair to say that the test covered all important aspects of fractions. It would be missing two really important pieces—division and multiplication.
Similarly, consider the ways that students could respond to your questions. Are they measuring frequency (how often/how many times), quality (bad, okay, excellent), agreement (strongly disagree or strongly agree)? More importantly, do these responses make sense given the questions that were set up? Do we care more about if a student has used a program or service, or that they are aware it exists (or that they would feel comfortable seeking it out).
2. Significantly different than what or whom?
Maybe our questions and the responses are good approximations of what it is that we are trying to learn more about—but are we making comparisons that make sense? For example, I am looking at the ACHA data right now, and two things are different about our institution’s sample: it is a good size with about 5,000 responses (which makes finding statistically significant results much easier)—and, more specifically, it has a lot more graduate students in 2016 than it did in 2013. These two facts change a lot of things, and although there are approaches that can account for this, it makes strict comparisons of 2013 and 2016 data challenging. Alternatively, consider what it means to compare your data to a national dataset or subset. A quick look at the data and I can see that the ways UofT students identify ethno-culturally are very different from other large research-intensive universities in Canada.
3. As significantly different as we expected?
There are several hundred questions in these surveys. Even with small samples, somewhere in it you will find significant differences. But before we pat ourselves on the back, or set off campus-wide panic, it is important to reflect back on what has changed over the past three years. Did your institution make a major push on increasing awareness of available support for mental health? Was there a focus on responsible alcohol consumption? If so, then this makes it more reasonable to think about difference. But if you notice a big change in the percentage of students who are eating the recommended number of fruits and vegetables, and there was no major programmatic or policy initiative to educate on these matters, then we have significant results without a good understanding of why.
4. How big is the difference?
A challenge at UofT is that we have big samples, which makes finding significant differences easier. But the fact that there is a difference should not be the focus; aren’t we more interested in how much something has improved or regressed? Look at the ACHA survey and we see several hundred questions asked in different response formats. To be able to look at a number of questions and see where the largest differences are we actually need to calculate something else called an effect size. Effect sizes let us compare differences between questions, even when they are asked in different ways. They allow us to look through all the questions in a survey and say these are the 10 that moved the most in a positive direction and/or negative direction. The good people at the National Survey of Student Engagement have written a lot, and quite thoughtfully, on the topic of effect size.
5. Can we do anything about it?
Imagine our questions are good, our answers are good, our samples are similar, and we made some major programming changes or there were institutional changes that changed our population. The real question then becomes: can the data be used to impact change, or is the data fit for use? When using an external survey like this, we obviously don’t have control over how the questions are asked and the responses that are aligned with them. Even if we find some significant differences, with moderate effect sizes, if the data isn’t actionable then its value is diminished.
Please do not take any of this as me saying that large scale survey data is not very, very valuable—it is! Like any data set, assessment, or observation, it’s one piece of the puzzle and has certain strengths and limitations. It’s also important in reviewing these large survey data sets to complement them with existing assessment data, records of visits in your Health & Wellness Centre, the programming your staff has done, and the daily experiences of your team. Assessment is a cycle, and these surveys can be very useful for (re)starting the conversation about what you want to learn or know more about next.
is the Manager, Assessment & Analysis for Student Life at UofT and is finishing up his PhD in Higher Ed at OISE.
Assessment & You II: Coast to Coast is an ongoing series brought to you by Lesley D’Souza, featuring a number of experts and perspectives on assessment from across Canada and the US. Join us every month as they dive into the depths of assessment knowledge and practice, building a culture of assessment for Student Affairs in Canada.
Check out our Assessment Glossary for a breakdown of terms used throughout Assessment & You, and stay tuned next month for ways to expand the culture of assessment in Canadian higher education.