So far, you have learned about Descriptive statistics, How to calculate effect sizes, Confidence intervals, Significance level
In this chapter, you are going to learn about The relationship between power, effect size and probability levels, The factors influencing power, Issues surrounding the use of significance levels
Agenda
7.1 Pitfalls of NHST
7.2 Criterion Significance Levels
7.3 Effect Sizes
7.3.1 Cohen's d
7.3.2 Pearson's r
7.3.3 The odds ratio
7.3.4 Effect Sizes compared to NHST
7.4 Meta-Analysis
7.5 Bayesian Approaches
7.6 Power
7.7 Factors Influencing Power
7.8 GPower: Calculating Power
7.9 Confidence Intervals
Pitfalls of NHST: Offers a rule-based frameworks for deciding whether to believe a hypothesis, Seems to provide an easy way to disentangle the 'correct' conclusion from the 'incorrect' one
Meehl: '"The almost universal reliance on merely refuting the null hypothesis is a terrible mistake, is basically unsound, poor scientific strategy, and one of the worst things that ever happened in the history of psychology."'
Misconception #1: A significant result means that the effect is important
Misconception #2: A nonsignificant result means that the null hypothesis is true
Misconception #3: A significant result means that the null hypothesis is false
NHST encourages all-or-nothing thinking (e.g. of p < 0.05 then an effect is significant, but if p > 0.05 it is not)
How different is a p-value of 0.051 from 0.75? Should 0.049 and 0.00001 be thought of as equally significant by report thing both as p<0.05?
Statistical significance is not always equated with importance
Statements reflecting view of antiSTATic
The evidence is equivocal; we need more research.
All the mean differences show a positive effect of antiSTATic; therefore, we have consistent evidence that antiSTATic works.
Four of the studies show a significant result (p < 0.05), but the other six do not. Therefore, the studies are inconclusive: some suggest that antiSTATic is better than placebo, but others suggest there's no difference. The fact that more than half of the studies showed no significant effect means that antiSTATic is not (on balance) more successful in reducing anxiety than the control
Looking at the confidence intervals rather than focusing on significance allows us to see the consistency in the data and not a bunch of apparently conflicting results
The conclusions from NHST depend on what the researcher intended to do before collecting data
Significant findings are about seven times more likely to be published than non-significant ones, leading to publicationbias
Researcher degrees of freedom - a scientist has many decisions to make when designing and analyzing a study, which could be misused to exclude cases to make the result significant
hacking
Practices that lead to the selective reporting of significant p-value, most commonly trying multiple analyses and reporting only the one that yields significant results
HARKing
Presenting a hypothesis that was made after data collection as though it were made before data collection
Ways to overcome the pitfalls of NHST
Effect Sizes
Meta-Analysis
Bayesian Estimation
Registration
Sense
Six Principles for Scientists Using NHST (Wasserstein & American Statistical Association, 2016)
Incompatibility with Null Hypothesis
Not probability of truth
Resist the all-or-nothing thinking
Don't P-hack
Don't confuse statistical significance with practical importance
P-value does not equal evidence
Pre-registration - the practice of making all aspects of your research process publicly available before data collection begins
Registered report - a submission to an academic journal that outlines an intended research protocol
Transparency and Openness Promotion (TOP) guidelines
Citations
Pre-registration of study protocols
Pre-registration of analysis protocols
Replication transparency with data
Analysis scripts
Design and Analysis plans
Research materials
replication
Effect size
An objective and (usually) standardized measure of the magnitude of observed effect
Cohen's d
A standardized effect size measure that expresses the difference between two means in terms of the pooled standard deviation
Pearson's r
A standardized effect size measure that expresses the strength of the linear relationship between two variables
Odds ratio
An effect size measure for categorical variables that expresses the relative odds of an outcome occurring in one group compared to another
Effect sizes are not affected by sample size, but it does affect how closely the sample effect size matches that of the population (the precision)
Interpretations based on effect sizes are more informative than those based solely on p-values
Two virtually identical means are deemed to be significantly different based on a p-value
Two experiments with identical means and standard deviations yield identical conclusions when using an effect size to interpret them (both studies had d = −0.667)
Two virtually identical means are deemed to be not very different at all based on an effect size (d = −0.003, which is tiny)
Pearson's r
Effect size measure
Pearson's r
r = 0.10 (small effect): The effect explains 1% of the total variance
r = 0.30 (medium effect): The effect accounts for 9% of the total variance
r = 0.50 (large effect): The effect accounts for 25% of the variance
Odds Ratio
Effect size for counts (frequency), categorical variables (yes or no)
The odds of a 'yes' response were 0.4 times as large to a singer as to someone who started a conversation
The odds of a 'yes' response were 2.5 times as large to a talker as to someone who sang
Effect sizes overcome many of the problems associated with NHST
Effect sizes are less affected than p-values by things like early or late termination of data collection, or sampling over a time period rather than until a set sample size is reached
There are still some researcher degrees of freedom (not related to sample size) that researchers could use to maximize (or minimize) effect sizes, but there is less incentive to do so because effect sizes are not tied to a decision rule in which effects either side of a certain threshold have qualitatively opposite interpretations