# 05������������������������������,P – VALUE, A TRUE TEST OF STATISTICAL SIGNIFICANCE? A CAUTIONARY NOTE

### Carbide Series SPEC05 MidTower Gaming Case — BlackThe US ASCII Character SetSpecial Attention of: NOTICE H 202105U.S. DEPARTMENT OF HOUSING AND URBAN RM 5205 ⋅ RICHARD MILLE | Manual Winding Tourbillon Wisconsin Legislature: 343.05DK05F & DK04F Electrical height adjustable tableAntiIFNγ autoantibodies are strongly associated with HLA

INTRODUCTIONThe medical journals are replete with P values and testsof hypotheses. It is a common practice among medicalresearchers to quote whether the test of hypothesisthey carried out is significant or non-significant andmany researchers get very excited when they discovera “statistically significant” finding without reallyunderstanding what it means. Additionally, whilemedical journals are florid of statement such as:“statistical significant”, “unlikely due to chance”, “notsignificant,” “due to chance”, or notations such as, “P> 0.05”, “P < 0.05”, the decision on whether to decidea test of hypothesis is significant or not based on Pvalue has generated an intense debate amongstatisticians. It began among founders of statisticalinference more than 60 years ago1-3. One contributingfactor for this is that the medical literature shows astrong tendency to accentuate the positive findings;many researchers would like to report positive findingsbased on previously reported researches as “non-significantresults should not take up” journal space4-7.

The idea of significance testing was introduced by R.A.Fisher, but over the past six decades its utility,understanding and interpretation has beenmisunderstood and generated so much scholarlywritings to remedy the situation3. Alongside thestatistical test of hypothesis is the P value, whichsimilarly, its meaning and interpretation has beenmisused. To delve well into the subject matter, a shorthistory of the evolution of statistical test of hypothesisis warranted to clear some misunderstanding.

A Brief History of P Value and Significance TestingSignificance testing evolved from the idea and practiceof the eminent statistician, R.A. Fisher in the 1930s.His idea is simple: suppose we found an associationbetween poverty level and malnutrition among childrenunder the age of five years. This is a finding, but couldit be a chance finding? Or perhaps we want to evaluatewhether a new nutrition therapy improves nutritionalstatus of malnourished children. We study a group ofmalnourished children treated with the new therapyand a comparable group treated with old nutritionaltherapy and find in the new therapy group animprovement of nutritional status by 2 units over theold therapy group. This finding will obviously, bewelcomed but it is also possible that this finding ispurely due to chance. Thus, Fisher saw P value as anindex measuring the strength of evidence against thenull hypothesis (in our examples, the hypothesis thatthere is no association between poverty level andmalnutrition or the new therapy does not improvenutritional status). To quantify the strength of evidenceagainst null hypothesis “he advocated P < 0.05 (5%significance) as a standard level for concluding thatthere is evidence against the hypothesis tested, thoughnot as an absolute rule’’ 8. Fisher did not stop there butgraded the strength of evidence against null hypothesis.He proposed “if P is between 0.1 and 0.9 there iscertainly no reason to suspect the hypothesis tested. Ifit’s below 0.02 it is strongly indicated that the hypothesisfails to account for the whole of the facts. We shallnot often be astray if we draw a conventional line at0.05’’ 9. Since Fisher made this statement over 60 yearsago, 0.05 cut-off point has been used by medicalresearchers worldwide and has become ritualistic touse 0.05 cut-off mark as if other cut-off points cannotbe used. Through the 1960s it was a standard practicein many fields to report P values with the star attachedto indicate P < 0.05 and two stars to indicate P <0.01. Occasionally three stars were used to indicate P< 0.001. While Fisher developed this practice ofquantifying the strength of evidence against nullhypothesis some eminent statisticians where notaccustomed to the subjective interpretation inherent inthe method 7. This led Jerzy Neyman and EgonPearson to propose a new approach which they called“Hypothesis tests”. They argued that there were two types of error that could be made in interpreting theresults of an experiment as shown in Table Table11.

Table 1.Errors associated with results of experiment.

The truthResult of experimentNull hypothesis trueNull hypothesis falseReject null hypothesisType I error rate(α)Power = 1- βAccept null hypothesisCorrect decisionType II error rate (β)Open in a separate windowThe outcome of the hypothesis test is one of two: toreject one hypothesis and to accept the other. Adoptingthis practice exposes one to two types of errors: rejectnull hypothesis when it should be accepted (i.e., thetwo therapies differ when they are actually the same,also known as a false-positive result, a type I error oran alpha error) or accept null hypothesis when it shouldhave rejected (i.e. concluding that they are the samewhen in fact they differ, also known as a false-negativeresult, type II error or a beta error).

What does P value Mean?The P value is defined as the probability under theassumption of no effect or no difference (nullhypothesis), of obtaining a result equal to or moreextreme than what was actually observed. The P standsfor probability and measures how likely it is that anyobserved difference between groups is due to chance.Being a probability, P can take any value between 0and 1. Values close to 0 indicate that the observeddifference is unlikely to be due to chance, whereas a Pvalue close to 1 suggests no difference between thegroups other than due to chance. Thus, it is commonin medical journals to see adjectives such as “highlysignificant” or “very significant” after quoting the Pvalue depending on how close to zero the value is.

Before the advent of computers and statisticalsoftware, researchers depended on tabulated valuesof P to make decisions. This practice is now obsoleteand the use of exact P value is much preferred.Statistical software can give the exact P value and allowsappreciation of the range of values that P can take upbetween 0 and 1. Briefly, for example, weights of 18subjects were taken from a community to determineif their body weight is ideal (i.e. 100kg). Using student’st test, t turned out to be 3.76 at 17 degree of freedom.Comparing tstat with the tabulated values, t= 3.26 ismore than the critical value of 2.11 at p=0.05 andtherefore falls in the rejection zone. Thus we rejectnull hypothesis that ì = 100 and conclude that thedifference is significant. But using an SPSS (a statisticalsoftware), the following information came when thedata were entered, t = 3.758, P = 0.0016, meandifference = 12.78 and confidence intervals are 5.60and 19.95. Methodologists are now increasinglyrecommending that researchers should report theprecise P value. For example, P = 0.023 rather than P< 0.05 10. Further, to use P = 0.05 “is an anachronism.It was settled on when P values were hard to computeand so some specific values needed to be provided in tables. Now calculating exact P values is easy (i.e., thecomputer does it) and so the investigator can report(P = 0.04) and leave it to the reader to (determine itssignificance)”11.

Hypothesis TestsA statistical test provides a mechanism for makingquantitative decisions about a process or processes.The purpose is to make inferences about populationparameter by analyzing differences between observedsample statistic and the results one expects to obtainif some underlying assumption is true. This comparisonmay be a single obser ved value versus somehypothesized quantity or it may be between two ormore related or unrelated groups. The choice ofstatistical test depends on the nature of the data andthe study design.

Neyman and Pearson proposed this process tocircumvent Fisher’s subjective practice of assessingstrength of evidence against the null effect. In its usualform, two hypotheses are put forward: a nullhypothesis (usually a statement of null effect) and analternative hypothesis (usually the opposite of nullhypothesis). Based on the outcome of the hypothesistest one hypothesis is rejected and accept the otherbased on a previously predetermined arbitrarybenchmark. This bench mark is designated the P value.However, one runs into making an error: one mayreject one hypothesis when in fact it should be acceptedand vise versa. There is type I error or á error (i.e.,there was no difference but really there was) and typeII error or â error (i.e., when there was differencewhen actually there was none). In its simple format,testing hypothesis involves the following steps:

Identify null and alternative hypotheses.

Determine the appropriate test statistic and itsdistribution under the assumption that the nullhypothesis is true.

Specify the significance level and determine thecorresponding critical value of the test statisticunder the assumption that null hypothesis is true.

Calculate the test statistic from the data.Having discussed P value and hypothesis testing,fallacies of hypothesis testing and P value are nowlooked into.

Fallacies of Hypothesis TestingIn a paper I submitted for publication in one of thewidely read medical journals in Nigeria, one of thereviewers commented on the age-sex distribution ofthe participants, “Is there any difference in sexdistribution, subject to chi square statistics”? Statistically,this question does not convey any query and this is oneof many instances among medical researchers(postgraduate supervisors alike) in which test ofhypothesis is quickly and spontaneously resorted towithout due consideration to its appropriateapplication. The aim of my research was to determinethe prevalence of diabetes mellitus in a ruralcommunity; it was not part of my objectives todetermine any association between sex and prevalence of diabetes mellitus. To the inexperienced, thiscomment will definitely prompt conducting test ofhypothesis simply to satisfy the editor and reviewersuch that the article will sail through. However, theresults of such statistical tests becomes difficult tounderstand and interprete in the light of the data. (Theresult of study turned out that all those with elevatedfasting blood glucose are females). There are severalfallacies associated with hypothesis testing. Below is asmall list that will help avoid these fallacies.

Failure to reject null hypothesis leads to itsacceptance. (No. When you fail to reject nullhypothesis it means there is insufficient evidenceto reject)

The use of á = 0.05 is a standard with an objectivebasis (No. á = 0.05 is merely a convention thatevolved from the practice of R.A. Fisher. There isno sharp distinction between “significant” and “notsignificant” results, only increasing strong evidenceagainst null hypothesis as P becomes smaller.(P=0.02 is stronger than P=0.04)

Small P value indicates large effects (No. P valuedoes not tell anything about size of an effect)

Statistical significance implies clinical importance.(No. Statistical significance says very little aboutthe clinical importance of relation. There is a biggulf of difference between statistical significanceand clinical significance. By statistical definition atá = 0.05, it means that 1 in 20 comparisons inwhich null hypothesis is true will result in P < 0.05!.Finally, with these and many fallacies of hypothesistesting, it is rather sad to read in journals howsignificance testing has become an insignificancetesting.

Fallacies of P ValueJust as test of hypothesis is associated with somefallacies so also is P value with common root causes, “It comes to be seen as natural that any finding worthits salt should have a P value less than 0.05 flashing likea divinely appointed stamp of approval’’12. Theinherent subjectivity of Fisher’s P value approach andthe subsequent poor understanding of this approachby the medical community could be the reason why Pvalue is associated with myriad of fallacies. Thirdly, Pvalue produced by researchers as mere ‘’passports topublication’’ aggravated the situation 13. We were earlieron awakened to the inadequacy of the P value in clinicaltrials by Feinstein 14,

“The method of making statistical decisions about‘significance’ creates one of the most devastating ironiesin modern biologic science. To avoid usual categoricaldata, a critical investigator will usually go to enormousefforts in mensuration. He will get special machinesand elaborate technologic devices to supplement hisold categorical statement with new measurements of‘continuous’ dimensional data. After all this work ingetting ‘continuous’ data, however, and after calculatingall the statistical tests of the data, the investigator thenmakes the final decision about his results on the basisof a completely arbitrary pair of dichotomouscategories. These categories, which are called‘significant’ and ‘nonsignificant’, are usually demarcatedby a P value of either 0.05 or 0.01, chosen accordingto the capricious dictates of the statistician, the editor,the reviewer or the granting agency. If the leveldemanded for ‘significant’ is 0.05 or lower and the Pvalue that emerge is 0.06, the investigator may be readyto discard a well-designed, excellently conducted,thoughtfully analyzed, and scientifically importantexperiment because it failed to cross the Procrusteanboundary demanded for statistical approbation.

We should try to understand that Fisher wanted tohave an index of measurement that will help him todecide the strength of evidence against null effect. Butas it has been said earlier his idea was poorlyunderstood and criticized and led to Neyman andPearson to develop hypothesis testing in order to goround the problem. But, this is the result of theirattempt: “accept” or “reject” null hypothesis oralternatively “significant” or “non significant”. Theinadequacy of P value in decision making pervades allepidemiological study design. This head-or-tailapproach to test of hypothesis has pushed thestakeholders in the field (statistician, editor, revieweror granting agency) into an ever increasing confusionand difficulty. It is an accepted fact among statisticiansof the inadequacy of P value as a sole standardjudgment in the analysis of clinical trials 15. Just ashypothesis testing is not devoid of caveats so also Pvalues. Some of these are exposed below.

The threshold value, P < 0.05 is arbitrary. As hasbeen said earlier, it was the practice of Fisher toassign P the value of 0.05 as a measure of evidenceagainst null effect. One can make the “significanttest” more stringent by moving to 0.01 (1%) or lessstringent moving the borderline to 0.10 (10%).Dichotomizing P values into “significant” and “nonsignificant” one loses information the same way asdemarcating laboratory finding into normal” and“abnormal”, one may ask what is the differencebetween a fasting blood glucose of 25mmol/L and15mmol/L?

Statistically significant (P < 0.05) findings are assumedto result from real treatment effects ignoring thefact that 1 in 20 comparisons of effects in whichnull hypothesis is true will result in significant finding(P < 0.05). This problem is more serious whenseveral tests of hypothesis involving several variableswere carried without using the appropriate statisticaltest, e.g., ANOVA instead of repeated t-test.

Statistical significance result does not translate intoclinical importance. A large study can detect a small,clinically unimportant finding.

Chance is rarely the most important issue. Rememberthat when conducting a research a questionnaire isusually administered to participants. Thisquestionnaire in most instances collect large amountof information from several variables included inthe questionnaire. The manner in which the questions where asked and manner they were answered areimportant sources of errors (systematic error) whichare difficult to measure.

What Influences P Value?Generally, these factors influence P value.

Effect size. It is a usual research objective to detect adifference between two drugs, procedures orprogrammes. Several statistics are employed tomeasure the magnitude of effect produced by theseinterventions. They range: r2, ç2, ù2, R2, Q2, Cohen’sd, and Hedge’s g. Two problems are encountered:the use of appropriate index for measuring the effectand secondly size of the effect. A 7kg or 10 mmHgdifference will have a lower P value (and more likelyto be significant) than a 2-kg or 4 mmHg difference.

Size of sample. The larger the sample the more likelya difference to be detected. Further, a 7 kg differencein a study with 500 participants will give a lower Pvalue than 7 kg difference observed in a studyinvolving 250 participants in each group.

Spread of the data. The spread of observations in adata set is measured commonly with standarddeviation. The bigger the standard deviation, themore the spread of observations and the lower theP value.

P Value and Statistical Significance: AnUncommon GroundBoth the Fisherian and Neyman-Pearson (N-P) schoolsdid not uphold the practice of stating, “P values ofless than 0.05 were regarded as statistically significant”or “P-value was 0.02 and therefore there wasstatistically significant difference.” These statements andmany similar statements have criss-crossed medicaljournals and standard textbooks of statistics andprovided an uncommon ground for marrying the twoschools. This marriage of inconvenience furtherdeepened the confusion and misunderstanding of theFisherian and Neyman-Pearson schools. Thecombination of Fisherian and N-P thoughts (asexemplified in the above statements) did not shed lighton correct interpretation of statistical test of hypothesisand p-value. The hybrid of the two schools as oftenread in medical journals and textbooks of statisticsmakes it as if the two schools were and are compatibleas a single coherent method of statistical inference 4, 23,24. This confusion, perpetuated by medical journals,textbooks of statistics, reviewers and editors, havealmost made it impossible for research report to bepublished without statements or notations such as,“statistically significant” or “statistically insignificant”or “P0.05”.Sterne, then asked “can weget rid of P-values? His answer was “practicalexperience says no-why? 21”

However, the next section, “P-value and confidenceinterval: a common ground” provides one of thepossible ways out of the seemingly insoluble problem.Goodman commented on P–value and confidenceinterval approach in statistical inference and its abilityto solve the problem. “The few efforts to eliminate Pvalues from journals in favor of confidence intervalshave not generally been successful, indicating that theresearchers’ need for a measure of evidence remainsstrong and that they often feel lost without one”6.

P Value and Confidence Interval: A CommonGroundThus, so far this paper has examined the historicalevolution of ‘significance’ testing as was initiallyproposed by R.A. Fisher. Neyman and Pearson werenot accustomed to his subjective approach andtherefore proposed ‘hypothesis testing’ involving binaryoutcomes: “accept” or “reject” null hypothesis. This,as we saw did not “solve” the problem completely.Thus, a common ground was needed and thecombination of P value and confidence intervalsprovided the much needed common ground.

Before proceeding, we should briefly understand whatconfidence intervals (CIs) means having gone throughwhat p-values and hypothesis testing mean. Supposethat we have two diets A and B given to two groupsof malnourished children. An 8-kg increase in bodyweight was observed among children on diet A whilea 3-kg increase in body weights was observed on dietB. The effect in weight increase is therefore 5kg onaverage. But it is obvious that the increase might beless than 3kg and also more than 8kg, thus a range canbe represented and the chance associated with thisrange under the confidence intervals. Thus, for 95%confidence interval in this example will mean that ifthe study is repeated 100 times, 95 out of 100 thetimes, the CI contain the true increase in weight.Formally, 95% CI: “the interval computed from thesample data which when the study is repeated multipletimes would contain the true effect 95% of the time.”

In the 1980s, a number of British statisticians tried topromote the use of this common ground approachin presenting statistical analysis16, 17, 18. They encouragedthe combine presentation of P value and confidenceintervals. The use of confidence intervals in addressinghypothesis testing is one of the four popular methodsjournal editors and eminent statisticians have issuedstatements supporting its use 19. In line with this, theAmerican Psychological Association’s Board ofScientific Affairs commissioned a white paper, “TaskForce on Statistical Inference”. The Task Forcesuggested,

“When reporting inferential statistics (e.g. t - tests, F -tests, and chi-square) include information about theobtained ….. value of the test statistic, the degree offreedom, the probability of obtaining a value asextreme as or more extreme than the one obtained[i.e., the P value]…. Be sure to include sufficientdescriptive statistics [e.g. per-cell sample size, means,correlations, standard deviations]…. The reporting ofconfidence intervals [for estimates of parameters, forfunctions of parameter such as differences in means,and for effect sizes] can be an extremely effective way of reporting results… because confidence intervalscombine information on location and precision andcan often be directly used to infer significance levels”20.

Jonathan Sterne and Davey Smith came up with theirsuggested guidelines for reporting statistical analysisas shown in the box21:

Box 1: Suggested guidance’s for the reporting ofresults of statistical analyses in medical journals.The description of differences as statisticallysignificant is not acceptable.

Confidence intervals for the main results shouldalways be included, but 90% rather than 95% levelsshould be used. Confidence intervals should notbe used as a surrogate means of examiningsignificance at the conventional 5% level.Interpretation of confidence intervals should focuson the implication (clinical importance) of the rangeof values in the interval.

When there is a meaningful null hypothesis, thestrength of evidence against it should be indexedby the P value. The smaller the P value, the strongeris the evidence.

While it is impossible to reduce substantially theamount of data dredging that is carried out,authors should take a very skeptical view ofsubgroup analyses in clinical trials and observationalstudies. The strength of the evidence forinteraction-that effects really differ betweensubgroups – should always be presented. Claimsmade on the basis of subgroup findings shouldbe even more tempered than claims made aboutmain effects.

In observational studies it should be rememberedthat considerations of confounding and bias areat least as important as the issues discussed in thispaper.

Since the 1980s when British statisticians championedthe use of confidence intervals, journal after journalare issuing statements regarding its use. In an editorialin Clinical Chemistry, it read as follows,

“There is no question that a confidence interval forthe difference between two true (i.e., population) meansor proportions, based on the observed differencebetween sample estimate, provides more usefulinformation than a P value, no matter how exact, forthe probability that the true difference is zero. Theconfidence interval reflects the precision of the samplevalues in terms of their standard deviation and thesample size …..’’22

On the final note, it is important to know why it isstatistically superior to use P value and confidenceintervals rather than P value and hypothesis testing:

Confidence intervals emphasize the importance ofestimation over hypothesis testing. It is moreinformative to quote the magnitude of the size ofeffect rather than adopting the significantnonsignificanthypothesis testing.

The width of the CIs provides a measure of thereliability or precision of the estimate.

Confidence intervals makes it far easier to determinewhether a finding has any substantive (e.g. clinical)importance, as opposed to statistical significance.

While statistical significant tests are vulnerable to typeI error, CIs are not.

Confidence intervals can be used as a significancetest. The simple rule is that if 95% CIs does notinclude the null value (usually zero for difference inmeans and proportions; one for relative risk andodds ratio) null hypothesis is rejected at 0.05 levels.

Finally, the use of CIs promotes cumulativeknowledge development by obligating researchersto think meta-analytically about estimation,replication and comparing intervals across studies25.For example, in a meta-analysis of trials dealingwith intravenous nitrates in acute myocardialinfraction found reduction in mortality ofsomewhere between one quarter and two-thirds.Meanwhile previous six trials 26 showed conflictingresults: some trials revealed that it was dangerousto give intravenous nitrates while others revealedthat it actually reduced mortality. For the six trials,the odds ratio, 95% CIs and P-values are: OR =0.33 (CI = 0.09, 1.13, P = 0.08); OR = 0.24 (CI =0.08, 0.74, P = 0.01); OR = 0.83(CI = 0.33, 2.12, P= 0.07); OR = 2.04 (CI = 0.39, 10.71, P = 0.04);OR = 0.58 (CI = 0.19. 1.65; P = 0.29) and OR =0.48 (CI = 0.28, 0.82; P = 0.007). The first, third,fourth and fifth studies appear harmful; while thesecond and the sixth appear useful (in reducingmortality).

What is to be done?While it is possible to make a change and improve onthe practice, however, as Cohen warns, “Don’t lookfor a magic alternative … It does not exist” 27.

The foundation for change in this practice shouldbe laid in the foundation of teaching statistics:classroom. The curriculum and class room teachingshould clearly differentiate between the two schools.Historical evolution should be clearly explained soalso meaning of “statistical significance”. Theclassroom teaching of the correct concepts shouldbegin at undergraduate and move up to graduateclassroom instruction, even if it means this teachingwould be at introductory level.

We should promote and encourage the use ofconfidence intervals around sample statistics andeffect sizes. This duty lies in the hands of statistics teachers, medical journal editors, reviewers and anygranting agency.

Generally, researchers, preparing on a study areencouraged to consult a statistician at the initial stageof their study to avoid misinterpreting the P valueespecially if they are using statistical software fortheir data analysis.