Tuesday, April 1, 2025

Bracey, Gerald. W. (2006). Reading Educational Research: How to Avoid Getting Statistically Snookered. Reviewed by Darrell L. Sabers, University of Arizona

Education Review. Book reviews in education. School Reform. Accountability. Assessment. Educational Policy.

Education Review/Reseñas Educativas/Resenhas Educativas

Bracey, Gerald. W. (2006). Reading Educational Research: How to Avoid Getting Statistically Snookered. Portsmouth, NH: Heinemann.

188 + 20 pp.
ISBN 0-325-00858-2

Reviewed by Darrell L. Sabers
University of Arizona

May 29, 2006

Reading Educational Research: How to Avoid Getting Statistically Snookered is the latest book by Gerald Bracey. He delivers what is expected by those of us who read his work, with many examples of how others lie with statistics and interpret data to fit their agendas. The book is focused on 32 “Principles of Data Interpretation” that should provide many readers with a set of instructions to examine and understand reports and claims about various educational issues. The 32 principles are as follows:

Principles of Data Interpretation

  1. Do the arithmetic.
  2. Show me the data!
  3. Look for and beware of selectivity in the data.
  4. When comparing groups, make sure the groups are comparable.
  5. Be sure the rhetoric and the numbers match.
  6. Beware of convenient claims that, whatever the calamity, public schools are to blame.
  7. Beware of simple explanations for complex phenomena.
  8. Make certain you know what statistic is being used when someone is talking about the “average.”
  9. Be aware of whether you are dealing with rates or numbers. Similarly, be aware of whether you are dealing with rates or scores.
  10. When comparing either rates or scores over time, make sure the groups remain comparable as the years go by.
  11. Be aware of whether you are dealing with ranks or scores.
  12. Watch out for Simpson’s paradox.
  13. Do not confuse statistical significance and practical significance.
  14. Make no causal inferences from correlation coefficients.
  15. Any two variables can be correlated. The resultant correlation coefficient might or might not be meaningful.
  16. Learn to “see through” graphs to determine what information they actually contain.
  17. Make certain that any test aligned with a standard comprehensively tests the material called for by the standard.
  18. On a norm-referenced test, nationally, 50 percent of students are below average, by definition.
  19. A norm-referenced standardized achievement test must test only material that all children have had an opportunity to learn.
  20. Standardized norm-referenced tests will ignore and obscure anything that is unique about a school.
  21. Scores from standardized tests are meaningful only to the extent that we know that all children have had a chance to learn the material which the test tests.
  22. Any attempt to set a passing score or a cut score on a test will be arbitrary. Ensure that it is arbitrary in the sense of arbitration, not in the sense of being capricious.
  23. If a situation really is as alleged, ask, “So what?”
  24. Achievement and ability tests differ mostly in what we know about how students learned the tested skills.
  25. Rising test scores do not necessarily mean rising achievement.
  26. The law of WYTIWYG applies: What you test is what you get.
  27. Any tests offered by a publisher should present adequate evidence of both reliability and validity.
  28. Make certain that descriptions of data do not include improper statements about the type of scale being used, for example, “The gain in math is twice as large as the gain in reading.”
  29. Do not use a test for a purpose other than the one it was designed for without taking care to ensure it is appropriate for the other purpose.
  30. Do not make important decisions about individuals or groups on the basis of a single test.
  31. In analyzing test results, make certain that no students were improperly excluded from the testing.
  32. In evaluating a testing program, look for negative or positive outcomes that are not part of the program. For example, are subjects not tested being neglected? Are scores on other tests showing gains or losses?

After 20 pages of introduction of topics about data-driven decisions and abuses of data, the principles are incorporated into the text with examples and explanations. Many of these examples are based on reports of educational evaluations, and Bracey does not hesitate to include controversial topics with political agendas. Most of this book is very good reading, and it is not intended only for the reader who has no background in reading research reports.

The principles of data interpretation are, for the most part, very good points to remember when reading reports. I take issue with a few in terms of the wording, and have a problem with the organization. These problems are not important enough to relegate the book to the “wait for the revision” category, but might help with some readers focus better on related issues in the principles. For example, #s 4 and 10 relate to comparing groups, and #s 9 and 11 relate to types of scores being compared. One looking at these principles might have a difficult time understanding the order. Principle #18 is incorrect as described below, and many could have been improved by more careful wording (#s 18, 19, 28). For example, with the advanced placement tests we don’t expect all children to have been exposed to all the desired curricula (#19, which orders words differently from #20), and in #28 the word “interpretations” would be better than “statements.” But this need for minor changes should not change the overall rating of excellent for the set as a whole and as a basis for the book.

The treatment of topics covered in the principles is mostly excellent, and the reading of the book should enhance the preparation of a reader to become a credible consumer of educational research. If the readers of most of what is published about testing used these principles, the writers would be encouraged to prepare their reports more carefully.

There are many reasons to support the claim that this is a very good book. However, for this review, the invitation that Bracey presents on page 172 as he closes his book is an opportunity that must be accepted. He states, “I would be happy to hear from any reader suggestions for what a second edition of this book might add, delete, alter, or reorganize.” That is a challenge any reviewer should be eager to answer.

OK, Jerry, here are some comments that your fan Sabers (whoever he is) would like you to consider in your second edition (and hopefully the reader of this review will benefit from these suggestions as well).

When discussing principle #3 Bracey discusses salaries of teachers “on average”; however, principle #8 tells the reader to “make certain you know what statistic is being used when someone is talking about the ‘average’.” Now if a critical researcher like Bracey uses “average” without telling the reader what statistic is meant, why would the reader be expected to practice the necessary diligence to “make certain…?” This particular sloppy use of the term “average” also detracts from a thoughtful explanation of principle #3, “look for and beware of selectivity in the data.”

A list of variables is presented on pages 38-39, but two of those are “student surveys (of former as well as current students)” and “changes over time in all of the above variables.” Now neither of these is a variable, and that is not a good example for an author to use when teaching the reader to be a careful consumer of research.

One would hope that the error in reporting variables would not distract the reader from the following sections on dealing with rates, numbers, and scores, and dealing with comparison of rates or numbers over time (make sure the groups remain comparable as the years go by). This is an excellent part of the book, but I think it would have been improved if the introduction of Simpson’s Paradox had included real rather than hypothetical data. Real data used in the explanation that follows the introduction provide a very good example of the paradox (that is, subgroup trends are different from the aggregate group trend). It is understandable that the final example in that chapter uses hypothetical medical data, but an introductory example could easily be based on real data. A “principle” I find useful is to “present real data, for if you have to manufacture data to give an example of a problem, the problem must be rare and the reader might ignore it.” Students in statistics classes often complain that the manufactured data used in textbook examples are not very realistic.

On page 72 the text under the formula for effect size defines what the symbols for the means are, but ignores the standard deviations in the same formula. This omission might cause confusion for a learner. Also, on page 73 the reader is informed of material on page 159 that will enhance learning. I have found that many students do not take kindly to suggestions to skip ahead to future material, and in this case a better choice would be to explain how this concept differs from what has already been presented on pages 48-49. On those pages the concept of standard deviation was used in conjunction with the normal curve, and the reader might benefit from the recall of previous material.

Given the opportunity to consider the previous interpretation of the standard deviation, the reader could be informed that the effect size is just a measure of distance in standard deviation terms—that is, the standardized distance between the means of the experimental and control groups. The normal curve can then be used to explain the magnitude of that distance. One might call the present writing a missed opportunity for a Vygotsky learning moment (my students use the term scaffolding here). The twins data on page 76 could be used to provide another example of the effect size.

The Brown University example on page 78 is very misleading. The example starts with “Brown University, which could fill two freshman classes just with applicants with SAT verbal scores between 750 and 800 …appeared to give more weight to other factors. For instance, it admitted only one-third of those who scored between 750 and 800.” I think this is supposed to be an example of a school that gives less weight to the SAT; however, given that there were so many applicants who had such high scores, 50% must be rejected just because of lack of room in the freshman class. In addition, if an initial cut-off for applicants might have been set at 650 for a verbal score, that one score might have been heavily weighted for those scoring below 650. Why not explain this case more fully—like the rest of page 78 is devoted to a selection process in algebra?

On page 78 a reference is made to the “Pearson rank-order coefficient” with a comment that “the rank-order coefficient is an approximation of the product-moment correlation”. Actually, the Spearman rank-order coefficient (as it is more frequently called) is the actual product-moment correlation of the ranks rather than an approximation. One should not confuse the readers who were taught these concepts correctly.

In the explanation of the point-biserial coefficient on page 83 the reader might learn “to correlate the chances that a person will get the item right with the person’s total test score.” Now that description might be correct for the estimate of the point biserial given by item response theory applications, but it is not correct for calculating the actual point-biserial coefficient. The correct description would be the correlation between the item score (zero or one) and the person’s total score. Perhaps this situation is an example of where one should be aware of dealing with estimates or scores, and that could be tied with the earlier caution about comparing ranks or rates with scores. Yes, I am picky, but I have learned from Bracey (and others) that I should be careful what I write. And besides, on page 122 he does get it correct, but the reader may not correct that initial misleading definition.

I wonder whether the graphs on pages 98-99 are really stem-and-leaf graphs. That name does not seem to fit what I learned from Tukey’s use of the stem-and-leaf method or what I get when I use a computer program to present data with the stem-and-leaf approach. However, I think the presentation is very clear and the caution presented on page 99 is more than enough to excuse the use of the term. Whatever those graphs are, they are effective. A problem might be that these examples are included in a section titled “Other Ways of Graphing Badly” and the reader might assume there is something wrong with their use.

In the section on “The Nature of Standardized Tests” there is a list of what constitutes a standardized test:

  • the questions are the same for everyone
  • the format of the questions is the same for everyone
  • the instructions given to the students are the same for everyone
  • the time allotted to take the test is the same for everyone
  • the tests are usually administered to a group of students
  • the items on the test have known statistical properties, especially in terms of proportion of test takers who get each item right (pp. 119-120).

Now the above might be Bracey’s wish list, but they do not correspond to what the reader will find in the tests currently described as standardized. With computer-adaptive testing, used for some tests including the GRE, different items are presented to examinees depending on the scoring of items presented previously. As I write this paragraph and come to points 2 and 3, I am looking at an announcement for a conference on Accommodating Students with Disabilities. Now if there are accommodations permissible, are the format, instructions, and time going to be the same for all students? Is the performance of a student who is allowed to type written responses comparable with the performance of one who writes by hand? What if the first student used a word processing program like WORD? Are instructions the same for computer-adaptive and group-administrations of the same test? For the fourth point, why are individually-administered tests not standardized? There is nothing about standardizing tests that differentiates between group and individual administration. The term ‘usually’ may make this point accurate enough, but the idea is misleading. The last point has to do with norms, not standardization. The administration procedure of the test is standardized, but only the data resulting from actual testing can provide information about the statistical properties of the items and the test.

To be fair in commenting on the above list, I will admit that Bracey has addressed some of these points on page 120, but that does not suggest that the list is worth presenting. He presents a great story on pp. 120-121 and relates that story to face validity well, but that does not justify the list. In a book that is intended to encourage one to be a critical reader of testing reports, there is no place for such carelessness.

On page 121 is a list of examples of norm-referenced tests (NRTs) that includes the California Achievement Test and the Comprehensive Tests of Basic Skills but not the TERRA NOVA (TN) that one finds when visiting that company’s web site. A reader may wonder whether TN is omitted because its use in ‘non-standard’ tests such as the Arizona Instrument to Measure Standards makes it something other than a NRT. I give credit in this example for referring to the Stanford Achievement Test as SAT10 to avoid confusion with the SAT. Bracey should have noted that the term NRT is used widely, but actually it is the interpretation, not the test, that is norm-referenced.

On page 122 Bracey does describe ‘norming’ but does not differentiate that activity from standardization. His statement on page 123 that “nationally, 50 percent of the students will always be below average” is suspect given the prevalence of the use of dated norms. But if the scores are reported as stanines, there are only 40% of the norm group scores that are below average. More on this is presented in the book, but not with this principle. It is true that approximately 50% of the scores in the norming group fell below the midpoint called ‘average’, but that is not true of the reported scores (recall the Lake Wobegon effect). That statement with the ‘always’ is more incorrect than principle 18 itself, which might be corrected by using the word “were” rather than “are” if the statistic used as the average were reported correctly.

On page 126 Bracey suggests that local assessments are not as useful as NRTs because they are not backed up by statistics like p values or point biserials. At this point the reader has been told that p represents the level of significance (back on page 71) but has not yet been told what the p value means in this case (which is defined as probability--rather than the actual observed percentage in a group--that got the item correct on page 139). Another problem with this suggestion that national statistics are so important is that the national statistics rarely correspond with local statistics that might represent a better measure for comparison, and the national statistics are very over rated when applied to local curriculum assessments. He might have covered the more important limitations of the national statistics rather than touting their importance. I return to this issue when discussing validity.

A section on Criterion-Referenced Tests (CRTs) follows the section on Construction of a Standardized Test. How misleading is that organization? His opening that NRTs are not much in favor these days and have given way to CRTs is followed by describing how today’s CRTs are nothing like the developers of that type of testing had in mind. This good description is followed by an excellent discussion of setting passing scores and standards-referenced tests. I suggest that his sections on passing scores be disseminated widely. I would hope that in a revision he explains that NRT, CRT, or standards-based tests can be reported with reference to cut scores. This treatment should not be limited to passing scores, as many programs report several levels of cut scores to designate several levels of performers, (e.g., failing, approaching standards, meeting standards, and exceeding the standards).

Now before anyone decides to disseminate the material described in the previous paragraph, why not add a note that William Angoff did not create the procedure called the “Angoff” or probably the “modified Angoff” procedure? Angoff was careful to give credit where credit is due (Wainer, 2006, p.81) and the rest of us should follow that example. Angoff credited Ledyard Tucker. I suspect that neither Angoff nor Tucker would approve of many current test misuses where the methods are mentioned. While on the point of names, coefficient alpha mentioned on page 145 is not “Cronbach’s alpha” (Cronbach, 2004).

On page 145 it should have been mentioned that reliability is a property of the test scores, not of the test (Thompson & Vacha-Haase, 2000). Although obvious to anyone who has considered the effect of restricted range on reliability coefficients, the misconception that reliability is a property of a test is still prevalent in the literature. This book should not perpetuate common misconceptions. And reliability is not restricted to stability, as the treatment on page 145 suggests. The treatment of validity on that page surprisingly does not mention face validity that was introduced on pages 120-121. Any treatment of validity should mention that the data presented by the publisher probably does not represent what will be found in local assessments, and thus validity data are often a property of the local situation rather than the national view addressed by the developer. Principle #27 could be examined using this perspective in addition to the fine treatment already included in this edition of the book.

Normal Curve Equivalents (NCEs) are given little praise in Bracey’s treatment (pp. 157-158), but the same limitations are also found in any other standard scores. After all, NCEs are just standard scores with a mean of 50 and a standard deviation of 21.06 (Bracey might ask, “Who determined the .06 part?)” But IQ or other standard scores have no more theoretical basis than NCEs. NCEs are often derived from national percentile ranks, so they are approximately “normalized” rather than linear-derived standard scores, but most tests come without a description of how the standard scores were derived—so this can be ignored. Rather than dismissing any type of standard score, Bracey could have suggested that these scores are more interval-level measures than are percentile ranks, and thus should always be preferred over percentiles when an average (mean) is calculated. The transformation from NCEs to percentile ranks is as easy as a transformation from any other standard score. Given that NCEs have more score points than other typical reported standard scores, they might be preferred for use in mathematical calculations because they suffer less from assigning students with different raw scores to the same standard score. This last point may be a problem by showing more precision than is warranted; however, I have never read a defense of grouping scores where less misinterpretation of performance compensated for the loss of precision.

On page 160 Bracey reminds the reader that percentages of scores between standard deviation points on the normal curve apply only for normal, bell-shaped curves and not skewed distributions. But this warning confuses the issue of normal curves versus other symmetrical curves that are not skewed. The distinction should be between normal and non-normal, not between normal and skewed. Most observed score distributions that are not skewed are also not normal, frequently being too flat (platykurtic). The reader must know that the percentages between these points do not apply to non-normal distributions even if the distributions are symmetrical.

On page 162 Bracey states that NRTs are designed to be insensitive to instruction. Now the correct statement is that the tests are designed to be insensitive to type of instruction, but not to the effectiveness of instruction. The tests could not be used to measure learning (that is affected by effective instruction) if they were insensitive to the effectiveness of instruction. I agree with most of what he says regarding the misuse of tests for the purpose of evaluating teachers, but used properly tests can be included as one dependent variable in a well-designed experiment to evaluate effective instruction. I will agree that such an experiment is rare, indeed.

The major section of the book is entitled “Testing: A Major Source of Data—and Maybe Child Abuse”. He gives so many examples of improper interpretation of test results that I cannot argue with the title. I just hope that in the future editions, and perhaps in his future writing in general, he’ll focus a bit on these comments rather than just perpetuate some of the misunderstandings that are prevalent in the literature today. I did not include every suggestion I have for improving the book; thus I do not want to hear from anyone about what I missed. However, I would respond to comments about my being incorrect about topics where I might have cited authorities rather than just pretending to be an expert.

And Jerry, whether you use any of this or choose not to, I will continue to look forward to reading your views on educational issues of the world.

References

Cronbach, L. J. (2004). My current thoughts on coefficient alpha and successor procedures. Educational and Psychological Measurement, 64, 391-418.

Wainer, H. (2006). Book review: Phelps, Richard P. (Ed), Defending Standardized Testing. Journal of Educational Measurement, 43, 77-84.

Thompson, B. & Vacha-Haase, T. (2000). Psychometrics is datametrics: The test is not reliable. Educational and Psychological Measurement, 60, 174-195.

About the Reviewer

Darrell Sabers is Professor in the Department of Educational Psychology at the University of Arizona. His research specialty is applied psychometrics, especially focused on educational testing and research. Darrell teaches courses in educational measurement (introduction and advanced theory) and in research methods. His current research projects include test analysis for a test of early mathematics achievement, scale construction and validation in deaf education, and validity studies in intellectual assessment.

Copyright is retained by the first or sole author, who grants right of first publication to the Education Review.

No comments:

Post a Comment