Tuesday, July 1, 2025

Loveless, Tom (Ed.) (2007) Lessons Learned: What International Assessments Tell Us About Math Achievement. Reviewed by Jean-Luc Patry, University of Salzburg

Loveless, Tom (Ed.) (2007) Lessons Learned: What International Assessments Tell Us About Math Achievement. Washington, DC: Brookings Institution Press

Pp. 256         ISBN 978-0-8157-5333-9

Reviewed by Jean-Luc Patry
University of Salzburg

November 7, 2008

The results of international comparative studies of student achievement in different domains have been widely discussed in the scientific community as well as among the public interested in education and those responsible for education policy and school administration. However, the discussions were and still are on a rather superficial level. This is particularly the case for the results of the recent TIMSS and PISA studies that have gained particular visibility and have had significant political impact. The most sophisticated conclusions from such rankings are usually discussions of the type, “What do countries with better rankings do what we do not do?” It is sometimes naively argued that “If we would do as they do, we would achieve as they achieve.” But as Mullis and Martin (Chapter 2, p. 25) correctly note, there “is little evidence that isolated factors consistently lead to improvement and excellence in mathematics achievement.” On the other hand, the different international studies provide plenty of additional data that can lead to insight into factors contributing to achievement.

In November 2006, the Brookings Institution hosted an international conference focusing on such detailed analyses. This book is a collection of eight papers (plus an introduction by the editor) on factors affecting math achievement in different countries as identified through detailed analyses of the results of international comparative studies since 1961.

Loveless’s introductory chapter is a summary of the subsequent chapters from the perspective of the editor. One would want some integrative remarks going beyond the statement that “international assessments have much more to contribute than simple rankings of nations on math achievement” and a short list of what can be analyzed (p. 7). One would hope that some conclusions of this kind would be presented in a final chapter, but this is not the case. Hence, the eight chapters are a series of single pieces of research with little more connection than the general topic presented in the above introduction.

Chapter two, by Mullis and Martin, covers the developments and improvements from the first studies in 1961 through the TIMSS Advanced studies planned for 2008. The focus is primarily on improvements in assessments and in comparative validity. The analysis is slightly more sophisticated than pure ranking comparisons: Variance and significance of differences are taken into account, as well as benchmarks and improvements and decreases over time; some factors influencing the achievement scores are included, such as per capita gross national income. Very few general statements can be found. One of the most important messages is that the success of the top-performing countries demonstrates that mathematics proficiency for a high percentage of students within a country is possible even if the benchmark is set quite high and that almost all students within a country can reach the lower benchmarks (p. 20 ff.). Some of these general statements (e.g., on the relationship between native language and math achievement) are contradicted by the studies in the volume (e.g., Schütz, Chapter Seven).

While Mullis and Martin put most emphasis on the comparison among countries, Gustafsson, in the third chapter, studies causal influences in educational achievement through analysis of differences over time within countries. He claims that comparative studies do not allow easy causal inferences, and that the phenomena investigated are complex and the different countries are unique. In particular, he writes that “the international studies of educational achievement are not based upon an elaborated theoretical framework, which makes it difficult to apply the analytical methods developed for making causal inference from cross-sectional data” (p. 38). Hence his chapter has two main aims. “First, it discusses those methodological problems and pitfalls in the international studies that are of importance when the purpose is to make causal inferences. Second, it uses two concrete examples to show how a longitudinal approach to analysis of country-level trend data can avoid some of these methodological problems.” (p. 39)

The direction of causality in the type of data gathered in international studies cannot be known (the “endogeneity problem”). This, Gustafsson claims, can be controlled for by using longitudinal designs that permit to take into account the differences between students that existed before the treatment was applied. Further, not all variables that might have an influence are assessed (the “omitted variable problem”) – but in such complex contexts it is “virtually impossible to include all the relevant variables, even if strong theory is available to guide the selection” (p. 41), and the latter is missing in these studies, as the author has said earlier. Within-country longitudinal studies can control at least partially for these problems: (1) Aggregation may control for the endogeneity problem “because it seems not reasonable to expect that difference in use of homework, for example, between countries can be affected by endogeneity problems” (p. 44; the author also gives some hints at such a theory) and (2) longitudinal studies within countries permit holding constant at least to some degree the parameters that might be relevant. The author tries to demonstrate his point with two studies. While the argumentation is not always convincing, it is noteworthy that he makes the effort to argue in a very differentiated way and tries to rule out plausible rival hypotheses (Campbell, 1969).

The approach proposed by Gustafsson seems quite useful, particularly taking into account the author’s claim that it is not a panacea but that it must be integrated into a systematic framework with different approaches. This concept, which can be seen as some kind of critical "multiplism" (Cook, 1985), seems much more appropriate than the traditional “one-shot approach,” and much humbler than the frequent claims to have found the “truth” with one single study. But most importantly, and once again, he emphasizes the value of theory-driven research.

In the fourth chapter, Schmidt and Houang look at the “lack of focus in the mathematics curriculum” and ask the question whether this is the symptom or the cause. Again, the question of the direction of causal influence is addressed. “Lack of focus” means here a large number of topics in a curriculum (p. 66). The question is whether lack of focus, as can be found in the US mathematics curriculum according to the authors, is “a root cause related to achievement differences” (p. 67).

Using a method similar to Gustafsson's – aggregation on the level of countries –, the authors relate math achievement to what they call "focus" (or lack thereof: number of topics intended to be covered by the country; p. 71) and what they call coherence. Coherence is defined as “increase in depth, sophistication, and complexity across the grades” (p. 66) of the topics in the curriculum, probably according to the abilities of the students, i.e., the mathematical topics are introduced neither too soon and nor dealt with too late (cf. p. 70). Coherence is operationalized as the correspondence of the topics covered by the curriculum of a given grade for the country under consideration with the topics covered by the curriculum for this grade in three or more of the six “top achieving countries (…) that were the highest performers at eighth grade” (p. 74). This operationalization of “coherence” seems problematic; there is no evidence of its validity, for instance through testing whether other operationalizations with similar plausibility yield (or do not yield) similar results (this would be another application of critical multiplism in the sense of Cook, 1985). Another reason for caution regarding the relationship between coherence and mathematics achievement is that coherence operationalized this way is not independent of achievement since it is defined as some kind of agreement with the curriculum of the best-achieving countries. The authors do not mention this problem.

The reports of the results are inconsistent. Focus and coherence were highly correlated. On one hand, focus and coherence, taken separately, were not related with country-level achievement (p. 74), but taken together the regression analyses reached significance (R2 between .18 and .26). Nevertheless, the relationships of focus and coherence with achievement are discussed separately (p. 77). The interaction between these two elements would require much more analysis and discussion than is presented in this chapter; for this reason, as well as because of the problems with coherence as discussed above, the conclusions of the authors (“Covering too many topics does have a negative impact on student learning even when controlling for coherence”, p. 79), seems very tentative.

Kilpatrick, Mesa and Sloane deal with US algebra performance in an international context (Chapter 5). They address a related problem: Whether the algebra curriculum in the United States is adequate. Their approach is interesting: First, they classified the mathematics items according to their content (patterns, informal algebra, equations, inequalities, functions, algebraic reasoning, algebraic manipulation), representation use (numerical, verbal, graphical, symbolic, and pictorial) both in the presentation of the item and in the required response (e.g., an item contains a verbal presentation and requires a numerical answer), and cognitive demand (with the two dimensions “rich versus lean content” and “open versus constrained processes”). Second, they looked at the items in which the students performed particularly well, both absolutely (more than 75% success) and relatively (US among the two best nations of comparably developed countries), and at the items in which the US students performed poorly, again both absolutely (less than 25% success) and relatively (performing least among these countries). This comparison should reveal in which domains the US algebra curriculum performs well and where there is need for improvement.

The classification of the items is quite sophisticated, but the validity of the classification procedure is questionable. On the other hand, the comparison with other countries is problematic: Relative performance is just whether the US is the best or second best country (high performance) or the last (low performance) among the chosen countries. This is a considerable loss of information (from ratio to nominal scale level). For instance, for item L13, the US is the lowest achieving country with 93% correct (SE: 0.8); the next lowest country, Hungary, also has a success rate of 93% (SE: 1.3): The difference might be due solely to random error (“US students’ performance is not markedly below that of other nations”, p. 118). Further: Why “first or second” on the top end of the list, but only “last” on the bottom? Hence, the comparison of countries is very primitive.

The conclusion of Kilpatrick, Mesa and Sloane's analysis is that US students do well with regard to algebraic manipulations but not with respect to the explicit production, description, or representation of relationships. If the analysis was done with more sophisticated methods, it might well be that the results would be different. However, the conclusions make sense insofar as the results are linked with teaching practices reported in other studies. The question is, however, whether the TIMSS items represent what really should be learned. According to the authors, the students’ ability to learn algebra has been underestimated in the US: The US is a country “in which too many people have assumed for too long that most students cannot learn, use, or value algebra” (p. 123). This statement brings to mind a saying by Abraham Lincoln: “You can fool some of the people all of the time, and all of the people some of the time, but you can't fool all of the people all of the time" (quoted from Hertz, 1941, p. 138). I do not want to imply that Kilpatrick et al. accuse anyone of wanting to fool anybody; however, perhaps TIMSS and PISA do, involuntarily.

While the previous chapters deal with issues of the curriculum, Chapter 6 (“What can TIMSS surveys tell us about mathematics reforms in the United States during the 1990s”, by Hamilton and Martínez) addresses the question of the appropriate teaching method: traditional vs. constructivist. The focus is on exploring the utility of the TIMSS survey data for understanding teachers’ use of reform-oriented instructional practices and the relationship of teaching method with math achievement. The authors used TIMSS data sets from the US, Japan, the Netherlands, and Singapore; these countries were chosen because of their high achievement levels and to represent a wide variety of classroom contexts in which to examine the use of reform-oriented instructional practices.

The study illustrates well the problems encountered when the items in the surveys change from year to year: Such changes render the interpretation of comparisons across time very difficult if not impossible. Apparently, not all changes in the instrument were improvements. Further influences can be effects of the pre-test (cf. Campbell & Stanley, 1963): Differences in teacher responses between 1995 and 2003 might have been caused by discussions of the first TIMSS reports (“math wars”, p. 144; influence of social desirability, Edwards, 1957), not by changes in practice. This issue is discussed (pp. 153 ff.), as are several other sources for artifacts since the authors put much emphasis on trying to control for other systematic errors as well, they test many plausible rival hypotheses and are very careful about causal interpretations. For instance, the study also shows the problem of using the same surveys in different countries with different languages and cultures. The sometimes surprising inconsistencies (e.g., in the correlations) may not only reflect different relationships, but also differences in the interpretation of the items, as is repeatedly mentioned.

One of the conclusions is: “Combined with the other studies reviewed earlier, the results presented here suggest that investing resources in promoting the kinds of instructional practices examined in this set of studies is unlikely to lead to detectable changes in student achievement” (p. 153). I vigorously disagree with this conclusion. There are too many methodical flaws (as acknowledged by the authors) to make such a strong statement even when taking into account the “other studies reviewed.” The results do not mean that the analysis of the effects of constructivist teaching should be abandoned. The statement needs do be reframed. In particular, the results of this study need to be compared with other studies using other methods. The study outcomes permit a critique of survey studies, not of reform math! On p. 158 however, there is a much more modest conclusion which can be fully endorsed: “[T]he answer to the question posed in the title of this paper – What can TIMSS tell us about mathematics reforms of the 1990s – could be summed up as ‘not very much’.”

The ambitions of Schütz, in the seventh chapter on school size and student achievement in TIMSS 2003, are much more modest: to analyze the relationship between school size and mathematics achievement in 51 countries or regions through regression analyses while controlling as many variables as possible; a second set of analyses looks at differences in these relationships between high and low SES students and between students speaking or not speaking at home the language in which the test is given. While the aim of this chapter seems much simpler than those in the previous chapters, the author still encounters plenty of problems which she tries to overcome as best as she can. In principle, the problems are the same as discussed above and will not be repeated here.

Unlike most studies, not only linear relationships are studied, but also U-shaped or inverted U-shaped relations. Linear relationships between school size and math achievement are of the type “the larger the size, the better the achievement” or “the smaller, the better”; U-shaped relationships are “small or large yields high achievement, medium size yields low achievement” and inverted-U-shaped relationships “not too small and not too large is best”. There are several quite convincing theoretical reasons that speak in favor of the latter type of relationship.

The conclusions, once again, are disillusioning for those who hoped for clear-cut results. From the first analysis the author concludes “that the shape of the relationship between student performance and school size differs widely across the countries” (p. 191): All four types of relationship can be found (a U-shaped relationship only in one country, Singapore, with the lowest achievement in schools of about 864 students). In countries with inverted-U-shaped relationships, the optimal school sizes were very different: In Bahrain, students in schools of about 652 students perform best, while in Lebanon, three times larger schools are appropriate since the optimum school size is reported to be 1,970 students (the largest school in Bahrain had only 1,034 students). The results of the analyses by native language and SES are inconsistent as well. However, there are so many methodical problems that I do not want to report the results here. Two problematic issues should be mentioned. First, the general approach has much the character of a “fishing” expedition. From many independent analyses (51 countries or regions), the author takes the significant results and interprets them. The consequence is an "alpha inflation," also known as the “Bonferroni problem.” There are possibile corrections for this problem; the Bonferroni correction, for instance, would result in a reduction of the alpha level to achieve a predetermined significance level for the entie collection of tests; so with five studies tested independently, instead of testing at 0.05, one should test at .01 to achieve an overall level of significance of .05. The author does the opposite: For her, the coefficients in the regression equations are significant with p<.10. If one were to perform the Bonferroni correction, it is likely that many of the significant results would turn out not to be significant. For instance, at least three of the five inversely-U-shaped relationships would then not be significant. Secondly, instead of arguing at least implicitly with the significance level, it would have been much more appropriate to argue with effect sizes as in the previous chapter. The results would probably have been that the variance in math achievement accounted for by the school size is very small, thus rendering the results of the study even less relevant than they seem to be (and judged so by the author).

In their chapter “Examining educational technology and achievement through latent variable modeling,” Papanastasiou and Paparistodemou explore the impact of computer use on mathematics achievement while controlling for the students’ educational background in two countries with high and two with low prevalence of computers in the families. The results of the analyses using structural equation modeling indicate good fit of the model in all four countries and fairly similar path coefficients across countries with few exceptions (e.g., frequent calculator use is positively related to mathematics achievements in the US but negatively associated in the other countries). The authors conclude that the “results of the study are not especially encouraging with regard to the overall relationship between technology use and achievement” (p. 219); if there is a relationship, it tends to be negative (more computer use in educational contexts is related with lower mathematics scores).

The lack of relationship of educational technology use with math achievement is not surprising. As the authors write, “to be effective, educational technology has to be used appropriately” (p. 221) and “although technology might work exceptionally well when it is used by trained educators, it might not work as well in situations where teachers do not adopt the requisite didactical ideas for its application” (p. 221). The authors demonstrate this with the results of the use of the calculator. The predominant policy in Cyprus is that calculator use is seen as an impediment to learning mathematics; a negative relationship between frequent calculator use and math achievement may be the consequence. In the US, in contrast, the use of calculators has been urged “to reduce time spent on drill and practice activities and encourage the focus on problems that foster development of underlying mathematics concepts” (p. 222); this may explain the positive yet moderate relationship of calculator use with math achievement scores. This kind of argumentation, and preferably not post hoc but as a hypothesis, might be more appropriate for analyses of factors having an influence on math achievement than the survey results. But the problems mentioned with regard to the Hamilton and Martínez chapter, where such an attempt was tried, remain important.

The last chapter, by Hutchinson and Schagen, is entitled “Comparisons between PISA and TIMSS – are we the man with two watches?” The title refers to an anonymous saying: “A man with one watch always knows what time it is; a man with two watches is never quite sure.” After a careful comparison (which I will not analyze in detail because it is well done and the conclusions are appropriate), the authors claim that it would have been a better investment of money not to perform a study series like PISA that parallels the TIMSS series. I agree; however, since both series of studies have been performed, and both continue, there is little use in complaining about this fact. Instead, it makes sense to capitalize as much as possible on the data that have been collected.

Hutchinson and Schagen give a tentative list of issues with respect to which the two series are similar and another list with differences. This provides a framework for planning replication studies. For all studies discussed above, it would have been necessary to do replications. The TIMSS and PISA data can be used for conceptual replication studies. By “conceptual replication” I understand a replication study in which there is an original study (e.g., TIMSS) and a replication study (PISA); in the latter, some variables that are not of theoretical importance are different from the original study or are operationalized differently. For some of the studies presented in this volume – e.g., on curriculum, on school size, or on educational technology – this could be done perfectly. In other cases (e.g., Hamilton and Martínez) the possibility of such a replication depends on whether corresponding data on teaching practices are available. A replication of the Kilpatrick et al. study with PISA data would be of a different kind: A theoretically important variable (item characteristics) would be altered, and one would check whether the results of the similar analysis are as expected theoretically based on the item differences. Given the different general item structure in PISA, such a replication would require item analyses similar to those done with the TIMSS items.

Discussion “The cloud of data generated becomes a canvas on to which the committed can project what they want to see.” (Smither, quoted from Huchinson and Schagen, p. 255) This is true for those who do not want to do deeper analyses. The studies reported in this volume attempt to penetrate the cloudiness of the data. They take much care in doing so, particularly in methodological regards, to a lesser degree with respect to theory. What remains after this effort? Not much. This conclusion is valid not only for the study by Hamilton and Martínez (p. 158) but for all the studies.

The problems with international comparative student achievement studies per se are well known and will not be discussed here. The studies presented here have additional methodological problems, such as the following:

  • Since TIMSS is not an experimental study series, it is not possible to make statements about causal effects. There may be relationships among variables, but the assumption that one of them is a cause and the other is an effect needs to be supported by theoretical reasons (see the discussion by Gustafsson).
  • All studies in this volume use the TIMSS data to address questions for which TIMSS was not designed. In consequence, the authors have to take data that are available. The construct validity of the assessments can be questioned in most studies; however, there are no data that could be used to test criterion validity; the best that could be done would be content validity but none of the authors uses this means.
  • In some studies, country variables are constructed from other sources (e.g., curriculum, Schmidt & Houang; teaching methods, Hamilton and Martínez; computer use, Papanastasiou & Paparistodemou) and high and low achieving countries are compared (e.g., Kilpatrick et al.). However, these are average values, and the heterogeneity is not taken into account (e.g., intended versus implemented versus attained curriculum; how the technology is used). The variance within the country might be much larger than the variance between the countries.
  • The relationships are very complex and require complex analysis techniques taking into account many variables. Not all methods used in the studies are appropriate in this regard.
  • Even if complex methods like structural equation modeling are used (e.g., Papanastasiou and Paparistodemou), the relationships, effect sizes, and variances accounted for are modest, to say the least. This is due, among others, to the heterogeneity mentioned above. They may become statistically significant because of the huge number of participants, but their relevance is questionable.
  • Comparisons between countries are subject to the hermeneutic problem: The subjects’ interpretation of statements (e.g., from the surveys) will differ from country to country, not only because of translation problems, but also because of cultural differences; if this is not taken into account, the conclusions are not valid. “For educational data to be useful in an international arena, it is not enough for them to be of high quality from a national perspective, the data also must be highly comparable from country to country” (Mullis & Martin, p. 15). TIMSS has emphasized this comparative validity for math achievement, but this is not the case for the other variables that are used in these studies.
  • “[I]f one wants to measure change, then one should not change the measure” (Hutchison and Schagen, p. 256).

Many of the statements and “lessons learned” seem rather trivial. It is not surprising that success depends on economic resources, on teachers being well prepared, on education of the parents, on the language the students are most fluent in, and the like. However, the obviousness argument is problematic, as Lazarsfeld has shown in “The American Soldier” (1949). Nevertheless, the argument that the results are trivial must be taken seriously by the researchers and the public. There are several methodological approaches that can be used to refute this argument.

In particular, it would be necessary to establish a list of hypotheses (trivial and non-trivial ones) before doing the analysis, and then test whether these hypotheses are confirmed. This is the strategy followed by Lazarsfeld (1949) to demonstrate his point: He presented a set of obvious statements (“trivial” hypotheses) and then showed that all of them are refuted by the data he had gathered. In the “lessons learned”, as well as in the discussions about the international assessments in general, I miss statements about hypotheses that can be refuted. Of course, it is a temptation to look for positive evidence given the huge amount of data. The hindsight bias (Hawkins & Hastie, 1990) may play an important role here; according to this bias lay people as well as scientists, after being told about research results, tend to say that they would have predicted them had they been asked beforehand.

TIMSS and PISA are huge endeavors. It is not the place here to question whether they should have been done. They have certainly generated a great deal of knowledge, and they have triggered education debates in many countries that were sorely needed. These are data sets that one can capitalize on. This should not be done, however, in the simplistic way that many politicians argue in favor of their preferred project or against their opponents’ concepts. The studies in this volume show that even with highly sophisticated methods conclusions are very difficult to draw from the data sets. I think one can do more. The appropriate approach, I think, is the one repeatedly mentioned above: critical multiplism (Cook, 1985). What I have said above about replications goes to the same point. We should capitalize on the different watches that are available, to take up the title of Hutchinson and Schagen – not only TIMSS and PISA, but also several cohorts within TIMSS and within PISA, several age levels, etc. We must in particular avoid fishing expeditions (looking for significant effects and trying to explain them post hoc). Critical multiplism also means cross-validation of conclusions from TIMSS and PISA with other studies, including experimental studies.

The modest outcomes of the studies reported in this volume should not lead to the conclusion that all these efforts are useless. But it would be inappropriate to hope for simple statements. Education systems are not simple, and appropriate studies of these systems should not be simple. The studies reported in this volume are a first step in this direction. They were not very successful, but they were not a waste of time. The efforts in this direction should be continued and enhanced.

References

Campbell, D. T. (1969). Prospective: Artifact and control. In: Rosenthal, R., & Rosnow, R. (Eds.): Artifact in behavior research. New York: Academic Press, 351-382.

Campbell, D. T., & Stanley, J. C. (1963). Experimental and quasi-experimental designs for research on teaching. In: Gage, N.L. (Ed.): Handbook of research on teaching. Chicago: Rand McNally, 171-246.

Cook, T. D. (1985). Postpositivistcritical multiplism. In Shortland, R. L., & Mark, M.M. (Eds.): Social science and social policy. Beverly Hills (CA): SAGE, 458-499

Edwards, A. L. (1957). The social desirability variable in personality assessment and research. New York: Dryden.

Hawkins, S. A., & Hastie, R. (1990). Hindsight: Biased judgments of past events after the outcomes are known. Psychological Bulletin, 107(3), 311-327.

Hertz, E. (1941). Lincoln talks. A biography in anecdote. New York: Halcyon House.

Lazarsfeld, P. (1949). The American soldier - an expository review. The Public Opinion Quarterly, 13, 378-404.

About the Reviewer

Jean-Luc Patry is professor of education at the Department of Educational Research and Sociology of the University of Salzburg ( Austria). He received his diploma in natural science (in 1972) and his doctoral degree (1976) at the Swiss Federal Institute of Technology in Zurich; his dissertation dealt with visual perception of geometric forms; his habilitation was at the University of Fribourg, Switzerland, for "research in education," 1991.

His main research activities have focused on situation specificity of human actions, on values in education, on methodological questions such as evaluation theory, field research, critical multiplism, on the relationship between theory and practice, on meta-theoretical questions of educational research, on questions of professional responsibility, on constructivism in education, etc. He has conducted several research projects in these fields.

1972 through 1975, he worked at the Institute for Behavioral Science at the Swiss Federal Institute of Technology (Zurich, Switzerland), 1975 through 1993 at the Pedagogical Institute of the University of Fribourg (Switzerland). 1982 through 1984, he was visiting scholar at the universities of Stanford, Lehigh, and Salzburg. He was vice president of the Swiss Educational Research Association and editor of the journal Bildungsforschung und Bildungspraxis/Education et Recherche. Since 1993, he has been a Full Professor of Education at the University of Salzburg.

No comments:

Post a Comment

Strong-Wilson, Teresa. (2008). <cite>Bringing Memory Forward: Storied Remembrance in Social Justice Education with Teachers. </cite> Reviewed by Patricia H. Hinchey, Pennsylvania State University

Strong-Wilson, Teresa. (2008). Bringing Memory Forward: Storied Remembrance in Social Justice Education with Teachers. Ne...