Nan Z. Da’s study published in Critical Inquiry participates in an emerging trend across a number of disciplines that falls under the heading of “replication.” In this, her work follows major efforts in other fields, such as the Open Science Collaboration’s “reproducibility project,” which sought to replicate past studies in the field of psychology. As the authors of the OSC collaboration write, the value of replication, when done well, is that it can “increase certainty when findings are reproduced and promote innovation when they are not.”
And yet despite arriving at sweeping claims about an entire field, Da’s study fails to follow any of the procedures and practices established by projects like the OSC. While invoking the epistemological framework of replication—that is, to prove or disprove the validity of both individual articles as well as an entire field—her practices follow instead the time-honoured traditions of selective reading from the field of literary criticism. Da’s work is ultimately valuable not because of the computational case it makes (that work still remains to be done), but the way it foregrounds so many of the problems that accompany traditional literary critical models when used to make large-scale evidentiary claims. The good news is that this article has made the problem of generalization, of how we combat the problem of selective reading, into a central issue facing the field.
Start with the evidence chosen. When undertaking their replication project, the OSC generated a sample of one hundred studies taken from three separate journals within a single year of publication to approximate a reasonable cross-section of the field. Da on the other hand chooses “a handful” of articles (fourteen by my count) from different years and different journals with no clear rationale of how these articles are meant to represent an entire field. The point is not the number chosen but that we have no way of knowing why these articles and not others were chosen and thus whether her findings extend to any work beyond her sample. Indeed, the only linkage appears to be that these studies all “fail” by her criteria. Imagine if the OSC had found that 100 percent of articles sampled failed to replicate. Would we find their results credible? Da by contrast is surprisingly only ever right.
Da’s focus within articles exhibits an even stronger degree of nonrepresentativeness. In their replication project, the OSC establishes clearly defined criteria through which a study can be declared not to replicate, while also acknowledging the difficulty of arriving at this conclusion. Da by contrast applies different criteria to every article, making debatable choices, as well as outright errors, that are clearly designed to foreground differences. She misnames authors of articles, mis-cites editions, mis-attributes arguments to the wrong book, and fails at some basic math. And yet each of these assertions always adds-up to the same certain conclusion: failed to replicate. In Da’s hands, part is always a perfect representation of whole.
Perhaps the greatest limitation of Da’s piece is her extremely narrow (that is, nonrepresentative) definition of statistical inference and computational modeling. In Da’s view, the only appropriate way to use data is to perform what is known as significance testing, where we use a statistical model to test whether a given hypothesis is “true.” There is no room for exploratory data analysis, for theory building, or predictive modeling in her view of the field. This is particularly ironic given that Da herself performs no such tests. She holds others to standards to which she herself is not accountable. Nor does she cite articles where authors explicitly undertake such tests or research that calls into question the value of such tests or research that explores the relationship between word frequency and human judgments that she finds so problematic. The selectivity of Da’s work is deeply out of touch with the larger research landscape.
All of these practices highlight a more general problem that has for too long gone unexamined in the field of literary study. How are we to move reliably from individual observations to general beliefs about things in the world? Da’s article provides a tour de force of the problems of selective reading when it comes to generalizing about individual studies or entire fields. Addressing the problem of responsible and credible generalization will be one of the central challenges facing the field in the years to come. As with all other disciplines across the university, data and computational modeling will have an integral role to play in that process.
ANDREW PIPER is Professor and William Dawson Scholar in the Department of Languages, Literatures, and Cultures at McGill University. He is the author most recently of Enumerations: Data and Literary Study (2018).
Nan Z. Da, “The Computational Case Against Computational Literary Studies,” Critical Inquiry 45 (Spring 2019) 601-639. For accessible introductions to what has become known as the replication crisis in the sciences, see Ed Yong, “Psychology’s Replication Crisis Can’t Be Wished Away,” The Atlantic March 4, 2016.
Compare Da’s sweeping claims with the more modest ones made by the OSC in Science even given their considerably larger sample and far more rigorous effort at replication, reproduced here. For a discussion of the practice of replication, see Brian D. Earp and David Trafimow, “Replication, Falsification, and the Crisis of Confidence in Social Psychology,” Frontiers in Psychology May 19, 2015: doi.org/10.3389/fpsyg.2015.00621.
For a list, see Ben Schmidt, “A computational critique of a computational critique of a computational critique.” I provide more examples in the scholarly response here: Andrew Piper, “Do We Know What We Are Doing?” Journal of Cultural Analytics, April 1, 2019.
She cites Mark Algee-Hewitt as Mark Hewitt, cites G. Casella as the author of Introduction to Statistical Learning when it was Gareth James, cites me and Andrew Goldstone as co-authors in the Appendix when we were not, claims that “the most famous example of CLS forensic stylometry” was Hugh Craig and Arthur F. Kinney’s book that advances a theory of Marlowe’s authorship of Shakespeare’s plays which they do not, and miscalculates the number of people it would take to read fifteen thousand novels in a year. The answer is 1250 not 1000 as she asserts. This statistic is also totally meaningless.
Statements like the following also suggest that she is far from a credible guide to even this aspect of statistics: “After all, statistics automatically assumes that 95 percent of the time there is no difference and that only 5 percent of the time there is a difference. That is what it means to look for p-value less than 0.05.” This is not what it means to look for a p-value less than 0.05. A p-value is the estimated probability of getting our observed data assuming our null hypothesis is true. The smaller the p-value, the more unlikely it is to observe what we did assuming our initial hypothesis is true. The aforementioned 5% threshold says nothing about how often there will be a “difference” (in other words, how often the null hypothesis is false). Instead, it says: “if our data leads us to conclude that there is a difference, we estimate that we will be mistaken 5% of the time.” Nor does “statistics” “automatically” assume that .05 is the appropriate cut-off. It depends on the domain, the question and the aims of modeling. These are gross over-simplifications.
For reflections on literary modeling, see Andrew Piper, “Think Small: On Literary Modeling.” PMLA 132.3 (2017): 651-658; Richard Jean So, “All Models Are Wrong,” PMLA 132.3 (2017); Ted Underwood, “Algorithmic Modeling: Or, Modeling Data We Do Not Yet Understand,” The Shape of Data in Digital Humanities: Modeling Texts and Text-based Resources, eds. J. Flanders and F. Jannidis (New York: Routledge, 2018).
See Andrew Piper and Eva Portelance, “How Cultural Capital Works: Prizewinning Novels, Bestsellers, and the Time of Reading,” Post-45 (2016); Eve Kraicer and Andrew Piper, “Social Characters: The Hierarchy of Gender in Contemporary English-Language Fiction,” Journal of Cultural Analytics, January 30, 2019. DOI: 10.31235/osf.io/4kwrg; and Andrew Piper, “Fictionality,” Journal of Cultural Analytics, Dec. 20, 2016. DOI: 10.31235/osf.io/93mdj.
The literature debating the values of significance testing is vast. See Simmons, Joseph P., Leif D. Nelson, and Uri Simonsohn. “False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant.” Psychological Science 22, no. 11 (November 2011): 1359–66. doi:10.1177/0956797611417632.
See Rens Bod, Jennifer Hay, and Stefanie Jannedy, Probabilistic Linguistics (Cambridge, MA: MIT Press, 2003); Dan Jurafsky and James Martin, “Vector Semantics,” Speech and Language Processing, 3rd Edition (2018): https://web.stanford.edu/~jurafsky/slp3/6.pdf; for the relation of communication to information theory, M.W. Crocker, Demberg, V. & Teich, E. “Information Density and Linguistic Encoding,” Künstliche Intelligenz 30.1 (2016) 77-81. https://doi.org/10.1007/s13218-015-0391-y; and for the relation to language acquisition and learning, Erickson LC, Thiessen ED, “Statistical learning of language: theory, validity, and predictions of a statistical learning account of language acquisition,” Dev. Rev. 37 (2015): 66–108.doi:10.1016/j.dr.2015.05.002.