A large-scaled corpus for assessing text readability

Abstract - Reading is an essential skill for academic success. It is important to support and scaffold literacy challenges faced by students by selecting texts of difficulties appropriate for their reading abilities. Providing students with texts that are accessible and well matched to their abilities helps to ensure that students better understand the text and, over time, can help readers improve their reading skills (Mesmer, 2008; Stanovich, 1985). Readability formulas, which provide an overview of text difficulty, have shown promise in more accurately benchmarking students with their text difficulty level, allowing students to read texts at target readability levels.

Most educational texts are calibrated using traditional readability formulas like Flesch–Kincaid Grade Level (FKGL; Kincaid et al., 1975) or commercially available formulas such as Lexile (Smith et al., 1989) or the Advantage-TASA Open Standard (ATOS; School Renaissance Inst. Inc, 2000). However, both types of readability formulas are problematic. Traditional readability formulas lack construct and theoretical validity because they are based on weak proxies of word decoding (i.e., characters or syllables per word) and syntactic complexity (i.e., number or words per sentence) and ignore many text features that are important components of reading models including text cohesion and semantics. Additionally, many traditional readability formulas were normed using readers from specific age groups on small corpora of texts taken from specific domains. Commercially available readability formulas are not publicly available, may not have been tested rigorously for their reliability, and may be cost-prohibitive for many schools and districts, let alone teachers.

In this paper, we introduce the open-source CommonLit Ease of Readability (CLEAR) corpus. The corpus is a collaboration between CommonLit, a non-profit education technology organization focused on improving reading, writing, communication, and problem-solving skills, and Georgia State University (GSU), with the end goal of promoting the development of more advanced and open-source readability formulas that government, state, and local agencies can use in testing, materials selection, material creation, and other applications commonly reserved for readability formulas. The formulas that will be derived from the CLEAR corpus will be open-source and ostensibly based on more advanced natural language processing (NLP) features that better reflect our understanding of the reading process.

The accessibility of these formulas and their reliability should lead to greater uptake by students, teachers, parents, researchers, and others, increasing opportunities for meaningful and deliberate reading experiences. We outline the importance of text readability along with concerns about previous readability formulas below. In addition, we present two studies that examine the reliability of the CLEAR corpus by discussing the methods used to develop the corpus, examining how well traditional and newer readability formulas correlate with the reading criteria reported in the CLEAR corpus, and developing a new readability formula to assess how individual features in CLEAR excerpts are predictive of CLEAR reading criteria.

Text readability can be defined as the ease with which a text can be read (i.e., processed) and understood in terms of the linguistic features found in that text (Dale & Chall, 1948; Richards et al., 1992). However, in practice, most research into text readability is more focused on measuring text understanding (i.e., comprehension Kate et al., 2010) and not the speed at which a text is read (i.e., text processing). Text comprehension is generally associated with the contents of the text including word sophistication, syntactic complexity, and discourse structures (Just & Carpenter, 1980; Snow, 2002), all of which relate to text complexity. Text comprehension is also a function of a reader’s reading proficiency and background knowledge of the text (McNamara et al., 1996). However, for the purpose of this study, we will focus only on text features.