The Era of Big Data: A Call for More Nuanced Datasets
by M. Felicity Rogers-Chapman - April 04, 2014
In this era of big data analysis, there is a push to understand education on a large-scale, yet existing datasets provide only a blurry picture of students and schools. This commentary argues that there is a need to develop twenty-first century tools that can fully capture the nuanced population that constitutes the educational landscape of the United States.
Technology combined with concerted data collection efforts has launched education research into an era of big data analysis. This is an outstanding opportunity for educators, researchers and policy makers to understand and respond to issues on a large-scale. Big-data analysis has the opportunity to relate student movements to outcomes or to link student mobility to achievement. Yet, the limitations in existing data sets may result in oversimplified conclusions. A growing interest in data mining and analytics raises the priority for the development of nuanced data sets, particularly as they relate to student background. Current large-scale datasets fail on three levels: accurate reporting, identification of race/ethnicity, and classification of socioeconomic status. For the purposes of this commentary, I will focus on the Common Core of Data (CCD), but other large-scale datasets have similar limitations.
Current national datasets, such as the Common Core of Data (CCD), are limited in capturing important demographic similarities and differences of students. The CCD, a database compiled by the National Center for Educational Statistics (NCES), provides school and district data for public elementary and secondary education. The CCD includes information about the racial composition of schools, school enrollment, including enrollment in charter and magnet schools, and the number of students eligible for free- and reduced-priced lunch. The data is limited as it is dependent on reporting by individual school districts. There are often missing data points for individual schools or whole districts. For example, the data for Washington D.C. public schools for 2000-2010 identifies the district as having zero students enrolled in charter schools, but a quick Google search finds that there are over 35,000 students enrolled in 100 charter schools in D.C. Public Schools (Focus, 2014). While researchers will eliminate incomplete data, the absence of accurate data limits any study that uses such big datasets.
LIMITS OF RACE/ETHNICITY
In addition to large portions of missing data in the existing large-scale datasets, the data reported fail to capture the nuances of race and socioeconomic background in the twenty-first century. The ways in which race is identified by schools is limited to twentieth century tools. The students that self-identify as Hispanic or African American or Caucasian by checking a box on a school form are more diverse than the broad racial categories indicate. For example, within the Hispanic category are students from multiple countries in the world, multiple generations of Americans, and new immigrants. Venezuelan-Americans, who fall within the Hispanic category, have higher levels of education than the Hispanic population overall. Mexican-Americans tend to have lower levels of education than the Hispanic population overall (Pew Research Hispanic Center, 2013). These nuances in race are likely to affect not only achievement but opportunities within classrooms. Asian as a category also encompasses individuals from a large group of countries such as China, Japan, Vietnam, and India. Much of the literature identifies important differences within these broad racial categories (Diamond & Diamond, 2011; Rodriguez, 2011; Warikoo & Carter, 2009). Existing datasets capture none of these nuances.
Nor do existing datasets fully capture the multi-racial nature of the United States population. Decades of immigration to the United States have increased racial diversity. With a more diverse population has come an increase in the multi-racial population. In 2000, seven million people identified themselves as multi-racial; in 2010, that number grew to nine million. Some researchers suggest that the number is even higher than that reported by the census. Researchers estimate that the number of individuals self-identifying as multi-racial will increase to one in five by 2050 (Lee & Bean, 2004). Yet, a multiracial population is not accurately captured by these twentieth century data collection tools. In 2000, the census for the first time allowed respondents to identify by one or more races. The research community and by extension policy makers would benefit from tools that will collect data that capture not only multiracial perspectives but also more precise racial information such as registering individual countries of origin rather than relying on broad stroke labels.
LIMITS IN SOCIOECONOMIC BACKGROUND DATA
Socioeconomic status is often identified by eligibility for free- and reduced-priced lunch. Eligibility for the lunch program is a non-intrusive, dichotomous, simple (eligible and not eligible) variable that is obtained inexpensively. Free- or reduced-priced lunch eligibility is an oversimplified measure of income because it only has two categories while income is divided into four or more levels. Further, research shows that participation in the free- and reduced-priced lunch program is not consistent across grades (Harwell & LeBeau, 2010). Participation declines as students increase in grade level. In a study of participation in free-and reduced-priced lunch programs, researchers found that participation was greatest for students aged eight to thirteen and lowest for students aged sixteen to eighteen (Oliveira, 2006). The decline in enrollment was found to be due to a failure of students to return applications for the free lunch program. Thus, many students who are eligible for the program are not counted.
As a measure of class, eligibility for free- and reduced-priced lunch is limited. Income and eligibility for free- and reduced-priced lunch are not equal. The quintile into which one is born can significantly affect the opportunity to move to another quintile, and education is one means to facilitate this movement (Chetty, Hendran, Kline, & Saez, 2013). Existing large-scale data sets do not provide sufficient detail to analyze the effects of policy and reforms on students from different socioeconomic quintiles. A more refined and accurate national data set for schools could provide the opportunity for analysis at this level.
Combining datasets can provide opportunities for better analysis, but this still relies to some extent on the use of proxies for race and socioeconomic status. With tools for big data analysis in place, the missing piece is datasets that can capture a more accurate picture of students, teachers and schools. In the search for evidenced based and data driven decision making, the collection of more nuanced data is important.
Chetty, R., Hendry, N., Kline, P. & Saez, E. (2013). The economic impacts of tax expenditures: Evidence from spatial variation across the U.S. Retrieved from http://www.equality-of-opportunity.org/
Diamond, J. B., & Diamond, J.P.H. (2011). Black-White Disparities in Educational Outcomes:
Rethinking Issues of Race, Culture, and Context. African American Children and Mental Health, 1, 63-94.
Focus D.C. (2014). DCs Public Charter Schools. Retrieved from http://focusdc.org/charter-facts, February 19, 2014.
Harwell, M., & LeBeau, B. (2010). Student eligibility for a free lunch as an SES measure in
education research. Educational Researcher, 39(2), 120-131.
Lee, J., & Bean, F. D. (2004). America's changing color lines: Immigration, race/ethnicity, and
multiracial identification. Annual Review of Sociology, 221-242.
Oliveira, V. (2006). Food Assistance Landscape March 2006. USDA-ERS Economic
Information Bulletin, 6-2.
Pew Hispanic Center (2013). A Nation of Immigrants. Washington, D.C.:
Rodriguez, N. (2011). Made it to America, now what? Understanding the educational
achievement differences among Latino subgroups. Sociological Insight, 3, 20-39.
Warikoo, N., & Carter, P. (2009). Cultural explanations for racial and ethnic stratification in
academic achievement: A call for a new and improved theory. Review of Educational
Research, 79(1), 366-394.