“It’s not ethical to use a dataset without spending time getting a very good understanding of what the data means.” 

Heather Kraus

When I started teaching an applied statistics class for undergraduates, I resolved to give my students as much practice working with real datasets as possible. A colleague introduced me to the newsletter Data is Plural, a goldmine for random, relevant, serious, interesting, funny, and obscure datasets. One dataset, from Reddit user swiftdata1989, counts every time a unique color is mentioned in a Taylor Swift song, broken down by song and album. Why? No idea, but I quickly incorporated it into an in-class activity about making pivot tables so that I could blast my favorite song while students worked and rolled their eyes at me (Data Is Plural 2024).  

Other datasets are more legit. The same newsletter that included the swifty spread also shared a public dataset of groundwater levels and another with study abroad metrics for US college students (Data Is Plural 2024). When dealing with more serious data, it’s important for users to learn the context of their datasets so that they are able to use them accurately and ethically, as well as understand their limitations. Heather Kraus, who runs the training organization We All Count and advocates a data equity framework for professional settings, writes that “it’s not ethical to use a dataset without spending time getting a very good understanding of what the data means” (Kraus 2019). Kraus and We All Count thus provide a tool to guide analysts to ask questions of their datasets. Kraus refers to this tool as a “data biography,” which boils down to asking the who/what/when/where/why of your data.  

Asking Questions 

As an example of why asking these questions of our data can be so important, consider another dataset I first learned about from Data Is Plural: the New York Prison Employee Discipline Data. In their July 12, 2023 newsletter, Data Is Plural provides a great description that sets us up to ask and answer questions about the dataset: 

NY prison employee misconduct. For a series of articles (co-published with The New York Times) investigating abuse by prison guards in New York State, The Marshall Project obtained and analyzed data representing 12 years of Department of Corrections and Community Supervision employee disciplinary notices. The newsroom is publishing those records, which they’ve converted from two PDFs into tabular data, along with additional context and caveats. The records contain ~6,000 (non-redacted) notices; they indicate the employee name, title, facility, union, type of misconduct, case disposition, description, and penalty, among other details. (Data Is Plural 2023

Already, the “who” here is complicated. While the records were compiled, cleaned, and made accessible by The Marshall Project—who describe themselves as a “nonpartisan, nonprofit news organization that seeks to create and sustain a sense of national urgency about the US criminal justice system”—the record of misconduct starts with the prison system itself (The Marshall Project n.d.). If the cases weren’t ever documented internally, The Marshall Project would not have had any records to pull together. Both institutions (and the actors within them) influence the final dataset we get. Even if The Marshall Project deems something worth knowing, they are reliant on what the prison system first collected. For example, the original documents provided to The Marshall Project do not report the race of the inmates or the officers (variables), and they redacted observations corresponding to open cases.   

The “why” matters a lot here, too, intersects with the “who,” and has implications for the analysis of the data itself. The Marshall Project wanted to document cases of abuse, and that’s explicitly what they focused on. As they describe in an article detailing their database creation process: 

We limited our analysis to cases alleging physical abuse of incarcerated people by front-line security staff—corrections officers, sergeants and lieutenants. We included any case that the department itself categorized as “inmate abuse,” as well as cases where the description indicated that it involved an “excessive,” “unnecessary,” “inappropriate,” “without authorization” or “unjustified” use of physical force on a prisoner. Using natural language processing methods, we clustered similar incident descriptions to identify cases that the agency did not designate as “inmate abuse,” but had the same description as those marked that way.(Meagher 2023)

The NY prison system’s purpose in collecting data is not limited to documenting abuse of incarcerated people (though this is part of the purview of the data collection), but it was the sole purpose for The Marshall Project. Therefore, the latter group took extra care to identify cases of abuse, even if that’s not how they were explicitly labeled by the prison system, and clearly explained how they created those labels. If their investigative focus had instead been on falsifying records, this would have likely guided their additional analysis of the data. Neither is “correct” or “incorrect,” but rather the purpose of the data collection shapes how we can use it and what conclusions we can reliably draw from analyses of the data.  

The Activity 

In addition to providing a trove of usable datasets for teaching and research, the repository is a great way to introduce students to the concept of asking questions of their secondary data sources. While Data Is Plural compiles links and detailed descriptions, it is not the original collector of the data. Students can struggle to understand that who collects, who reports, and who disseminates information can all be different. These are questions students must learn to ask of their data, and answering those questions is a skill they must spend time practicing.  

To guide students in this practice in my introductory business statistics class, I combine Data Is Plural’s archive with an adapted version of We All Counts’ data biography tool. This exercise, shared below, does not require an extensive background in statistics or data analysis and could easily be adapted to other settings or used as an introductory exercise in a data analysis course.  

The activity also serves the purpose of allowing students to see real-world, messy, and even unwieldy datasets without the built-in burden of needing to clean and use them for analysis right away. Students may be surprised at things like how many variables a dataset might include, how much missing data there is, and sometimes how hard it is to find good information about where the data came from and how and why it was collected. As an added bonus, you as the instructor may learn about some pretty great data out there, too. 


Data Biographies Activity 

Learning Goals 

  • Learn about what secondary datasets exist and are available to researchers
  • Identify strategies for sourcing the information in a data biography 

Directions 

  1. Explore the Data is Plural archives and identify a dataset highlighted in one of the issues that you find interesting. Choose one with a dataset that you can actually download and open in Excel (this may take a few tries). 
  2. Read An Introduction to the Data Biography (very short) 
  3. Go to the source website for your dataset and look for descriptions of the data collection process, FAQ’s about the data, information about the organization, etc. 
  4. Take notes on the questions below.  
Overview: Short Description of the Dataset 
  • What is the observational unit of the dataset? (i.e. what does each row of data correspond to?) 
  • What are some (3-4) of the variables in this dataset? What measurement level are these variables? 
  • Who collected the data? 
  • What are the methods behind the data collection design and process? 
  • In what locations was the data collected? 
  • For what purpose was the data collected? 
  • When was the data collected? 
  • Are there any barriers to accessing the data? 
  • Other notes on the who/what/when/where/why/how of this dataset: 
  • Was the data what you expected it to be? 

References 

Data is Plural. n.d. Accessed July 11, 2025. https://www.data-is-plural.com/.

Data is Plural. 2023. “2023.07.12 edition: US military interventions, the latest White House visitors, NY prison employee misconduct, Australian mine production, and “the global human day.”” https://www.data-is-plural.com/archive/2023-07-12-edition/.  

Data is Plural. 2024. “2024.01.31 edition: Groundwater levels, military surplus, study abroad, dot-gov metadata, and Taylor’s colors.” https://www.data-is-plural.com/archive/2024-01-31-edition/  

Kraus, Heather. 2019. “An Introduction to the Data Biography.” We All Count. January 21, 2019. https://weallcount.com/2019/01/21/an-introduction-to-the-data-biography/  

Meagher, Tom. 2023. “How We Investigated Abuse by Prison Guards in New York.” The Marshall Project. May 19, 2023. https://www.themarshallproject.org/2023/05/19/new-york-prison-officer-abuse-how-we-investigated  

Meagher, Tom, and The Marshall Project. 2023. “New York Prison Employee Discipline Data.” (Hosted on Observable) https://observablehq.com/@themarshallproject/new-york-prison-employee-discipline-data  

Santo, Alysia, Joseph Neff, and Tom Meagher. 2023. “Guards Brutally Beat Prisoners and Lied About It. They Weren’t Fired.” The New York Times. May 19, 2023. https://www.nytimes.com/2023/05/19/nyregion/ny-prison-guards-brutality-fired.html  

swiftdata1989. 2023. “Every time a color is mentioned on a Taylor Swift album.” Accessed July 11, 2025. https://www.reddit.com/r/dataisbeautiful/comments/12u0ncx/comment/jh4zehl/?context=3  

The Marshall Project. n.d. “About Us.” Accessed July 11, 2025. https://www.themarshallproject.org/about?via=navright  

The Marshall Project. n.d. “How New York’s Abusive Guards Keep Their Jobs.” Accessed July 11, 2025. https://www.themarshallproject.org/tag/when-guards-abuse-prisoners  

We All Count. n.d. Accessed July 11, 2025. https://weallcount.com/  

We All Count. n.d. “The Data Equity Framework.” Accessed July 11, 2025. https://weallcount.com/the-data-process/


About the Author

Cora Wigger is an assistant professor of economics and a 2025–2027 CEL Scholar. Her research focuses on the intersections of education and housing policy, with an emphasis on racial inequality and desegregation. At Elon, she teaches statistics and data-driven courses and contributes to equity-centered initiatives like the “Quant4What? Collective” and the Data Nexus Faculty Advisory Committee.

How to Cite This Post

Wigger, Cora. 2025. “Learning Your Data: Teaching with Data Biographies.” Center for Engaged Learning (blog), Elon University. August 5, 2025. https://www.centerforengagedlearning.org/learning-your-data-teaching-with-data-biographies/.