Get Data Ready to Use! (Data Literacy 101)
Sometimes it can feel like we are swimming in data. This may feel especially acute right now, living in our new normal with COVID-19 and through the recent election cycle. However, just because we have lots of data doesn’t mean it is easy to make sense of it all.
In fact, moving between having data and making meaning from those data can be challenging for many (see Hunter-Thomson 2020, for strategies in making better claims). There are multiple steps that happen for words and numbers of raw data to come together to help us answer our testable questions. A key initial step, which is often skipped for a variety of reasons, is getting the data ready to analyze. This provides lots of benefits down the road for analyzing and interpreting the data. Let’s explore how we can help our students strengthen their understanding and skills to explore and prepare their data, some of what John Tukey (1980) coined as part of exploratory data analysis (EDA).
What follows are things that we do when working with data regardless of what software (e.g., Excel, Numbers, Google Sheets) we are using or whether we are having our students work with the data by hand. These steps are part of the process of working with data in any instance of preparing data. Apply them in whatever way you are having your students work with data.
Also, we should think both about helping students with how to execute a specific task as well as why we are doing what we are doing with the data. Without the latter, our students are more likely to freeze up when the software is updated and/or the data doesn’t look the same in the future. See the Resource section for links to how-to guides for commonly used spreadsheet programs and/or software options designed around students exploring the data as they analyze and interpret it.
One final note regarding software: It can be helpful to remember that spreadsheet programs (e.g., Google Sheets, Excel) are built to organize, manipulate, and derive data in a tabular format. A secondary function, which evolved over time, is to create visualizations (e.g., graphs, charts, plots, maps). Therefore, it is not too surprising that these programs are good at the former and sometimes clunky or limited at the latter. Fortunately, it is the features of the former that we are going to talk about here.
Nathan Yau, who runs FlowingData.com and writes data visualization books, said: “I spend just as much time getting data in the format that I need as I do putting the visual part of a data graphic together” (2011, p. 39). Wow, that is a lot of time spent getting the data ready. Let me state, emphatically, that this is not what middle school students should do every time they work with data. But they absolutely should start to understand and practice some of these steps.
I suggest a good place to start is unpacking how we organize the data in a tabular format. We often record data as we are collecting it in ways that make it easiest to record. This is how it should be. Data collection is time consuming and challenging, so anything we can do to make it easier is preferred. But these ways of recording data are often not the most conducive for exploring, analyzing, and interpreting the data. This is where trouble can creep into our classrooms—when we try to “round peg, square hole” data organized for collection directly into data for analysis and interpretation.
There are two ways that we typically tabulate data: wide and long. Let’s explore each of these ways with an example of data students could collect to explore cyclical patterns in moon phases and tidal heights (a component of MS-ESS1-1). The data have four variables: moon phase, month of observation, mean high tide height, and group recording the data.
This layout is typical for how data tables are organized in published papers and/or are oriented for data collection by hand. The columns and rows are arranged to communicate organization frames of information (usually about two variables), and the values within the cells are often the observation values for a third variable (Healy 2019).
Table 1 represents a wide organization of our data. Each student group has their own data table to enter their data. Each group needs to enter the tidal height data as they record it on the basis of the moon phase and month of the investigation. This makes data collection for each group relatively straightforward.
However, in this layout it can be tricky to visualize high tide height across the phases, months, or groups (especially if we have more data points). Part of the trouble is that tide height is not its own column that we can direct a software platform to graph, but rather it is within the moon phase and month of observation organization. Additionally, the units for tide height are within the cell as the data value, making it hard to graph because many software packages will read the data as text rather than a number. Finally, it is cognitively hard to look at all of the data across the groups’ tables (collate the data) to try to investigate questions of the data.
So, the upsides of wide formatting are that it is often a good option for efficiently organizing data at the end of an investigation when you want to save space and also when collecting data by hand originally. However, the downsides are that it can be harder to collate data and visually explore the data for analysis and interpretation.
This layout is typically how online data sets and data portals are organized. Long formatting, or tidy data as Wickham and Grolemund (2016) call it, is often used for data collection in technology platforms and EDA. Every variable is assigned its own column with applicable units in the column header, every observation is in its own row, and every cell has the applicable value noted for each column (also known as no cell merging; Healy 2019). Interestingly, long formatting often is how our students look at data in tables in math classes (x and y columns of numbers).
Table 2 represents a long organization of our data. Data from all of the groups are combined (collated) into one table that is organized by the month of observation, then moon phase, then group. The units are included within the column header so that the data within the cell is in the appropriate setup of text or numeric, and the different categorical values within a variable can be easily graphed as such. Additionally, with each variable in its own column, we can more easily direct the software program to calculate statistics for all or parts of the data set.
However, originally collecting data—especially by hand—in this layout can be tricky, as you need to know the number of rows required for each category ahead of time. Additionally, this takes up a lot of space on a page and does not easily communicate summary findings once you have a conclusion from the data.
So, the upsides of long formatting are that you can see all the data and that the data are oriented in a way that software programs anticipate data to be organized for graphing and making calculations from the data. This makes exploring and visualizing data much easier. However, the downsides are that it is not an efficient way to organize findings at the end of an investigation when you want to save space, nor is it an easy orientation for collecting data, especially by hand.
In other words, there is no one way to organize a data table, but it is important to think through how the organization of our data table does or does not help students to accomplish what we are asking them to do with data in any given activity. If we are asking them to collect data by hand or evaluate a provided claim from summarized data, then a wide format is probably better. But if we are asking them to make graphs of the data and/or discover what the data demonstrate about a certain question or proposed hypothesis, then a long format is probably better.
Beyond how we format the data table, a second place to help students gain skills in data organization is in labeling things in data tables. Bowen and Bartley (2014) identified what to look for in an effective data table (p. 118). I suggest a few additional components and articulate how these components, and those identified by Bowen and Bartley, vary if you are using wide or long format:
Organizing data can be peppered throughout our work with data across the year. Students can practice these skills while organizing their own data (or data someone else collected that they are using), as well as while reviewing data tables created by others. The important thing is that we explicitly talk about data organization with our students so that they begin to see it as a necessary and important part of working with data.
Another aspect of EDA that we can bring into our classrooms is the initial processing of data. This is where the power of spreadsheet programs truly helps, but processing can also be done by hand. I would first recommend that you organize the data in a long format so that you can visually explore what is there, either in tabular or graphic formats (see long formatting section above).
Processing the data is not about cherry picking certain data values out of the data. Instead, this involves sorting, filtering, summarizing, calculating, and so forth. Consider having your students, where appropriate, practice the following data processing steps:
Identify variable type: An important aspect of variable type, beyond independent versus dependent, is categorical versus numerical (see Hunter-Thomson 2019, for more on ways we talk about data). We visualize these types of variables differently, and often there can be a disconnect between what our software programs “guess” the variable is and what we know that it is. Taking time to identify which variables are categorical or numerical at the start can save you a lot of time down the road. (Note: you can have numbered categories [i.e., numbered months of the year, not names], so this sometimes involves more than just looking for numbers or text.) Once you identify which are which, then take a few moments to make sure the data in each cell of that column are entered appropriately (e.g., numerical data only have numbers in the cell, no units or text).
Sort: Sometimes we have lots of data, but our testable question only relates to a particular subset of the data. Rather than looking at all of the data all the time, we can sort the data to group it by different components of the data. Use a spreadsheet platform’s Sort feature to have the platform quickly do this for you (i.e., no need to copy and paste to move the data around yourself). Make sure to select all of your data when sorting so the corresponding data stays together as it is sorted by one variable. Also, sorting can be a great way to make a dataset in long formatting organized in a way that makes the most sense to you (e.g., you could sort the data in Table 2 by moon phase first rather than month if your question was more interested in comparing tide height by moon phase rather than tide height within or by month).
Filter: We can also filter data to better narrow in on what is relevant to our testable question. We can filter out for what we do want to make sure remains visible or what we don’t want so that it is “hidden” away. Filtering does not delete the data, but rather removes it from our field of view. This makes it easier to analyze the data we need now (e.g., if you are only interested in comparing tide height at the full and new moon phases you could filter out the first and third quarter data from Table 2),
but we can go back and explore something else later. Again, most spreadsheet programs have a built-in Filter feature.
Calculate summary statistics: While it is always best to first look at all of your data, we have summary statistics for a reason. Most spreadsheet programs can calculate our middle school summary statistics: mode, median, mean, maximum/minimum, interquartile range, and so forth. Often you can either click to select a function or type the function’s equation into a cell. Either way works, so I would recommend doing what is easiest for your students. The benefit is that the computer does the work for you so you save computing time that can now go toward analysis and interpretation. But the downside is that if you don’t have a sense of what the summary statistic provides about your data, then the output won’t help much.
Create Pivot Tables: Often we are interested in getting a sense of the data overall (e.g., how many instances of different values are there, what is the average of X by Y category). But if there are lots of data, it can be very time consuming to manually make such calculations. Pivot Tables to the rescue! Simply put, Pivot Tables are summary tables where the spreadsheet program is doing the work for you. If your data are well organized in a long format with no merged cells, then all you need to do is highlight your full data table, insert a Pivot Table, and direct the spreadsheet program as to what variable to put in the rows or columns of the new table and what calculation (e.g., count, average, totals) you want as the cells. Table 3 is an example of one configuration of a Pivot Table you could make from the data we have been looking at so far.
If we consciously leverage the technology resources that we have and put some time toward preparing our data, we can better empower our students to analyze and interpret the data. We’ve got the technology, especially now more than before, so let’s use it for what it was designed to do and make our lives a bit easier for us and our students.
There are some things we can do as we teach our students how to organize and process data to set us up for more success, including (a) don’t merge cells until putting together a data table to demonstrate a conclusion, as it makes it harder to explore and graph the data; (b) know what you are specifically asking students to do with the data and organize things accordingly, unless you are teaching them organizing; and (c) teach the process for how to do things as well as why, so that students build their data toolkits to use in the future—not just get to the end product of a particular activity/lesson.
As with almost every other data skill, spending the time to teach students some of the steps of organizing and processing data is not something we need to or should do every time they work with data. But if we never expose our students to these steps or provide them opportunities to practice in a supported way or communicate how this is part of the actual process of working with data, then we are falsely setting students expectations of what goes into “Analyzing and Interpreting Data” (NGSS Lead States 2013).
What approaches do you use to help your students understand organizing and processing data? What prompts do you have your students consider when reviewing someone else’s data table? Remember, becoming more data literate is a marathon, not a sprint. While our teaching situations look different these days, more than ever let’s remember that together we can do this! •
|Table 1. Examples of data students could collect during an investigation looking at mean high tide with different moon phases, organized in two wide formatted tables.|
|Table 2: Examples of data students could collect during an investigation looking at mean high tide with different moon phases, organized in one long formatted table.|
|Table 3: Example of Pivot Table to compare the maximum tide heights recorded by each group for each moon phase to determine if there are big differences between the two groups’ data.|
Resources for organizing and/or processing in spreadsheet programs:
The Essential Google Spreadsheet Tutorial by Smartsheet— https://www.smartsheet.com/essential-google-spreadsheet-tutorial
Basic tasks in Excel by Microsoft Support—https://support.microsoft.com/en-us/office/basic-tasks-in-excel-dc775dd1-fa52-430f-9c3c-d998d1735fca
Programs designed for data exploration:
CODAP by Concord Consortium—https://concord-consortium.github.io/codap-data/ (free tools, datasets, and activities)
Tuva Labs—https://tuvalabs.com/ (free tools, most datasets, and activities are part of Premium subscription)
Fathom Dynamic Data Software—https://fathom.concord.org/ (free activities, $5.25 a year license for tools)
TinkerPlots by Learn Troop—https://www.tinkerplots.com/ (free activities, $7 a year license for tools)
Tableau Academic Programs by Tableau Software, LLC—https://www.tableau.com/academic (free tools and datasets)
Bowen, M., and A. Bartley. 2014. The basics of data literacy: Helping your students (and you!) make sense of data. Arlington, VA: National Science Teaching Association.
Healy, K. 2019. Data visualization: A practical introduction. Princeton, NJ: Princeton University Press.
Hunter-Thomson, K. 2019. Data literacy 101: What do we really mean by “data”? Science Scope 43 (2): 18–22.
Hunter-Thomson, K. 2020. Data literacy 101: What can we actually claim from our data? Science Scope 43 (6): 20–26.
NGSS Lead States. 2013. Next Generation Science Standards: For states, By states. Washington, DC: National Academies Press.
Tukey, J.W. 1980. We need both exploratory and confirmatory. The American Statistician 34 (1): 23–25.
Wickham, H., and G. Grolemund. 2016. R for data science: Import, tidy, transform, visualize, and model data. Sepbastopol, CA: O’Reilly Media.
Yau, N. 2011. Visualize this: The FlowingData guide to design, visualization, and statistics. New York, NY: Wiley & Sons.
Web SeminarWeb Seminar: Are your Lab SOPs in Place for a Safer School Year?, September 16, 2021
Join us on Thursday, September 16, 2021, from 7:00 PM to 8:00 PM ET, to learn how teachers of science can prepare for a safe return to school....