Day Three: Data Fundamentals

SDS 189: Data and Social Justice

Lindsay Poirier
Statistical & Data Sciences, Smith College

Fall 2023

What is a dataset?

Grolemund, Garrett, and Hadley Wickham. n.d. R for Data Science. Accessed March 31, 2019. https://r4ds.had.co.nz/.

  • a collection of data points organized into a structured format
  • in this course, we will mainly work with datasets that are structured in a two-dimensional format
  • we will refer to these as rectangular datasets
  • rectangular datasets are organized into a series of rows and columns; ideally:
    • we refer to rows as observations
    • we refer to columns as variables

Observations vs. Variables vs. Values

  • Observations refer to individual units or cases of the data being collected.
    • If I was collecting data about each student in this course, one student would be an observation.
    • If I was collecting census data and aggregating it at the county level, one county would be an observation.
  • Variables describe something about an observation.
    • If I was collecting data about each student in this course, ‘major’ might be one variable.
    • If I was collecting county-level census data, ‘population’ might be one variable.
  • Values refer to the actual value associated with a variable for a given observation.
    • If I was collecting data about each student’s major in this course, one value might be SDS.

Grolemund, Garrett, and Hadley Wickham. n.d. R for Data Science. Accessed March 31, 2019. https://r4ds.had.co.nz/.

Key Considerations for Rectangular Datasets

  • All rows in a rectangular dataset are of equal length.

  • All columns in a rectangular dataset are of equal length.

Understanding Check

Let’s say I have a rectangular dataset documenting student names and majors, and I was missing major information for one student. What would this look like in a rectangular dataset?

Grolemund, Garrett, and Hadley Wickham. n.d. R for Data Science. Accessed March 31, 2019. https://r4ds.had.co.nz/.

Is this dataset rectangular?

Is this a rectangular dataset?

How do I find out more information about a dataset?

  • Metadata can be referred to as “data about data”
  • Metadata provides important contextual information to help us interpret a dataset.
  • There are two types of metadata associated with datasets:
  • Administrative metadata tells us how a dataset is managed and its provenance, or the history of how it came to be in its current form:
    • Who created it?
    • When was it created?
    • When was it last updated?
    • Who is permitted to use it?
  • Descriptive metadata tells us information about the contents of a dataset:
    • What does each row refer to?
    • What does each column refer to?
    • What values might appear in each cell?

Where do I find metadata for a dataset?

  • Oftentimes metadata is recorded in a dataset codebook or data dictionary.
  • These documents provide definitions for the observations and variables in a dataset and tell you the accepted values for each variable.
  • Let’s say that I have a dataset of student names, majors, and class years. A codebook or data dictionary might tell me that:
    • Each row in the dataset refers to one student.
    • The ‘Class Year’ variable refers to “the year the student is expected to graduate.”
    • Possible values for the ‘Major’ variable are Political Science, SDS, and Sociology.

Types of Variables

Categorical Variables (Dimensions) Numeric Variables (Measures)
Nominal Variables: Named or classified labels (e.g. names, zip codes, hair color) Discrete Variables: Countable variables (e.g. number of students in this class)
Ordinal Variables: Ordered labels (e.g. letter grades, pollution levels) Continuous Variables: Measured variables (e.g. temperature, height)