Homework

Data Manipulations using the dplyr package

We are going to be working with the iris dataset (classic) and using dplyr TO BEND THE DATASET TO OUR WILL AND REFORGE IT IN OUR IMAGE

## first things first load up the libraries

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

# and bring in the dataset

data(iris)

#shorten it to make a copy and also to make it easier to type

ir <- iris

Examine the structure of the iris data set. How many observations and variables are in the data set?

dim(ir)

## [1] 150   5

#looks like 150 observations and 5 variables

Create a new data frame iris1 that contains only the species virginica and versicolor with sepal lengths longer than 6 cm and sepal widths longer than 2.5 cm. How many observations and variables are in the data set?

iris1 <- ir %>% filter(Species %in% c("virginica","versicolor") & Sepal.Length > 6 & Sepal.Width > 2.5)

#check the structure

dim(iris1)

## [1] 56  5

#looks like we got a 56 5, 56 observations and 5 variables pawtner

Now, create a iris2 data frame from iris1 that contains only the columns for Species, Sepal.Length, and Sepal.Width. How many observations and variables are in the data set?

iris2 <- select(iris1, Species, Sepal.Length, Sepal.Width)

# check the dims

dim(iris2)

## [1] 56  3

#we knocked out two columns so it makes sense that we have 56 observations still but only 3 variables

Create an iris3 data frame from iris2 that orders the observations from largest to smallest sepal length. Show the first 6 rows of this data set.

iris3 <- arrange(iris2,desc(Sepal.Length))

# check the first six rows

head(iris3)

##     Species Sepal.Length Sepal.Width
## 1 virginica          7.9         3.8
## 2 virginica          7.7         3.8
## 3 virginica          7.7         2.6
## 4 virginica          7.7         2.8
## 5 virginica          7.7         3.0
## 6 virginica          7.6         3.0

Create an iris4 data frame from iris3 that creates a column with a sepal area (length * width) value for each observation. How many observations and variables are in the data set?

# somebody call prof. x because we're about to make a mutant!

iris4 <- iris3 %>% mutate(Sepal.Area=(Sepal.Length * Sepal.Width))

dim(iris4)

## [1] 56  4

#with mutate we added an additional column but did not change the original number of observations thus we have 56 observations and 4 variables (one with that new variable smell)

Create iris5 that calculates the average sepal length, the average sepal width, and the sample size of the entire iris4 data frame and print iris5.

# whats that smell? It smells like SUMMARIZE TIME!!

# we will summarize to get the means and while we're here let's do the other stats too. Why not? Whose going to stop me? I fear neither god nor man. 

iris5 <- iris4 %>% summarize(avg.Sepal.Length=mean(Sepal.Length), avg.Sepal.Width=mean(Sepal.Width),sd.Sepal.Length=sd(Sepal.Length), sd.Sepal.Width=sd(Sepal.Width),var.Sepal.Length=var(Sepal.Length), var.Sepal.Width=var(Sepal.Width),obs.count=n())

print(iris5)

##   avg.Sepal.Length avg.Sepal.Width sd.Sepal.Length sd.Sepal.Width
## 1         6.698214        3.041071       0.4863561      0.2535399
##   var.Sepal.Length var.Sepal.Width obs.count
## 1        0.2365422      0.06428247        56

Finally, create iris6 that calculates the average sepal length, the average sepal width, and the sample size for each species of in the iris4 data frame and print iris6.

# here we are once again, we're summarizing, can't deny it, can't pretend. we're doing it by species this time!

# why do one stat when three will do!? we'll also do the variance and standard deviation. FOR COMPLETENESS!

iris6 <- iris4 %>% group_by(Species) %>%  summarize(avg.Sepal.Length=mean(Sepal.Length), avg.Sepal.Width=mean(Sepal.Width),sd.Sepal.Length=sd(Sepal.Length), sd.Sepal.Width=sd(Sepal.Width),var.Sepal.Length=var(Sepal.Length), var.Sepal.Width=var(Sepal.Width),obs.count=n())

# interesting, our dataset is a pretty virginica heavy, that must have something to do with our selection criteria.

In these exercises, you have successively modified different versions of the data frame iris1 iris2 iris3 iris4 iris5 iris6. At each stage, the output data frame from one operation serves as the input fro the next. A more efficient way to do this is to use the pipe operator %>% from the tidyr package. See if you can rework all of your previous statements (except for iris5) into an extended piping operation that uses iris as the input and generates irisFinal as the output.

# I was using pipes from the get go! They're so handy!

# but I am a heathen and don't believe in using the new line. one. big. line. 

irisFinal <- ir %>% filter(Species %in% c("virginica","versicolor") & Sepal.Length > 6 & Sepal.Width > 2.5) %>% select(Species, Sepal.Length, Sepal.Width) %>% arrange(desc(Sepal.Length))%>% mutate(Sepal.Area=(Sepal.Length * Sepal.Width))%>% group_by(Species) %>%  summarize(avg.Sepal.Length=mean(Sepal.Length), avg.Sepal.Width=mean(Sepal.Width),sd.Sepal.Length=sd(Sepal.Length), sd.Sepal.Width=sd(Sepal.Width),var.Sepal.Length=var(Sepal.Length), var.Sepal.Width=var(Sepal.Width),obs.count=n())

Create a ‘longer’ data frame using the original iris data set with three columns named “Species”, “Measure”, “Value”. The column “Species” will retain the species names of the data set. The column “Measure” will include whether the value corresponds to Sepal.Length, Sepal.Width, Petal.Length, or Petal.Width and the column “Value” will include the numerical values of those measurements.

ir.long <- ir %>% pivot_longer(cols = Sepal.Length:Petal.Width,names_to = "Measure",values_to = "Value")

What’s that you say? you want to see a graph of the data? Well since you asked so nicely!

theme_set(theme_bw())

ggplot(data = ir.long,aes(x = Species,y=Value,fill=Measure))+geom_bar(stat = 'identity',position = 'dodge')+scale_fill_manual(values = c("#219ebc","#023047","#ffb703","#fb8500"))

Homework_7

RMJ

2025-02-26

Data Manipulations using the dplyr package