Lab 2A: Working with Data (Easy)

Import the Data

  1. Set your working directory.
setwd("C:/Workshop/Data")
  1. Load the data from the tab-delimited data file.
movies <- read.table(
  file = "Movies.txt", 
  sep = "\t", 
  header = TRUE,
  quote = "\"")
  1. Take a peek at data.
head(movies)
##                   Title Year Rating Runtime Tomato.Meter Box.Office
## 1  The Whole Nine Yards 2000      R  98 min           45     $57.3M
## 2             Gladiator 2000      R 155 min           76    $187.3M
## 3      Cirque du Soleil 2000      G  39 min           45     $13.4M
## 4              Dinosaur 2000     PG  82 min           65    $135.6M
## 5     Big Momma's House 2000  PG-13  99 min           30      $0.5M
## 6 Gone in Sixty Seconds 2000  PG-13 118 min           24    $101.0M
  1. Inspect the column names.
names(movies)
## [1] "Title"        "Year"         "Rating"       "Runtime"     
## [5] "Tomato.Meter" "Box.Office"

Problem 1: Column name is incorrect

  1. Inspect the incorrect column name (i.e. Tomato.Meter).
names(movies)[5]
## [1] "Tomato.Meter"
  1. Rename the column to “Critic.Score”.
names(movies)[5] <- "Critic.Score"
  1. Verify that problem 1 is solved.
names(movies)
## [1] "Title"        "Year"         "Rating"       "Runtime"     
## [5] "Critic.Score" "Box.Office"

Problem 2: Missing values

  1. Count the missing values.
sum(is.na(movies))
## [1] 4
  1. Exclude rows with missing values.
movies <- na.omit(movies)
  1. Verify that Problem 2 is solved.
sum(is.na(movies))
## [1] 0

Problem 3: Units of measure in runtime column

  1. Peek at the movie runtime data.
head(movies$Runtime)
## [1] 98 min  155 min 39 min  82 min  99 min  118 min
## 114 Levels: 100 min 101 min 102 min 103 min 104 min 105 min ... 99 min
  1. Verify that we cannot perform mathematical operations on the column.
    NOTE: This next line will throw an error
mean(movies$Runtime)
## Warning in mean.default(movies$Runtime): argument is not numeric or
## logical: returning NA
## [1] NA
  1. Determine the data type.
class(movies$Runtime)
## [1] "factor"
  1. Cast from factor to a character string.
runtimes <- as.character(movies$Runtime)
  1. Verify the values are now character strings.
head(runtimes)
## [1] "98 min"  "155 min" "39 min"  "82 min"  "99 min"  "118 min"
class(runtimes)
## [1] "character"
  1. Eliminate the unit of measure.
runtimes <- sub(" min", "", runtimes)
  1. Verify the unit of measure has been eliminated.
head(runtimes)
## [1] "98"  "155" "39"  "82"  "99"  "118"
  1. Cast the character string to integer.
movies$Runtime <- as.integer(runtimes)
  1. Verify the values are now integers.
head(movies$Runtime)
## [1]  98 155  39  82  99 118
class(movies$Runtime)
## [1] "integer"
  1. Verify that we can now perform mathematical operations.
mean(movies$Runtime)
## [1] 104.4052

Problem 4: Box Office uses three units of measure

  1. Inspect the Box Office column.
head(movies$Box.Office)
## [1] $57.3M  $187.3M $13.4M  $135.6M $0.5M   $101.0M
## 1367 Levels: $0.1M $0.2M $0.3M $0.4M $0.5M $0.6M $0.7M $0.8M $0.9M ... $99.9k
  1. Create a function to convert box office revenue to a single unit of measure (i.e. millions of USD).
convertBoxOffice <- function(boxOffice)
{
  stringBoxOffice <- as.character(boxOffice)
  
  replacedBoxOffice <- gsub("[$|k|M]", "", stringBoxOffice)
  
  numericBoxOffice <- as.numeric(replacedBoxOffice)
  
  if (grepl("M", boxOffice)) {
    numericBoxOffice 
  } else if (grepl("k", boxOffice)){
    numericBoxOffice * 0.001 
  } else { 
    numericBoxOffice * 0.000001
  }
}
  1. Convert box office to single unit of measure.
movies$Box.Office <- sapply(movies$Box.Office, convertBoxOffice)
  1. Verify that Problem 4 has been solved.
head(movies$Box.Office)
## [1]  57.3 187.3  13.4 135.6   0.5 101.0
class(movies$Box.Office)
## [1] "numeric"
mean(movies$Box.Office)
## [1] 40.67558

Export the data

  1. Export the data to a CSV file.
write.csv(movies, "Movies2.csv")
  1. Verify the new CSV file exists in your C:/Workshop directory.

  2. Verify the file’s contents by opening the file.

Bonus Problem: Replace Joe’s Analysis Software

  1. Load the dplyr package
library(dplyr)
  1. Select columns
temp <- select(movies, Year, Rating, Box.Office)
  1. Filter rows
temp <- filter(temp, Year == 2014)
## Warning: package 'bindrcpp' was built under R version 3.4.1
  1. Create new revenue column ($B)
temp <- mutate(temp, Revenue = Box.Office / 1000)
  1. Group by rating
temp <- group_by(temp, Rating)
  1. Summarize by rating category
temp <- summarize(temp, Total.Revenue = sum(Revenue))
  1. Sort rows by revenue (descending)
temp <- arrange(temp, desc(Total.Revenue))
  1. Display results
print(temp)
## # A tibble: 4 x 2
##   Rating Total.Revenue
##   <fctr>         <dbl>
## 1  PG-13      4.207022
## 2      R      1.985307
## 3     PG      1.931452
## 4      G      0.159132
  1. Chain dplyr methods
report <- movies %>%
    select(Year, Rating, Box.Office) %>%
    filter(Year == 2014) %>%
    mutate(Revenue = Box.Office / 1000) %>%
    group_by(Rating) %>%
    summarize(Total.Revenue = sum(Revenue)) %>%
    arrange(desc(Total.Revenue)) %>%
    as.data.frame()
  1. Display results
print (report)
##   Rating Total.Revenue
## 1  PG-13      4.207022
## 2      R      1.985307
## 3     PG      1.931452
## 4      G      0.159132