Lab 3A: Descriptive Statistics (Easy)

  1. Load the CSV data files.
movies <- read.csv("Movies.csv")

genres <- read.csv("Genres.csv")
  1. Peek at the data.
head(movies)
##                   Title Year Rating Runtime Critic.Score Box.Office
## 1  The Whole Nine Yards 2000      R      98           45       57.3
## 2             Gladiator 2000      R     155           76      187.3
## 3      Cirque du Soleil 2000      G      39           45       13.4
## 4              Dinosaur 2000     PG      82           65      135.6
## 5     Big Momma's House 2000  PG-13      99           30        0.5
## 6 Gone in Sixty Seconds 2000  PG-13     118           24      101.0
head(genres)
##                  Title  Genre Year Rating Runtime Critic.Score Box.Office
## 1 The Whole Nine Yards  Crime 2000      R      98           45       57.3
## 2 The Whole Nine Yards Comedy 2000      R      98           45       57.3
## 3     Cirque du Soleil  Drama 2000      G      39           45       13.4
## 4     Cirque du Soleil Family 2000      G      39           45       13.4
## 5            Gladiator Action 2000      R     155           76      187.3
## 6            Gladiator  Drama 2000      R     155           76      187.3

Analyzing One Categorical Variable

  1. Create a frequency table of observations of movies by rating category.
table(movies$Rating)
## 
##     G    PG PG-13     R 
##    93   497  1225  1423

Analyzing One Numeric Variable

  1. Analyze measures of central tendancy (i.e. location) for movie runtime.
mean(movies$Runtime)
## [1] 104.4052
median(movies$Runtime)
## [1] 101
  1. Analyze measures dispersion (i.e. spread) for movie runtime.
min(movies$Runtime)
## [1] 38
max(movies$Runtime)
## [1] 219
range(movies$Runtime)
## [1]  38 219
diff(range(movies$Runtime))
## [1] 181
quantile(movies$Runtime)
##   0%  25%  50%  75% 100% 
##   38   93  101  113  219
quantile(movies$Runtime, 0.95)
## 95% 
## 135
IQR(movies$Runtime)
## [1] 20
var(movies$Runtime)
## [1] 284.4487
sd(movies$Runtime)
## [1] 16.86561
  1. Analyze measures of the shape of movie runtime.
library(moments)

skewness(movies$Runtime)
## [1] 1.007788
kurtosis(movies$Runtime)
## [1] 5.956355
  1. Summarize a quantitative variable (i.e. movie runtime).
summary(movies$Runtime)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    38.0    93.0   101.0   104.4   113.0   219.0

Analyzing Two Categorical Variables

  1. Create a contingency table containing the frequency of observations of movies by genre and rating.
table(genres$Genre, genres$Rating)
##              
##                 G  PG PG-13   R
##   Action        2  70   311 229
##   Adventure    44 179   209  64
##   Animation    43 111     8   6
##   Biography     0  27    73  93
##   Comedy       45 258   472 506
##   Crime         0   9   141 328
##   Documentary  27  73    78  65
##   Drama        12 136   586 836
##   Family       38 181    10   1
##   Fantasy       6  51   115  43
##   History       3  12    36  35
##   Horror        0   3    71 195
##   Music         5  31    81  59
##   Musical       0  11    20   6
##   Mystery       0   6   102 136
##   Sci-Fi        0   7   119  72
##   Sport         4  36    62  19
##   Thriller      0   2   167 324
##   War           1   0    19  31
##   Western       0   4     6  10

Analyzing Two Numeric Variables

  1. Analyze the correlation coefficient for runtime and box office.
cor(movies$Runtime, movies$Box.Office)
## [1] 0.347748
  1. Analyze the correlation coefficient for runtime and box office.
cor(movies$Critic.Score, movies$Box.Office)
## [1] 0.1608324

Analyzing a Numeric Variable Grouped by a Categorical Variable

  1. Create a table of aggregate numeric values (i.e average box office revenue) grouped by a categorical variable (i.e. rating category).
tapply(movies$Box.Office, movies$Rating, mean)
##        G       PG    PG-13        R 
## 55.47561 56.40439 54.56134 22.26118
  1. Create a table of average box office revenue grouped by a genre.
tapply(genres$Box.Office, genres$Genre, mean)
##      Action   Adventure   Animation   Biography      Comedy       Crime 
##   76.530806  101.745110   96.603311   26.500308   40.860973   34.320142 
## Documentary       Drama      Family     Fantasy     History      Horror 
##    6.268575   24.740296   68.339200   93.251211   24.181583   27.932895 
##       Music     Musical     Mystery      Sci-Fi       Sport    Thriller 
##   21.978918   37.172776   40.328661   86.874763   27.739240   38.523364 
##         War     Western 
##   26.474298   36.146105

Analyzing Many Variables

  1. Create a correlation matrix
cor(movies[, 4:6])
##                Runtime Critic.Score Box.Office
## Runtime      1.0000000    0.1881713  0.3477480
## Critic.Score 0.1881713    1.0000000  0.1608324
## Box.Office   0.3477480    0.1608324  1.0000000
  1. Summarize an entire table.
summary(movies)
##                   Title           Year        Rating        Runtime     
##  Camp                :   2   Min.   :2000   G    :  93   Min.   : 38.0  
##  Frozen              :   2   1st Qu.:2004   PG   : 497   1st Qu.: 93.0  
##  The Other Woman     :   2   Median :2008   PG-13:1225   Median :101.0  
##  (500) Days of Summer:   1   Mean   :2008   R    :1423   Mean   :104.4  
##  (Untitled)          :   1   3rd Qu.:2011                3rd Qu.:113.0  
##  10 Items or Less    :   1   Max.   :2015                Max.   :219.0  
##  (Other)             :3229                                              
##   Critic.Score      Box.Office      
##  Min.   :  0.00   Min.   :  0.0002  
##  1st Qu.: 26.00   1st Qu.:  1.0000  
##  Median : 49.00   Median : 16.1000  
##  Mean   : 49.68   Mean   : 40.6756  
##  3rd Qu.: 74.00   3rd Qu.: 51.4750  
##  Max.   :100.00   Max.   :760.5000  
##