STAT 29000: Project 11 — Spring 2022
Motivation: Data wrangling is the process of gathering, cleaning, structuring, and transforming data. Data wrangling is a big part in any data driven project, and sometimes can take a great deal of time. tidyverse
is a great, but opinionated, suite of integrated packages to wrangle, tidy and visualize data. It is useful to gain some familiarity with this collection of packages, in case you run into a situation where these packages are needed — you may even find that you enjoy using them!
Context: We have covered a few topics on the tidyverse
packages, but there is a lot more to learn! We will continue our strong focus on the tidyverse (including ggplot) and data wrangling tasks. This is the second in a series of 5 projects focused around using tidyverse
packages to solve data-driven problems.
Scope: R, tidyverse, ggplot
Dataset(s)
The following questions will use the following dataset(s):
-
/depot/datamine/data/beer/beers.csv
-
/depot/datamine/data/beer/reviews_sample.csv
Questions
Question 1
Let’s pick up where we left in the previous project. Copy and paste your commands from questions 1 to 3 that result in our beers_reviews
dataset.
Using the pipelines (remember, the %>%
), combine the necessary parts of questions 2 and 3, removing the need to have an intermediate reviews_summary
dataset. This is a great way to practice and get a better understanding of tidyverse
.
Your code should read the datasets, summarize the reviews data similarly to what you did in question 2, and combine the summarized dataset with the beers
dataset. This should all be accomplished from a single chunk of "piped-together" code.
Feel free to remove the
|
If you want to update how you calculated your |
-
Code used to solve this problem.
-
Output from running the code.
Question 2
Are there any differences in terms of abv
between beers that are available in specific seasons?
ABV refers to the alcohol by volume of a beer. The higher the ABV, the more alcohol is in the beer. |
-
Filter the
beers_reviews
dataset to contain beers available only in a specific season (Fall
,Winter
,Spring
,Summer
).Only click below if you are stuck!
This function will help you do this operation.
-
Make a side-by-side boxplot comparing
abv
for each seasonavailability
.Only click below if you are stuck!
This function will help you do this operation.
-
Make sure to use the
labs
function to have nice x-axis label and y-axis label.This is more information on
labs
.
Use pipelines, resulting in a single chunk of "piped-together" code.
Use the |
Write 1-2 sentences comparing the beers in terms of abv
between the specific seasons. Are the results surprising or did you expect them?
The |
-
Code used to solve this problem.
-
Output from running the code.
-
1-2 sentences comparing the beers in terms of
abv
between the specific seasons. Are the results surprising or did you expect them?
Question 3
Modify your code from question 2 to:
-
Create a new variable
is_good
that is 1 or TRUE ifbeer_goodness_indicator
is greater than 3.5, and 0 or FALSE otherwise. -
Facet your boxplot based on
is_good
. The resulting graphic should make it easy to compare the "good" vs "bad" beers for each season.facet_grid
andfacet_wrap
are two other functions that can be a bit confusing at first. With that being said, they are incredible powerful and make creating really impressive graphics very straightforward.
Make sure to use piping |
How do beers differ in terms of ABV and being considered good or not (based on our definition) for the different seasons? Write 1-2 sentences describing what you see based on the plots.
-
Code used to solve this problem.
-
Output from running the code.
-
1-2 sentences answering the question.
Question 4
Modify your code from question 3 to answer the question based on summary statistics instead of graphical displays.
Make sure you compare the ABV per season availability
and is_good
using mean
, median
and sd
. Your final dataframe should have 8 rows and the following columns: is_good
, availability
, mean_abv
, median_abv
, std_abv
.
-
Code used to solve this problem.
-
Output from running the code.
Question 5
In this question, we want to make comparison in terms of ABV
and beer_goodness_indicator
for US states.
Feel free to use whichever data-driven method you desire to answer this question! You can take summary statistics, make a variety of plots, and even filter to compare specific US states — you can even create new columns combining states (based on region, political affiliation, etc).
Write a question related to US states, ABV and our "beer_goodness_indicator". Use your data-driven method(s) to answer it (if only anecdotally).
-
Code used to solve this problem.
-
Output from running the code.
-
Write 1-2 sentences explaining your question and data-driven method(s).
-
Write 1-2 sentences answering your question.
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connect ion, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. In addition, please review our submission guidelines before submitting your project. |