STAT 19000: Project 9 — Spring 2022
Motivation: Learning how to wrangle and clean up data using pandas is extremely useful. It takes lots of practice to start to feel comfortable.
Context: At this point in the semester, we have a solid grasp on the basics of Python, and are looking to build our skills using pandas
by using pandas
to solve data-driven problems.
Scope: Python, pandas
Dataset(s)
The following questions will use the following dataset(s):
-
/depot/datamine/data/disney/total.parquet
Questions
Question 1
Let’s start by reading in the cleaned up and combined dataset. This is just the cleaned up dataset — essentially the same thing you got as a result from much of your processing from project 7.
How many rows of data are there for each ride?
-
Code used to solve this problem.
-
Output from running the code.
Question 2
Recall that a single row of data either has a value for SPOSTMIN
or SACTMIN
, but not both. How many rows of data are there in total? How many non-null rows for SPOSTMIN
? How many non-null rows for SACTMIN
? Create a new dataframe called reduced
where:
-
Each row has a value for both
SPOSTMIN
andSACTMIN
. The value in theSPOSTMIN
column is the value for the closestSPOSTMIN
value in seconds from the datetime shown for theSACTMIN
value. -
There is a new column called
time_diff
that is the difference (in seconds) between theSACTMIN
value and associated closestSPOSTMIN
value.
This is the toughest question for this project. So it is OK if it takes you a bit more time to think of a solution. |
Check out the Don’t worry too much about edge cases — as long as you are close, you will get full credit. |
-
Code used to solve this problem.
-
Output from running the code.
Question 3
How many fewer rows does reduced
have than the original dataset? What does the time_diff
column look like?
In project 7 you calculated the median SPOSTMIN
and SACTMIN
by ride_name
. Perform the same operation on reduced
. Are the SACTMIN
and SPOSTMIN
medians closer or further away than our not-cleaned data from project 7?
Do you think that, overall, the data in reduced
is close enough (by time) to be able to draw comparisons? Why or why not?
-
Code used to solve this problem.
-
Output from running the code.
Question 4
Any observation where the (absolute) time_diff
is greater than an hour is probably not very high quality. Remove said observations from reduced
. How many rows are left in reduced
?
Finally, explore the refined dataset, reduced
, more. Write a question you would like to have answered down, what you think the answer will be, and do your best to used the dataset to answer your question.
Your analysis should include: a question, your hypothesis, at least 1 graphic, any and all code you used, and your conclusions. You will not be graded on whether or not you are correct, but rather the effort you put into your analysis. Any good effort including the requirements will receive full credit. Have fun!
-
Code used to solve this problem.
-
Output from running the code.
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connect ion, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. In addition, please review our submission guidelines before submitting your project. |