TDM 10200: Project 8 — 2023
Motivation: We are going to take a step back and work towards building a beer recommendation system. As you know already, a key part of analysis is being able to write functions.
Context: We will continue to introduce functions and practice doing so!
Scope: python, functions, pandas
Dataset(s)
The following questions will use the following dataset(s):
-/anvil/projects/tdm/data/yelp/data/parquet/
Questions
Let’s first list all of the files in this folder .Helpful Hint
Details
ls /anvil/projects/tdm/data/yelp/data/parquet/
We want to load only two pandas
data frames to use in this project. We will not be working with the other data frames.
Don’t forget to import pandas!
users = pd.read_parquet("/anvil/projects/tdm/data/yelp/data/parquet/users.parquet")
reviews = pd.read_parquet("/anvil/projects/tdm/data/yelp/data/parquet/reviews.parquet")
You will likely need to use 10 cores for this project because the files for this week are very large. |
ONE
Typically we can assume that we surround ourselves with people with similar interests and we would have aligning tastes in restaurants and businesses.
Let’s check that by writing a function called get_friends_data
with the user_id
as an argument.
This new function should return a pandas DataFrame with the information in the users
DataFrame for each friend of user_id
.
In this function we want to add type hints
. We have added type hints
in most of our functions so far this spring, but now we want to pay attention to them.
These type hints
provide a formal way to indicate the type of arguments passed to a function in python.
You can learn more about python type hints here or this is another good site for information.
Go ahead and use type hints
into your function. Use one for the user_id
and one for the returned data.
These are the three examples that I tested in the videos:
get_friends_data("ntlvfPzc8eglqvk92iDIAw")
get_friends_data("AY-laIws3S7YXNl_f_D6rQ")
get_friends_data("xvu8G900tezTzbbfqmTKvA")
Helpful Hint
a type hint
for a string appears as str
in our function
get_friends_data(myuserid: str)
-
Code used to answer the question.
-
Result of code.
TWO
Next, we need to write a function called calculate_avg_business_stars
that accepts the business_id
and returns the average number of stars that the business has received.
Doing this we can use the groupby()
function. This is one of the most powerful functions for (more easily) working with data frames in python.
These are the three examples that I tested in the videos:
calculate_avg_business_stars('-MhfebM0QIsKt87iDN-FNw')
calculate_avg_business_stars('5JxlZaqCnk1MnbgRirs40Q')
calculate_avg_business_stars('faPVqws-x-5k2CQKDNtHxw')
Insider Information
-
groupby()- allows us group data according to categories and also can help us compile and summarize data easily.
-
Code used to answer the question.
-
Result of code.
THREE
Next up, we need to write a function called visualize_stars_over_time
. It needs to accept a business_id
as an argument. It should make a line plot that shows the average number of stars for each year that the business has reviews.
These are the three examples that I tested in the videos:
visualize_stars_over_time("-MhfebM0QIsKt87iDN-FNw")
visualize_stars_over_time("5JxlZaqCnk1MnbgRirs40Q")
visualize_stars_over_time("faPVqws-x-5k2CQKDNtHxw")
Helpful Hint
You will need to import matplotlib.pyplot as plt
FOUR
We are now going to add an argument to our function visualize_stars_over_time
that we wrote in question 3. This argument is called granularity
. Granularity should accept one of two strings, either "years"
or "months"
(don’t forget the double quotes!) that will help show the average rating over time.
These are the nine examples that I tested in the videos:
visualize_stars_over_time("-MhfebM0QIsKt87iDN-FNw")
visualize_stars_over_time("-MhfebM0QIsKt87iDN-FNw", "years")
visualize_stars_over_time("-MhfebM0QIsKt87iDN-FNw", "months")
visualize_stars_over_time("5JxlZaqCnk1MnbgRirs40Q")
visualize_stars_over_time("5JxlZaqCnk1MnbgRirs40Q", "years")
visualize_stars_over_time("5JxlZaqCnk1MnbgRirs40Q", "months")
visualize_stars_over_time("faPVqws-x-5k2CQKDNtHxw")
visualize_stars_over_time("faPVqws-x-5k2CQKDNtHxw", "years")
visualize_stars_over_time("faPVqws-x-5k2CQKDNtHxw", "months")
Insider Information
Granularity indicates how much data can be shown on a chart. It can expressed in units of time, it can be - "minute" - "hour" - "day" - "week" - "month" - "year".
-
Code used to answer the question
-
Result of the code
FIVE
Now we continue to modify the function visualize_stars_over_time
that we were working on, from questions 3 and 4. We want to add the ability to accept multiple business_ids, and create a line for each id. (It is OK to remove the functionality about "granularity" from Project 4; just make the plots with the yearly summaries, and do not worry about the monthly granularity. You can just remove anything about the granularity.)
This is the example that I tested in the videos:
visualize_stars_over_time("-MhfebM0QIsKt87iDN-FNw", "5JxlZaqCnk1MnbgRirs40Q", "faPVqws-x-5k2CQKDNtHxw")
-
Code used to answer the question
-
Result of the code
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. In addition, please review our submission guidelines before submitting your project. |