STAT 39000: Project 11 — Fall 2020
Motivation: Being able to use results of queries as tables in new queries (also known as writing sub-queries), and calculating values like MIN, MAX, and AVG in aggregate are key skills to have in order to write more complex queries. In this project we will learn about aliasing, writing sub-queries, and calculating aggregate values.
Context: We are in the middle of a series of projects focused on working with databases and SQL. In this project we introduce aliasing, sub-queries, and calculating aggregate values using a much larger dataset!
Scope: SQL, SQL in R
Dataset
The following questions will use the elections
database. Similar to Project 10, this database is hosted on Scholar. Moreover, Question 1 also involves the following data files found in Scholar:
/class/datamine/data/election/itcontYYYY.txt
(for example, data for year 1980 would be /class/datamine/data/election/itcont1980.txt
)
A public sample of the data can be found here:
www.datadepot.rcac.purdue.edu/datamine/data/election/itcontYYYY.txt (for example, data for year 1980 would be www.datadepot.rcac.purdue.edu/datamine/data/election/itcont1980.txt)
Questions
For this project you will need to connect to the database ````markdown
|
When a question involves SQL queries in this project, you may use a SQL code chunk (with {sql}
), or an R code chunk (with {r}
) and functions like dbGetQuery
as you did in Project 10. Please refer to Question 5 in the project template for examples.
Question 1
Approximately how large was the lahman database (use the sqlite database in Scholar: /class/datamine/data/lahman/lahman.db
)? Use UNIX utilities you’ve learned about this semester to write a line of code to return the size of that .db file (in MB).
The data we consider in this project are much larger. Use UNIX utilities (bash and awk) to write another line of code that calculates the total amount of data in the elections folder /class/datamine/data/election/
. How much data (in MB) is there?
The data in that folder has been added to the elections
database, all aggregated in the elections
table. Write a SQL query that returns the number of rows of data are in the database. How many rows of data are in the table elections
?
These are some examples of how to get the sizes of collections of files in UNIX: |
The SQL query will take some time! Be patient. |
You may use more than one code chunk in your RMarkdown file for the different tasks. |
We will accept values that represent either apparent or allocated size, as well as estimated disk usage. To get the size from |
-
Line of code (bash/awk) to show the size (in MB) of the lahman database file.
-
Approximate size of the lahman database in MB.
-
Line of code (bash/awk) to calculate the size (in MB) of the entire elections dataset in
/class/datamine/data/election
. -
The size of the elections data in MB.
-
SQL query used to find the number of rows of data in the
elections
table in theelections
database. -
The number of rows in the
elections
table in theelections
database.
Question 2
Write a SQL query using the LIKE
command to find a unique list of zip_code
that start with "479".
Write another SQL query and answer: How many unique zip_code
are there that begin with "479"?
Here are some examples about SQL that might be relevant for Questions 2 and 3 in this project. |
The first query returns a list of zip codes, and the second returns a count. |
Make sure you only select |
-
SQL queries used to answer the question.
-
The first 5 results from running the query.
Question 3
Write a SQL query that counts the number of donations (rows) that are from Indiana. How many donations are from Indiana? Rewrite the query and create an alias for our field so it doesn’t read COUNT(*)
but rather Indiana Donations
.
You may enclose an alias’s name in quotation marks (single or double) when the name contains space. |
-
SQL query used to answer the question.
-
The result of the SQL query.
Question 4
Rewrite the query in (3) so the result is displayed like: IN: 1234567
. Note, if instead of "IN" we wanted "OH", only the WHERE clause should be modified, and the display should automatically change to OH: 1234567
. In other words, the state abbreviation should be dynamic, not static.
This video demonstrates how to use CONCAT in a MySQL query: |
Use CONCAT and aliasing to accomplish this. |
Remember, |
-
SQL query used to answer the question.
Question 5
In (2) we wrote a query that returns a unique list of zip codes that start with "479". In (3) we wrote a query that counts the number of donations that are from Indiana. Use our query from (2) as a sub-query to find how many donations come from areas with zip codes starting with "479". What percent of donations in Indiana come from said zip codes?
This video gives two examples of sub-queries: |
You can simply manually calculate the percent using the count in (2) and (5). |
-
SQL queries used to answer the question.
-
The percentage of donations from Indiana from `zip_code`s starting with "479".
Question 6
In (3) we wrote a query that counts the number of donations that are from Indiana. When running queries like this, a natural "next question" is to ask the same question about another state. SQL gives us the ability to calculate functions in aggregate when grouping by a certain column. Write a SQL query that returns the state, number of donations from each state, the sum of the donations (transaction_amt
). Which 5 states gave the most donations (highest count)? Order you result from most to least.
In this video we demonstrate |
You may want to create an alias in order to sort. |
-
SQL query used to answer the question.
-
Which 5 states gave the most donations?
Question 7
Write a query that gets the number of donations, and sum of donations, by year, for Indiana. Create one or more graphics that highlights the year-by-year changes. Write a short 1-2 sentences explaining your graphic(s).
-
SQL query used to answer the question.
-
R code used to create your graphic(s).
-
1 or more graphics in png/jpeg format.
-
1-2 sentences summarizing your graphic(s).