subset
Basics
subset
is a function that helps you take subsets of data. By default, subset
removes NA rows.
subset does not perform any operation that can’t be accomplished by indexing.
|
With indexing, we would write something like:
grades[grades$year=="junior" | grades$sex=="M",]$grade
[1] 100 75 74 69 88 99 90 92
Using subset
, we can instead do this:
subset(grades, year=="junior" | sex=="M", select=grade)
grade 1 100 3 75 4 74 6 69 7 88 8 99 9 90 10 92
Unlike indexing, NA values will be removed by subset automatically:
grades$sex[8] <- NA
subset(grades, year=="junior" | sex=="M", select=grade)
grade 1 100 3 75 4 74 6 69 7 88 9 90 10 92
grades[grades$year=="junior" | grades$sex=="M",]$grade
[1] 100 75 74 69 88 NA 90 92
Examples
How can I make a subset of the 8451
data using 1 line of code?
Click to see solution
In the 84.51 data set, we look at the variables and dimensions:
myDF <- read.csv("/depot/datamine/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv")
head(myDF)
dim(myDF)
BASKET_NUM HSHD_NUM PURCHASE_ PRODUCT_NUM SPEND UNITS STORE_R WEEK_NUM YEAR <dbl> <dbl> <chr> <dbl> <dbl> <int> <chr> <int> <int> 24 1809 03-JAN-16 5817389 -1.50 -1 SOUTH 1 2016 24 1809 03-JAN-16 5829886 -1.50 -1 SOUTH 1 2016 34 1253 03-JAN-16 539501 2.19 1 EAST 1 2016 60 1595 03-JAN-16 5260099 0.99 1 WEST 1 2016 60 1595 03-JAN-16 4535660 2.50 2 WEST 1 2016 168 3393 03-JAN-16 5602916 4.50 1 SOUTH 1 2016 10625553 9
There are 10625553 rows and 9 columns.
We can use the subset
function to focus on only the purchases from the CENTRAL
store region, in the YEAR
2016. We can also pick which variables (columns) that we want to include in the new data frame.
The subset function knows which data set that we are working with, because we specify it as the first parameter in the subset function, so we don’t need to include myDF before each variable.
The structure of the subset function is as follows: subset(x, subset, select, drop=FALSE, …) .
The subset parameter describes the rows that we are interested in. (In particular, we specify the conditions that we want the rows to satisfy.)
The select parameter describes the columns that we are interested in. (We list the columns by their names, and we need to put each such column name in double quotes.)
|
myfocusedDF <- subset(myDF, subset=(STORE_R=="CENTRAL") & (YEAR==2016),
select=c("PURCHASE_","PRODUCT_NUM","SPEND","UNITS") )
myfocusedDF
This new data set has only 1246144 rows, i.e., about 12 percent of the purchases, as expected. It also has only the 4 columns that we specified in the subset
function.
dim(myfocusedDF)
1246144 4
How can I make a subset of the election
data using 1 line of code?
Click to see solution
Here is an example of how to use the subset
function with the data from the federal election campaign contributions from 2016:
library(data.table)
myDF <- fread("/depot/datamine/data/election/itcont2016.txt", sep="|")
dim(myDF)
20557796 21
mymidwestDF <- subset(myDF, subset=(STATE %in% c("IN","IL","OH","MI","WI")) & (TRANSACTION_AMT > 0),
select=c("NAME","CITY","STATE","TRANSACTION_AMT") )
We can use the subset
command to focus on the donations made from Midwest states, and limit our results to those donations that had positive TRANSACTION_AMT
values. We can extract interesting variables, e.g., the NAME
, CITY
, STATE
, and TRANSACTION_AMT
.
dim(mymidwestDF)
2435825 4
The resulting data frame has 2435825 rows.
tail( sort( tapply(mymidwestDF$TRANSACTION_AMT, mymidwestDF$NAME, sum) ) )
UIHLEIN, RICHARD E. JOBSOHIO ASSOCIATION OF REALTORS, NATIONAL 9216700 10000000 10234334 PRITZKER, JAY ROBERT UIHLEIN, RICHARD EYCHANER, FRED 10511348 11866100 37901658
From the data set, we can sum
the TRANSACTION_AMT
values, grouped according to the NAME
of the donor, and we find that "Fred Eychaner" was the top Midwesterner donor during the 2016 federal election campaigns.