Whenever I’m teaching introductory courses in data science using the R language, I often encounter students who use a different language like Python or Julia, and still others who are transitioning into data science from other fields and don’t know any data science language at all. The common thread is they need to get up to speed with R quickly to be productive, so I like to address their needs in a creative way. One technique I use is to find some commonality in terms of background and experience. I find that most newbie data scientists have some familiarity with SQL, so that’s a good starting point.
In this article, we’ll review a number of ways R has to access data sets (once stored in an R data frame) that are similar to the functionality found in SQL. If you already know SQL, it is useful to see the equivalent R way of doing things for manipulating data. To illustrate these techniques, we’ll use the CO2 data set that’s part of the base R system. Here is the structure and a short extract of the data set’s content:
> data(CO2)
> head(CO2)
Plant Type Treatment conc uptake
1 Qn1 Quebec nonchilled 95 16.0
2 Qn1 Quebec nonchilled 175 30.4
3 Qn1 Quebec nonchilled 250 34.8
4 Qn1 Quebec nonchilled 350 37.2
5 Qn1 Quebec nonchilled 500 35.3
6 Qn1 Quebec nonchilled 675 39.2
The small dataset has 84 observations along with 5 variables: Plant, Type, Treatment, conc, and uptake. Here is the SQL SELECT statement to retrieve rows from the data frame based on a filter condition:
SELECT * FROM CO2 WHERE conc>400 AND uptake>40
The R equivalent uses the following simple syntax:
CO2_subset <- CO2[CO2$conc>400 & CO2$uptake>40,]
head(CO2_subset)
Plant Type Treatment conc uptake
12 Qn2 Quebec nonchilled 500 40.6
13 Qn2 Quebec nonchilled 675 41.4
14 Qn2 Quebec nonchilled 1000 44.3
19 Qn3 Quebec nonchilled 500 42.94
20 Qn3 Quebec nonchilled 675 43.9
21 Qn3 Quebec nonchilled 1000 45.5
> dim(CO2_subset) # 8 observations selected
[1] 8 5
Now let’s consider an example of the ORDER BY clause of the SQL SELECT statement. Here, we wish to sort the result set by the variable conc in ascending sequence, and then by the variable uptake in descending sequence. Here is the SQL statement:
SELECT * FROM CO2 ORDER BY conc, uptake DESC
The R equivalent uses the following simple syntax which includes some additional R syntax to limit the result set to the first 20 records:
> CO2[order(CO2$conc, -CO2$uptake),][1:20,]
Plant Type Treatment conc uptake
15 Qn3 Quebec nonchilled 95 16.2
1 Qn1 Quebec nonchilled 95 16.0
36 Qc3 Quebec chilled 95 15.1
22 Qc1 Quebec chilled 95 14.2
8 Qn2 Quebec nonchilled 95 13.6
50 Mn2 Mississippi nonchilled 95 12.0
57 Mn3 Mississippi nonchilled 95 11.3
43 Mn1 Mississippi nonchilled 95 10.6
78 Mc3 Mississippi chilled 95 10.6
64 Mc1 Mississippi chilled 95 10.5
29 Qc2 Quebec chilled 95 9.3
71 Mc2 Mississippi chilled 95 7.7
16 Qn3 Quebec nonchilled 175 32.4
2 Qn1 Quebec nonchilled 175 30.4
9 Qn2 Quebec nonchilled 175 27.3
30 Qc2 Quebec chilled 175 27.3
23 Qc1 Quebec chilled 175 24.1
51 Mn2 Mississippi nonchilled 175 22.0
37 Qc3 Quebec chilled 175 21.0
58 Mn3 Mississippi nonchilled 175 19.4
Another powerful SQL construct is the GROUP BY clause used to compute aggregate values such as average. Continuing with our example, we’d like to calculate the average uptake value for each unique value of Plant. Here is the SQL statement to achieve this:
SELECT Plant, AVG(uptake) FROM CO2 GROUP BY Plant
The R equivalent uses the following syntax based on the aggregate() function. The first argument is x=CO2[,c(“uptake”)], which isolates the uptake column from the CO2 data frame. The second argument is by=data.frame(CO2$Plant), which is the grouping variable. Finally, the FUN=”mean” argument is the R function which is used to compute summary statistics, in this case, the mean.
> aggregate(x=CO2[,c(“uptake”)], by=data.frame(CO2$Plant), FUN=”mean”)
CO2.Plant x
1 Qn1 33.22857
2 Qn2 35.15714
3 Qn3 37.61429
4 Qc1 29.97143
5 Qc3 32.58571
6 Qc2 32.70000
7 Mn3 24.11429
8 Mn2 27.34286
9 Mn1 26.40000
10 Mc2 12.14286
11 Mc3 17.30000
12 Mc1 18.00000
The last example will be how to do a SQL JOIN using R. In this case, we’ll look up the country from a secondary table that maps states and provinces to countries. Here is the SQL statement to achieve this:
SELECT c.Type,
c.Plant,
c.Treatment,
c.conc,
c.uptake,
g.country
FROM geo_map g
LEFT JOIN CO2 c ON(c.Type = g.Type)
The R equivalent uses the following syntax based on the merge() function. Looking at the first several observations from the CO2 dataset, we see the variable Type, which contains the state or province of origin for the plant. We’ll use this variable as the common key value. Next, we’ll create a new data frame, geo_map, which plays the role of a look-up table in which the country is paired with the state/province. Then, after assigning appropriate column names for geo_map, we will use merge() to produce the equivalent of a SQL joined result set, the joinCO2 data frame. The new data frame contains a variable country, which is the lookup value based on Type. The R code has the same effect as the SQL JOIN.
> head(CO2)
Plant Type Treatment conc uptake
1 Qn1 Quebec nonchilled 95 16.0
2 Qn1 Quebec nonchilled 175 30.4
3 Qn1 Quebec nonchilled 250 34.8
4 Qn1 Quebec nonchilled 350 37.2
5 Qn1 Quebec nonchilled 500 35.3
6 Qn1 Quebec nonchilled 675 39.2
> stateprov <- c(“Mississippi”, “California”, “Victoria”, “New South Wales”, “Quebec”, “Ontario”)
> country <- c(“United States”, “United States”, “Australia”, “Australia”, “Canada”, “Canada”)
> geo_map <- data.frame(country=country, stateprov=stateprov)
> geo_map
country Type
1 United States Mississippi
2 United States California
3 Australia Victoria
4 Australia New South Wales
5 Canada Quebec
6 Canada Ontario
> colnames(geo_map) <- c(“country”, “Type”)
> joinCO2 <- merge(CO2, geo_map, by=c(“Type”))
> head(joinCO2)
Type Plant Treatment conc uptake country
1 Mississippi Mn1 nonchilled 95 10.6 United States
2 Mississippi Mn1 nonchilled 175 19.2 United States
3 Mississippi Mn1 nonchilled 250 26.2 United States
4 Mississippi Mn1 nonchilled 350 30.0 United States
5 Mississippi Mn1 nonchilled 500 30.9 United States
6 Mississippi Mn1 nonchilled 675 32.4 United States