7 High performance computing: data.table

Okay, let’s get started with this chapter on high performance computing using the R package data.table.

## [1] "Hello data.table"

If data.table is not yet installed, we install it via:

install.packages("data.table" , repos="http://cran.us.r-project.org")

We load data.table package and - to demonstrate the differences - also the tidyverse library.

library(tidyverse) 
library(data.table)

Data.table objects are similar to a data.frame object but with some extensions. Let us first load some data with the data.table file reader fread(), which is optimised for performance and creates a data.table object.

inpOlympic <- if (file.exists("data/olympic-games.csv")) {
   "data/olympic-games.csv"
} 


inpSpoti <- if (file.exists("data/spotify_charts_germany.csv")) {
   "data/spotify_charts_germany.csv"
} 

olympics <- fread(inpOlympic)
spotify <- fread(inpSpoti)

7.1 Motivation

The package data.table provides an enhanced data frame object called (big surprise: ) data.table and a different type of syntax for common types of data manipulations. In this sense it is similar to the tidyverse (especially the dplyr part of it). A data.table object has the following form DT[i,j,by] where i represent the rows, j the columns and by the grouping argument.

class(olympics)   # We see that a data.table inherits properties of data.frames

## [1] "data.table" "data.frame"

                  # and we can easily cast an data.frame object to data.table

matrix <- data.frame(c(1,2,3),c(4,5,6),c(7,8,9))
class(matrix)

## [1] "data.frame"

setDT(matrix)
class(matrix)

## [1] "data.table" "data.frame"

So what are the main differences between data.table and tidyverse? Why or when would we want to use one or the other

Performance: data.tables are heavily optimised for speed. If performance is crucial, then data.tables might be the better choice.
Syntax: data.table syntax is concise while tidyverse syntax is more expressive. Code sequences that span multiple lines are easier to read in case of tidyverse syntax due to the pipe (%>%) operator. Overall, the choice is a matter of personal preferences here.
Mutability: data.tables are mutable, i.e. they can be changed in place (by reference). If this is a desired feature, then we might opt for data.tables.
Quoted and unquoted names: in data.tables we usually have the option to use either quoted column names (which makes writing functions easier) or unquoted column names (which is convenient for exploration). In dplyr the default is unquoted column names and it is a bit trickier to use quoted column names.

Before we go into the details, let’s start with two motivating examples of differences between data.table and tidyverse syntax. First, we want to know the 5 sports with the highest average weight of players in the 2016 summer olympic games.

# data.table syntax is short
olympics[game=="2016 Summer", .(weight = mean(weight, na.rm=T)), sport][order(-weight)][1:5]

##         sport   weight
## 1: Basketball 87.86786
## 2: Water Polo 85.44961
## 3:   Handball 83.26912
## 4: Volleyball 80.29329
## 5:     Rowing 79.87226

# tidyverse syntax is more expressive and readable
olympics %>% 
  filter(game=="2016 Summer") %>%
  group_by(sport) %>%
  summarise(weight = mean(weight, na.rm=TRUE)) %>%
  arrange(-weight) %>%
  head()

Next, we want to change an existing column:

# data.tables are mutable objects, i.e. changes happen in-place
olympics[,age:=as.numeric(age)]   # age is coerced from character to numeric

# tidyverse's tibbles are immutable, i.e. we need to reassign to make permanent changes to the data
olympics <- olympics %>% mutate(age_numeric =as.numeric(age))

7.2 Data exploration

As mentioned before the datatype data.table has the following structure: dt[i,j,by], where i conditon on the rows, j select the colums and by defines the groups by which results are aggregated. We’ll start by setting conditions on the rows.

7.2.1 Subsetting rows

We can condition on the rows one time, which gives us all swimmers.

again, for repetition the code tidyverse syntax

swimmers <- olympics %>% filter(sport == "Swimming")

and in data.table syntax

swimmers <- olympics[sport == "Swimming"]
olympics[sport=="Swimming"]                # or without assinging operator

##               game city    sport                                    discipline
##     1: 1952 Summer <NA> Swimming           Swimming Men's 400 metres Freestyle
##     2: 1912 Summer <NA> Swimming        Swimming Men's 200 metres Breaststroke
##     3: 1912 Summer <NA> Swimming        Swimming Men's 400 metres Breaststroke
##     4: 1920 Summer <NA> Swimming        Swimming Men's 200 metres Breaststroke
##     5: 1920 Summer <NA> Swimming        Swimming Men's 400 metres Breaststroke
##    ---                                                                        
## 23191: 2000 Summer <NA> Swimming    Swimming Men's 4 x 100 metres Medley Relay
## 23192: 2004 Summer <NA> Swimming Swimming Men's 4 x 100 metres Freestyle Relay
## 23193: 1924 Summer <NA> Swimming        Swimming Men's 200 metres Breaststroke
## 23194: 1980 Summer <NA> Swimming           Swimming Men's 100 metres Butterfly
## 23195: 1980 Summer <NA> Swimming           Swimming Men's 200 metres Butterfly
##                                  athlete     country age height weight      bmi
##     1:  Einar Ferdinand ""Einari"" Aalto     Finland  26     NA     NA       NA
##     2:              Arvo Ossian Aaltonen     Finland  22     NA     NA       NA
##     3:              Arvo Ossian Aaltonen     Finland  22     NA     NA       NA
##     4:              Arvo Ossian Aaltonen     Finland  30     NA     NA       NA
##     5:              Arvo Ossian Aaltonen     Finland  30     NA     NA       NA
##    ---                                                                         
## 23191: Klaas Erik ""Klaas-Erik"" Zwering Netherlands  19    189     80 22.39579
## 23192: Klaas Erik ""Klaas-Erik"" Zwering Netherlands  23    189     80 22.39579
## 23193:             Marius Edmund Zwiller      France  18     NA     NA       NA
## 23194:        Bogusaw Stanisaw Zychowicz      Poland  19    189     80 22.39579
## 23195:        Bogusaw Stanisaw Zychowicz      Poland  19    189     80 22.39579
##        sex  medal
##     1:   M   <NA>
##     2:   M   <NA>
##     3:   M   <NA>
##     4:   M Bronze
##     5:   M Bronze
##    ---           
## 23191:   M   <NA>
## 23192:   M Silver
## 23193:   M   <NA>
## 23194:   M   <NA>
## 23195:   M   <NA>

And two or more times with the ‘&’ operator (‘|’ operator for the logical or),for example all gold medal winners in swimming.

swimmers_gold <- olympics[sport == "Swimming" & medal == "Gold"]


swimmers_gold_without_doping <- swimmers_gold[!country=="Russia" & country!= "China"]  #the `!` operator negates the logical expression

The i argument also allows for regular expressions

Michael_Phelps <- olympics[sport=="Swimming" & athlete %like% "Phelps"]

The columns can be treated like variables so the expression olympics$sport is not necessary here but would work too. In additon we can select for the row numbers:

tidyverse:

var <- swimmers %>% slice(1:5)

data.table

var <- swimmers[1:5]

This is especially useful after ordering the data:

height <- swimmers[order(height)]

or in decreasing order( here we pick the first element, so the tallest swimmer)

height <- swimmers[order(-height)][1:1]

The order function, called on data.table objects calls the internal forder() function which is optimised on such objects. Counting rows is possible with different aproaches:

usmedals <- olympics[sport == "Swimming" & !is.na(medal), length(medal)]

or with the .N approach

usmedals <- olympics[sport == "Swimming" & !is.na(medal), .N]
usmedals

## [1] 3048

As you see above we come now to the second argument of data.table objects:

7.2.2 Selecting Columns

Now we want to condition on the colums too. Here we can either write the column names as unquoted names inside a list. Note we can use the . as an abbreviation for list here.

swimmers_gold[, list(athlete, country, medal)][1:5]   # list with unquoted column names

##                        athlete       country medal
## 1: Rebecca ""Becky"" Adlington Great Britain  Gold
## 2: Rebecca ""Becky"" Adlington Great Britain  Gold
## 3:      Nathan Ghar-Jun Adrian United States  Gold
## 4:      Nathan Ghar-Jun Adrian United States  Gold
## 5:      Nathan Ghar-Jun Adrian United States  Gold

swimmers_gold[,    .(athlete,country, medal)][1:5]   # . is an abbreviation for lists

##                        athlete       country medal
## 1: Rebecca ""Becky"" Adlington Great Britain  Gold
## 2: Rebecca ""Becky"" Adlington Great Britain  Gold
## 3:      Nathan Ghar-Jun Adrian United States  Gold
## 4:      Nathan Ghar-Jun Adrian United States  Gold
## 5:      Nathan Ghar-Jun Adrian United States  Gold

Or we can write the quoted column names inside the c() function.

swimmers[,  c("athlete", "medal")][1:5] #

##                             athlete  medal
## 1: Einar Ferdinand ""Einari"" Aalto   <NA>
## 2:             Arvo Ossian Aaltonen   <NA>
## 3:             Arvo Ossian Aaltonen   <NA>
## 4:             Arvo Ossian Aaltonen Bronze
## 5:             Arvo Ossian Aaltonen Bronze

The tidyverse syntax for comparison was

swimmers %>% select("athlete", "medal")

This syntax also allows us to define the column in a separate step.

cols <- c("athlete", "medal")  # columns are defined in a separate step
swimmers_gold[, ..cols][1:5]        # note the .. syntax

##                        athlete medal
## 1: Rebecca ""Becky"" Adlington  Gold
## 2: Rebecca ""Becky"" Adlington  Gold
## 3:      Nathan Ghar-Jun Adrian  Gold
## 4:      Nathan Ghar-Jun Adrian  Gold
## 5:      Nathan Ghar-Jun Adrian  Gold

or using the ‘with’ statement

colums <- swimmers_gold[, cols, with = FALSE]
head(colums)

##                        athlete medal
## 1: Rebecca ""Becky"" Adlington  Gold
## 2: Rebecca ""Becky"" Adlington  Gold
## 3:      Nathan Ghar-Jun Adrian  Gold
## 4:      Nathan Ghar-Jun Adrian  Gold
## 5:      Nathan Ghar-Jun Adrian  Gold
## 6:      Nathan Ghar-Jun Adrian  Gold

and operations on the rows and colums at the same time:

var <- olympics[sport == "Swimming" & medal == "Gold", ..cols]

We can also execute functions on the columns e.g. to find out the average age of all olympians

olympics[, mean(age, na.rm=T)]

## [1] 25.45227

and last but not least we can directly plot within the j argument (even if the informative value is questionable here)

spotify[,plot(danceability,Streams)]

7.2.3 Grouping results

Now we’ll consider the third parameter of the data.table object DT[i, j, by]. Let’s count the competitions grouped by sports. (Hint: if we have a look at the data in teamsports, every athlete is listed so caution with this result.)

sports <- olympics[, .N, by = "sport"]

So using all three arguments we find all swimmers, and grouping they by the country

sw <- olympics[sport== "Swimming", .N,by = country]

grouping by more colums

sw <- olympics[sport== "Swimming", .N,by = .(country,discipline)]

If we want to order the data we have to change by to keyby

sw <- olympics[sport== "Swimming", .N,keyby = .(country,discipline)]

Chaining operations on the data.table object:

ordOlym <- olympics[,.N, by = .(country,game)][order(-game,country)]

7.2.4 Editing Data

When we want to edit a Data set of the class dataframe like

df = data.frame(name = c("D","a","t","a"), a = 1:4, b = 5:8, c = 9:12)
df

##   name a b  c
## 1    D 1 5  9
## 2    a 2 6 10
## 3    t 3 7 11
## 4    a 4 8 12

df$c <- 13:16
df

##   name a b  c
## 1    D 1 5 13
## 2    a 2 6 14
## 3    t 3 7 15
## 4    a 4 8 16

this is done via copying the data set and a bad performance. Datatable provides the ‘:=’ operator for an better performance. Here the Data object isn’t copied but edited by reference.

Adding a column

spotify[, ':='(duration_s=duration_ms/1000)]

this column is added ‘by reference’ which is done with a higher performance than adding a column to a data.frame object. In addition we do not have to assign the expression back to the variable. Edit an column

olympics[sex=='M', sex := 'Male']
olympics[sex=='F', sex := 'Female']

Deleting an column by reference

spotify[, duration_ms := NULL]

7.2.5 Side effects

Sometimes we’re in the situation that we want to work with a data.table object but not on the object itself but on a copy. For example we want to add again the column duration_ms by reference, then the variable spotify_cols change too.

spotify_cols <- names(spotify)   # Columnnames
spotify_cols

##  [1] "Position"         "Track.Name"       "Artist"           "Streams"         
##  [5] "date"             "danceability"     "energy"           "loudness"        
##  [9] "speechiness"      "acousticness"     "instrumentalness" "liveness"        
## [13] "valence"          "tempo"            "duration_s"

spotify[, ':='(duration_ms=duration_s*1000)]
spotify_cols

##  [1] "Position"         "Track.Name"       "Artist"           "Streams"         
##  [5] "date"             "danceability"     "energy"           "loudness"        
##  [9] "speechiness"      "acousticness"     "instrumentalness" "liveness"        
## [13] "valence"          "tempo"            "duration_s"       "duration_ms"

If we want to prevent these side effect we have to work on a copy. So again:

spotify[, duration_ms := NULL]   # deleting by reference
spotify_cols

##  [1] "Position"         "Track.Name"       "Artist"           "Streams"         
##  [5] "date"             "danceability"     "energy"           "loudness"        
##  [9] "speechiness"      "acousticness"     "instrumentalness" "liveness"        
## [13] "valence"          "tempo"            "duration_s"

spotify_cols <- copy(names(spotify))
spotify[, ':='(duration_ms=duration_s*1000)]
spotify_cols

##  [1] "Position"         "Track.Name"       "Artist"           "Streams"         
##  [5] "date"             "danceability"     "energy"           "loudness"        
##  [9] "speechiness"      "acousticness"     "instrumentalness" "liveness"        
## [13] "valence"          "tempo"            "duration_s"

7.3 Runtime comparision

Let us now compare the runtime of some expression made in data.table and tidyverse. Performance differs depending on the type of operation, and will likely also depend on the size of the data set. Here we test separately sorting, filtering, and grouping/aggregating for our data set of about 270.000 observations.

install.packages("microbenchmark")

library(microbenchmark)              # package for performance comparisons

## Warning: Paket 'microbenchmark' wurde unter R Version 3.6.3 erstellt

olympics_tbl <- as_tibble(olympics)  # create a tibble to compare performance

# Sorting
microbenchmark(
  olympics_tbl %>% arrange(age),     # dplyr
  olympics[order(age)],              # data.table
  times = 5                          # run 5 time to get more robust measurements
)

## Unit: milliseconds
##                           expr     min      lq      mean   median       uq
##  olympics_tbl %>% arrange(age) 96.7030 99.4314 117.22992 115.6790 135.1999
##           olympics[order(age)] 34.6687 46.2439  54.21128  47.6448  49.4354
##       max neval
##  139.1363     5
##   93.0636     5

# Filtering
microbenchmark(
  olympics_tbl %>% filter(height > 160),   # dplyr
  olympics[height > 160],                  # data.table
  times = 5
)

## Unit: milliseconds
##                                   expr     min      lq     mean  median     uq
##  olympics_tbl %>% filter(height > 160) 14.3292 15.7076 18.77070 17.0241 20.216
##                 olympics[height > 160] 12.0343 22.4867 35.87588 22.5284 25.425
##      max neval
##  26.5766     5
##  96.9050     5

# Grouping
microbenchmark(
  olympics_tbl %>% group_by(athlete) %>% summarise(mean(height, na.rm=T)),  # dplyr
  olympics[, .(mean(height, na.rm=T)), keyby=athlete],                      # data.table
  times = 5
)

## Unit: milliseconds
##                                                                            expr
##  olympics_tbl %>% group_by(athlete) %>% summarise(mean(height,      na.rm = T))
##                         olympics[, .(mean(height, na.rm = T)), keyby = athlete]
##        min        lq     mean    median        uq       max neval
##  2598.6271 2670.7910 3037.267 3115.6672 3120.4755 3680.7744     5
##   418.4699  439.0937  463.506  445.1356  494.1703  520.6606     5

7.4 Further resources

Here are links to further resources on the package data.table: