SD in data table in R

Question

What does .SD stand for? How is it helpful and when to use it?

According to some source, .SD is a data.table containing the subset of x's data for each group, excluding the group column(s).

Can be used when grouping by i, when grouping by by, keyed by, and adhoc_ by

Does that mean that the subset data.tables is held in memory for the upcoming/next operation?

nirvana · Answer 1 · Apr 13, 2018

.SD stands for "Subset of Data.table". The dot before SD has no significance but doesn't let it clash with a user-defined column name.

Consider your data.table as follows:

DT = data.table(a=rep(c("x","y","z"),each=2), b=c(1,3), v=1:6)
setkey(DT, p)
DT
#    a b p
# 1: x 1 1
# 2: y 1 3
# 3: z 1 5
# 4: x 3 2
# 5: y 3 4
# 6: z 3 6

Try the below code to understand what .SD does:

DT[ , .SD[ , paste(a, p, sep="", collapse="_")], by=b]
#    b       V1
# 1: 1 x1_y3_z5
# 2: 3 x2_y4_z6

The by=b statements divides the original data.table into a subset of 2 data.tables

DT[ , print(.SD), by=b]
# 1st sub-data.table, called '.SD' while it's being operated on:
#    a p
# 1: x 1
# 2: y 3
# 3: z 5
# 2nd sub-data.table, called '.SD' while it's being operated on:
#    a p
# 1: x 2
# 2: y 4
# 3: z 6
# Final output, since print() doesn't return anything
# Empty data.table (0 rows) of 1 col: b
and operates on them in turn.

While it is operating on any one of the subset, it let's you refer to the current subset of data.table by using a nick-name/handle/symbol .SD.

So, you can access and operate on the columns very easily.

But, data.table will carry out the operations on every single sub-data.table defined by combinations of the key, and then "pasting" them back together. After which it will return the results in a single data.table!