Data.table package

Data.table is an extremely fast and memory efficient package for transforming data. Many people use it while struggling a big dataset to save some time and memory space. The second class of data.table is data.frame which is a good news because it means that functions that work with data.frame also work with data.table. Data.table has sql like query commands. It looks like this:

dt[ i, j, by]

i= subset (rows) to be extracted based on a condition

j= calculation to be performed on the subset

by= grouping parameter that serves as a base for aggregation. Very often it is column or a vector.

Quering data with data.table

I will use again mtcars dataset which is included in your base R program to present some queries using data.table . Mtcars has 32 observations on 11 (numeric) variables.

library(data.table)

dt=data.table(mtcars)

By using commends bellow we will check the class of dt and of its content:

class(dt)

[1] "data.table" "data.frame"

sapply(dt,class)

mpg cyl disp hp drat wt qsec vs am gear carb

"numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"

dt[3,] # subset of row nr 3

mpg cyl disp hp drat wt qsec vs am gear carb

1: 22.8 4 108 93 3.85 2.32 18.61 1 1 4 1

dt[,3] #subset of column number 3

disp

1: 160.0

2: 160.0

3: 108.0

4: 258.0

5: 360.0

6: 225.0

7: 360.0

8: 146.7

9: 140.8

10: 167.6

11: 167.6

12: 275.8

13: 275.8

14: 275.8

15: 472.0

16: 460.0

17: 440.0

18: 78.7

19: 75.7

20: 71.1

21: 120.1

22: 318.0

23: 304.0

24: 350.0

25: 400.0

26: 79.0

27: 120.3

28: 95.1

29: 351.0

30: 145.0

31: 301.0

32: 121.0

dt[cyl==4,] #subset of rows where cyl=4

mpg cyl disp hp drat wt qsec vs am gear carb

1: 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1

2: 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2

3: 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2

4: 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1

5: 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2

6: 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1

7: 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1

8: 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1

9: 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2

10: 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2

11: 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2

dt[cyl==4 & gear > 4] # subset of rows where cyl=4 and gear >4

mpg cyl disp hp drat wt qsec vs am gear carb

1: 26.0 4 120.3 91 4.43 2.140 16.7 0 1 5 2

2: 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2

My journey through the data science - by Karolina M'Goma

Search This Blog

Data.table package

Comments

Post a Comment

Popular posts from this blog

Model Residuals in Time Series Data

Random number generators, reproducibility and sampling with dplyr

The Power of dplyr in R - part 3