Skip to main content

Data.table package


 Data.table is an extremely fast and memory efficient package for transforming data. Many people use it while struggling a big dataset to save some time and memory space. The second class of data.table is data.frame which is a good news because it means that functions that work with data.frame also work with data.table. Data.table has sql like query commands. It looks like this: 

dt[ i, j, by] 

i= subset (rows) to be extracted based on a condition 
j= calculation to be performed on the subset 
by= grouping parameter that serves as a base for aggregation. Very often it is column or a vector. 

 

Quering data with data.table 

I will use again  mtcars dataset which is included in your base R program to present some queries using data.table . Mtcars has 32 observations on 11 (numeric) variables. 

library(data.table) 
dt=data.table(mtcars) 

By using commends bellow we will check the class of dt and of its content: 

class(dt) 
[1] "data.table" "data.frame" 

sapply(dt,class) 
  mpg       cyl      disp        hp      drat        wt      qsec        vs       am    gear      carb   
"numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 

dt[3,] # subset of row nr 3
mpg cyl disp hp drat   wt  qsec vs am gear carb
1: 22.8   4  108 93 3.85 2.32 18.61  1  1    4    1

dt[,3] #subset of column number 3
   disp
 1: 160.0
 2: 160.0
 3: 108.0
 4: 258.0
 5: 360.0
 6: 225.0
 7: 360.0
 8: 146.7
 9: 140.8
10: 167.6
11: 167.6
12: 275.8
13: 275.8
14: 275.8
15: 472.0
16: 460.0
17: 440.0
18:  78.7
19:  75.7
20:  71.1
21: 120.1
22: 318.0
23: 304.0
24: 350.0
25: 400.0
26:  79.0
27: 120.3
28:  95.1
29: 351.0
30: 145.0
31: 301.0
32: 121.0

dt[cyl==4,] #subset of rows where cyl=4
    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
 1: 22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
 2: 24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
 3: 22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
 4: 32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
 5: 30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
 6: 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
 7: 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
 8: 27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
 9: 26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
10: 30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
11: 21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

dt[cyl==4 & gear > 4] # subset of rows where cyl=4 and gear >4
   mpg cyl  disp  hp drat    wt qsec vs am gear carb
1: 26.0   4 120.3  91 4.43 2.140 16.7  0  1    5    2
2: 30.4   4  95.1 113 3.77 1.513 16.9  1  1    5    2

 

Comments

Popular posts from this blog

Model Residuals in Time Series Data

Residuals are the indicator of the model quality. Based on Rob J Hyndman's book "Forecasting: Principles & Practice", residuals in forecasting is difference between observed value and its forecast based on all previous observations. Residuals are useful in checking whether a model has adequately captured the information in the data. All the patterns should be in the model, only randomness remains in the residuals. Therefore the ideal model has to be: uncorrelated has zero mean and useful properties are: constant variance  be normally distributed First I will activate some useful libraries we will be using. library(fpp) library(forecast) For our example I will use dowjones index as a data set. The idea will be to set up already well know simple models like: Mean Model, Naive model and Drift Model. In previous post I described  it more detailed. Next, knowing what attributes  the ideal model should  have we can check which one of those 3 are quite good or  def...

Random number generators, reproducibility and sampling with dplyr

Let's assume that you want to take some random observations from your data set. Dplyr helps you with the function sample_n(). To make your code reproducible you seed the ID of a “random” set of values. You need to indicate number of rows you want to extract and specify if the rows should be replaced or not. To show you how it works I will use again mtcars dataset which is included in your base R program. Let's see first six rows of this data frame.  library(dplyr) data("mtcars") head(mtcars)                    mpg cyl disp  hp drat    wt  qsec vs am gear carb Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4 Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4 Datsun 710        22.8   4  108  93 3.85 2.320 18.61...

The Power of dplyr in R - part 3

Today I would like to present pipe operator which simplify our code and makes it more readable. As we can see all of the dplyr functions take a data frame (or tibble) as the first argument. Dplyr provides the %>% operator from magrittr that chains the functions so x %>% f(y) turns into f(x, y). Therefore  the result from one step is then “piped” into the next step. We will use pipe operator in further examples.  Additionally we will focus on grouping, ordering and summarising functions. As previously I will continue using mtcars dataset which is included in your R base program. count() #count the unique values of one or more variables   n()  n_distinct() #number of unique observation found in a category  group_by() # group by a column, allows to group operation in the “split-apply-combine" concept   library(dplyr) data("mtcars") head(mtcars)                    mpg cyl disp  hp drat...