How can I speed up my R code? Part 2

In my previous post, I looked at some simple first steps in speeding up your R code by learning habits of tidier, more R-friendly code.

It’s a good start, but it might not solve all your problems. Especially if you have…

Problem 2 – Big data

If you’re a lucky researcher and get handed mahooosive amounts of data to work with and it’s all miraculously clean and ready to do, you’re likely gunning to get drawing out some awesome conclusions from your insightful analyses!

You’ve read about vectorisation and you’ve sworn off for-loops, so that will speed up your code nicely.

There are probably still going to be several bottlenecks at various points in your analysis.

The problems start at line 1 when you try to load your code.

my_data <- read.csv("massive_file.csv")

Might as well go and make lunch!

Now you might think “ah, but I’ll just do it once at the start of the project and never have to touch it again because R will keep it the environment when I shut down!”.

Let me stop you there: that’s not best practice! If you want your work to be reproducible you should always aim to have everything you do saved in your scripts, not your environment:

You can commit your scripts to an online repository or archive
You can run the same analyses on any machine with the same starting dataset
You can always trace your steps again so you’re never lost
You can keep track of how your data has transformed, so you’re not introducing bugs
If you use several large datasets they don’t all have to hang around in your environment at once

So that means that every morning you should be able to open R afresh, with a clean environment, and run your data_import.R script to get everything loaded back in and ready to go again. In fact, if you’re really hardcore, you should be testing your code frequently by restarting R (Ctrl + Shift + F10) and running again from the beginning (Ctrl + Alt + B).

Here it becomes a problem if you have big data and big files to read in. There are always better, faster machines of course, but before you run off to Carol in IT and beg for a fancy upgrade, there’s a few tricks you can use first.

Solution 2 – Better tools

So say I have a huge data file called big_data.csv and I want to read it in. It has a million rows and 26 columns of data.

Let’s check its size and read it in:

file.info("big_data.csv")$size/1024^2
#> [1] 447.2663
# 447MB! Ooof, that's pretty massive!

t1 <- Sys.time()
my_data <- read.csv("big_data.csv")
cat("That took ", Sys.time() - t1, "minutes!")
#> That took  3.035497 minutes!

That’s time enought to make a coffee, but frustratingly slow if I want to keep doing it to ensure my code runs well. And a real nuisance if that’s one of many datasets (such as Year 1 of 10 or similar).

Are there ways of speeding this up? Yes of course!

This is probably a great time to look into and get used to some better tools more designed for handline big data in R. And a top one is data.table.

data.table is a whole ecosystem unto itself and sits alongside base R and tidyverse as three very different approaches and mindets in using R. Until recently I’ve avoided learning/using data.table, being a tidyverse fan myself, but it’s well worth getting into and getting the hang of, especially if you’re dealing with big data. In almost every use case, when dealing with big data data.table is faster than tidyverse and base R.

Here’s a great introduction to data.table which walks you through all of the key data manipulation steps you’ll likely need. It’s a different syntax in lots of ways and worth learning if you’re going to be using it a lot. It only really makes a difference when you have a lot of data to munch through, and there’s a balance to be struck between time spent problem-solving new syntax, consistency and readability of style, and time saved in using the faster tools (although read on for a sneaky trick to use familiar code on a fast data.table).

But ahead of diving in there, why would you learn it or use it? Let’s test out its speed. When we’re importing data we use fread (“fast read”) to read lots of common files types:

library(data.table)

t1 <- Sys.time()
my_data <- fread("big_data.csv")
cat("That took ", Sys.time() - t1, "seconds")
#> That took  3.238447 seconds

Wowza! That was quite a difference! Our data was all read in in a fraction of the time that read.csv took!

fread has read in the data as a data.table, printing the first three columns gives you an idea of how it’s displayed in R:

my_data[, 1:3]
#>                  a           b         c
#>       1: 0.9618882 0.796524509 0.5164788
#>       2: 0.6690679 0.586633077 0.5674156
#>       3: 0.1505604 0.839050524 0.2876428
#>       4: 0.8216251 0.095614468 0.5110775
#>       5: 0.4480956 0.386914796 0.1850743
#>      ---                                
#>  999996: 0.8216204 0.781533947 0.1490690
#>  999997: 0.3862336 0.639817097 0.1020972
#>  999998: 0.8502541 0.828530686 0.9069263
#>  999999: 0.9138978 0.098687139 0.4762799
#> 1000000: 0.7435147 0.008184642 0.1124607

The handy thing about data.tables is that again they’re also build for speedily doing things to big data. So things like doing quick sums by group:

my_data[, lapply(.SD[,1:3], sum), by = group]
#>     group        a        b        c
#>  1:     G 49894.44 49865.78 49940.88
#>  2:     A 50396.46 50433.90 50493.61
#>  3:     E 50155.44 50064.91 50009.21
#>  4:     C 49938.98 49764.02 49741.81
#>  5:     F 50059.16 49721.46 50053.69
#>  6:     J 50030.64 50073.31 50031.77
#>  7:     H 50065.75 50138.02 49983.35
#>  8:     B 49670.46 49753.41 49687.94
#>  9:     D 49849.69 49912.20 49898.87
#> 10:     I 50126.60 50221.47 50026.92

And joining with other data.tables is swift and relatively easy to learn.

But why should I learn new stuff?

Here’s a sneaky trick though, if you’re used to using dplyr syntax from the tidyverse to manipulate your data and are struggling to get the hang of data.table syntax, you can get around this quickly by using the dtplyr package.

This package essentially ‘translates’ your code into data.table syntax, shows you what it’s done and produces the output for you:

library(tidyverse)
library(dtplyr)

my_data |> 
  group_by(group) |> 
  summarise(across(1:3, sum))
#> Source: local data table [10 x 4]
#> Call:   `_DT1`[, .(a = sum(a), b = sum(b), c = sum(c)), keyby = .(group)]
#> 
#>   group      a      b      c
#>   <chr>  <dbl>  <dbl>  <dbl>
#> 1 A     50396. 50434. 50494.
#> 2 B     49670. 49753. 49688.
#> 3 C     49939. 49764. 49742.
#> 4 D     49850. 49912. 49899.
#> 5 E     50155. 50065. 50009.
#> 6 F     50059. 49721. 50054.
#> # … with 4 more rows
#> 
#> # Use as.data.table()/as.data.frame()/as_tibble() to access results

That’s so easy it’s almost cheating! You can do all your manipulation using stuff you’re used to, then turn it to a tibble or data.table at the end (see last line) to benefit from all the speed. Or you can both cheat and not look like you’re cheating by copying the generated data.table code and running my_data[, .(a = sum(a), b = sum(b), c = sum(c)), keyby = .(group)] instead!

Clever eh? So you can make the most of your new data.table knowledge and start whizzing through your data today!

One extra tip – an even faster package: `fst`

Having said the above, one even nicer thing I came across recently was an even faster file-reading package called fst. This package has its own file structure, but makes use of this and multi-threading to a) speed up reading/writing and b) make files smaller:

library(fst)

write_fst(my_data, "big_data.fst")

file.info("big_data.fst")$size/1024^2
#> [1] 201.6315
# 201 MB - half the size

t1 <- Sys.time()
my_data <- read_fst("big_data.fst", as.data.table = TRUE)
cat("That took ", Sys.time() - t1, "seconds!!")
#> That took  0.754739 seconds!!

That’s even faster than fread! The output here is a data.table as before, and can similarly be converted to tibbles and other things as needed.

Another fantastic feature of fst is that it doesn’t have to read in all your data every time!. This is especially neat if you have a very large number of columns that you don’t need, or a very long dataset and you only want a portion at a time. fst can work to only read those parts that you need, rather than normal routes of reading all in and chucking away unnecessary parts:

library(fst)

# Read in selected columns
my_first_columns <- read_fst("big_data.fst", columns = c("a", "b", "c"),
                             as.data.table = TRUE)

my_first_columns
#>                  a         b         c
#>       1: 0.9158961 0.9034102 0.3263035
#>       2: 0.9518044 0.7306460 0.1395702
#>       3: 0.9257475 0.7572834 0.5020019
#>       4: 0.4330511 0.8438266 0.2768958
#>       5: 0.9174331 0.2129135 0.9292762
#>      ---                              
#>  999996: 0.4551166 0.7424223 0.1766266
#>  999997: 0.7887979 0.2182357 0.5705282
#>  999998: 0.7474662 0.2952766 0.8554208
#>  999999: 0.7975528 0.3795795 0.6409650
#> 1000000: 0.4271937 0.3870189 0.1662612

# Read in only first 100,000 rows
first_tenth <- read_fst("big_data.fst", from = 1, to = 100000,
                            as.data.table = TRUE)

first_tenth[,1:3]
#>                 a          b         c
#>      1: 0.9158961 0.90341024 0.3263035
#>      2: 0.9518044 0.73064605 0.1395702
#>      3: 0.9257475 0.75728336 0.5020019
#>      4: 0.4330511 0.84382660 0.2768958
#>      5: 0.9174331 0.21291353 0.9292762
#>     ---                               
#>  99996: 0.6838186 0.89080290 0.8677441
#>  99997: 0.1322674 0.10334003 0.2235338
#>  99998: 0.2311932 0.61540034 0.6702913
#>  99999: 0.6608369 0.07354918 0.8161556
#> 100000: 0.5163428 0.15634849 0.9247975

Very handy for quick processing of data if you only need certain parts at certain times. Your 3-minute read-all-in-at-once at the start of the day has turned into fraction-of-a-second snippets of just the bits you need for a tiny sub-analysis.

So, has that solved it?

That’s the top tip for dealing with Big Data – use Better tools such as data.table and fst to get things running more smoothly and cut frustrating waits.

You might still be banging your head on the desk though - your analyses are still taking ages to run! You’ve imported lots of fancy packages, set them running on your nicely shaped big data, but even then there’s something slowing you down in this uber-complex algorithm.

That’s where we might have to turn to even fancier methods in my next post…

Published: Feb 28, 2023