суббота, 6 января 2018 г.

Comparing different serialization methods in R

Introduction


In this article I would like to compare different approaches for data serialization available in R. The comparison will be done from point of views of serialization / deserialization performance and compactness of disk space required. I would perform analysis for data table objects, since these are objects which I need to serialize/deserialize in my practice most often.

The following approaches are reviewed:
  • Functions saveRDS / readRDS:
    • It supports all R objects types and provides as-is serialization / deserialization (with possible nuances for custom references objects).
    • It supports compressed & uncompressed data storage.
    • Essentially, this is dump of object memory representation in R, so unfortunately this is R-only serialization format.
  • Package feather:
    • This is fast and language agnostic alternative for RDS format.
    • It uses column oriented file format (based on Appache Arrow and Flatbuffers library).
    • The format is open-source and is supported both in R & Python.
  • Package fst:
    • This is another alternative for RDS & Feather formats which can be used for fast data frames serialization.
    • It supports compression by using LZ4 and ZSTD algorithms.
    • The big advantage of this approach that it provides full random access to rows & columns of stored data.
  • Package RProtoBuf:
    • This is R interface package for Protocol Buffers serialization method proposed by Google.
    • Usually, this approach is used for serializing of relatively small structured objects. But it would be interesting to see how it will deal with data tables serialization in R.
  • Functions write.csv & read.csv:
    • This is standard R functions for storing & reading data frames in text-based CSV format.
    • This approach can be easily applied only to data frame objects, but I've included it into comparison, since most objects which I need to serialize in my practice are data tables.
  • Functions fwrite & fread from data.table package:
    •  This is another approach for storing & reading data table objects.
    • These functions are much more optimized in comparison to standard ones above, so it would be nice to compare them.
  • Package RSQLite:
    • This package provides R interface to SQLite embedded database engine.
    • Also it may be overkill to use such approach for simple data tables serialization purposes, I've included this package into comparison for sake of completeness.


Approach

I've measured the following metrics:
  • Average write and read time. Performance is measured by using microbenchmark package.
  • Compression ratio. File size measured by using file.size() function in comparison to object size in R memory measured by using object.size() function.

The measurement is done for two different sizes of data table:
  • Small ~ 2.5MB data table.
  • Large objects ~ 2.5GB data table.
The hard drive installed on my PC is SSD Intel 540s Series.

Results

Here are measurement results for small data table:


Method
Write Time (ms)
Read Time (ms)
File Size (bytes)
Object Size (bytes)
Compression Ratio (file/object)
fst (0% compressed)
4.43
14.11
         2,401,527
         2,410,512
0.9963
fst (50% compressed)
6.23
25.44
         1,814,420
         2,410,512
0.7527
fst (100% compressed)
15.10
15.67
         1,684,412
         2,410,512
0.6988
RDS (compressed)
315.58
36.74
         1,891,996
         2,410,512
0.7849
RDS (uncompressed)
7.60
24.08
         2,402,000
         2,410,512
0.9965
feather
4.97
3.07
         2,404,192
         2,410,512
0.9974
fread/fwrite
25.49
43.54
         4,143,627
         2,410,512
1.7190
write.csv
596.58
771.41
         4,143,627
         2,410,512
1.7190
protobuf
74.56
62.62
         1,818,676
         2,410,512
0.7545
sqllite
92.98
154.77
         2,576,384
         2,410,512
1.0688
 
Here are measurement results for large data table: 



Method
Write Time (ms)
Read Time (ms)
File Size (bytes)
Object Size (bytes)
Compression Ratio (file/object)
fst (0% compressed)
17,956
4,914
         2,400,001,527
         2,400,010,512
 1.0000
fst (50% compressed)
15,347
4,968
1,811,710,706
         2,400,010,512
 0.7549
fst (100% compressed)
12,731
7,145
         1,680,028,768
         2,400,010,512
 0.7000
RDS (compressed)
276,159
25,520
         1,891,565,116
         2,400,010,512
 0.7881
RDS (uncompressed)
20,310
6,690
         2,400,002,000
         2,400,010,512
 1.0000
feather
20,480
32,886
         2,400,004,192
         2,400,010,512
 1.0000
fread/fwrite
51,513
48,384
         4,143,341,213
         2,400,010,512
 1.7264
write.csv
529,811
669,340
         4,143,341,213
         2,400,010,512
 1.7264
protobuf
71,640
N/A
         1,833,994,768
         2,400,010,512
 0.7642
sqllite
55,534
133,963
         2,567,032,832
         2,400,010,512
 1.0696

And, finally, some observations:
  •  As expected, fst package shows the best performance on large data table. On small data tables, feather performance is a little bit better.
  • What's interesting is that performance of uncompressed RDS is quite good. On large data table, it is comparable with feather for writing and even better for reading. Also, it is interesting to note that sizes of files saved by feather and uncompressed RDS are almost the same. Perhaps, it can be explained by fact that R internally stores data tables in columnar format as list of vectors.
  • Another observation is that compressed RDS approach showed quite bad write performance. The possible explanation that my SSD hard drive is not a performance bottleneck and it is fast enough to write data directly, rather spent CPU time on compression.

 Code


Here is preparatory code (loading required libraries, generating sample data table, assigning file names):


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
library(data.table)
library(ggplot2)
library(microbenchmark)

library(fst)
library(feather)
library(RProtoBuf)
library(RSQLite)

generateDataSet = function(nrow, ncol) {

 require(data.table)

 as.data.table(do.call(cbind, sapply(seq_len(ncol), function(i) {
  as.integer(runif(n = nrow, min = 0, max = 1e6))
 }, simplify = F)))

}

x = generateDataSet(1e4, 60) ## generateDataSet(1e7, 60)

fst_file_0 = tempfile()
fst_file_50 = tempfile()
fst_file_100 = tempfile()
rds_file_comp = tempfile()
rds_file_uncomp = tempfile()
feather_file = tempfile()
fcsv_file = tempfile()
csv_file = tempfile()
protobuf_file = tempfile()
sqllite_file = tempfile()


Here is write & read stats measurement code:


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
write_stats = microbenchmark(
 list = list(
  fst_0 = bquote({ write.fst(x, fst_file_0, compress = 0) }),
  fst_50 = bquote({ write.fst(x, fst_file_50, compress = 50) }),
  fst_100 = bquote({ write.fst(x, fst_file_100, compress = 100) }),
  rds_compressed = bquote({ saveRDS(x, rds_file_comp, compress = T) }),
  rds_uncompressed = bquote({ saveRDS(x, rds_file_uncomp, compress = F) }),
  feather = bquote({ write_feather(x, feather_file) }),
  fcsv = bquote({ fwrite(x, fcsv_file) }),
  csv = bquote({ write.csv(x, csv_file, row.names = F, quote = F) }),
  protobuf = bquote({ pbf = file(protobuf_file, open='w+b'); serialize_pb(x, pbf); close(pbf); }),
  sqllite = bquote({ con <- dbConnect(RSQLite::SQLite(), sqllite_file); dbWriteTable(con, "x", x, overwrite = T); dbDisconnect(con); })
 ),
 times = 15L
, unit = "ms")

read_stats = microbenchmark(
 list = list(
  fst_0 = bquote({ x2 = read.fst(fst_file_0) }),
  fst_50 = bquote({ x2 = read.fst(fst_file_50) }),
  fst_100 = bquote({ x2 = read.fst(fst_file_100) }),
  rds_compressed = bquote({ x2 = readRDS(rds_file_comp) }),
  rds_uncompressed = bquote({ x2 = readRDS(rds_file_uncomp) }),
  feather = bquote({ x2 = read_feather(feather_file) }),
  fcsv = bquote({ x2 = fread(fcsv_file) }),
  csv = bquote({ x2 = read.csv(csv_file) }),
  protobuf = bquote({ pbf = file(protobuf_file, open='r+b'); x2 = unserialize_pb(protobuf_file); close(pbf); }),
  sqllite = bquote({ con <- dbConnect(RSQLite::SQLite(), sqllite_file); x2 = dbReadTable(con, "x"); dbDisconnect(con); })
 ),
 times = 15L
, unit = "ms")

file_stats = rbindlist(list(
 data.table(method = "fst_0", file_size = file.size(fst_file_0)),
 data.table(method = "fst_50", file_size = file.size(fst_file_50)),
 data.table(method = "fst_100", file_size = file.size(fst_file_100)),
 data.table(method = "rds_compressed", file_size = file.size(rds_file_comp)),
 data.table(method = "rds_uncompressed", file_size = file.size(rds_file_uncomp)),
 data.table(method = "feather", file_size = file.size(feather_file)),
 data.table(method = "fcsv", file_size = file.size(fcsv_file)),
 data.table(method = "csv", file_size = file.size(csv_file)),
 data.table(method = "protobuf", file_size = file.size(protobuf_file)),
 data.table(method = "sqllite", file_size = file.size(sqllite_file))
))


Here is code which generates summary table:


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
write_stats = as.data.table(summary(write_stats))
setkeyv(write_stats, "expr")

read_stats = as.data.table(summary(read_stats))
setkeyv(read_stats, "expr")

file_stats[, write_time := write_stats[J(method), mean]]
file_stats[, read_time := read_stats[J(method), mean]]

file_stats[, object_size := as.integer(object.size(x))]
file_stats[, file_obj_ratio := file_size / object_size]

file_stats


Комментариев нет:

Отправить комментарий