Introduction
In this article I would like to compare different approaches for data serialization available in R. The comparison will be done from point of views of serialization / deserialization performance and compactness of disk space required. I would perform analysis for data table objects, since these are objects which I need to serialize/deserialize in my practice most often.
The following approaches are reviewed:
- Functions saveRDS / readRDS:
- This is standard serialization mechanism available in R.
- It supports all R objects types and provides as-is serialization / deserialization (with possible nuances for custom references objects).
- It supports compressed & uncompressed data storage.
- Essentially, this is dump of object memory representation in R, so unfortunately this is R-only serialization format.
- Package feather:
- This is fast and language agnostic alternative for RDS format.
- It uses column oriented file format (based on Appache Arrow and Flatbuffers library).
- The format is open-source and is supported both in R & Python.
- Package fst:
- This is another alternative for RDS & Feather formats which can be used for fast data frames serialization.
- It supports compression by using LZ4 and ZSTD algorithms.
- The big advantage of this approach that it provides full random access to rows & columns of stored data.
- Package RProtoBuf:
- This is R interface package for Protocol Buffers serialization method proposed by Google.
- Usually, this approach is used for serializing of relatively small structured objects. But it would be interesting to see how it will deal with data tables serialization in R.
- Functions write.csv & read.csv:
- This is standard R functions for storing & reading data frames in text-based CSV format.
- This approach can be easily applied only to data frame objects, but I've included it into comparison, since most objects which I need to serialize in my practice are data tables.
- Functions fwrite & fread from data.table package:
- This is another approach for storing & reading data table objects.
- These functions are much more optimized in comparison to standard ones above, so it would be nice to compare them.
- Package RSQLite:
- This package provides R interface to SQLite embedded database engine.
- Also it may be overkill to use such approach for simple data tables serialization purposes, I've included this package into comparison for sake of completeness.
Approach
I've measured the following metrics:- Average write and read time. Performance is measured by using microbenchmark package.
- Compression ratio. File size measured by using file.size() function in comparison to object size in R memory measured by using object.size() function.
The measurement is done for two different sizes of data table:
- Small ~ 2.5MB data table.
- Large objects ~ 2.5GB data table.
Results
Here are measurement results for small data table:
Method
|
Write Time (ms)
|
Read Time (ms)
|
File Size (bytes)
|
Object Size (bytes)
|
Compression Ratio (file/object)
|
fst (0% compressed)
|
4.43
|
14.11
|
2,401,527
|
2,410,512
|
0.9963
|
fst (50% compressed)
|
6.23
|
25.44
|
1,814,420
|
2,410,512
|
0.7527
|
fst (100% compressed)
|
15.10
|
15.67
|
1,684,412
|
2,410,512
|
0.6988
|
RDS (compressed)
|
315.58
|
36.74
|
1,891,996
|
2,410,512
|
0.7849
|
RDS (uncompressed)
|
7.60
|
24.08
|
2,402,000
|
2,410,512
|
0.9965
|
feather
|
4.97
|
3.07
|
2,404,192
|
2,410,512
|
0.9974
|
fread/fwrite
|
25.49
|
43.54
|
4,143,627
|
2,410,512
|
1.7190
|
write.csv
|
596.58
|
771.41
|
4,143,627
|
2,410,512
|
1.7190
|
protobuf
|
74.56
|
62.62
|
1,818,676
|
2,410,512
|
0.7545
|
sqllite
|
92.98
|
154.77
|
2,576,384
|
2,410,512
|
1.0688
|
Here are measurement results for large data table:
Method
|
Write Time (ms)
|
Read Time (ms)
|
File Size (bytes)
|
Object Size (bytes)
|
Compression Ratio (file/object)
|
fst (0% compressed)
|
17,956
|
4,914
|
2,400,001,527
|
2,400,010,512
|
1.0000
|
fst (50% compressed)
|
15,347
|
4,968
|
1,811,710,706
|
2,400,010,512
|
0.7549
|
fst (100% compressed)
|
12,731
|
7,145
|
1,680,028,768
|
2,400,010,512
|
0.7000
|
RDS (compressed)
|
276,159
|
25,520
|
1,891,565,116
|
2,400,010,512
|
0.7881
|
RDS (uncompressed)
|
20,310
|
6,690
|
2,400,002,000
|
2,400,010,512
|
1.0000
|
feather
|
20,480
|
32,886
|
2,400,004,192
|
2,400,010,512
|
1.0000
|
fread/fwrite
|
51,513
|
48,384
|
4,143,341,213
|
2,400,010,512
|
1.7264
|
write.csv
|
529,811
|
669,340
|
4,143,341,213
|
2,400,010,512
|
1.7264
|
protobuf
|
71,640
|
N/A
|
1,833,994,768
|
2,400,010,512
|
0.7642
|
sqllite
|
55,534
|
133,963
|
2,567,032,832
|
2,400,010,512
|
1.0696
|
And, finally, some observations:
- As expected, fst package shows the best performance on large data table. On small data tables, feather performance is a little bit better.
- What's interesting is that performance of uncompressed RDS is quite good. On large data table, it is comparable with feather for writing and even better for reading. Also, it is interesting to note that sizes of files saved by feather and uncompressed RDS are almost the same. Perhaps, it can be explained by fact that R internally stores data tables in columnar format as list of vectors.
- Another observation is that compressed RDS approach showed quite bad write performance. The possible explanation that my SSD hard drive is not a performance bottleneck and it is fast enough to write data directly, rather spent CPU time on compression.
Code
Here is preparatory code (loading required libraries, generating sample data table, assigning file names):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | library(data.table) library(ggplot2) library(microbenchmark) library(fst) library(feather) library(RProtoBuf) library(RSQLite) generateDataSet = function(nrow, ncol) { require(data.table) as.data.table(do.call(cbind, sapply(seq_len(ncol), function(i) { as.integer(runif(n = nrow, min = 0, max = 1e6)) }, simplify = F))) } x = generateDataSet(1e4, 60) ## generateDataSet(1e7, 60) fst_file_0 = tempfile() fst_file_50 = tempfile() fst_file_100 = tempfile() rds_file_comp = tempfile() rds_file_uncomp = tempfile() feather_file = tempfile() fcsv_file = tempfile() csv_file = tempfile() protobuf_file = tempfile() sqllite_file = tempfile() |
Here is write & read stats measurement code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 | write_stats = microbenchmark( list = list( fst_0 = bquote({ write.fst(x, fst_file_0, compress = 0) }), fst_50 = bquote({ write.fst(x, fst_file_50, compress = 50) }), fst_100 = bquote({ write.fst(x, fst_file_100, compress = 100) }), rds_compressed = bquote({ saveRDS(x, rds_file_comp, compress = T) }), rds_uncompressed = bquote({ saveRDS(x, rds_file_uncomp, compress = F) }), feather = bquote({ write_feather(x, feather_file) }), fcsv = bquote({ fwrite(x, fcsv_file) }), csv = bquote({ write.csv(x, csv_file, row.names = F, quote = F) }), protobuf = bquote({ pbf = file(protobuf_file, open='w+b'); serialize_pb(x, pbf); close(pbf); }), sqllite = bquote({ con <- dbConnect(RSQLite::SQLite(), sqllite_file); dbWriteTable(con, "x", x, overwrite = T); dbDisconnect(con); }) ), times = 15L , unit = "ms") read_stats = microbenchmark( list = list( fst_0 = bquote({ x2 = read.fst(fst_file_0) }), fst_50 = bquote({ x2 = read.fst(fst_file_50) }), fst_100 = bquote({ x2 = read.fst(fst_file_100) }), rds_compressed = bquote({ x2 = readRDS(rds_file_comp) }), rds_uncompressed = bquote({ x2 = readRDS(rds_file_uncomp) }), feather = bquote({ x2 = read_feather(feather_file) }), fcsv = bquote({ x2 = fread(fcsv_file) }), csv = bquote({ x2 = read.csv(csv_file) }), protobuf = bquote({ pbf = file(protobuf_file, open='r+b'); x2 = unserialize_pb(protobuf_file); close(pbf); }), sqllite = bquote({ con <- dbConnect(RSQLite::SQLite(), sqllite_file); x2 = dbReadTable(con, "x"); dbDisconnect(con); }) ), times = 15L , unit = "ms") file_stats = rbindlist(list( data.table(method = "fst_0", file_size = file.size(fst_file_0)), data.table(method = "fst_50", file_size = file.size(fst_file_50)), data.table(method = "fst_100", file_size = file.size(fst_file_100)), data.table(method = "rds_compressed", file_size = file.size(rds_file_comp)), data.table(method = "rds_uncompressed", file_size = file.size(rds_file_uncomp)), data.table(method = "feather", file_size = file.size(feather_file)), data.table(method = "fcsv", file_size = file.size(fcsv_file)), data.table(method = "csv", file_size = file.size(csv_file)), data.table(method = "protobuf", file_size = file.size(protobuf_file)), data.table(method = "sqllite", file_size = file.size(sqllite_file)) )) |
Here is code which generates summary table:
1 2 3 4 5 6 7 8 9 10 11 12 13 | write_stats = as.data.table(summary(write_stats)) setkeyv(write_stats, "expr") read_stats = as.data.table(summary(read_stats)) setkeyv(read_stats, "expr") file_stats[, write_time := write_stats[J(method), mean]] file_stats[, read_time := read_stats[J(method), mean]] file_stats[, object_size := as.integer(object.size(x))] file_stats[, file_obj_ratio := file_size / object_size] file_stats |
Комментариев нет:
Отправить комментарий