суббота, 6 января 2018 г.

Comparing different serialization methods in R

Introduction


In this article I would like to compare different approaches for data serialization available in R. The comparison will be done from point of views of serialization / deserialization performance and compactness of disk space required. I would perform analysis for data table objects, since these are objects which I need to serialize/deserialize in my practice most often.

The following approaches are reviewed:
  • Functions saveRDS / readRDS:
    • It supports all R objects types and provides as-is serialization / deserialization (with possible nuances for custom references objects).
    • It supports compressed & uncompressed data storage.
    • Essentially, this is dump of object memory representation in R, so unfortunately this is R-only serialization format.
  • Package feather:
    • This is fast and language agnostic alternative for RDS format.
    • It uses column oriented file format (based on Appache Arrow and Flatbuffers library).
    • The format is open-source and is supported both in R & Python.
  • Package fst:
    • This is another alternative for RDS & Feather formats which can be used for fast data frames serialization.
    • It supports compression by using LZ4 and ZSTD algorithms.
    • The big advantage of this approach that it provides full random access to rows & columns of stored data.
  • Package RProtoBuf:
    • This is R interface package for Protocol Buffers serialization method proposed by Google.
    • Usually, this approach is used for serializing of relatively small structured objects. But it would be interesting to see how it will deal with data tables serialization in R.
  • Functions write.csv & read.csv:
    • This is standard R functions for storing & reading data frames in text-based CSV format.
    • This approach can be easily applied only to data frame objects, but I've included it into comparison, since most objects which I need to serialize in my practice are data tables.
  • Functions fwrite & fread from data.table package:
    •  This is another approach for storing & reading data table objects.
    • These functions are much more optimized in comparison to standard ones above, so it would be nice to compare them.
  • Package RSQLite:
    • This package provides R interface to SQLite embedded database engine.
    • Also it may be overkill to use such approach for simple data tables serialization purposes, I've included this package into comparison for sake of completeness.