среда, 7 февраля 2018 г.

Batch system database design considerations

Introduction

To continue topic of best practices for batch systems creation, in this article I would like to describe some considerations related to database design for batch systems.

More precisely, I am referring batch systems related to financial area (such as risk engines), but I believe that these principles can be applied to all batch systems. Also I will also concentrate on relational databases. In theory, batch system can be built around NoSQL storage, or use service-oriented / microservices architecture where there will be no central data storage at all. But practically I believe that relational databases used for most batch systems. Such databases contain input data for calculations which are loaded from external upstream systems. Also, they store calculation results which are used later for reporting purposes, or extracted to downstream systems. In fact, design of such databases often follows traditional Extract Transform Load (ETL) approach.

Looking at wider picture, such  batch systems are usually part of larger enterprise framework. Often organizations use federated enterprise framework. Under such approach organization has many batch systems which are responsible for particular business functions and which communicate between each other by exchanging data feeds.

Batch Database Design Considerations

When creating database for such batch system, the following considerations are often taken into account:
  1. Master Data Management. This is enterprise-wide practice which allows to make sure that important reference data (e.g. clients data) are used consistently across organization and there is single point of reference for them. There are many different approaches which allows to set up proper MDM solution. For example, this presentation can be used as reference. But MDM is rather organization-wide problem, so I would not spent much time on its review in this article.
  2. Data Lineage & Traceability. This is an ability to track data sourcing & data flow between different systems and ideally within single system as well. It is not critical requirement for system/organization functioning per se. But it is often very important for troubleshooting & audit purposes. It should be relatively easy to track data sourcing between systems which follows best ETL practices. But it is usually harder to have automatic tracking of data flows within single system, unless some special solutions are used.
  3. Data Consistency. From my point of view, this is one of the most important requirements. It is crucial for providing accurate & reproducible batch outputs. I include here consistency between different input data items (e.g. making sure that data pieces are consistent between each other from timing point of view). I also include here repeatability of calculations (e.g. so we can re-run some calculations for given combinations of input data and get same results).

суббота, 6 января 2018 г.

Comparing different serialization methods in R

Introduction


In this article I would like to compare different approaches for data serialization available in R. The comparison will be done from point of views of serialization / deserialization performance and compactness of disk space required. I would perform analysis for data table objects, since these are objects which I need to serialize/deserialize in my practice most often.

The following approaches are reviewed:
  • Functions saveRDS / readRDS:
    • It supports all R objects types and provides as-is serialization / deserialization (with possible nuances for custom references objects).
    • It supports compressed & uncompressed data storage.
    • Essentially, this is dump of object memory representation in R, so unfortunately this is R-only serialization format.
  • Package feather:
    • This is fast and language agnostic alternative for RDS format.
    • It uses column oriented file format (based on Appache Arrow and Flatbuffers library).
    • The format is open-source and is supported both in R & Python.
  • Package fst:
    • This is another alternative for RDS & Feather formats which can be used for fast data frames serialization.
    • It supports compression by using LZ4 and ZSTD algorithms.
    • The big advantage of this approach that it provides full random access to rows & columns of stored data.
  • Package RProtoBuf:
    • This is R interface package for Protocol Buffers serialization method proposed by Google.
    • Usually, this approach is used for serializing of relatively small structured objects. But it would be interesting to see how it will deal with data tables serialization in R.
  • Functions write.csv & read.csv:
    • This is standard R functions for storing & reading data frames in text-based CSV format.
    • This approach can be easily applied only to data frame objects, but I've included it into comparison, since most objects which I need to serialize in my practice are data tables.
  • Functions fwrite & fread from data.table package:
    •  This is another approach for storing & reading data table objects.
    • These functions are much more optimized in comparison to standard ones above, so it would be nice to compare them.
  • Package RSQLite:
    • This package provides R interface to SQLite embedded database engine.
    • Also it may be overkill to use such approach for simple data tables serialization purposes, I've included this package into comparison for sake of completeness.