среда, 7 февраля 2018 г.

Batch system database design considerations

Introduction

To continue topic of best practices for batch systems creation, in this article I would like to describe some considerations related to database design for batch systems.

More precisely, I am referring batch systems related to financial area (such as risk engines), but I believe that these principles can be applied to all batch systems. Also I will also concentrate on relational databases. In theory, batch system can be built around NoSQL storage, or use service-oriented / microservices architecture where there will be no central data storage at all. But practically I believe that relational databases used for most batch systems. Such databases contain input data for calculations which are loaded from external upstream systems. Also, they store calculation results which are used later for reporting purposes, or extracted to downstream systems. In fact, design of such databases often follows traditional Extract Transform Load (ETL) approach.

Looking at wider picture, such  batch systems are usually part of larger enterprise framework. Often organizations use federated enterprise framework. Under such approach organization has many batch systems which are responsible for particular business functions and which communicate between each other by exchanging data feeds.

Batch Database Design Considerations

When creating database for such batch system, the following considerations are often taken into account:
  1. Master Data Management. This is enterprise-wide practice which allows to make sure that important reference data (e.g. clients data) are used consistently across organization and there is single point of reference for them. There are many different approaches which allows to set up proper MDM solution. For example, this presentation can be used as reference. But MDM is rather organization-wide problem, so I would not spent much time on its review in this article.
  2. Data Lineage & Traceability. This is an ability to track data sourcing & data flow between different systems and ideally within single system as well. It is not critical requirement for system/organization functioning per se. But it is often very important for troubleshooting & audit purposes. It should be relatively easy to track data sourcing between systems which follows best ETL practices. But it is usually harder to have automatic tracking of data flows within single system, unless some special solutions are used.
  3. Data Consistency. From my point of view, this is one of the most important requirements. It is crucial for providing accurate & reproducible batch outputs. I include here consistency between different input data items (e.g. making sure that data pieces are consistent between each other from timing point of view). I also include here repeatability of calculations (e.g. so we can re-run some calculations for given combinations of input data and get same results).