среда, 7 февраля 2018 г.

Batch system database design considerations

Introduction

To continue topic of best practices for batch systems creation, in this article I would like to describe some considerations related to database design for batch systems.

More precisely, I am referring batch systems related to financial area (such as risk engines), but I believe that these principles can be applied to all batch systems. Also I will also concentrate on relational databases. In theory, batch system can be built around NoSQL storage, or use service-oriented / microservices architecture where there will be no central data storage at all. But practically I believe that relational databases used for most batch systems. Such databases contain input data for calculations which are loaded from external upstream systems. Also, they store calculation results which are used later for reporting purposes, or extracted to downstream systems. In fact, design of such databases often follows traditional Extract Transform Load (ETL) approach.

Looking at wider picture, such  batch systems are usually part of larger enterprise framework. Often organizations use federated enterprise framework. Under such approach organization has many batch systems which are responsible for particular business functions and which communicate between each other by exchanging data feeds.

Batch Database Design Considerations

When creating database for such batch system, the following considerations are often taken into account:
  1. Master Data Management. This is enterprise-wide practice which allows to make sure that important reference data (e.g. clients data) are used consistently across organization and there is single point of reference for them. There are many different approaches which allows to set up proper MDM solution. For example, this presentation can be used as reference. But MDM is rather organization-wide problem, so I would not spent much time on its review in this article.
  2. Data Lineage & Traceability. This is an ability to track data sourcing & data flow between different systems and ideally within single system as well. It is not critical requirement for system/organization functioning per se. But it is often very important for troubleshooting & audit purposes. It should be relatively easy to track data sourcing between systems which follows best ETL practices. But it is usually harder to have automatic tracking of data flows within single system, unless some special solutions are used.
  3. Data Consistency. From my point of view, this is one of the most important requirements. It is crucial for providing accurate & reproducible batch outputs. I include here consistency between different input data items (e.g. making sure that data pieces are consistent between each other from timing point of view). I also include here repeatability of calculations (e.g. so we can re-run some calculations for given combinations of input data and get same results).

Note: I don't mention here common recommendations about database normalization (e.g. to avoid data duplication) and I also don't mention here common non-functional requirements (such as system performance, or security requirements).

One more thing is that often data governance requirements are enforced by regulators for financial institutions. For example:

Batch Database Design Approaches

In most cases data which are stored in batch database can be classified in one of the four categories:
  1. Input data for calculations. For example, it can be company master reference data (e.g. client details, trade data), or it can be some market data loaded from 3rd party providers.
  2. Static & users data. These can be data dictionaries, lookup tables if we speak about static data. In respect to users' data - these can be some manual corrections for input data or what-if calculation scenarios.
  3. Output data. These are basically calculations results.
  4. Metadata. They are linking together and describing data from three previous categories.
All these data are changing with time. Depending on business needs, the different reference time interval can be chosen. But most often in finance area batch systems deal with end-of-day extracts.

So, there can be two approaches chosen for data storage:
  1. Feed based. This is when whole end-of-day extract is loading into database every time and referenced as whole in calculation. This approach is easy to implement, it is quite robust and allows to keep different data versions in the same database schema. The disadvantage is that it may require extra data space capacity. Usually, it can be applied to upstream feeds loading and for calculation results storing.
    • For example, if feed contains clients data, this means that clients table will contain multiple feeds loads and every load will contain full set of clients.
  2. Slowly Changing Dimension (in particular Type 6). This approach can be utilized for static data which are rarely changing, but still it is useful to have ability to reference both old & new versions of data.
    • For example, if static table contains some regulatory calculation parameter, then this approach allows to save several different versions of this parameter which are each valid for different time intervals.
The mandatory modules of such batch system usually are:
  1. Upstream feed loaders.
    • If organization uses standardized feed formats, then it can be standalone application which loads upstream feeds according to some configuration.
    • Otherwise it can be set of packages / scripts / applications which are loading different upstream feeds into system database.
    • This layer can also perform data quality checks.
  2. Batch / calculation module.
    • This is module which is doing calculations per se. Actually, there may be many such modules if system supports different calculation types.
  3. Downstream feed extractors.
    • Opposite to upstream feeds loading, this module is responsible for generation of output feeds to downstream systems. Again, this can be either single module which generates extracts in standardized format basing on system configuration (e.g. list of pre-set queries), or it can be set of packages/scripts/applications which are generating each own downstream feed.
  4. Data cleanup & maintenance module.
    • Removal of outdated & not required anymore data from the system.
    • Database indexes & statistics maintenance.
  5. Database management console:
    • For example, user interface for modification of static and user data.
    • Or user interface to monitor database state and allow triggering of ad-hoc actions.
The following metadata design approach can be used to consistently save & reference data in batch database. This approach uses three core metadata tables:
  1. ETL Types.
    • This table contains about all data types / feeds which are available in the system. This includes upstream feeds data, static data tables, calculation results, and downstream extracts.
    • Sample attributes - ETL Type ID, ETL Type Description, ETL Category ID (e.g. upstream or downstream).
  2. ETL Events.
    • This table contains log of ETL events in the system related to data loads, transformations, and extracts - for example, new upstream feed load, new batch run, or new downstream feed generation.
    • Sample attributes - ETL Event ID, ETL Type ID, Start & End Time, Event Status, Reference Date.
  3. ETL Links.
    • This table contains links between input data & results data. So, for example, if we run some calculation batch. Then this table will contain references to input data feeds used for this particular batch for particular batch reference date.
    • ETL Links are populated once ETL Event entry is created.
    • Sample attributes - Parent ETL Event ID, Children ETL Event ID.
These tables can be used as follows:

  • Every day when new versions of upstream feeds are loaded into the system database, every loads get its unique ETL Event ID.
  • When calculation batch is run:
    • Batch gets its unique ETL ID and this ID is linked to related upstream feeds ETL ID's.
    • Batch results are written to database under batch ETL ID.
  • When downstream extract are generated, each extract gets its unique ETL ID and this ID is linked to related batch ETL ID's.

The advantages of this approach are following:
  1. We can keep and reference different versions of data in same database tables.
  2. We always know what particular data version was used for batch calc or downstream feed generation.
  3. We can re-run calculations and be sure that input data will be consistent.

What's out of scope?

I don't try to cover all aspects of data repositories design for batch systems and leave many topics out of scope:
  • Database performance & how it can be optimized with NoSQL / Big Data solutions.
  • How to handle changes in data structures.
  • Deployment aspects of batch databases.




Комментариев нет:

Отправить комментарий