суббота, 27 февраля 2016 г.

Best Practices for Batch Systems Creation

Introduction


Inspired by article Creating a Microservice? Answer these 10 Questions First I decided to summarize my experience working with batch processing systems and created check list of items which need to be taken into account / think over when working on such projects.

The existing literature on this topic usually concentrates on batch architecture design and specific frameworks. In this article I distilled my experience and try to reflect on general best practices for real world applications.

Before start I would like to introduce some terms:

* Batch is the execution series of processing and calculation steps in non-interactive mode. Usually batches are run in daily mode (e.g. overnight batches), but there may be intraday batches (several re-runs per day) or less frequent batches (e.g. monthly).

* Job (or batch job) is a single step in batch processing. Usually batch consists of multiple jobs which can be executed in parallel or sequentially. Batch processing jobs may be run on pre-defined schedule and have cascade dependencies (e.g. job B is executed after job A is done).

* Environment is a place where batches are executing. This usually includes pre-configured hosts (application boxes, database servers, cluster nodes) where batches is running and messaging infrastructure (e.g. message queues, service buses). There may be multiple environment instances - production, disaster recovery, development, testing, etc.



Check List


01. How fast will be your batch and how its performance will be scaled with increasing amount of input data?

Think how your batch will behave if amount of input data will increase twice. Will it just take more time to complete, or fail due to some resource constraints (e.g. memory or disk size)? Would it be possible to increase batch performance by horizontal scaling (e.g. adding more nodes to cluster)? 

02. How stable will be your batch and how sporadic failures will be handled?

When batch consists of hundreds of jobs and executes for hours, then almost for sure there will be some sporadic failures during its run. It may be database connection resets, it may be network errors. But for sure if batch processes millions and billions of requests, then almost for sure 1-10 requests will fail.

So, one should plan strategy which allows (i) decrease amount of such issues and (ii) if possible provides automatic recover for them, this may include:

(a) Robust design and implementation. E.g. don't keep database connection opened during the whole batch time, instead open it only when you need to read / write data and close it once this is done.

(b) Automatic re-tries in case of failures. E.g. if database connection has been reset, try to re-open it.

(c) But the main thing here is to keep batch design simple and batch time small. This will minimize amount of such issues and you may not need to perform any special actions.

03. How batch components will be deployed?

You should plan how batch environment will be set up (i.e. preparation of app boxes, installation of 3rd part software packages, etc), how batch components will be installed, and uninstalled.

This may be not trivial things, since complicated batch systems are often operating in heterogeneous environments (e.g. RDBMS + Hadoop Cluster + App Boxes) and each their component may have different approach for installation. As result, you may not be able to use modern deployment techniques like Docker.

The ideal deployment process is:

(a) To have installation package for each batch component. So there should be no manual steps during installation (like copy those file, run these script, etc).

(b) To have one-click deployment system which allows to deploy above installation packages on environment.

(c) To have automatic configuration set up after packages deployment. Please refer to item 04 below for more details on batch configuration.

(d) To have deployment rollback procedure. If your batch installation only includes the deployment of binary files, rollback is usually not a big problem - it is just enough to restore old version of binaries. But if you deployment includes database upgrade or includes multiple components, you must have mechanism which allows quickly and automatically to restore previous system state in case of failure.

04. How batch components will be configured?

It is important to have consistent approach for batch configuration and if possible single place for all configuration parameters. It does not really matter what exactly mechanism will be chose for this - xml config files, registry entries, configuration tables in database, or something else. What's more important to have this in one place and to make sure that configurations can be automatically set up when deploying batch components to different environments.

05. How batch will be run in development environment?

It is important to be able to run small batch in development environment. Even if batch infrastructure includes many components which should be run on different hosts, you should have an ability to quickly deploy development environment and quickly run batches there. This is crucial requirement to allow effective development of such system with multiple teams involved. Otherwise, you may have problems later during system integration testing phase.

06. How batch progress and errors will be logged?

There are four important items here:

(a) Obviously batch should report its progress. At any time, you should be able to know how much work is done and how much is left.

(b) In case of error, batch should report enough information to perform analysis without debugging. I.e. it is often not enough to just report error message, there should be at least stack traces and memory dumps provided. Otherwise it may be very hard to reproduce such error in development environment.

(c) Fail if error occurs. It may be obvious, but batch process should explicitly fail with error, if something goes wrong. The opposite situation when error is just reported to message log, but process itself completes successfully. This may create dangerous illusion of successful run and will require extra effort to analyse logs and check batch results.

I would not advise here any logging frameworks or techniques. It will be just better if they would be consistent between different batch components.

07. How batch performance and capacity metrics will be monitored?

There should be in place mechanisms which allows to monitor batch failures, batch performance, and environment capacity:

(a) If batch job fails, there should be notification (e.g. via email) to respective team or person.

(b) Environment capacity (e.g. free disk space, free database space, memory utilization, etc) should be monitored and notification alerts should be sent out automatically if thresholds are breached.

(c) Batch performance and timings. It is important to collect such statistics, in order to be able to monitor performance regression issues and plan future optimizations.

08. How failed batch will be recovered?

If batch execution takes hours and cannot be improved, you should think about recovery strategy in case of batch failures. Let's imagine that you implemented improvement to fight with sporadic failures as described in item 02, but batch still failed due to more severe failure. Now this failure is resolved, but it will be too costly to just re-run batch from scratch. Also it may lead to breached SLA's with outbound system. So you should think about solution which allows to either re-run batch from failure point, or run batch for partial data.

09. What will be approach for batch housekeeping?

The regular batch system is working in daily mode and produces a lot of output data (e.g. batch results in database, temp files on disk, log files, etc).

So there should be in place mechanism which will be periodically cleanup these old data and prevent out of space errors. Also this mechanism may be responsible for periodical services reboot and other routine maintenance operations.

10. How batch health will be verified?

There are two aspects here:

(a) Pre-batch checks. These checks allow to identify potential problem before starting a batch, so you won't have to wait till batch fail in the middle. This includes checks that all input data are in place, all batch components are OK, etc. This is usually the first job in batch flow.

(b) Post-batch checks. These checks allow to confirm that batch not only completed successfully, but also produced meaningful results. This may be some simple checks (e.g. batch produced non-empty results, or produced numbers are within expected bounds), or it may be complicated checks based on business rules. This is usually the last job in batch flow and it is produced automatic report to respective people / teams.

11. How batch changes will be tested?

I would not like to concentrate a lot on best testing practices, but need to mention that in addition to standard unit tests, you will also need to consider regression testing to guarantee no unexpected impact after changes.

Also it is important to test batch end-to-end flow both with extreme data. Otherwise component in the middle may fail the whole batch flow if it cannot handle some extreme number.

12. How batch results will be reproduced?

It is often required to re-run batches for some old data or reproduce batch results on different environment. So you should plan an approach which allows you to easily backup batch environment along with all the data required for calculations and restore it in another environment.

13. How batch components will communicate with external systems?

Usually there are two types of external systems in batch processing:

(a) Inbound (or upstream) systems which are responsible for providing input data
(b) Outbound (or downstream) systems which are consuming batch outputs

The main requirement here is to identify all such external systems and establish formal and strict Service Layer Agreements (SLA) with them. The SLA should include timing requirements, data formats and tolerances (e.g. expected amount of data), delivery mechanisms, recovery actions in case of SLA breach (e.g. silently re-use data from the previous run or escalate failure).

14. Who, how, and when will maintain batch environment?

It is mandatory to automate routine batch maintenance operations (e.g. deployment or housekeeping) as much as possible, but it is unlikely that complex batch system will be able to function in fully automatic and autonomous mode.

So, you will have dedicated team of Support or (and) DevOps engineers who will maintain batches and their environments.

In order to make their work efficient, it is important:

(a) To have knowledge base with batch documentation, known issues, etc.
(b) To have special training and introduction courses.
(c) To decide on maintenance window for batches.
etc.

15. What will be impact and side effects on other systems?

When batch components are execute in shared environment, it is important to think about side effects on and from other components. For example, if you are running batch and UI middleware service on the same host, what will be UI performance when batch will be running?


What's out of scope?


Here is the list of items which I left out of scope of checks list above:

01. Batch security.

Usually batch jobs are executed under special dedicated account which has all required permissions to the components. It is good practice to have different such accounts for different environment types (e.g. prod vs dev) to avoid cross-access issues.

02. Batch performance tuning.

It's important to have good batch performance, but approaches how to achieve this is out of scope of this article.

03. Batch orchestration.

There are a lot of batch orchestrating and job scheduling tools and their review is out of scope of this article.

04. Batch architecture and batch design patterns in general.

Комментариев нет:

Отправить комментарий