Fortunately, drilling disasters are a rare occurrence. Unlike failures in business data-processing systems, which are common: power goes out and the backups don’t turn on, machines crash even though they are redundant, disk drives fail even though they are in RAID arrays, someone pushes the wrong button even though they have been trained not to do that, bad data gets sucked into a system and gums up the works, databases jam and refuse to load data, networks get congested and transactions get dropped… When these systems go bad, if you are lucky, things just stop. But if you are unlucky, like the oil well disaster in the Gulf, data spews out of the data pipes and makes a horrible mess! Then the operations people have to shut down the systems and try to recover whatever data they can. And restarting operations can be dangerous as it can lead to another blowout, more loss of data and more mess.
Software developers don’t like to use the word “blowout” because it scares people – but that’s what a failure usually is. It’s a blowout! The data gets sprayed all over the place, and pieces of the system may be lying dead or disabled on the floor.
Software engineers have been wrestling with this type of problem for decades. But building fail-safe systems is extremely difficult because each one is different, and the “software blowout preventers” are custom designed (often by people who don’t really understand). The associated costs are high, and so software engineers are inclined to cut corners (how many times have you heard, “It’s not supposed to fail”?). Testing these systems tends to be done haphazardly due to the lack of understanding as to why failures occur. Even worse, a high degree of robustness is usually directly in conflict with other needs for the data-processing system, such as high performance – you can have one or the other, but not both at the same time. Since one can’t prove that a system is robust, but one can easily measure delivery of business goals such as performance, robustness is usually something that is pushed off until later.
Ab Initio realized, as a core principle, that the only way to prevent these data-processing blowouts was to take a different and comprehensive approach. Ab Initio engineers vowed that they would not build any software until they had figured out how to design in mechanisms – from the beginning – that were both robust and easy to use, and that would not impact performance. This approach is built into Ab Initio’s software so that users can focus on their business requirements and be confident that they will not have disabling blowouts.
How does this work? First, think of an Ab Initio application as a series of processing steps connected together through pipes that flow data. Data flows at high rates through these pipes from one step to the next. Sometimes a step will have multiple incoming pipes and/or multiple outgoing pipes. These pipes eventually connect to data storage units or other processing systems that accept or produce streams of data. These interconnected systems of processing steps and data pipes can be extremely large – much larger and more complex than a New York City subway map!
The Ab Initio checkpointing mechanism consists of interlocking valves on these data pipes. In general there is a set of valves on the inputs and outputs to the overall processing system. There may also be valves at strategic points in the processing system. In addition, there may be data reservoirs at the valves (checkpoints) to capture a copy of the data before it passes into the next section of the system. While some of these valves (checkpoints) may be specified by users, others are automatically positioned by the Ab Initio software. Most of these valves are implemented in Ab Initio’s software, though some are in other software technologies like databases and message queues. The key is that all these valves are connected to a central controller – the Ab Initio Co>Operating System – that operates their opening and closing in a carefully synchronized manner.
The Co>Operating System is designed for high-performance parallel and distributed processing, and to be incredibly robust in the face of failure. So while large amounts of data are flowing through its pipes at high speed – over networks and across servers and connected to all kinds of external systems – it has to be vigilantly on the lookout for a blowout.
Because the Co>Operating System has been carefully opening and closing the valves, if a blowout occurs the loss is limited to the data flowing between valves. And because the Co>Operating System keeps a copy in its reservoirs of any data that might be at risk, and because it knows exactly how much data has flowed into and out of the system, no data is ever actually lost. Once the cause of the failure has been fixed, the Co>Operating System can be restarted and it will automatically resume the applications at the right places. It will refill the data pipes with the right data (from the reservoirs), and everything will be as if the failure had never occurred. And that’s what everyone wants to hear: blowouts happen, but nobody gets injured and no data is lost, and pumping resumes as soon as the pipes are fixed. What more can you ask for?
PS: Lots of details were omitted from this description of checkpointing. Stuff like 2-phase commit, checkpoint triggers, message blips, transaction managers, and the XA protocol. Ab Initio sweats the details so that you don’t have to. The good news is: It just works.