Data Driven Workflow

June 2008

Matthew Rapaport

 

Why workflow:

 

Applications manipulate data. Workflow makes data available, and also triggers (explicitly or implicitly) and sequences application execution. Applications, run inside the operating system and make use of operating system objects and commands to obtain resources for their own execution. These include files, connections to databases, access to clocks, and connections to the outside world – other applications, or operating systems of other computers.

 

Why don’t applications make their own data available and run themselves? They can and sometimes do. Many application languages have the power to use most of the features of the operating system necessary to their complete functioning. Languages like C and its descendents have deep control of operating systems, to the point of being able to write operating systems themselves. But most applications do not need such deep control and are content with the functionality listed above. Workflow is the process of using operating system commands, objects, and features to move data between, coordinate, and running of application. Modern workflows have grown to the scale of large applications themselves! They now include data transformation tasks formerly embedded in applications. Workflow’s importance to the enterprise necessitates status reporting, operator visibility, and control. These demands begin to blur the line between workflow and applications. Three reasons militate writing workflow into applications themselves.

 

  1. Today’s applications are tremendously complex. Third parties develop products to appeal to and work for as many different customers in as many different environments as possible. Trying to anticipate and support workflow in applications results in even greater complexity, size, and cost.
  2. Workflow supports requirements that change often in the application’s environment. File names change daily. Execution times vary depending on business need and on other applications, while communications parameters and outside data also change frequently. Adaptations can be made inside applications, but complexity makes this difficult, expensive, and risky. Historically, script languages evolved specifically to make just these and related tasks easy. Changes made in moments test quickly mitigating elaborate change control and migration procedures. Correcting mistakes is also easy, usually a matter of minutes. Applications able to work smoothly with automated workflows are more robust in the face of environmental changes.
  3. Many modern (4th generation languages and above) application languages have lost the power to effect the necessary operating system objects and do their own workflow. Part of what makes these environments so productive (often a 4 or 5 to 1 productivity advantage over 3rd generation languages) is their reliance on otherwise easy-to-use work-flow directed languages to perform tasks that are not germain to their purpose. In today’s heavily networked and distributed computing environments, no modern large-scale application runs without some workflow. Script languages are still popular, though often today, large, complex applications, specialize in workflow services!

 

Early workflows (for example MVS JCL)  mapped references to I/O channels and enabled applications to refer to their inputs and outputs without having to know anything about how these related to physical data (files) on the disk. Not only were file structures abstracted away from applications, but also file names!

 

The MVS and its JCL are very formal. Other multitasking and multi-user operating systems of the day (early Unix, pick, RSTS, Oasis,) were more flexible. They supported script languages providing access to the set of operating system objects and functions written in a more natural style. IBM created Rexx, still capable and competitive with today’s first-order workflow languages such as Perl, Ruby, Tcl, and Python. Even the early Windows inspired a third-party workflow language called 4DOS which today has evolved into a rich control environment able to coordinate tasks on Windows servers throughout the enterprise. Workflow languages, enriched  with data manipulation and transformation features far more powerful than those built in to most application languages, perform wizardry in 20 or 50 lines of code that might take 500 or more lines in an application language.

 

Today, the line between workflow language and application development environment have blurred with most operating systems supporting both multiple and functionally rich choices for workflow. Unix (and its variants) and Windows dominate large application arenas. A plethora of choices makes it possible (though usually not advisable) to write workflow suites in multiple languages simultaneously!

 

When there are many connections between applications, modification of a dataset becomes costly thanks to all the testing involved, and very risky without it. Creating a new connection, with its own data, takes less time and bears less risk for overtaxed development, QA, and operations departments (see textbox 1).

 

Ironically, productivity leverage, often leads to misuse. A few scripts coordinating workflow between a few applications became thousands of scripts coordinating work and data transformations throughout the enterprise. Modern script languages can coordinate roles, and provide many choices for audit, and control features. Writing workflow to make this happen takes a little design time, often anathema to the quick fix that scripting promised to overworked IT groups. Scripts are easy to write, small, and quick. Their mushrooming numbers, become a management problem, derisively called “stovepipe workflows” because they are so many individual 1-to-1 connections (see fig 1a).

 

 

EAI applications

 

In the late 1990s there emerged a new class of application promising to correct the excesses of the stovepipe model and provide command and control at the same time. The acronym EAI, Enterprise Application Integration, coined to describe these applications, represents collections of programs that communicate with one another more-or-less automatically and internalized the workflow by providing the following essential services:

 

  1. Data transport around the local network as well as services for data exchange with remote networks.
  2. Mapping of data from one application into another.
  3. Visibility into the flow of data throughout the EAI application.
  4. Control (stopping, starting, debugging) of the workflow.

 

Of course, it was EAI going on in the stove-piped scripts all along, but no one thought to call it that. The new applications did deliver on some promises, but not without costs. They are complex applications with attendant bugs, and all that follows with respect to vendor relations and IT work-hours consumed finding workarounds when necessary. Expensive (in time and resources) to upgrade, they become an obstacle to rapid variation of the workflow environment. They are also expensive in dollars costing even a medium sized corporation or department hundreds of thousands in capital outlay while annual maintenance equals or exceeds the salary of a skilled developer or two. Even the stovepipes, though hidden, are not removed (see fig 1b). As with scripts, and for the same reason (risk), it turns out to be easier to create a new mapping for each integration!

 

 True the “internal stovepipes” run in an environment that automatically supports monitoring and control. That is not a bad feature, but well within the power of a thought-through scripted workflow. Another important feature of EAI software, enterprise-wide guaranteed data delivery by means of a virtual data bus, is also within the scope of modern script languages coupled with supportive operating-systems! With EAI, these features come with the software compared to having to write and test an underlying infrastructure.

 

However someone must still create maps between applications that drive the behavior of the EAI-based workflow. By ‘maps’ I mean possibly complex configurations (usually including at least one explicit map) within the EAI software. Replacing strategic legacy scripts can sometimes take many months for one or more employees and several contractors! Creating a queue-based, enterprise-wide data bus, the core of EAI, takes only a few weeks in a modern script language if the enterprise is serious about it. Finding workarounds for unexpected limitations (and bugs) encountered while converting workflows to modern EAI software will typically double the estimated time needed to replace strategic scripts.

 

EAI products are complex and complexity entails risk. Large applications cannot avoid complexity. So long as scripted workflows remain simple, they add little risk to the over-all application environment. Stovepipe workflows become complex largely due to redundant data transport and transforms. Large, and sometimes even small, enterprises end up with hundreds of scripts that all perform the same role: moving data from one place to another while transforming it to suit the needs of consuming applications. EAI software succeeds in the marketplace because it consolidates control and reporting within itself. This, coupled with the genuine innovation of a built-in data-bus, makes EAI products seem worth their high cost and all the issues arising from their own complexity.

 

The secret is on the bus!

 

The real technical innovation of EAI software is a reliable low-level mechanism for transporting data around the enterprise. Half the workflow battle comes down to making data produced by one application available to another without either having to know in what part of the network the other runs! Decoupling is accomplished by creating an abstraction layer, the data bus, and making it responsible for shuffling data around the enterprise.

 

Two bus architectures, Queue, and Subject-based, compete in different EAI products. The queue approach is analogous to lines of commuters at a transit bus terminal. People (data) queue next to the bus that takes them to their destination (application). Each originating application (or its accompanying workflow) must know to which queue (or queues if there are multiple consuming applications) the data belongs, but the bus itself embeds knowledge of where the consuming applications run on the network. Handshakes between the ends of the queue insure accurate delivery of the data or retransmission if necessary. Given the tools available today on modern networks queue based data buses are easy to implement in scripts, C or Java. I completed a prototype queue-based bus using sockets at a medium sized financial services company in a single week using Perl.

 

In subject-based bus architecture originating applications (publishers) need not know anything about the consuming (subscriber) applications. An envelope containing a subject wraps the published data. Data travels throughout the bus system (subject to security controls), not only to particular queue endpoints. Subscriber applications examine the subjects traveling across the bus and extract those which apply to them. In a queue-based bus adding a new application requires adding a new queue and enhancement of the originating software to place data onto it. With a subject-based bus, in theory, a new application can subscribe to an existing subject without the publisher even being aware of it. I say “in theory” because it is rarely the case that new applications need exactly the same data as an existing subscriber. Moreover, if the data exchange requirements demand “guaranteed delivery” (data held and redelivered to an application that is initially unavailable), then the publisher must be aware of every subscriber to its subject.

 

EAI products wrap application data in a proprietary (XML-based) envelope for transport through the bus. Each of these products provides a means (GUI based) for declaring application data structures. Publishing utilities map source data to the bus format while subscribing utilities map the bus format to the input of the consuming application. Vendor standardization of the bus format makes application-to-application mapping possible. It also demands use of the vendor’s EAI utilities to perform the connection. The bus format, though accessible, is not immediately meaningful to other enterprise applications. For all practical purposes, developers must specify at least 3 elements for each connection: 1) Data structure of the source, 2) data structure of the consumer, 3) the map between the two. 

 

Data Driven Connections:

 

EAI applications do save developers from the need to think through their own workflow control, visibility, transport, and translation mechanisms. In return they are expensive and complex applications in their own right, and developers must still map relationships between data as well as specifying the data structures on both sides of every integration (see fig 2).

 

Different applications will always demand different data formats. Even with XML a near-universal syntactic convention, one application will want to see <applicant full name> while another will want <first>, <middle>, <last>. Each will expect different attributes inside these tags, etc. DDLs and schemata can resolve only some of the possible differences. Rosetta stone and other standards bodies can help to resolve these differences. At one level or another, however, there will be some requirement to map a semantic data standard to an application schema whether by script or GUI (web) accessible utility. That leaves the I/O map the only point around which to simplify the workflow.

 

Mapping is possible because vendors standardize their bus format. Mapping is necessary because the format is inherently meaningless to the enterprise. Making the format used to transport data meaningful to enterprise applications obviates explicit mapping. Each application could pluck the data it needs from a predefined pool.

 

In theory a vendor’s bus format can be read in the same way as a format specifically created by the enterprise. Two issues make this problematic:

 

  1. Vendor formats are always more complex than necessary for any given enterprise. While it is possible to write scripts that READ the format, it is much more difficult to write the format correctly. Every time an application needs a new data element both the writing and the reading of that element requires development work.
  2. Vendor products tie data format and bus transport protocols together. It is impossible to separate the data itself from the mechanics of the bus. This means that if an application does not write data in the vendor format correctly it will not only fail to be meaningful to the consuming application, but transport too will fail.

 

If, instead, an enterprise separates the transport from the data two possibilities emerge:

 

  1. The enterprise can define a simple format for all its data that is both easy to write and read by applications under its direct control.
  2. The enterprise can mix and substitute different format and transport technologies at will.

 

Combining transport and semantic data format in the same layer complicates EAI transport. An enterprise cannot alter format without replacing transport and vice versa. Yet there is nothing inherently semantic about a data bus; while data transport across the enterprise can take various forms -- socket, NFS, JMS, ftp, etc. Separating format from transport therefore makes sense. Transport need have no bearing on the relation between applications which take place purely through data. Put another way, the connection between applications at the data layer should be agnostic with respect to transport. This allows the enterprise to substitute one transport for another as technology improves or corporate needs change without having any impact on the transported data and therefore the integration between applications.

 

How the process works:

 

Survey producing and consuming applications for the data they needed to meet the requirements of the integration. Design a corporation-standard data structure from the results of the survey. The design goals are simplicity for both producers (writes) and consumers (reads) while capturing relations within the data. For example, an invoice has lines that ship in different boxes. Some lines ship in more than one box:

 

<INVOICE nbrinset=1>12345678, 20080525150205

            <ILINE linenbr=1>SKU2501,25,12.00,.10,270.00

                        <BOX boxnbr=1>TRK12345,15,20,”1 of 2”</BOX>

                        <BOX boxnbr=2>TRK12346,10,18,”2 of 2”</BOX>

            </ILINE>

            <ILINE linenbr=2>SKU3455,10,15.00,.0,150.00

                        <BOX boxnbr=1>TRK12346,5,10,”2 of 2”</BOX>

            </ILINE>

</INVOICE>

 

Notice that in this hypothetical example, the attributes of the tags are metadata used by scripts to determine where they are in the over-all hierarchy. The data delivered to the application appears as comma delimited fields in the data portion of the tag. The ILINE contains sku, quantity, pricing, discount, etc. The BOX tag has box tracking numbers, weights, line quantity contained in the box and a box label. Of course each of the elements could have their own tags. Tailor the design suite the needs of the organization!

 

Customize producing applications to produce the standard, or write scripts that select data from the application’s DBMS layer producing standard data as output. Segregate data in the abstract from the transported physical structures (files). One script understands “getting” application data and writing the standard structure. Another knows how the structure relates to the physical file or message used to transport the data. For example an entire invoice might correspond to a single physical file. Alternatively, put header information into one file and lines in another, something that might prove more efficient for a socket-based transport if invoices are large.

 

On the consuming side the reverse process occurs. One module knows how to read data from transport. Another makes it available to the consuming application (see textbox 2). Configure the “transport reader” to the specific mechanism used for that application. If using alternative transports, it must understand the alternatives and check them all for available data. The semantic layer (module) loads the consuming application database. This might be a script reading the corporate data format and writing direct to a DBMS or a translation to some application-specific API.

 

At some point, someone is going to point out there exists an implicit mapping in creating the enterprise “standard data format”. Because the SDF is a product of producer and consumer requirements, the designer is implicitly mapping what producers create to what consumers need, and this is true! But the meaningfulness of the data allows the enterprise to map once, and not every time there is a new connection made between applications.  Naming the data, and making the names meaningful to enterprise applications, obviates explicit mapping (see fig 3). Naming simplifies architecture compared with either the stovepipe model or the bus model of EAI products. One could implement every module represented in the Fig 3 diagram, including the bus itself, in a different tool or language. If new technology comes along, the enterprise is free to substitute, mix, or match as desired. With a vendor EAI product by comparison, vendor-supplied tools control every step in the EAI process! Nothing can be changed without changing everything!

 

Before discussing an implementation, there is something else to notice about this proposed architecture. Implicit mapping becomes explicit if the enterprise (or department) does not control application integration at the database. Data must be extractable with the equivalent of SELECT statements and addable with INSERT, UPDATE, etc. If the only access to data or the database is through an API (third-party applications or increasingly popular “virtual applications”) then some explicit mapping to and from the API will always be required. In such cases, data driven workflow, while a sound basis for development of well layered control and reporting systems, does not remove an explicit mapping requirement.

 

Successful implementation in the late 1990s proved the data-driven idea, integrating a large OLTP application with a high-volume EDI system. A tagged transport data structure was created reflecting both trading partner and application requirements, including not only normal data (customer names and invoice amounts), but also internal data like user-printer locations. For simplicity, each specific EDI document was given its own filename mask. Two routines, a reader and a writer of the standard structure were written for Oracle (in PL/SQL and made available to all procedures through a public-package) and for Perl (available as an exported module).

 

The files themselves were made available network-wide using the simple expedient of the network file system (NFS). As applications published (outbound from the DBMS or inbound via the EDI software), they updated DBMS tables with information about the filename, timestamps, and elements written to the file. Any application could examine these tables and acquire data by opening the appropriate files. Added control over application execution was built into a simple socket system for secure command execution throughout the company.

 

Integration of several other applications with the primary OLTP DBMS was accomplished with relative ease. These included third-party centralized fax software supporting both inbound and outbound faxing from any application with a print queue. Every read and write script reported its status in a standard way, and could be controlled (shut down, re-started, etc) using simple tools made easily available to authorized users.

 

Where we didn’t control access to application data directly we wrote a sublayer to map the corporate structure to an application API! Writing maps only when necessary accelerated workflow development. Working with this model through a corporate buy-out showed its flexibility. Connections to and from new applications were made quickly and hooked into the standard reporting and control structures.  Over-all the entire corporate workflow was converted to a semantic model, with no capital or licensing expense, in about the same time required to bring enterprise workflow under the control of a modern EAI product.

 

Conclusions:

 

Data-driven workflow works! It is especially powerful where companies control their application databases. It results in a workflow that is easy to maintain and extend needing far less computer resources than either the stovepipe model or present-day EAI software. But “easy” does not mean “effortless”. The process requires some effort to understand the data flowing between targeted applications and a flexible, extensible data structure that suits the enterprise needs defining. In terms of time and expense, the effort demanded is usually far smaller than that needed to install an EAI product and convert major workflows over to it. Having made the effort, the reward, in low cost maintenance and extensibility is great; correcting a stovepipe nightmare, or obviating an expensive EAI application suite!