April 17, 2020

Optimizing Transforms in DataStage with CoSort

ITKeySource, an ETL consultancy in Jacksonville, FL, recently benchmarked relative performance gains running IRI CoSort — and its SortCL program in particular — alongside IBM DataStage.

If you want to Gain In-depth Knowledge on DataStage, please go through this link DataStage Training

CoSort has been used for years in this capacity in high volume data transformation environments, like this one on Solaris. But in this case, we tested them with this:

Hardware:

  • Dell PowerEdge R230 Server
  • Intel® Xeon® E31230 v5 3.4GHz, 8M cache, 4C/8T
  • 16GB 2133MT/s UDIMMs
  • 1TB 7.2K RPM SATA 6Gbps 3.5in

and

Software:

  • IBM InfoSphere DataStage® and QualityStage® 8.7
  • IRI Voracity® 1.0 & CoSort® 9.5.3
  • IRI Workbench, a free GUI option, built on Eclipse™:

In this POC, I transformed flat files with combined row counts of 2MM, 5MM, 20MM, and 50MM in the CoSort SortCL program, IBM DataStage, and then together from within DataStage. Below are the times it took to sort, join, aggregate, and filter two input files and create two output

In addition to the speed differences, which I try to account for in this article, I also explain the methodology by which I ran the same transform jobs … first alone, and then together. Before I do that, it should be noted that the raw data used for this job was randomly generated with the same SortCL executable using embedded ‘RowGen’ functionality.1

CoSort Process

The IRI CoSort product — and its SortCL data definition and manipulation 4GL program doing all the transforms — is shown above in the free IRI Workbench Eclipse GUI’s syntax-aware editor.

SortCL is fast primarily because it combines the transforms (and other functionality) in the same job script and I/O pass. It also uses the CoSort engine in the file system, which takes advantage of multi-threading, modern memory management and I/O optimization techniques. And, it is not a compiled program.

SortCL jobs can be hand-written, generated via a job wizard, or ETL workflow palette (with an IRI Voracity data management platform subscription2). The scripts are easy-to-read text files you can run (ad hoc or scheduled) in the GUI, or from the command line. That is how I called it in DataStage below.

The CoSort ‘SortCL’ script is shown above, with an outline, in the IRI Workbench (Eclipse) GUI. IRI Voracity ETL workflow and transform mapping diagrams of the same job are below those. All the metadata is: 1) re-entrant regardless of work-style, 2) manageable in eGit and other repositories, and 3) compatible with Erwin (formerly AnalytiX DS) Mapping Manager software and CATfx templates for spreadsheet-style mapping definition, stratification, and conversion to/from other ETL tools.3

DataStage Process

Below is the DataStage job design required to sort and join the same two sequential files, then aggregate, and finally, filter them before writing out to the same two separate sequential files. In addition to taking longer to build and run those jobs in DataStage, they must be compiled:

Transforming with CoSort in DataStage

To combine the processes for the benefit of my DataStage operations, I wrote a simple to call my SortCL script to do the transformation ‘magic’ externally. Rather than using a traditional sequential file-stage call, it was more seamless to use a built-in, Before-job subroutine. This allows me to execute CoSort SortCL on the command line (sortcl /spec=join.scl) from my Windows (DOS) or Unix (Korn) shell, even before any DataStage processes begin:

The benefit of running CoSort in this environment is immediately apparent:

Given the same source data and targets, it is almost 10X faster at the 50-million-row level to in-source transformations to CoSort through a DataStage subroutine.

To process sources with more than 50 million rows, many DW architects would upgrade to either Parallel Extender or run DataStage on a Unix/Linux server license where the hardware provides more power. CoSort runs in those same environments with even more speed, however, and is much less expensive than either alternative. Read On To Find More DataStage Online Training

If you would like more information or have additional suggestions, please provide feedback in the comment form below.