Multinational Energy Firm Retires Aging DataStage 11.7 Infrastructure, Migrates 3,400 Jobs to Databricks

MigryX Case Study • April 2026 • Energy & Utilities

Executive Summary

A multinational energy and utilities corporation with generation, transmission, and retail operations across three continents faced an inflection point with its IBM DataStage 11.7 estate. With IBM's extended support window closing, a complex portfolio of 3,400 parallel, server, and sequence jobs, and a strategic mandate to migrate to Google Cloud Platform with Databricks as the unified analytics engine, the organization engaged MigryX to execute the migration at machine speed. Over 16 months, MigryX parsed all DataStage .dsx export files, mapped every stage type to equivalent PySpark transformers, and orchestrated the resulting pipelines through Databricks Workflows. The program delivered over 1 million lines of PySpark, throughput improvements of 4–8X on critical settlement and grid metering jobs, and a projected $7.1 million in three-year savings from eliminated DataStage licensing and infrastructure decommissioning.

Client Overview

The client is a vertically integrated energy group with operating segments spanning natural gas production, electricity generation, high-voltage transmission network management, and direct-to-consumer retail supply. Their data infrastructure underpins real-time grid operations analytics, regulatory reporting to energy market operators in multiple jurisdictions, wholesale energy trading settlement, and customer billing for millions of residential and commercial accounts. The DataStage platform had been the backbone of their enterprise data warehouse loading processes since the early 2010s, ingesting data from SCADA systems, smart meter platforms, trading systems, and customer information systems into an Oracle Exadata-based EDW.

The organization had already committed to a multi-year cloud migration strategy anchored on Google Cloud Platform, with Databricks on GCP selected as the unified data engineering and analytics platform. The DataStage estate represented the largest single migration workstream in the cloud journey, and its complexity had repeatedly caused timeline slippage when evaluated for manual rewriting approaches.

Business Challenge

The MigryX technical discovery audit surfaced a range of challenges that had made previous migration attempts stall at the proof-of-concept stage:

The MigryX Approach

The migration program was structured in four phases: Discovery & Classification, Automated Conversion, Validation & Parallel Run, and Cutover & Decommission. MigryX's Discovery Engine performed the initial parse of all DataStage .dsx exports — which encode jobs as structured XML describing stage topology, link metadata, and property configurations — and generated a complexity heat map across all 3,400 jobs within the first two weeks of engagement. This heat map revealed that 55% of jobs were suitable for fully automated conversion with no human intervention, 35% required configuration review, and 10% (predominantly those using custom C transformer stages) required deep engineering engagement.

For the core automated conversion, MigryX's DataStage-to-PySpark translation engine processed each job's stage graph, resolving link schemas from DataStage's table definition repository and emitting equivalent PySpark DataFrame transformation chains. The engine handles DataStage's partition-aware execution model by mapping partition-level operations to Spark's native distributed shuffle and repartition operations, ensuring that sort-merge joins and aggregation operations that previously relied on DataStage's sort stage configurations produce identical results on Databricks' distributed execution engine.

The IPC stage challenge was addressed through a systematic dependency analysis that determined whether each IPC-connected job pair could be collapsed into a single Spark job (where data volumes permitted) or should be decoupled via a Delta Lake intermediate table. In 78% of cases, job pairs were collapsed, reducing overall pipeline latency by eliminating the IPC serialization overhead. The remaining 22% were decoupled with Delta Lake tables that provide equivalent data sharing with the added benefit of checkpointing and restart capability for long-running settlement jobs.

Custom C transformer stages were handled through a structured decompilation and specification process: MigryX engineers extracted the C source, worked with the client's domain SMEs to produce formal business specifications, and then generated PySpark UDFs or native DataFrame operations implementing the specified logic. All custom logic conversions were subject to extended validation against three years of historical settlement data before promotion. Sequence job orchestration was converted to Databricks Workflows task graphs, with conditional execution branches mapped to Databricks' built-in conditional task logic and event-driven triggers replaced with Databricks REST API calls from the client's existing GCP Pub/Sub event bus.

DataStage Stage Mapping Reference

DataStage Stage Type PySpark / Databricks Equivalent Notes
Sequential File Stage spark.read.csv() / .write.csv() Schema inferred from DataStage column definitions
Aggregator Stage groupBy().agg() All standard aggregation functions mapped 1:1
Join Stage (Hash / Sort-Merge) DataFrame.join() with broadcast hints Join type and key columns preserved from link metadata
Transformer Stage (standard) withColumn() / select() chain BASIC-derived expression syntax converted to PySpark
Transformer Stage (custom C) PySpark UDF (Python) or native DataFrame op Requires specification review by domain SME
Filter Stage DataFrame.filter() DataStage constraint syntax converted to Spark SQL expressions
Sort Stage DataFrame.orderBy() or implicit shuffle sort Partition-level sort collapsed where downstream allows
Lookup Stage (Sparse/Normal) Broadcast join or Delta Lake lookup table Sparse lookup policy mapped to left outer join semantics
IPC Stage Delta Lake intermediate table or single Spark job Dependency analysis determines consolidation vs. decoupling
ODBC Stage Databricks JDBC connector with connection pool config Vendor-specific SQL dialect preserved in pushdown queries
Sequencer Job Databricks Workflow task graph Conditional branches mapped to Databricks conditional tasks

Key Migration Highlights

Security & Compliance

The client is subject to energy sector regulatory requirements including NERC CIP (North American Electric Reliability Corporation Critical Infrastructure Protection) standards, GDPR for European residential customer data, and national grid operator data reporting standards in each jurisdiction. The migration program was conducted under a strict data sovereignty framework with dedicated security review gates at each phase transition.

Results & Business Impact

The program outcomes met or exceeded the original business case projections across all primary KPIs established at program initiation. The 4–8X performance improvement range reflected both the inherent parallelism advantages of Databricks over DataStage's SMP-bounded execution and the architectural improvements enabled by the migration, particularly the IPC stage consolidation and Delta Lake Z-ordering on the high-cardinality meter point reference datasets.

3,400
DataStage Jobs Migrated
1M+
Lines of PySpark Generated
4–8X
Avg. Pipeline Performance Gain
$7.1M
Projected Savings Over 3 Years
16 Mo
End-to-End Migration Duration
38 min
Settlement Batch (was 5 hrs)

The $7.1 million savings projection encompasses $4.2 million in eliminated IBM DataStage licensing and InfoSphere Information Server infrastructure costs, $1.9 million in reduced operational overhead from simplified orchestration and monitoring, and $1.0 million in avoided Oracle Exadata capacity expansion that would have been required to handle projected data volume growth under the old architecture. The migration also enabled the client to decommission 18 physical servers in two legacy data centers, accelerating their data center consolidation program by an estimated 18 months.

"The DataStage estate was the crown jewel of our legacy infrastructure — and the biggest blocker to our cloud strategy. We had evaluated manual rewrites twice and each time the complexity estimates were so daunting that the program never got past the approval stage. MigryX made the timeline and cost feasible. The automated conversion coverage exceeded our expectations, and the quality of the generated code for even our most complex parallel jobs was production-ready with minimal review. We are now running settlement jobs intraday that we could never have contemplated on DataStage."

— Chief Data Officer, Multinational Energy & Utilities Group

Ready to Modernize Your DataStage Estate?

See how MigryX can accelerate your migration to Databricks with parser-driven automation. Automated conversion with full stage-level validation.

Explore Databricks Migration →