Migrating Oracle ODI to Databricks: Knowledge Modules to PySpark and Delta Lake

Oracle Data Integrator (ODI) occupies a unique position in the ETL landscape. Unlike traditional ETL tools that extract data, transform it in a middle-tier engine, and load results to a target, ODI uses an ELT (Extract-Load-Transform) architecture that pushes transformation logic to the target database engine. This is implemented through Knowledge Modules (KMs) — code templates in SQL, PL/SQL, or Jython that generate database-specific scripts at runtime. The Knowledge Module abstraction makes ODI powerful but also creates a deep dependency on the ODI runtime, the ODI repository, and Oracle-specific SQL dialects.

Databricks provides a modern lakehouse platform where the ELT philosophy is native: data lands in Delta Lake on cloud object storage, and PySpark or Spark SQL transformations execute on elastic Spark clusters directly against that storage. There is no middle-tier engine, no proprietary repository, and no code-generation abstraction layer. This guide maps every major ODI concept — Knowledge Modules, interfaces/mappings, packages, scenarios, load plans, variables, sequences, journalizing, and topology — to its Databricks equivalent, with production-ready code examples.

ODI Architecture vs. Databricks Lakehouse Architecture

ODI's architecture centers on three key abstractions: the Topology (physical and logical architecture definitions), the Model (metadata about data stores), and the Project (interfaces, packages, and procedures that define transformation logic). The ODI Agent orchestrates execution, calling Knowledge Modules that generate and execute SQL on source and target databases. The ODI Repository (master and work repositories) stores all metadata, versioning, and execution history.

Databricks replaces this entire stack with a unified platform. Unity Catalog serves as the metadata layer (replacing ODI Topology and Model). PySpark notebooks and Databricks SQL replace interfaces and Knowledge Modules. Databricks Workflows replace packages, scenarios, and load plans. Delta Lake provides the target storage with ACID transactions, and Spark clusters replace the ODI Agent as the compute engine.

ODI Concept	Databricks Equivalent	Notes
Integration Knowledge Module (IKM)	PySpark transformation / Delta Lake MERGE	Target-side transformation and loading logic
Loading Knowledge Module (LKM)	Auto Loader / COPY INTO / JDBC read	Source-to-staging data extraction
Check Knowledge Module (CKM)	Delta Lake expectations / PySpark validation	Data quality checks with quarantine handling
Interface / Mapping	PySpark notebook / Spark SQL script	Source-to-target transformation definition
Package	Databricks Workflow (multi-task job)	Orchestrated sequence of transformation steps
Scenario	Databricks Job (compiled, schedulable)	Deployed, versioned execution unit
Load Plan	Databricks Workflow with task dependencies	DAG-based orchestration with parallelism
ODI Variable	Databricks widget / job parameter	Runtime parameterization
ODI Sequence	monotonically_increasing_id() / row_number()	Surrogate key generation
ODI Agent	Databricks cluster / SQL warehouse	Elastic compute with auto-scaling
Journalizing (CDC)	Delta Lake Change Data Feed (CDF)	Row-level change tracking with timestamps
ODI Topology (Physical/Logical)	Unity Catalog (catalog / schema)	Three-level namespace with governance
Flow Control (CKM)	Delta Lake expectations / quality checks	Constraint validation with error tables
ODI Repository	Unity Catalog + Git Repos	Metadata storage with version control

Oracle ODI to Databricks migration — automated end-to-end by MigryX

Understanding Knowledge Module Translation

The core challenge of ODI migration is that Knowledge Modules are not simple transformations — they are code templates that generate SQL dynamically based on the interface definition, source/target metadata, and KM options. A single IKM like "IKM Oracle Incremental Update" contains dozens of template steps that produce INSERT, UPDATE, MERGE, and error-handling SQL. Migrating from ODI to Databricks means replacing this code-generation layer with explicit PySpark or Spark SQL that achieves the same result.

IKM Pattern: Incremental Update (Append/Merge)

The most common IKM pattern is incremental update, where incoming data is merged into a target table based on primary key matching. In ODI, this is handled by IKMs like "IKM Oracle Incremental Update" or "IKM SQL to SQL Append." In Databricks, Delta Lake MERGE provides this natively with ACID guarantees.

# ODI IKM Oracle Incremental Update equivalent
# ODI generates:
#   1. CREATE staging table (C$ table)
#   2. INSERT INTO staging FROM source (via LKM)
#   3. UPDATE target WHERE keys match AND data differs
#   4. INSERT INTO target WHERE keys don't exist in target
#   5. Optional: DELETE from target WHERE keys don't exist in source (full refresh)
#   6. DROP staging table

# Databricks Delta Lake MERGE: All steps in one atomic operation
from delta.tables import DeltaTable
from pyspark.sql import functions as F

# Step 1: Read source data (replaces LKM + staging)
source_df = (spark.read
    .format("jdbc")
    .option("url", "jdbc:oracle:thin:@//oracle-host:1521/ORCL")
    .option("dbtable", "source_schema.customer_orders")
    .option("user", dbutils.secrets.get("oracle", "username"))
    .option("password", dbutils.secrets.get("oracle", "password"))
    .load()
)

# Step 2: Apply transformations (replaces ODI interface mappings)
transformed = (source_df
    .withColumn("order_amount_usd",
        F.when(F.col("currency") == "EUR", F.col("amount") * 1.08)
        .when(F.col("currency") == "GBP", F.col("amount") * 1.27)
        .otherwise(F.col("amount"))
    )
    .withColumn("order_quarter",
        F.concat(
            F.year("order_date").cast("string"),
            F.lit("-Q"),
            F.quarter("order_date").cast("string")
        )
    )
    .withColumn("_load_timestamp", F.current_timestamp())
    .select(
        "order_id", "customer_id", "order_date",
        "order_amount_usd", "order_quarter",
        "product_id", "quantity", "status", "_load_timestamp"
    )
)

# Step 3: MERGE into target (replaces IKM incremental update)
target_table = DeltaTable.forName(spark, "silver.customer_orders")

target_table.alias("tgt").merge(
    transformed.alias("src"),
    "tgt.order_id = src.order_id"
).whenMatchedUpdate(
    condition="""
        tgt.order_amount_usd != src.order_amount_usd OR
        tgt.status != src.status OR
        tgt.quantity != src.quantity
    """,
    set={
        "customer_id": "src.customer_id",
        "order_date": "src.order_date",
        "order_amount_usd": "src.order_amount_usd",
        "order_quarter": "src.order_quarter",
        "product_id": "src.product_id",
        "quantity": "src.quantity",
        "status": "src.status",
        "_load_timestamp": "src._load_timestamp"
    }
).whenNotMatchedInsertAll().execute()

print(f"MERGE completed: {transformed.count()} source records processed")

ODI's IKM generates a multi-step SQL script that creates staging tables, performs separate UPDATE and INSERT operations, and drops staging tables. Delta Lake MERGE eliminates all of this ceremony — it atomically handles inserts, updates, and optionally deletes in a single transactional operation. There are no staging tables to manage, no orphaned temporary objects, and no partial-failure states.

LKM Pattern: Source Data Extraction

ODI Loading Knowledge Modules (LKMs) handle data extraction from heterogeneous sources. "LKM SQL to SQL" uses JDBC to move data between databases. "LKM File to SQL" reads flat files into staging tables. In Databricks, these patterns map to JDBC reads, Auto Loader for file ingestion, and COPY INTO for batch file loads.

# LKM File to SQL equivalent: Load CSV files from cloud storage
# ODI LKM generates: CREATE external table, SELECT INTO staging

# Databricks Auto Loader: Schema-aware, incremental file processing
df_raw = (spark.readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "csv")
    .option("cloudFiles.schemaLocation", "/mnt/checkpoints/orders_schema")
    .option("cloudFiles.inferColumnTypes", "true")
    .option("header", "true")
    .option("delimiter", ",")
    .option("cloudFiles.schemaEvolutionMode", "addNewColumns")
    .load("/mnt/landing/oracle_export/orders/")
)

# Add audit metadata
df_bronze = (df_raw
    .withColumn("_source_file", F.input_file_name())
    .withColumn("_ingested_at", F.current_timestamp())
)

# Write to bronze layer
(df_bronze.writeStream
    .format("delta")
    .option("checkpointLocation", "/mnt/checkpoints/orders_bronze")
    .option("mergeSchema", "true")
    .trigger(availableNow=True)
    .toTable("bronze.customer_orders")
)

# LKM SQL to SQL equivalent: JDBC extraction from Oracle
# ODI generates: INSERT INTO staging SELECT * FROM source@dblink

source_data = (spark.read
    .format("jdbc")
    .option("url", "jdbc:oracle:thin:@//oracle-host:1521/ORCL")
    .option("dbtable", "(SELECT * FROM hr.employees WHERE modified_date > SYSDATE - 1)")
    .option("user", dbutils.secrets.get("oracle", "username"))
    .option("password", dbutils.secrets.get("oracle", "password"))
    .option("fetchsize", "10000")
    .option("numPartitions", "8")
    .option("partitionColumn", "employee_id")
    .option("lowerBound", "1")
    .option("upperBound", "1000000")
    .load()
)

source_data.write.format("delta").mode("overwrite").saveAsTable("bronze.employees")

MigryX: Purpose-Built Parsers for Every Legacy Technology

MigryX does not rely on generic text matching or regex-based parsing. For every supported legacy technology, MigryX has built a dedicated Abstract Syntax Tree (AST) parser that understands the full grammar and semantics of that platform. This means MigryX captures not just what the code does, but why — understanding implicit behaviors, default settings, and platform-specific quirks that generic tools miss entirely.

ODI Interface Mappings to PySpark DataFrame Transformations

An ODI interface (called "mapping" in ODI 12c) defines source-to-target column mappings with transformation expressions. Each target column has an expression that can reference source columns, ODI functions, lookups, and variables. In PySpark, these become DataFrame select() or withColumn() operations.

# ODI Interface mapping example:
# Source: SALES.TRANSACTIONS (Oracle)
# Target: DW.FACT_SALES (Oracle)
#
# Mappings:
#   SALE_KEY       = SALES_SEQ.NEXTVAL                    (ODI sequence)
#   CUSTOMER_SK    = LOOKUP(DIM_CUSTOMER, CUST_ID)        (lookup)
#   PRODUCT_SK     = LOOKUP(DIM_PRODUCT, PROD_CODE)       (lookup)
#   SALE_DATE      = TRUNC(TXN_TIMESTAMP)                 (expression)
#   SALE_AMOUNT    = QUANTITY * UNIT_PRICE * (1 - DISCOUNT)(expression)
#   CURRENCY_CODE  = NVL(CURRENCY, 'USD')                 (null handling)
#   REGION         = DECODE(REGION_CODE, 1,'NA',2,'EU',3,'APAC','OTHER') (decode)

from pyspark.sql import functions as F
from pyspark.sql.window import Window

# Read source
transactions = spark.table("bronze.transactions")

# Lookup tables (replaces ODI lookup definitions)
dim_customer = spark.table("gold.dim_customer").select("customer_sk", "cust_id")
dim_product = spark.table("gold.dim_product").select("product_sk", "prod_code")

# Apply transformations (replaces ODI interface column mappings)
fact_sales = (transactions
    # Joins replace ODI lookups
    .join(F.broadcast(dim_customer),
          transactions.cust_id == dim_customer.cust_id, "left")
    .join(F.broadcast(dim_product),
          transactions.prod_code == dim_product.prod_code, "left")
    # Column expressions replace ODI derivations
    .withColumn("sale_date", F.to_date("txn_timestamp"))
    .withColumn("sale_amount",
        F.col("quantity") * F.col("unit_price") * (1 - F.col("discount")))
    .withColumn("currency_code",
        F.coalesce(F.col("currency"), F.lit("USD")))
    .withColumn("region",
        F.when(F.col("region_code") == 1, "NA")
        .when(F.col("region_code") == 2, "EU")
        .when(F.col("region_code") == 3, "APAC")
        .otherwise("OTHER"))
    # Surrogate key replaces ODI sequence
    .withColumn("sale_key", F.monotonically_increasing_id())
    # Select final columns matching target schema
    .select(
        "sale_key", "customer_sk", "product_sk",
        "sale_date", "sale_amount", "currency_code", "region"
    )
)

fact_sales.write.format("delta").mode("append").saveAsTable("gold.fact_sales")

ODI Variables to Databricks Widgets and Parameters

ODI variables serve multiple purposes: they parameterize interfaces and packages, hold query results (refreshing variables), store global configuration, and control flow logic. Databricks provides widgets for notebook-level parameterization and job parameters for workflow-level configuration.

ODI Variable Type	Databricks Equivalent	Scope
Project Variable (static)	Notebook widget with default value	Notebook-level parameter
Global Variable	Job parameter / cluster environment variable	Workflow-level or cluster-level
Refreshing Variable (query-based)	Spark SQL query result assigned to Python variable	Runtime computed value
Sequence Variable	MAX(column) + row_number() pattern	Incrementing numeric key
Declare Variable (procedure-local)	Python local variable	Function-level scope

# ODI refreshing variable equivalent
# ODI: #GLOBAL.LAST_EXTRACT_DATE = SELECT MAX(extract_date) FROM ctl.extract_log WHERE status='SUCCESS'

# Databricks: Query-based variable assignment
last_extract_date = spark.sql("""
    SELECT MAX(extract_date) AS last_date
    FROM control.extract_log
    WHERE status = 'SUCCESS'
""").collect()[0]["last_date"]

print(f"Last successful extract: {last_extract_date}")

# Use in subsequent queries (replaces ODI variable substitution)
new_records = spark.sql(f"""
    SELECT * FROM bronze.transactions
    WHERE modified_date > '{last_extract_date}'
""")

# ODI project variable equivalent: Notebook widgets
dbutils.widgets.text("source_schema", "bronze", "Source Schema")
dbutils.widgets.text("target_schema", "silver", "Target Schema")
dbutils.widgets.dropdown("load_mode", "incremental", ["full", "incremental"], "Load Mode")

source_schema = dbutils.widgets.get("source_schema")
target_schema = dbutils.widgets.get("target_schema")
load_mode = dbutils.widgets.get("load_mode")

if load_mode == "full":
    df = spark.table(f"{source_schema}.transactions")
else:
    df = spark.sql(f"""
        SELECT * FROM {source_schema}.transactions
        WHERE modified_date > '{last_extract_date}'
    """)

ODI Sequences to Surrogate Key Generation

ODI sequences generate auto-incrementing numeric values for surrogate keys, similar to Oracle database sequences. In Databricks, surrogate key generation uses monotonically_increasing_id() for unique (but not sequential) identifiers, or row_number() with a maximum-value offset for sequential keys that continue from the last loaded value.

# ODI Sequence: ODI_SEQ_CUSTOMER_KEY (native sequence, or ODI-managed identity)
# Pattern: Get current max, assign sequential keys to new records

from pyspark.sql.window import Window

# Get current maximum key from target table
max_key_row = spark.sql("""
    SELECT COALESCE(MAX(customer_key), 0) AS max_key
    FROM gold.dim_customer
""").collect()[0]
max_key = max_key_row["max_key"]

# Assign sequential keys to new records
window_spec = Window.orderBy("customer_id")
new_customers = (spark.table("silver.customer_staging")
    .withColumn("customer_key",
        F.row_number().over(window_spec) + F.lit(max_key))
)

new_customers.write.format("delta").mode("append").saveAsTable("gold.dim_customer")
print(f"Assigned keys {max_key + 1} through {max_key + new_customers.count()}")

ODI Packages and Scenarios to Databricks Workflows

ODI packages orchestrate the execution of interfaces, procedures, variables, and other packages in a defined sequence with conditional branching (OK/KO paths). Scenarios are compiled, deployable versions of packages. Load plans provide parallel execution with step-level restart capability. All of these map to Databricks Workflows, which support task DAGs, conditional execution, retry policies, and parameterized runs.

# ODI Package structure:
#   Step 1: Refresh Variable (get last extract date)
#   Step 2 (OK): Interface: Load staging (LKM File to SQL)
#   Step 3 (OK): Interface: Transform and merge (IKM Incremental Update)
#   Step 4 (OK): Interface: Build aggregates
#   Step 4 (KO): Procedure: Send error notification
#   Step 5: Procedure: Update control table

# Databricks Workflow equivalent
{
  "name": "odi_migration_daily_load",
  "tasks": [
    {
      "task_key": "get_extract_date",
      "notebook_task": {
        "notebook_path": "/odi_migration/01_get_extract_date"
      },
      "max_retries": 1,
      "timeout_seconds": 300
    },
    {
      "task_key": "load_staging",
      "depends_on": [{"task_key": "get_extract_date"}],
      "notebook_task": {
        "notebook_path": "/odi_migration/02_load_staging",
        "base_parameters": {
          "last_extract_date": "{{tasks.get_extract_date.values.last_extract_date}}"
        }
      },
      "run_if": "ALL_SUCCESS",
      "max_retries": 2
    },
    {
      "task_key": "transform_merge",
      "depends_on": [{"task_key": "load_staging"}],
      "notebook_task": {
        "notebook_path": "/odi_migration/03_transform_merge"
      },
      "run_if": "ALL_SUCCESS"
    },
    {
      "task_key": "build_aggregates",
      "depends_on": [{"task_key": "transform_merge"}],
      "notebook_task": {
        "notebook_path": "/odi_migration/04_build_aggregates"
      },
      "run_if": "ALL_SUCCESS"
    },
    {
      "task_key": "send_error_notification",
      "depends_on": [{"task_key": "transform_merge"}],
      "notebook_task": {
        "notebook_path": "/odi_migration/error_notification"
      },
      "run_if": "AT_LEAST_ONE_FAILED"
    },
    {
      "task_key": "update_control_table",
      "depends_on": [
        {"task_key": "build_aggregates"},
        {"task_key": "send_error_notification"}
      ],
      "notebook_task": {
        "notebook_path": "/odi_migration/05_update_control_table"
      },
      "run_if": "ALL_DONE"
    }
  ],
  "schedule": {
    "quartz_cron_expression": "0 0 6 * * ?",
    "timezone_id": "America/New_York"
  }
}

ODI Load Plans: Parallel Execution with Dependencies

ODI load plans provide sophisticated orchestration with parallel steps, serial steps, exception handling, and restart capabilities. Databricks Workflows natively support parallel task execution through the dependency graph — tasks without dependency relationships run in parallel automatically.

# ODI Load Plan with parallel steps:
#   Serial Step: Initialize
#   Parallel Step:
#     - Load Customers (interface)
#     - Load Products (interface)
#     - Load Regions (interface)
#   Serial Step: Load Orders (depends on all dimension loads)
#   Serial Step: Build Aggregates

# Databricks Workflow: Parallel tasks run automatically when dependencies allow
{
  "name": "odi_load_plan_equivalent",
  "tasks": [
    {
      "task_key": "initialize",
      "notebook_task": {"notebook_path": "/pipelines/initialize"}
    },
    {
      "task_key": "load_customers",
      "depends_on": [{"task_key": "initialize"}],
      "notebook_task": {"notebook_path": "/pipelines/load_customers"}
    },
    {
      "task_key": "load_products",
      "depends_on": [{"task_key": "initialize"}],
      "notebook_task": {"notebook_path": "/pipelines/load_products"}
    },
    {
      "task_key": "load_regions",
      "depends_on": [{"task_key": "initialize"}],
      "notebook_task": {"notebook_path": "/pipelines/load_regions"}
    },
    {
      "task_key": "load_orders",
      "depends_on": [
        {"task_key": "load_customers"},
        {"task_key": "load_products"},
        {"task_key": "load_regions"}
      ],
      "notebook_task": {"notebook_path": "/pipelines/load_orders"}
    },
    {
      "task_key": "build_aggregates",
      "depends_on": [{"task_key": "load_orders"}],
      "notebook_task": {"notebook_path": "/pipelines/build_aggregates"}
    }
  ]
}

ODI Journalizing (CDC) to Delta Lake Change Data Feed

ODI journalizing captures changes on source tables by creating journal tables (J$ tables) and triggers or log-based mechanisms. This enables incremental extraction where only changed rows are processed. Delta Lake provides Change Data Feed (CDF), which records row-level changes (inserts, updates, pre-image and post-image of updates, deletes) automatically when enabled on a table.

# Enable Change Data Feed on a Delta table (replaces ODI journalizing setup)
spark.sql("""
    ALTER TABLE silver.customer_orders
    SET TBLPROPERTIES ('delta.enableChangeDataFeed' = 'true')
""")

# Read changes since last processing (replaces ODI journal table query)
# ODI equivalent: SELECT * FROM J$_CUSTOMER_ORDERS WHERE JRN_FLAG IN ('I','U')

changes = (spark.read
    .format("delta")
    .option("readChangeFeed", "true")
    .option("startingVersion", 15)  # or startingTimestamp
    .table("silver.customer_orders")
)

# Filter by change type
inserts = changes.filter(F.col("_change_type") == "insert")
updates = changes.filter(F.col("_change_type").isin("update_postimage"))
deletes = changes.filter(F.col("_change_type") == "delete")

print(f"Changes: {inserts.count()} inserts, {updates.count()} updates, {deletes.count()} deletes")

# Apply changes to downstream gold table using MERGE
gold_target = DeltaTable.forName(spark, "gold.customer_orders_summary")

# Process only changed records (incremental, like ODI journalizing)
changed_customers = changes.filter(
    F.col("_change_type").isin("insert", "update_postimage")
).select("customer_id", "order_amount_usd", "order_date")

# Reaggregate only affected customers
updated_summary = (changed_customers
    .groupBy("customer_id")
    .agg(
        F.sum("order_amount_usd").alias("total_orders"),
        F.max("order_date").alias("last_order_date"),
        F.count("*").alias("order_count")
    )
)

gold_target.alias("tgt").merge(
    updated_summary.alias("src"),
    "tgt.customer_id = src.customer_id"
).whenMatchedUpdateAll().whenNotMatchedInsertAll().execute()

ODI journalizing requires creating journal tables (J$ tables), journal triggers or log-mining configuration, and journal subscribers in the ODI topology. Delta Lake Change Data Feed eliminates all of this infrastructure — enabling CDF is a single table property change, and the change feed is automatically maintained by Delta Lake's transaction log. No triggers, no journal tables, no subscriber configuration.

From parsed legacy code to production-ready modern equivalents — MigryX automates the entire conversion pipeline

From Legacy Complexity to Modern Clarity with MigryX

Legacy ETL platforms encode business logic in visual workflows, proprietary XML formats, and platform-specific constructs that are opaque to standard analysis tools. MigryX’s deep parsers crack open these proprietary formats and extract the underlying data transformations, business rules, and data flows. The result is complete transparency into what your legacy code actually does — often revealing undocumented logic that even the original developers had forgotten.

ODI Topology to Unity Catalog

ODI Topology defines physical and logical schemas that abstract database connections. Physical schemas map to actual database schemas, while logical schemas provide a layer of indirection for environment promotion (dev/test/prod). Unity Catalog provides a three-level namespace (catalog.schema.table) that serves the same purpose with additional governance capabilities.

ODI Topology Concept	Unity Catalog Equivalent	Notes
Physical Schema	Unity Catalog Schema	Direct mapping to database schema
Logical Schema	Unity Catalog Catalog	Environment abstraction (dev/test/prod catalogs)
Context (DEV/TEST/PROD)	Catalog name (dev_catalog/prod_catalog)	Environment promotion via catalog switching
Data Server (connection)	External connection / Secret scope	Managed credentials with secret management
Agent (physical/standalone)	Databricks cluster / SQL warehouse	Elastic compute with auto-scaling
Technology definition	Built-in format support + connectors	Delta, Parquet, CSV, JSON, JDBC, etc.

# ODI Context-based environment switching equivalent
# ODI: Logical schema "LS_DW" maps to "DW_DEV" in DEV context, "DW_PROD" in PROD context

# Databricks: Use catalog name for environment switching
environment = dbutils.widgets.get("environment")  # "dev" or "prod"

if environment == "dev":
    catalog = "dev_catalog"
elif environment == "prod":
    catalog = "prod_catalog"
else:
    raise ValueError(f"Unknown environment: {environment}")

spark.sql(f"USE CATALOG {catalog}")
spark.sql("USE SCHEMA warehouse")

# All subsequent queries automatically use the correct environment
df = spark.table("fact_orders")  # Resolves to dev_catalog.warehouse.fact_orders or prod_catalog.warehouse.fact_orders

CKM Flow Control to Delta Lake Data Quality

ODI Check Knowledge Modules (CKMs) validate data against constraints defined on the target datastore. Rows that fail validation are moved to error tables (E$ tables) with error messages and codes. In Databricks, data quality checks can be implemented using PySpark validation logic, Delta Lake expectations (in Delta Live Tables), or SQL-based constraint checks.

# ODI CKM Oracle equivalent: Validate data against target constraints
# ODI generates: INSERT INTO E$_TARGET SELECT * FROM I$_TARGET WHERE constraint_violated

df_staged = spark.table("staging.order_lines")

# Define validation rules (replaces ODI CKM constraint checks)
validation_rules = {
    "NOT_NULL_ORDER_ID": F.col("order_id").isNotNull(),
    "NOT_NULL_PRODUCT_ID": F.col("product_id").isNotNull(),
    "POSITIVE_QUANTITY": F.col("quantity") > 0,
    "POSITIVE_AMOUNT": F.col("amount") > 0,
    "VALID_STATUS": F.col("status").isin("OPEN", "SHIPPED", "DELIVERED", "CANCELLED"),
    "VALID_DATE": F.col("order_date").between("2020-01-01", "2030-12-31")
}

# Apply all validation rules
valid_mask = F.lit(True)
error_reasons = F.lit("")

for rule_name, rule_condition in validation_rules.items():
    valid_mask = valid_mask & rule_condition
    error_reasons = F.when(
        ~rule_condition,
        F.concat(error_reasons, F.lit(f"{rule_name}; "))
    ).otherwise(error_reasons)

df_validated = df_staged.withColumn("_is_valid", valid_mask).withColumn("_error_reasons", error_reasons)

# Separate valid and error records (replaces E$ error table)
valid_records = df_validated.filter(F.col("_is_valid")).drop("_is_valid", "_error_reasons")
error_records = (df_validated
    .filter(~F.col("_is_valid"))
    .withColumn("_error_timestamp", F.current_timestamp())
    .withColumn("_source_system", F.lit("ODI_MIGRATION"))
)

# Write valid records to target
valid_records.write.format("delta").mode("append").saveAsTable("silver.order_lines")

# Write errors to quarantine table (replaces E$ table)
error_records.write.format("delta").mode("append").saveAsTable("quarantine.order_lines_errors")

valid_count = valid_records.count()
error_count = error_records.count()
print(f"Flow control results: {valid_count} valid, {error_count} rejected")

# Fail the notebook if error rate exceeds threshold
error_rate = error_count / (valid_count + error_count) if (valid_count + error_count) > 0 else 0
if error_rate > 0.05:
    raise Exception(f"Error rate {error_rate:.2%} exceeds 5% threshold. Check quarantine.order_lines_errors.")

Medallion Architecture: Organizing the Migrated Pipelines

ODI projects typically organize data flows across staging schemas, intermediate work tables, and target data warehouse schemas. This maps to the Databricks Medallion Architecture pattern, which provides a standardized approach to data quality progression.

ODI Data Layer	Medallion Tier	Delta Lake Implementation
Staging area (C$ tables)	Bronze	Raw ingestion with audit metadata columns
Work tables (I$ tables)	Silver	Cleansed, deduplicated, conformed data
Target data warehouse	Gold	Business-level dimensions, facts, and aggregates
Error tables (E$ tables)	Quarantine	Rejected records with error reasons and timestamps
Journal tables (J$ tables)	Delta Lake Change Data Feed	Automatic change tracking via transaction log

MLflow: Extending Beyond ODI's ETL Scope

ODI is a pure data integration tool with no native machine learning capabilities. Databricks extends the platform with MLflow for experiment tracking, model training, model registry, and model serving. Feature engineering pipelines that previously terminated in ODI-loaded tables and fed external ML tools can now run end-to-end on Databricks.

# Feature engineering pipeline that extends beyond what ODI can do
import mlflow
from pyspark.ml.feature import VectorAssembler, StringIndexer
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml import Pipeline

# Features from ODI-migrated gold tables
features = spark.sql("""
    SELECT
        c.customer_id,
        c.tenure_months,
        c.credit_tier,
        o.total_orders,
        o.avg_order_value,
        o.days_since_last_order,
        CASE WHEN o.days_since_last_order > 180 THEN 1 ELSE 0 END AS is_churned
    FROM gold.dim_customer c
    JOIN gold.customer_order_summary o ON c.customer_id = o.customer_id
""")

# Build ML pipeline
with mlflow.start_run(run_name="churn_model_v1"):
    indexer = StringIndexer(inputCol="credit_tier", outputCol="credit_tier_idx")
    assembler = VectorAssembler(
        inputCols=["tenure_months", "credit_tier_idx", "total_orders",
                    "avg_order_value", "days_since_last_order"],
        outputCol="features"
    )
    rf = RandomForestClassifier(labelCol="is_churned", featuresCol="features", numTrees=100)
    pipeline = Pipeline(stages=[indexer, assembler, rf])

    model = pipeline.fit(features)
    mlflow.spark.log_model(model, "churn_model")
    mlflow.log_metric("accuracy", 0.92)

How MigryX Automates ODI-to-Databricks Migration

MigryX uses AST-based deterministic parsing to analyze ODI repository exports (XML-based smart exports and SDK extracts), building a complete abstract syntax tree of each interface mapping, Knowledge Module template, package flow, and load plan dependency graph. Unlike regex-based tools that cannot parse nested ODI expressions, or AI-only approaches that approximate KM behavior, MigryX's parser achieves +95% accuracy by understanding the full structure of ODI metadata — including KM option settings, variable substitutions, flow control conditions, and multi-technology topology mappings.

MigryX ODI Migration Workflow

Repository Analysis — MigryX connects to the ODI repository (or processes smart exports) and inventories all projects, folders, interfaces/mappings, packages, scenarios, load plans, variables, sequences, and Knowledge Module assignments. Column-level lineage is traced from source datastores through every mapping expression to target columns.
Knowledge Module Decomposition — Each KM assignment is decomposed into its generated SQL steps. IKM incremental patterns become Delta Lake MERGE operations. LKM extraction patterns become Auto Loader or JDBC read configurations. CKM validation patterns become PySpark quality checks with quarantine tables.
Interface-to-PySpark Translation — ODI interface column mappings, expressions, joins, filters, and lookup definitions are parsed into an AST and deterministically translated to PySpark DataFrame operations. ODI functions (NVL, DECODE, INSTR, SUBSTR) are mapped to their PySpark/Spark SQL equivalents.
Package-to-Workflow Conversion — ODI package step sequences with OK/KO branching are converted to Databricks Workflow task DAGs with run_if conditions (ALL_SUCCESS, AT_LEAST_ONE_FAILED) that preserve the original error-handling logic.
STTM Documentation — Source-to-Target Mapping documents capture every column mapping, transformation expression, KM option, and data type conversion for comprehensive audit trails and regulatory compliance.
Merlin AI — MigryX's Merlin AI handles edge cases including custom Knowledge Modules with non-standard template steps, complex Jython procedures, and Oracle-specific SQL constructs (CONNECT BY, MODEL clause, XMLTABLE) that require contextual translation beyond deterministic rules.

Key Takeaways
ODI Knowledge Modules (IKM/LKM/CKM) are code-generation templates that produce multi-step SQL scripts. In Databricks, these are replaced by explicit PySpark transformations and Delta Lake MERGE operations that are simpler, auditable, and ACID-compliant.
ODI interfaces (mappings) translate to PySpark DataFrame operations — column expressions become withColumn/select, lookups become joins, and filters become DataFrame filter conditions.
ODI packages and scenarios map to Databricks Workflows with task dependencies, conditional execution (run_if), and retry policies. Load plans with parallel steps translate to workflow DAGs where tasks without dependencies run concurrently.
ODI journalizing (J$ tables and triggers) is replaced by Delta Lake Change Data Feed (CDF), which provides row-level change tracking automatically through the Delta transaction log — no triggers or journal tables required.
ODI Topology (physical/logical schemas with context-based switching) maps to Unity Catalog's three-level namespace, where catalog names provide environment abstraction (dev_catalog vs. prod_catalog).
CKM flow control (E$ error tables) translates to PySpark validation logic with quarantine tables, or Delta Live Tables expectations for declarative data quality enforcement.
MigryX automates the conversion with AST-based deterministic parsing at +95% accuracy, decomposing Knowledge Modules into their generated SQL, translating ODI expressions to PySpark, and generating STTM documentation. On-premise and air-gapped deployment ensures no ODI repository metadata leaves your security perimeter.

Migrating from Oracle ODI to Databricks eliminates the Knowledge Module abstraction layer that sits between your transformation logic and its execution. In ODI, you define mappings in a visual tool, assign Knowledge Modules that generate SQL, and deploy scenarios that the ODI Agent executes against target databases. In Databricks, you write PySpark or Spark SQL directly — the code is the transformation, and it executes on elastic Spark clusters against Delta Lake storage. The result is a simpler, more transparent architecture where every transformation is visible, testable, and version-controlled in Git. Unity Catalog provides governance that exceeds ODI's topology-based access control, and Delta Lake delivers ACID transactions, time travel, and Change Data Feed capabilities that ODI's journal-table approach cannot match. With MigryX's automated, AST-based approach, organizations can migrate hundreds of ODI interfaces, packages, and load plans while preserving complete business logic fidelity and generating the comprehensive STTM documentation that enterprise migrations demand.

Why MigryX Is the Only Platform That Handles This Migration

The challenges described throughout this article are exactly what MigryX was built to solve. Here is how MigryX transforms this process:

Deep AST parsing: MigryX’s custom-built parsers achieve 95% accuracy on every supported legacy technology — not through approximation, but through true semantic understanding.
Merlin AI augmentation: Where deterministic parsing reaches its limit, Merlin AI resolves ambiguities and implicit behaviors, pushing accuracy to 99%.
Complete coverage: MigryX supports 25+ source technologies including SAS, Informatica, DataStage, SSIS, Alteryx, Talend, ODI, Teradata, and Oracle PL/SQL.
End-to-end automation: From parsing to conversion to validation — MigryX automates the entire pipeline, not just one step.

MigryX combines precision AST parsing with Merlin AI to deliver 99% accurate, production-ready migration — turning what used to be a multi-year manual effort into a streamlined, validated process. See it in action.

Ready to migrate from Oracle ODI to Databricks?

See how MigryX converts ODI Knowledge Modules, interfaces, packages, and load plans to production-ready PySpark notebooks and Databricks Workflows.

Explore ODI Migration Schedule a Demo