This document provides a comprehensive, detailed, and professional output for the "Data Migration Planner" workflow, specifically focusing on generating production-ready code and structured configurations for a robust data migration. This deliverable outlines the core components, including data mapping, transformation rules, validation, and rollback procedures, presented with clear explanations and actionable code examples.
This deliverable provides the foundational code and structural components required to execute a complex data migration. It encompasses configuration management, core migration logic (Extract, Transform, Load), pre- and post-migration validation, and essential rollback mechanisms. The code is designed to be modular, extensible, and production-ready, utilizing Python for its versatility in data processing and database interaction.
A robust data migration requires a clear and centralized configuration. We will use a YAML file for defining source-to-target mappings, transformation rules, and connection parameters, making it easy to manage and update.
data_migration_config.yaml (Example Configuration File)
#### Explanation of `data_migration_config.yaml`
* **`source_db` / `target_db`**: Defines connection parameters for source and target databases. Passwords are specified via environment variables for security.
* **`tables_to_migrate`**: Lists tables to be migrated, including schema, primary key (crucial for validation and rollback), and batch size for performance.
* **`mappings`**: This is the core of the data transformation.
* Each entry represents a field mapping from source to target.
* `transform`: Specifies a Python function (from `transformations.py`) to apply to the source data.
* `args`: Optional arguments passed to the transformation function.
* **`validation_rules`**: Defines rules to be executed *before* and *after* the migration.
* `type`: Type of validation (e.g., `row_count_check`, `schema_compatibility_check`).
* `source_query`/`target_query`: SQL queries for data comparison.
* `tolerance_percentage`: Allows for minor discrepancies in counts or sums.
* **`rollback`**: Configures the strategy and details for reverting the migration in case of failure.
* **`project_timeline`**: A conceptual section to integrate timeline estimates directly into the planning document, even if the code itself doesn't execute a timeline.
### 2. Core Migration Framework (Python)
This section provides the Python scripts that implement the data migration logic.
#### `config_loader.py`
Handles loading the YAML configuration file.
Document Version: 1.0
Date: October 26, 2023
Prepared For: [Customer Name/Department]
Prepared By: PantheraHive Solutions Team
This document outlines the proposed architectural plan for the data migration project. The objective is to establish a robust, secure, and efficient framework for transferring data from the [Source System Name] to the [Target System Name]. This plan details the high-level strategy, key architectural components, considerations for data integrity, performance, and reversibility, along with preliminary thoughts on field mapping, transformation, validation, and rollback procedures. This foundational architecture will guide subsequent detailed design, development, and execution phases, ensuring a successful and controlled data migration.
The proposed migration strategy will be a [e.g., Phased Big Bang / Iterative / Incremental] approach, combining initial data profiling and cleansing with a structured migration process.
The migration architecture will comprise the following logical layers:
* Direct Database Queries: For relational databases, optimized SQL queries will be used to extract data in batches.
* API Calls: For SaaS applications or systems with robust APIs, programmatic extraction will be utilized.
* File Exports: For systems without direct database access or APIs, flat file (CSV, XML) exports will be used.
* Raw Data Storage: Store extracted data in its original format.
* Data Profiling: Run tools to analyze data quality, completeness, and consistency.
* Data Cleansing: Apply rules to correct errors, standardize formats, and remove duplicates.
* Data Transformation: Apply business rules to map source data to target schema.
* Field Mapping: Define one-to-one, one-to-many, many-to-one, and many-to-many mappings between source and target fields.
* Data Type Conversion: Adjust data types (e.g., string to integer, date formats).
* Value Normalization: Standardize values (e.g., 'CA', 'California' to 'California').
* Data Enrichment: Add missing data from external sources if required.
* Derivations: Calculate new fields based on source data.
* De-duplication: Identify and merge duplicate records based on defined rules.
* Error Handling: Mechanisms to log and manage transformation failures.
* API-based Loading: Utilize target system APIs for controlled and validated data insertion (e.g., Salesforce Bulk API, REST APIs).
* Database Inserts/Updates: Direct SQL inserts/updates for relational databases.
* Batch Loading Tools: Leverage target system-specific tools for high-volume loading (e.g., Salesforce Data Loader, database bulk insert utilities).
* Record Count Validation: Compare the number of records migrated for each entity against source counts.
* Data Sample Validation: Randomly select records and compare field-by-field values between source and target.
* Checksum/Hash Validation: Generate checksums for key datasets before and after migration.
* Business Rule Validation: Verify that target data adheres to business logic (e.g., all accounts must have an owner).
* Referential Integrity Checks: Ensure relationships between migrated entities are correctly established.
* Error Tables: Dedicated database tables to store failed records and error messages.
* Retry Mechanisms: Automated retries for transient errors.
* Notification System: Alerts for critical failures (e.g., email, Slack).
* Batch Processing: Optimize data processing in batches to handle large volumes efficiently.
* Parallel Processing: Utilize parallel extraction, transformation, and loading where possible.
* Resource Allocation: Ensure sufficient CPU, memory, and I/O resources for migration servers/containers.
* Transactional Control: Implement transactional loads where possible to ensure atomicity.
* Data Quality Gates: Establish checkpoints at each stage to prevent bad data from progressing.
* Pre-load Data: Migrate static and historical data in advance of the cutover.
* Delta Loads: For the cutover, only migrate changes since the last pre-load.
* Performance Tuning: Optimize all migration scripts and processes to minimize the cutover window.
* Pre-Migration Backups: Full backups of both source and target systems before cutover.
* Snapshotting: Database snapshots of the target system immediately prior to load.
* Soft Delete/Flagging: If direct deletion is not feasible, flag migrated records in the target system for easy identification and logical deletion
python
import yaml
import os
import logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
def load_config(config_path: str = 'data_migration_config.yaml') -> dict:
"""
Loads the data migration configuration from a YAML file.
Resolves environment variables for sensitive data like passwords.
Args:
config_path (str): Path to the YAML configuration file.
Returns:
dict: The loaded configuration.
Raises:
FileNotFoundError: If the config file does not exist.
yaml.YAMLError: If there's an issue parsing the YAML file.
KeyError: If a required environment variable is not set.
"""
if not os.path.exists(config_path):
logging.error(f"Configuration file not found at: {config_path}")
raise FileNotFoundError(f"Configuration file not found: {config_path}")
with open(config_path, 'r') as f:
config = yaml.safe_load(f)
# Resolve environment variables for database passwords
for db_type in ['source_db', 'target_db']:
if db_type in config and 'password_env_var' in config[db_type]:
env_var_name = config[db_type]['password_env_var']
password = os.getenv(env_var_name)
if password is None:
logging.error(f"Environment variable '{env_var_name}' for {db_type} password is not
Project Name: [Insert Project Name, e.g., CRM System Upgrade Data Migration]
Date: October 26, 2023
Version: 1.0
Prepared For: [Customer Name]
Prepared By: PantheraHive Solutions
This document outlines a comprehensive plan for the data migration from [Source System Name] to [Target System Name]. The primary objective is to ensure a secure, accurate, and efficient transfer of critical business data, minimizing downtime and mitigating risks. This plan details the scope, methodology, field mappings, transformation rules, validation procedures, rollback strategy, and estimated timeline to guide the successful execution of the migration. Adherence to this plan will ensure data integrity, business continuity, and a seamless transition to the new system.
The purpose of this document is to provide a detailed roadmap for the data migration initiative. This plan serves as a foundational guide for all stakeholders involved, ensuring a shared understanding of the process, responsibilities, and expected outcomes.
Key Objectives:
This section identifies the systems involved in the migration, including their key characteristics relevant to the data transfer.
* Name: [e.g., Legacy CRM System (Microsoft Dynamics CRM 2011)]
* Database: [e.g., SQL Server 2012]
* Key Data Entities: [e.g., Accounts, Contacts, Opportunities, Products, Orders]
* Current Data Volume: [e.g., ~500GB, 10 Million Records]
* Access Method: [e.g., ODBC, Direct Database Connection, API]
* Critical Dependencies: [e.g., Integration with ERP system for order data]
* Name: [e.g., Salesforce Sales Cloud]
* Database: [e.g., Salesforce Native Database (Cloud-based)]
* Key Data Entities: [e.g., Accounts, Contacts, Opportunities, Products, Orders]
* Target Data Volume (Post-Migration): [e.g., Estimated ~600GB after transformation]
* Access Method: [e.g., Salesforce API (SOAP/REST)]
* Critical Dependencies: [e.g., User provisioning, integration with external reporting tools]
The following data entities and their associated fields will be migrated. Exclusions and inclusions are detailed below.
* Accounts (All active and inactive accounts from the last 5 years)
* Contacts (All contacts associated with in-scope accounts)
* Opportunities (All open and closed opportunities from the last 3 years)
* Products (All active products)
* Orders (All orders from the last 2 years)
* [Add other relevant entities]
* Historical Activities older than 5 years (e.g., emails, tasks, calls)
* Archived Reports
* [Add other relevant exclusions]
* Accounts: 500,000 records
* Contacts: 1,500,000 records
* Opportunities: 800,000 records
* Products: 10,000 records
* Orders: 2,000,000 records
* Total Estimated Records: ~4.8 Million
* Total Estimated Data Size: [e.g., 600 GB]
This section details the mapping of source system fields to their corresponding target system fields. This is a critical step to ensure data accuracy and proper placement.
Example Mapping Table Structure:
| Source System Entity | Source Field Name | Source Data Type | Source Max Length | Target System Entity | Target Field Name | Target Data Type | Target Max Length | Mandatory (Target) | Notes/Comments |
| :------------------- | :---------------- | :--------------- | :---------------- | :------------------- | :---------------- | :--------------- | :---------------- | :----------------- | :------------- |
| Account | AccountID | INT | N/A | Account | External_ID__c | Text | 255 | Yes | Unique Identifier |
| Account | AccountName | NVARCHAR | 255 | Account | Name | Text | 255 | Yes | |
| Account | AccountType | NVARCHAR | 50 | Account | Type | Picklist | N/A | No | See Transformation Rule A.1 |
| Account | BillingAddress1 | NVARCHAR | 255 | Account | BillingStreet | Text Area | 255 | No | Concatenation needed for full address |
| Account | LastModifiedDate | DATETIME | N/A | Account | LastModifiedDate | DateTime | N/A | Yes | Direct Map |
| Contact | ContactID | INT | N/A | Contact | External_ID__c | Text | 255 | Yes | Unique Identifier |
| Contact | FirstName | NVARCHAR | 100 | Contact | FirstName | Text | 40 | Yes | Truncate if >40 |
| Contact | LastName | NVARCHAR | 100 | Contact | LastName | Text | 80 | Yes | |
| Opportunity | OpportunityStatus | NVARCHAR | 50 | Opportunity | StageName | Picklist | N/A | Yes | See Transformation Rule O.1 |
| [Add more entities and fields as required] |
A comprehensive mapping document will be provided as an appendix.
Data transformation rules define how source data will be modified, cleansed, or enriched to meet the target system's requirements and business logic.
A. Account Entity Transformations:
* Source AccountType values ('Customer', 'Prospect', 'Partner', 'Vendor', 'Other') will be mapped to Target Type picklist values ('Client', 'Lead', 'Alliance', 'Supplier', 'Other').
* Any source AccountType not explicitly mapped will default to 'Other'.
* Source fields BillingAddress1, BillingAddress2, BillingCity, BillingState, BillingZipCode, BillingCountry will be concatenated into Target BillingStreet, BillingCity, BillingState, BillingPostalCode, BillingCountry fields, respectively.
* Example: BillingStreet = BillingAddress1 + ', ' + BillingAddress2 (if BillingAddress2 is not null).
* All phone numbers will be formatted to E.164 international standard (+[Country Code][Area Code][Local Number]).
* Non-numeric characters will be removed; missing country codes will default to +1 (USA).
C. Contact Entity Transformations:
* FirstName will be truncated to 40 characters if the source value exceeds this length.
* LastName will be truncated to 80 characters if the source value exceeds this length.
* Email addresses will be validated for correct format (e.g., name@domain.com). Invalid emails will be flagged and stored in a 'Quarantine' field for manual review.
O. Opportunity Entity Transformations:
* Source OpportunityStatus values ('New', 'Qualified', 'Proposal Sent', 'Negotiation', 'Won', 'Lost', 'Closed-No Sale') will be mapped to Target StageName picklist values ('Prospecting', 'Qualification', 'Proposal/Price Quote', 'Negotiation/Review', 'Closed Won', 'Closed Lost', 'Closed Lost').
* For 'Closed Won' opportunities with a CloseDate older than 5 years, the CloseDate will be set to the first day of the 5-year lookback period.
G. General Transformations:
YYYY-MM-DD format. Datetime fields will be YYYY-MM-DD HH:MM:SS and converted to UTC timezone.* If target field is mandatory, default value will be assigned (e.g., 'N/A', 'Unknown', or a specific placeholder).
* If target field is not mandatory, NULL will be preserved.
Ensuring data quality is paramount. This section outlines the procedures for validating data both before and after migration.
7.1 Pre-Migration Data Quality Checks (Source System):
7.2 Post-Migration Validation Scripts and Procedures (Target System):
* Compare the total number of migrated records per entity in the target system against the expected count from the source system (after applying transformation filters).
SQL Query/API Call: SELECT COUNT() FROM [Target Entity] vs. SELECT COUNT(*) FROM [Source Entity] (with filters).
* Select a statistically significant random sample of records (e.g., 5-10% or N=1000 per entity) and manually verify field values against the source system.
* Validate key fields for accuracy and correctness (e.g., Account Name, Contact Email, Opportunity Amount).
* Script: SELECT Target.Field FROM Target_Entity WHERE Target.External_ID__c = [Source_ID] and compare with source.
* Verify parent-child relationships (e.g., all Contacts are correctly associated with an Account).
* Script: Identify orphaned records (e.g., Contacts without an associated Account).
* Verify that transformation rules have been applied correctly (e.g., phone numbers are in E.164 format, Account Types are mapped correctly).
* Script: SELECT Target.Phone FROM Target_Entity WHERE Target.Phone NOT LIKE '+%'
* Run key business reports and dashboards in the target system and compare results with equivalent reports from the source system.
* Business users will perform UAT on a dedicated migration environment to validate data accuracy and system functionality with migrated data.
* A robust error logging mechanism will capture all failed record migrations, transformation errors, and validation failures.
* Errors will be categorized, prioritized, and assigned for resolution.
* Resolution Strategy: Failed records will be reviewed, remediated (either in source or via direct data manipulation), and re-migrated in batches.
This section defines the overall methodology and tools for executing the data migration.
* Phased Migration: Data will be migrated in phases, starting with foundational data (e.g., Accounts, Products), followed by transactional data (e.g., Contacts, Opportunities, Orders). This allows for earlier validation and reduces risk.
* Alternative: Big Bang Cutover: All data migrated over a single, extended downtime window. (Less recommended for complex migrations).
* Recommendation: Phased approach is preferred for [Project Name] due to [reason, e.g., large data volume, complexity, need for early user feedback].
* ETL Tool: [e.g., Informatica PowerCenter, Talend Data Integration, SSIS, Custom Python/Java Scripts, Salesforce Data Loader (for specific entities)]
* Database Tools: [e.g., SQL Server Management Studio, Oracle SQL Developer]
* Version Control: [e.g., Git for transformation scripts and mapping documents]
* Project Management: [e.g., Jira, Azure DevOps]
* Development/Sandbox: Initial migration runs for script development and testing.
* Staging/UAT: Full migration run on a near-production environment for extensive testing and user acceptance.
* Production: Final cutover migration.
* Estimated Downtime: [e.g., 4-8 hours for the final cutover, specific to transactional data entities].
* Strategy to Minimize Downtime:
* Pre-load static and historical data where possible.
* Perform delta migrations for critical, frequently changing data leading up to cutover.
* Schedule cutover during off-peak hours (e.g., weekend).
* Thorough testing to reduce unforeseen issues during cutover.
A robust rollback plan is essential to recover from unforeseen issues or failures during the migration.
* Source System Backup: A full, verified backup of the source database will be taken immediately prior to initiating the production migration. This backup will be stored securely and be recoverable.
* Target System Backup/Snapshot: If the target system allows, a full snapshot or backup will be taken immediately