This document provides a comprehensive set of code components and configuration examples designed to support the planning and execution of your data migration. These scripts and configurations serve as a foundational framework, illustrating best practices for field mapping, data transformation, validation, and rollback procedures.
The code is written in Python, a widely adopted language for data manipulation and scripting, ensuring readability, maintainability, and extensibility.
The generated code aims to provide a modular and configurable approach to data migration. It separates concerns into distinct components:
This framework is designed to be adapted to your specific data sources (e.g., relational databases, flat files, APIs) and target systems.
A configuration-driven approach is crucial for managing complex migrations. This migration_config.yaml file defines all parameters, making the migration process transparent and easy to modify without changing core logic.
migration_config.yaml
--- ### 3. Core Migration Framework (Python Code) This section provides the Python modules that implement the logic defined in the `migration_config.yaml`. #### 3.1. `utils.py` - General Utilities
As a professional AI assistant, I understand the need for clear, actionable deliverables. While the overarching workflow is "Data Migration Planner" and the current step is "plan_architecture," the specific instruction provided for this output is to "Create a detailed study plan with: weekly schedule, learning objectives, recommended resources, milestones, and assessment strategies."
To best serve this request and provide a relevant output within the context of data migration, this study plan is designed for an individual or team aiming to master the principles and practices of Data Migration Planning and Architecture.
This study plan outlines an intensive 8-week program designed to equip individuals with the knowledge and practical skills required to plan, design, and execute successful data migration projects. The curriculum covers foundational concepts, architectural considerations, practical methodologies, and essential tools, culminating in the ability to architect robust data migration solutions.
Target Audience: Aspiring Data Migration Specialists, Solution Architects, Data Engineers, Project Managers involved in data-intensive projects.
Upon successful completion of this study plan, participants will be able to:
This schedule provides a structured approach, dedicating approximately 15-20 hours per week (mix of self-study, practical exercises, and resource review).
* Introduction to Data Migration: Definition, types, common triggers (mergers, system upgrades, cloud adoption).
* Data Migration Lifecycle: Discovery, Design, Build, Test, Execute, Validate, Decommission.
* Key Roles & Responsibilities in Data Migration.
* Project Scoping & Feasibility Analysis.
* Identifying Stakeholders & Communication Planning.
* Introduction to Data Governance and Compliance in Migration.
* Read foundational articles/chapters.
* Participate in discussion forums on migration challenges.
* Begin drafting a high-level project charter for a hypothetical migration scenario.
* Deep Dive into Source System Analysis: Data models (relational, NoSQL), schemas, data types, constraints, relationships, data volume.
* Target System Analysis: Design goals, new schema definition, integration points.
* Data Profiling Techniques: Identifying data quality issues (missing values, inconsistencies, duplicates), data distribution, cardinality.
* Metadata Management: Importance and tools.
* Legacy System Challenges & Strategies.
* Practice data profiling using sample datasets (e.g., SQL queries, Python scripts with Pandas).
* Document schema differences between a hypothetical source and target system.
* Extraction Methods: Full extract, incremental extraction (CDC - Change Data Capture), API-based extraction, database replication.
* Performance Optimization during Extraction: Batching, parallelization, indexing.
* Introduction to ETL/ELT Tools: Overview of popular tools (e.g., Apache NiFi, Talend, Informatica, AWS DMS, Azure Data Factory, Google Cloud Dataflow).
* Scripting for Extraction: SQL, Python, Shell scripting.
* Research and compare 2-3 ETL tools relevant to data migration.
* Write a Python script to extract data from a CSV file and perform basic initial cleansing.
* Data Mapping: Field-level mapping, data type conversions, composite keys.
* Transformation Rules: Cleansing (deduplication, standardization, imputation), enrichment, aggregation, derivation, lookup transformations.
* Handling Complex Data Structures: Hierarchical data, XML/JSON transformations.
* Error Handling during Transformation: Logging, error rows, data rejection.
* Business Rule Implementation & Validation.
* Create a detailed data mapping document for a small dataset, including transformation rules.
* Implement several transformation rules using SQL or Python on a sample dataset.
* Loading Methods: Direct inserts, bulk loading, API-based loading, streaming inserts.
* Performance Considerations for Loading: Indexing, constraints, triggers, transaction management.
* Rollback Mechanisms: Planning for failure, transaction control.
* Error Handling during Loading: Logging, retry mechanisms.
* Migration Cutover Strategies: Big Bang vs. Phased Approach, Parallel Run.
* Simulate a bulk load operation (e.g., using COPY command in PostgreSQL/Snowflake or a Python bulk insert).
* Design a basic rollback plan for a failed loading phase.
* Pre-Migration Validation: Source data quality checks, data profiling revisit.
* Post-Migration Validation: Row counts, checksums, reconciliation reports, sample data verification, business rule validation.
* Data Quality Metrics & Reporting.
* Testing Strategies: Unit, Integration, User Acceptance Testing (UAT).
* Security in Data Migration: Data at rest/in transit encryption, access controls, anonymization/pseudonymization.
* Compliance: GDPR, HIPAA, PCI-DSS considerations.
* Develop a set of SQL queries for post-migration data validation (e.g., row counts, sum of key columns).
* Outline a data security plan for a cloud migration project.
* Cloud Data Migration Strategies: Lift-and-shift, re-platforming, refactoring.
* Cloud-Native Migration Services (e.g., AWS DMS, Azure Migrate, Google Cloud Migrate for Compute Engine).
* Real-time Data Migration vs. Batch.
* DevOps for Data Migration: CI/CD pipelines.
* Project Management for Data Migration: Risk management, change management, communication.
* Vendor Selection & Management.
* Research a specific cloud migration service and its capabilities.
* Create a risk register for a hypothetical data migration project.
* Comprehensive review of all previous topics.
* Presentation skills for project proposals.
* Capstone Project: Design a complete data migration plan (architecture, mapping, transformation, validation, rollback, timeline) for a complex hypothetical scenario. This will involve integrating all learned concepts.
* Peer review and feedback sessions on capstone projects.
* Final Q&A and knowledge consolidation.
* "Designing Data-Intensive Applications" by Martin Kleppmann (for foundational data system knowledge).
* "Data Migration" by Johny S. John and Paul M. M. John (specific to data migration).
* "The DAMA Guide to the Data Management Body of Knowledge (DMBOK2)" (for data governance and quality).
* Coursera/edX: "Data Engineering with Google Cloud," "AWS Data Analytics Specialization," "Microsoft Azure Data Engineer Associate."
* Udemy/Pluralsight: Courses on specific ETL tools (Talend, Informatica), SQL, Python for Data Engineering.
* Cloud Provider Documentation: AWS, Azure, Google Cloud data migration services documentation.
* Databases: PostgreSQL, MySQL, SQL Server, MongoDB (for source/target practice).
* ETL/ELT Tools: Talend Open Studio, Apache NiFi, dbt (data build tool).
* Programming Languages: Python (with libraries like Pandas, SQLAlchemy), SQL.
* Version Control: Git/GitHub.
* Blogs from major cloud providers (AWS, Azure, GCP) on data migration.
* Medium articles on data engineering and migration case studies.
* Gartner/Forrester reports on data migration trends and tools.
* AWS Certified Data Analytics – Specialty
* Microsoft Certified: Azure Data Engineer Associate
* Google Cloud Professional Data Engineer
python
import os
import yaml
import logging
from datetime import datetime
def setup_logging(log_level="INFO"):
"""
Configures the global logger.
"""
logging.basicConfig(
level=getattr(logging, log_level.upper(), logging.INFO),
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
handlers=[
logging.FileHandler(f"migration_{datetime.now().strftime('%Y%m%d_%H%M%S')}.log"),
logging.StreamHandler()
]
)
return logging.getLogger(__name__)
logger = setup_logging()
def load_config(config_path="migration_config.yaml"):
"""
Loads the YAML configuration file.
Args:
config_path (str): Path to the YAML configuration file.
Returns:
dict: The loaded configuration.
"""
try:
with open(config_path, 'r') as f:
config = yaml.safe_load(f)
logger.info(f"Configuration loaded successfully from {config_path}")
return config
except FileNotFoundError:
logger.error(f"Configuration file not found at {config_path}")
raise
except yaml.YAMLError as e:
logger.error(f"Error parsing YAML configuration: {e}")
raise
def get_db_credentials(db_config):
"""
Retrieves database credentials, prioritizing environment variables for passwords.
Args:
db_config (dict): Dictionary containing database connection details.
Returns:
dict: Updated database configuration with resolved password.
"""
credentials = db_config.copy()
if 'password_env_var' in credentials:
env_var_name = credentials['password_env_var']
password = os.getenv(env_var_name)
if password:
credentials['password'] = password
logger.debug(f"Password for {credentials.get('database')} retrieved from environment variable '{env_var_name}'.")
else:
logger.warning(f"Environment variable '{env_var_name}' not set for database password. Proceeding without password or with default.")
credentials['password'] = credentials.get('password', '') # Fallback to empty string if not found
return credentials
def get_db_connection(db_config):
"""
Placeholder function to establish a database connection.
In a real scenario, this would import and use specific DB drivers (e.g., psycopg2, sqlalchemy).
Args:
db_config (dict): Database connection details.
Returns:
object: A database connection object (e.g., psycopg2 connection).
Raises:
NotImplementedError: If the specific database type is not handled.
"""
credentials = get_db_credentials(db_config)
db_
This document outlines the comprehensive plan for the upcoming data migration, serving as a critical deliverable for all stakeholders. It details the strategy, technical specifications, operational procedures, and timeline estimates required to ensure a successful and seamless transition of data from the [Source System Name] to the [Target System Name].
This document details the complete data migration plan from [Source System Name, e.g., Legacy CRM 1.0] to [Target System Name, e.g., Salesforce Sales Cloud]. The objective is to securely and accurately transfer all defined customer, product, and historical transaction data, ensuring data integrity, minimizing downtime, and providing a robust foundation for future operations. This plan covers data mapping, transformation rules, validation procedures, rollback strategies, and a detailed timeline, designed to mitigate risks and ensure a smooth transition.
* Data accuracy: >99.5% match between source and target for key fields.
* Data completeness: 100% of scoped records migrated.
* Downtime: Max [X] hours for critical systems during cutover.
* Zero critical data loss or corruption.
* Successful user acceptance testing (UAT).
The migration will follow a phased approach to minimize risk and allow for iterative testing and validation.
* Name: [e.g., "Legacy Customer Management System (LCMS)"]
* Database: [e.g., SQL Server 2016]
* Key Tables/Objects: Customers, Products, Orders, OrderItems, Users, Cases
* Authentication: [e.g., SQL Server Authentication]
* Access Method: [e.g., ODBC connection via secure VPN]
* Name: [e.g., "Salesforce Sales Cloud"]
* Objects: Account, Contact, Product2, Opportunity, Order, Case, User
* Authentication: [e.g., OAuth 2.0 via connected app]
* API: [e.g., Salesforce Bulk API 2.0 for large volumes, SOAP API for individual records]
| Data Entity | Source Table/Object | Target Object | Estimated Record Count (Source) | Estimated Data Volume (GB) | Comments |
| :---------- | :------------------ | :------------ | :------------------------------ | :------------------------- | :------- |
| Customers | Customers | Account | 500,000 | 0.5 | Includes company accounts |
| Contacts | Customers | Contact | 1,200,000 | 1.2 | Linked to Accounts |
| Products | Products | Product2 | 15,000 | 0.01 | Active products only |
| Orders | Orders, OrderItems | Order (with related OrderItem) | 3,000,000 (last 5 years) | 3.0 | Includes line items |
| Cases | Cases | Case | 800,000 (last 2 years) | 0.8 | Closed and Open cases |
| Users | Users | User | 150 | 0.001 | Active users only |
| Total | | | ~5.5 Million Records | ~5.5 GB | Excludes attachments/blobs |
This section outlines the detailed field-level mapping and the necessary transformation rules to ensure data compatibility and quality in the target system.
| Source Object.Field | Source Data Type | Target Object.Field | Target Data Type | Transformation Rule | Validation Check |
| :------------------ | :--------------- | :------------------ | :--------------- | :------------------ | :--------------- |
| Customers.CustomerID | INT | Account.External_ID__c | Text (Unique) | Direct Map | Not Null, Unique |
| Customers.CompanyName | NVARCHAR(255) | Account.Name | Text (255) | Direct Map, Trim | Not Null, Min Length 3 |
| Customers.Status | VARCHAR(10) | Account.Status__c | Picklist | CASE statement: Active->Active, Inactive->Inactive, Pending->Prospect | Valid Picklist Value |
| Customers.CreatedDate | DATETIME | Account.CreatedDate | DateTime | Direct Map, UTC Conversion | Not Null, Valid Date |
| Customers.AddressLine1 | NVARCHAR(255) | Account.BillingStreet | Text (255) | Concatenate with AddressLine2 if not null | Not Null |
| Customers.AddressLine2 | NVARCHAR(255) | (Part of BillingStreet) | - | Concatenate with AddressLine1 | - |
| Customers.City | NVARCHAR(100) | Account.BillingCity | Text (100) | Direct Map | Not Null |
| Customers.ZipCode | VARCHAR(10) | Account.BillingPostalCode | Text (20) | Format to XXXXX-XXXX if needed | Valid US Zip Code |
| Orders.OrderTotal | DECIMAL(18,2) | Order.TotalAmount | Currency (18,2) | Direct Map, Round to 2 decimal places | > 0 |
| Users.LoginName | VARCHAR(50) | User.Username | Email | Convert to email format: login@yourdomain.com | Valid Email Format, Unique |
* All date/time fields from source (e.g., DATETIME, TIMESTAMP) will be converted to ISO 8601 format and stored as DateTime in the target system, with UTC conversion where applicable.
* Numeric fields (e.g., DECIMAL, INT) will be converted to the corresponding Currency, Number, or Integer types in the target system, with precision and scale adjustments as per target system requirements.
* Status Codes: Source Customers.Status values (Active, Inactive, Pending, Archived) will be mapped to Account.Status__c picklist values (Active, Inactive, Prospect, Archived) via a lookup table.
* User IDs: Source Users.UserID will be mapped to existing or newly created User.Id in the target system. Ownership of records will be assigned based on this mapping.
* Customers.AddressLine1 and Customers.AddressLine2 will be concatenated into Account.BillingStreet, separated by a newline character if AddressLine2 is present.
* Customers.FirstName and Customers.LastName will be mapped directly to Contact.FirstName and Contact.LastName.
* If a non-nullable target field has no corresponding source field or the source field is null, a predefined default value will be applied (e.g., Account.RecordType will default to 'Standard Account').
* Trim Whitespace: All text fields will have leading/trailing whitespace removed.
* Case Standardization: Company names will be converted to Title Case where appropriate (e.g., "ACME CORP" -> "Acme Corp").
* Phone Numbers: Standardize to E.164 format +CC NNN NNN NNNN.
* Email Addresses: Validate format; flag invalid emails for review or skip migration.
* Prior to migration, a de-duplication pass will be performed on source data based on a combination of CompanyName and PrimaryContactEmail for Accounts, and FirstName, LastName, Email for Contacts.
* During migration, the target system's native de-duplication rules will be leveraged, and any identified duplicates will be logged for manual review.
A multi-stage validation approach will be implemented to ensure data quality and integrity throughout the migration process.
OrderItems have a valid Order). Script: Python/SQL script to compare COUNT() of records for each entity in the source and target systems.
* Expected Outcome: COUNT(Source.Entity) == COUNT(Target.Entity)
* Tolerance: 0% deviation for critical entities (Accounts, Contacts, Orders); <0.5% for less critical (e.g., historical cases where some might be intentionally skipped).
* Script: Randomly select 5% of records for each major entity. Extract key fields from source and target for these records.
* Procedure: Manual review by business users and QA team to visually confirm accuracy of mapped fields, especially transformed data.
* Script: Generate reports comparing sum of key numeric fields (e.g., SUM(OrderTotal), AVG(ProductPrice)) between source and target for migrated data.
* Expected Outcome: SUM(Source.Field) == SUM(Target.Field)
* Tolerance: 0% deviation for financial values.
* Script: Verify parent-child relationships in the target system (e.g., all Contacts are linked to an Account, all OrderItems are linked to an Order).
* All transformation errors, skipped records, and validation failures will be logged with detailed messages, timestamps, and source record identifiers.
* Daily/hourly reports will be generated during migration execution to track progress and identify issues promptly.
* Business users will perform a comprehensive User Acceptance Testing (UAT) on the migrated data in a dedicated sandbox environment.
* Test cases will include:
* Searching for specific customers/orders.
* Creating new records and verifying auto-population/defaults.
* Running standard reports to check aggregated data.
* Verifying workflows and automation rules with migrated data.
A comprehensive rollback plan is essential to mitigate risks in case of unforeseen issues or critical failures during the migration.
* If a full backup of the target system was taken immediately prior to the migration attempt, restore the target system to its pre-migration state. This is the fastest and most reliable method for a clean slate.
* Estimated Time: [X] hours (dependent on backup size and restoration speed).
* Execute scripts to delete all data imported during the failed migration attempt from the target system. This requires robust delete scripts that handle related records and referential integrity.
* Estimated Time: [Y] hours (dependent on data volume and target system API limits).
The following timeline provides estimated durations for each phase. Actual durations may vary based on discovery findings and unforeseen challenges.
| Phase | Start Date | End Date | Duration (Weeks) | Key Activities
\n