Advanced data preparation techniques, feature engineering, and data preprocessing specifically for machine learning workflows on AWS.
Learners will master comprehensive data preparation and feature engineering techniques essential for ML model success. They will learn data ingestion, cleaning, transformation, feature creation, and validation using AWS tools like SageMaker Data Wrangler, AWS Glue, and built-in preprocessing capabilities. Students will understand how to handle missing data, outliers, categorical encoding, and feature scaling while ensuring data quality and integrity for production ML systems.
Systematic approach to data quality assessment including data profiling, completeness analysis, consistency checks, and readiness evaluation for ML workflows.
Comprehensive missing data analysis including MCAR, MAR, MNAR patterns, and advanced imputation techniques including statistical, ML-based, and domain-specific approaches.
Advanced outlier detection including statistical methods, isolation forests, local outlier factor, and context-aware outlier treatment strategies.
Comprehensive categorical encoding including one-hot encoding, ordinal encoding, target encoding, binary encoding, and advanced techniques like entity embeddings.
Advanced scaling techniques including standardization, min-max scaling, robust scaling, quantile transformation, and power transformations for various ML algorithms.
Creative feature engineering including polynomial features, interaction terms, binning, aggregations, time-based features, and domain-specific transformations.
Comprehensive feature selection including filter, wrapper, and embedded methods, plus dimensionality reduction techniques like PCA, LDA, and t-SNE.
Advanced data validation including schema validation, statistical tests, data drift detection, and bias identification across different demographic groups.
Time series specific preprocessing including trend and seasonality decomposition, lag feature creation, rolling statistics, and temporal aggregations.
Advanced Data Wrangler usage including visual transformations, custom transforms, data insights, bias detection, and integration with SageMaker pipelines.
Advanced data cleaning methods including duplicate detection and removal, data type optimization, text cleaning, and data standardization for ML preprocessing.
Comprehensive AWS Glue usage including ETL job creation, data catalog management, crawlers, transformations, and integration with ML pipelines.