Machine Learning Integration
75 minPandas integrates seamlessly with scikit-learn for machine learning workflows, enabling you to prepare data for ML models efficiently. scikit-learn expects NumPy arrays or Pandas DataFrames as input, making Pandas the standard data preparation tool. Understanding ML integration enables building complete ML pipelines. ML integration is essential for data science.
Data preprocessing and feature engineering are crucial for ML success—clean, transformed data improves model performance significantly. Preprocessing includes normalization, encoding categorical variables, handling missing values, and feature scaling. Feature engineering creates new features from existing ones. Understanding preprocessing enables effective ML. Preprocessing is often more important than algorithm choice.
Understanding data types and handling missing values improves model performance by ensuring data is in the correct format and complete. ML models require numeric data, so categorical variables must be encoded. Missing values must be handled (imputed or removed). Understanding data preparation enables successful ML. Data preparation is critical for ML success.
Feature selection and dimensionality reduction enhance ML models by removing irrelevant features and reducing complexity. Feature selection identifies important features. Dimensionality reduction (PCA, etc.) reduces feature count while preserving information. Understanding feature engineering enables better models. Feature engineering is essential for ML.
Train-test splitting separates data into training and testing sets, enabling model evaluation. Use train_test_split() to create splits, typically 80/20 or 70/30. Proper splitting prevents data leakage. Understanding splitting enables valid model evaluation. Splitting is essential for ML workflows.
Best practices include preprocessing data before modeling, handling missing values appropriately, encoding categorical variables correctly, scaling features when needed, splitting data properly, and validating preprocessing steps. Understanding ML integration enables building successful ML pipelines. ML integration is essential for data science.
Key Concepts
- Pandas integrates seamlessly with scikit-learn for ML workflows.
- Data preprocessing and feature engineering are crucial for ML success.
- Understanding data types and handling missing values improves performance.
- Feature selection and dimensionality reduction enhance ML models.
- Train-test splitting enables proper model evaluation.
Learning Objectives
Master
- Preparing data for machine learning with Pandas
- Integrating Pandas with scikit-learn
- Performing feature engineering and preprocessing
- Splitting data for training and testing
Develop
- Understanding ML data preparation workflows
- Designing effective ML pipelines
- Appreciating Pandas' role in machine learning
Tips
- Convert DataFrames to NumPy arrays for scikit-learn: X = df.values or X = df.to_numpy().
- Handle missing values before modeling: df.fillna() or df.dropna().
- Encode categorical variables: use LabelEncoder or OneHotEncoder.
- Scale features when using distance-based algorithms: StandardScaler or MinMaxScaler.
Common Pitfalls
- Not preprocessing data, causing poor model performance.
- Not handling missing values, causing errors.
- Not encoding categorical variables, causing type errors.
- Data leakage from test set to training set.
Summary
- Pandas integrates seamlessly with scikit-learn for ML workflows.
- Data preprocessing and feature engineering are crucial for ML success.
- Understanding data preparation enables successful ML.
- ML integration is essential for data science.
- Proper data preparation is often more important than algorithm choice.
Exercise
Integrate pandas with scikit-learn for machine learning data preparation and analysis.
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, mean_squared_error, r2_score
from sklearn.feature_selection import SelectKBest, f_classif, RFE
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import seaborn as sns
# Set random seed for reproducibility
np.random.seed(42)
# Create comprehensive sample dataset
n_samples = 1000
data = {
'age': np.random.normal(35, 12, n_samples),
'income': np.random.normal(50000, 20000, n_samples),
'credit_score': np.random.normal(700, 100, n_samples),
'education_years': np.random.normal(16, 3, n_samples),
'employment_length': np.random.normal(8, 5, n_samples),
'debt_to_income': np.random.normal(0.3, 0.15, n_samples),
'loan_amount': np.random.normal(200000, 100000, n_samples),
'loan_term': np.random.choice([15, 30], n_samples),
'property_type': np.random.choice(['Single Family', 'Condo', 'Townhouse'], n_samples),
'occupancy': np.random.choice(['Primary', 'Secondary', 'Investment'], n_samples),
'loan_purpose': np.random.choice(['Purchase', 'Refinance', 'Cash-out'], n_samples)
}
# Create target variables
data['loan_approved'] = (
(data['credit_score'] > 650) &
(data['debt_to_income'] < 0.5) &
(data['income'] > 40000)
).astype(int)
data['interest_rate'] = (
3.5 +
(800 - data['credit_score']) * 0.01 +
data['debt_to_income'] * 2 +
(data['loan_amount'] > 300000) * 0.5
)
# Create DataFrame
df = pd.DataFrame(data)
# Add some missing values and outliers for realistic data
df.loc[np.random.choice(df.index, 50), 'credit_score'] = np.nan
df.loc[np.random.choice(df.index, 30), 'income'] = np.nan
df.loc[np.random.choice(df.index, 20), 'employment_length'] = np.nan
# Add some outliers
df.loc[np.random.choice(df.index, 10), 'income'] = np.random.uniform(200000, 500000)
df.loc[np.random.choice(df.index, 5), 'credit_score'] = np.random.uniform(300, 500)
print("=== Dataset Overview ===")
print(f"Dataset shape: {df.shape}")
print(f"Missing values:\n{df.isnull().sum()}")
print(f"\nData types:\n{df.dtypes}")
print(f"\nSample data:")
print(df.head())
# 1. Data Preprocessing
print("\n=== Data Preprocessing ===")
# Handle missing values
print("Handling missing values...")
df_cleaned = df.copy()
# Fill missing values with appropriate strategies
df_cleaned['credit_score'].fillna(df_cleaned['credit_score'].median(), inplace=True)
df_cleaned['income'].fillna(df_cleaned['income'].median(), inplace=True)
df_cleaned['employment_length'].fillna(df_cleaned['employment_length'].median(), inplace=True)
# Handle outliers using IQR method
def remove_outliers(df, column, n_std=1.5):
Q1 = df[column].quantile(0.25)
Q3 = df[column].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - n_std * IQR
upper_bound = Q3 + n_std * IQR
outliers_mask = (df[column] < lower_bound) | (df[column] > upper_bound)
print(f"Removing {outliers_mask.sum()} outliers from {column}")
return df[~outliers_mask]
# Remove outliers from numerical columns
numerical_cols = ['age', 'income', 'credit_score', 'education_years',
'employment_length', 'debt_to_income', 'loan_amount']
for col in numerical_cols:
df_cleaned = remove_outliers(df_cleaned, col)
print(f"\nCleaned dataset shape: {df_cleaned.shape}")
# 2. Feature Engineering
print("\n=== Feature Engineering ===")
# Create new features
df_cleaned['income_per_year'] = df_cleaned['income'] / df_cleaned['age']
df_cleaned['credit_income_ratio'] = df_cleaned['credit_score'] / (df_cleaned['income'] / 10000)
df_cleaned['loan_to_income'] = df_cleaned['loan_amount'] / df_cleaned['income']
df_cleaned['age_group'] = pd.cut(df_cleaned['age'],
bins=[0, 25, 35, 45, 55, 100],
labels=['Young', 'Early Career', 'Mid Career', 'Late Career', 'Senior'])
# Create interaction features
df_cleaned['credit_employment'] = df_cleaned['credit_score'] * df_cleaned['employment_length']
df_cleaned['income_education'] = df_cleaned['income'] * df_cleaned['education_years']
print("New features created:")
print(df_cleaned[['income_per_year', 'credit_income_ratio', 'loan_to_income',
'credit_employment', 'income_education']].head())
# 3. Categorical Encoding
print("\n=== Categorical Encoding ===")
# Label encoding for ordinal categories
le = LabelEncoder()
df_cleaned['age_group_encoded'] = le.fit_transform(df_cleaned['age_group'])
# One-hot encoding for nominal categories
categorical_cols = ['property_type', 'occupancy', 'loan_purpose']
df_encoded = pd.get_dummies(df_cleaned, columns=categorical_cols, drop_first=True)
print("Categorical encoding completed:")
print(f"Encoded dataset shape: {df_encoded.shape}")
print(f"New columns: {[col for col in df_encoded.columns if col not in df_cleaned.columns]}")
# 4. Feature Selection
print("\n=== Feature Selection ===")
# Prepare features and targets
feature_cols = [col for col in df_encoded.columns if col not in
['loan_approved', 'interest_rate', 'age_group']]
X = df_encoded[feature_cols]
y_classification = df_encoded['loan_approved']
y_regression = df_encoded['interest_rate']
# Split data
X_train, X_test, y_train_class, y_test_class = train_test_split(
X, y_classification, test_size=0.2, random_state=42, stratify=y_classification
)
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(
X, y_regression, test_size=0.2, random_state=42
)
# Feature selection using statistical tests
selector = SelectKBest(score_func=f_classif, k=10)
X_train_selected = selector.fit_transform(X_train, y_train_class)
X_test_selected = selector.transform(X_test)
selected_features = X.columns[selector.get_support()]
print(f"Selected features: {list(selected_features)}")
# Feature importance using Random Forest
rf_selector = RandomForestClassifier(n_estimators=100, random_state=42)
rf_selector.fit(X_train, y_train_class)
feature_importance = pd.DataFrame({
'feature': X.columns,
'importance': rf_selector.feature_importances_
}).sort_values('importance', ascending=False)
print("\nTop 10 most important features:")
print(feature_importance.head(10))
# 5. Machine Learning Models
print("\n=== Machine Learning Models ===")
# Classification Model
print("Training Random Forest Classifier...")
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
rf_classifier.fit(X_train_selected, y_train_class)
# Make predictions
y_pred_class = rf_classifier.predict(X_test_selected)
y_pred_proba = rf_classifier.predict_proba(X_test_selected)[:, 1]
# Evaluate classification model
print("\nClassification Results:")
print(f"Accuracy: {rf_classifier.score(X_test_selected, y_test_class):.4f}")
print("\nClassification Report:")
print(classification_report(y_test_class, y_pred_class))
# Confusion Matrix
cm = confusion_matrix(y_test_class, y_pred_class)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
xticklabels=['Rejected', 'Approved'],
yticklabels=['Rejected', 'Approved'])
plt.title('Confusion Matrix - Loan Approval')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.show()
# Regression Model
print("\nTraining Random Forest Regressor...")
rf_regressor = RandomForestRegressor(n_estimators=100, random_state=42)
rf_regressor.fit(X_train_selected, y_train_reg)
# Make predictions
y_pred_reg = rf_regressor.predict(X_test_selected)
# Evaluate regression model
mse = mean_squared_error(y_test_reg, y_pred_reg)
r2 = r2_score(y_test_reg, y_pred_reg)
print("\nRegression Results:")
print(f"Mean Squared Error: {mse:.4f}")
print(f"R² Score: {r2:.4f}")
# 6. Cross-Validation and Model Comparison
print("\n=== Cross-Validation and Model Comparison ===")
# Compare multiple models
models = {
'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
'Logistic Regression': LogisticRegression(random_state=42, max_iter=1000),
'Linear Regression': LinearRegression()
}
# Classification models
for name, model in models.items():
if hasattr(model, 'predict_proba'): # Classification model
scores = cross_val_score(model, X_train_selected, y_train_class, cv=5, scoring='accuracy')
print(f"{name} - Cross-validation accuracy: {scores.mean():.4f} (+/- {scores.std() * 2:.4f})")
# 7. Feature Importance Visualization
print("\n=== Feature Importance Visualization ===")
plt.figure(figsize=(12, 8))
top_features = feature_importance.head(15)
bars = plt.barh(range(len(top_features)), top_features['importance'])
plt.yticks(range(len(top_features)), top_features['feature'])
plt.xlabel('Feature Importance')
plt.title('Top 15 Most Important Features')
plt.gca().invert_yaxis()
# Add value labels on bars
for i, bar in enumerate(bars):
width = bar.get_width()
plt.text(width + 0.001, bar.get_y() + bar.get_height()/2,
f'{width:.3f}', ha='left', va='center')
plt.tight_layout()
plt.show()
# 8. Model Performance Analysis
print("\n=== Model Performance Analysis ===")
# Create performance comparison DataFrame
performance_data = []
for name, model in models.items():
if hasattr(model, 'predict_proba'): # Classification model
model.fit(X_train_selected, y_train_class)
train_score = model.score(X_train_selected, y_train_class)
test_score = model.score(X_test_selected, y_test_class)
performance_data.append({
'Model': name,
'Training Score': train_score,
'Test Score': test_score,
'Overfitting': train_score - test_score
})
performance_df = pd.DataFrame(performance_data)
print("\nModel Performance Comparison:")
print(performance_df)
# Visualize performance comparison
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
# Training vs Test scores
x = np.arange(len(performance_df))
width = 0.35
ax1.bar(x - width/2, performance_df['Training Score'], width, label='Training Score', alpha=0.8)
ax1.bar(x + width/2, performance_df['Test Score'], width, label='Test Score', alpha=0.8)
ax1.set_xlabel('Models')
ax1.set_ylabel('Score')
ax1.set_title('Training vs Test Scores')
ax1.set_xticks(x)
ax1.set_xticklabels(performance_df['Model'], rotation=45)
ax1.legend()
ax1.grid(True, alpha=0.3)
# Overfitting analysis
ax2.bar(performance_df['Model'], performance_df['Overfitting'], color='red', alpha=0.7)
ax2.set_xlabel('Models')
ax2.set_ylabel('Overfitting (Train - Test)')
ax2.set_title('Model Overfitting Analysis')
ax2.tick_params(axis='x', rotation=45)
ax2.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print("\nMachine Learning integration completed!")
print("Key takeaways:")
print("1. Proper data preprocessing improves model performance")
print("2. Feature engineering creates valuable insights")
print("3. Feature selection reduces noise and improves efficiency")
print("4. Cross-validation provides reliable performance estimates")
print("5. Model comparison helps choose the best approach")