May 14, 2025

Why Your Data Science Code Needs Comments (And How to Write Good Ones)

Let’s face it: data science and machine learning code can be some of the most confusing to revisit months later. Those elegant transformations and model architectures that made perfect sense when you wrote them? They’ve morphed into cryptic puzzles that make you question your past self’s sanity.

I’ve been there—staring at a notebook full of seemingly random transformations, mysterious hyperparameters, and a model architecture that now looks like it was designed by a caffeinated octopus. The solution isn’t just cleaner code; it’s thoughtful comments that preserve your reasoning and make your future self (and teammates) much happier.

The Myth of “Self-Documenting” Data Science Code

I often hear data scientists claim their code is “self-documenting.” While that might work for basic data manipulations, it falls apart when dealing with:

Feature engineering decisions
Model selection rationale
Hyperparameter choices
Business constraints that guided the solution
Dataset quirks and preprocessing decisions
Experiment contexts and outcomes

Sure, your variable names might explain what you’re doing, but they rarely explain why you chose that approach or what trade-offs you considered.

Data Science Comments That Actually Help

Bad Comment:

# Calculate the mean
mean_val = df['salary'].mean()

Good Comment:

# Using mean rather than median for salary normalization to match finance team's reporting standards
# See discussion in issue DS-342 for business context
mean_val = df['salary'].mean()

The difference? The good comment explains the reasoning behind the choice and connects it to broader context.

When You Absolutely Must Comment in ML Code

1. Data Transformations and Feature Engineering

Comments here save countless hours of reverse engineering:

# Log-transform price to handle right-skewed distribution
# Reduces RMSE by 14% in cross-validation
df['log_price'] = np.log1p(df['price'])

# Winsorizing outliers at 1% and 99% percentiles
# Prevents GBM model from overfitting to extreme values
df['winsorized_returns'] = winsorize(df['returns'], limits=[0.01, 0.01])

2. Model Architecture Decisions

Explain non-obvious architecture choices:

# Using 3 LSTM layers (tried 1-5 in experiments)
# More layers didn't improve validation accuracy but increased training time by 3x
# Dropout between layers prevents overfitting on our limited dataset
model = Sequential([
    LSTM(128, return_sequences=True, input_shape=(seq_length, features)),
    Dropout(0.2),  # 0.2 dropout provided optimal regularization in grid search
    LSTM(64, return_sequences=True),
    Dropout(0.2),
    LSTM(32),
    Dense(1)
])

3. Hyperparameter Selections

Document the reasoning behind hyperparameter choices:

# Learning rate of 0.001 works best for our dataset
# - 0.01 caused oscillation around minima
# - 0.0001 converged too slowly (>100 epochs to match 0.001's 20-epoch performance)
# See experiment logs: https://wandb.ai/team/project/runs/exp42
optimizer = Adam(learning_rate=0.001)

4. Dataset-Specific Handling

Explain unique data characteristics:

# Filtering transactions below $5 as they're primarily test transactions
# This aligns with business logic in the payment processing system
# See data documentation: https://wiki.company.com/data/transactions
df = df[df['amount'] > 5]

# Handling missing values with median instead of mean
# Our time series data contains outliers that skew means significantly
df['duration'].fillna(df['duration'].median(), inplace=True)

The ML Code Comment Hierarchy

Follow this structure to maintain consistent commenting throughout your ML codebase:

Level 1: Notebook/Script Purpose

Start with a high-level overview:

"""
Customer Churn Prediction Model
================================

This notebook develops a gradient boosting model to predict customer churn
based on transaction history and customer profile data.

Key outcomes:
- Feature importance analysis for business insights
- Model achieves 0.82 ROC-AUC on validation set
- Deployed as API endpoint for real-time scoring

Data sources:
- customer_profiles.csv: Monthly snapshot from CRM system
- transactions.parquet: Daily transaction logs from payment system

Author: Brave
Last updated: 2025-04-12
"""

Level 2: Section-Level Documentation

Break down analysis steps with clear section comments:

# ===== 1. DATA PREPROCESSING =====
# This section handles:
# - Merging customer and transaction data
# - Cleaning missing values and outliers
# - Feature engineering based on transaction patterns
# - Train/validation/test splitting with time-based validation

# Merge strategy preserves all customers even if they have no transactions
merged_df = pd.merge(customers, transactions, on='customer_id', how='left')

# ...more code...

# ===== 2. EXPLORATORY DATA ANALYSIS =====
# Key insights from EDA:
# - Churned customers have 45% fewer transactions in the last month
# - Average transaction value is not significantly different between churned/retained
# - Geographic distribution shows higher churn in urban areas

Level 3: Critical Code Explanations

Explain the critical methodology choices:

# Stratified time-series split preserves class distribution while preventing data leakage
# Traditional random splits would leak future information into training set
tscv = TimeSeriesSplit(n_splits=5)
for train_idx, test_idx in tscv.split(X):
    # Split implementation...

# SMOTE applied only to training data after splitting to prevent data leakage
# Synthetic samples generated for minority class to address 9:1 class imbalance
X_train_resampled, y_train_resampled = SMOTE().fit_resample(X_train, y_train)

ML-Specific Commenting Pitfalls

1. Undocumented Magic Numbers

Avoid unexplained constants:

# BAD:
model = XGBoost(max_depth=6, n_estimators=250, learning_rate=0.08)

# GOOD:
# Max depth of 6 prevents overfitting on our noisy financial data
# 250 estimators balances performance and training time
# Learning rate of 0.08 determined through grid search (see grid_search_results.csv)
model = XGBoost(max_depth=6, n_estimators=250, learning_rate=0.08)

2. Missing Experiment Context

Document what you’ve already tried:

# BAD:
final_model = RandomForestClassifier()

# GOOD:
# Random Forest selected after comparing performance of:
# - Logistic Regression (baseline): 0.72 AUC
# - Random Forest: 0.81 AUC
# - XGBoost: 0.83 AUC but 3x longer inference time
# - Neural Network: 0.79 AUC with more feature engineering requirements
# RF chosen for balance of performance, inference time, and interpretability
final_model = RandomForestClassifier()

3. Unexplained Preprocessing Steps

Clarify the purpose of transformations:

# BAD:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)

# GOOD:
# Standardization is critical for distance-based algorithms and regularized models
# MinMaxScaler was tested but StandardScaler improved convergence speed by ~15%
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)

ML Experiment Tracking and Reproducibility

Great comments also facilitate reproducibility:

# Experiment: Churn-XGB-2025-04-10
# Seeds fixed for reproducibility across all randomized components
# Results logged to MLflow: experiment_id=42
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
tf.random.set_seed(RANDOM_SEED)
random.seed(RANDOM_SEED)

# Model configuration
# Hyperparameters tuned via Bayesian optimization (see tuning_report.pdf)
config = {
    'learning_rate': 0.01,    # Lower learning rate for stable convergence
    'max_depth': 5,           # Limited depth to avoid overfitting
    'subsample': 0.8,         # Reduces variance in tree construction
    'colsample_bytree': 0.7,  # Introduces feature randomness
    'n_estimators': 200       # Sufficient for convergence without overfitting
}

Bringing It All Together: A Real-World Example

Before (Unclear Data Science Code):

# Load data
df = pd.read_csv('data.csv')

# Process 
df['col1'] = df['col1'].fillna(0)
df['col2'] = df['col2'] / df['col3']
df['col4'] = np.log(df['col4'] + 1)

# Model
X = df.drop('target', axis=1)
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = LGBMClassifier(num_leaves=31, learning_rate=0.05, n_estimators=100)
model.fit(X_train, y_train)

After (Well-Commented Data Science Code):

# Load customer transaction data (daily exports from payment system)
# Dataset contains 2.3M transactions from 150K customers over 6 months
df = pd.read_csv('transaction_data.csv')

# Data preprocessing based on business rules and feature importance analysis
# Replace missing values in transaction amount with 0 (missing = no transaction)
df['transaction_amount'] = df['transaction_amount'].fillna(0)

# Create transaction frequency ratio normalized by customer tenure
# Strong predictor of churn according to previous analysis (correlation=0.72)
df['transaction_frequency_ratio'] = df['transaction_count'] / df['tenure_days']

# Log-transform customer revenue to handle high skew (skewness reduced from 8.2 to 0.5)
# Improves model performance across all algorithms tested by ~4% AUC
df['log_revenue'] = np.log(df['total_revenue'] + 1)  # +1 to handle zero values

# Model training with time-based validation (predicting future from past)
# Using 80% for training, 20% for testing with stratification to maintain churn ratio
X = df.drop('churned', axis=1)
y = df['churned']
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# LightGBM selected after comparison with Random Forest and XGBoost
# - Faster training time (3min vs 15min for RF with similar performance)
# - Better handling of categorical features without extensive encoding
# Hyperparameters tuned via 5-fold cross-validation grid search
model = LGBMClassifier(
    num_leaves=31,           # Reduced from default 127 to prevent overfitting
    learning_rate=0.05,      # Conservative learning rate for stable convergence
    n_estimators=100,        # Early stopping monitors validation performance
    importance_type='gain'   # Using gain for more reliable feature importance
)
model.fit(
    X_train, y_train,
    eval_set=[(X_test, y_test)],
    early_stopping_rounds=20,
    verbose=10
)

# Results tracked in experiment log: exp_tracking.com/project/churn-pred/run-123

Tools That Help Data Scientists Document Their Work

Beyond comments, these tools help preserve your thought process:

1. Experiment Tracking

MLflow for tracking experiments, parameters, and metrics
Weights & Biases for visualizing experiment results
DVC for versioning datasets alongside code

2. Computational Notebooks

Jupyter Notebooks with properly structured markdown between code cells
Google Colab with collaborative comments
Deepnote with team documentation capabilities

3. Documentation Generators

Sphinx for building comprehensive documentation
pdoc for auto-generating Python module documentation
Quarto for creating reproducible, publication-quality reports

The Golden Rule of ML Code Comments

Write for the data scientist who has to reproduce your results or update your model six months from now.

That might be you, a teammate, or someone who joined the team after you left. Without proper documentation, they’ll waste days trying to understand:

Why you chose that feature engineering approach
What data quirks influenced your decisions
What experiments failed and why
How the business requirements shaped your methodology

Conclusion

Data science and machine learning code demands even more thorough commenting than standard software development. The experimental nature, complex transformations, and statistical reasoning all need clear documentation to be truly reproducible and maintainable.

Remember: A machine learning solution is only as good as its ability to be understood, maintained, and improved over time. Your brilliant algorithm might solve today’s problem, but clear documentation ensures it can keep solving tomorrow’s problems too.

The next time you’re tempted to skip comments because “the code is obvious,” imagine explaining your work to a new team member who needs to build on your foundation. The extra minutes spent documenting your reasoning will save hours or days of confusion later.

Now go forth and comment your data science code—your future self (and team) will thank you.