
Why Your Data Science Code Needs Comments (And How to Write Good Ones)
Why Your Data Science Code Needs Comments (And How to Write Good Ones)
Let’s face it: data science and machine learning code can be some of the most confusing to revisit months later. Those elegant transformations and model architectures that made perfect sense when you wrote them? They’ve morphed into cryptic puzzles that make you question your past self’s sanity.
I’ve been there—staring at a notebook full of seemingly random transformations, mysterious hyperparameters, and a model architecture that now looks like it was designed by a caffeinated octopus. The solution isn’t just cleaner code; it’s thoughtful comments that preserve your reasoning and make your future self (and teammates) much happier.
The Myth of “Self-Documenting” Data Science Code
I often hear data scientists claim their code is “self-documenting.” While that might work for basic data manipulations, it falls apart when dealing with:
- Feature engineering decisions
- Model selection rationale
- Hyperparameter choices
- Business constraints that guided the solution
- Dataset quirks and preprocessing decisions
- Experiment contexts and outcomes
Sure, your variable names might explain what you’re doing, but they rarely explain why you chose that approach or what trade-offs you considered.
Data Science Comments That Actually Help
Bad Comment:
# Calculate the mean
mean_val = df['salary'].mean()
Good Comment:
# Using mean rather than median for salary normalization to match finance team's reporting standards
# See discussion in issue DS-342 for business context
mean_val = df['salary'].mean()
The difference? The good comment explains the reasoning behind the choice and connects it to broader context.
When You Absolutely Must Comment in ML Code
1. Data Transformations and Feature Engineering
Comments here save countless hours of reverse engineering:
# Log-transform price to handle right-skewed distribution
# Reduces RMSE by 14% in cross-validation
df['log_price'] = np.log1p(df['price'])
# Winsorizing outliers at 1% and 99% percentiles
# Prevents GBM model from overfitting to extreme values
df['winsorized_returns'] = winsorize(df['returns'], limits=[0.01, 0.01])
2. Model Architecture Decisions
Explain non-obvious architecture choices:
# Using 3 LSTM layers (tried 1-5 in experiments)
# More layers didn't improve validation accuracy but increased training time by 3x
# Dropout between layers prevents overfitting on our limited dataset
model = Sequential([
LSTM(128, return_sequences=True, input_shape=(seq_length, features)),
Dropout(0.2), # 0.2 dropout provided optimal regularization in grid search
LSTM(64, return_sequences=True),
Dropout(0.2),
LSTM(32),
Dense(1)
])
3. Hyperparameter Selections
Document the reasoning behind hyperparameter choices:
# Learning rate of 0.001 works best for our dataset
# - 0.01 caused oscillation around minima
# - 0.0001 converged too slowly (>100 epochs to match 0.001's 20-epoch performance)
# See experiment logs: https://wandb.ai/team/project/runs/exp42
optimizer = Adam(learning_rate=0.001)
4. Dataset-Specific Handling
Explain unique data characteristics:
# Filtering transactions below $5 as they're primarily test transactions
# This aligns with business logic in the payment processing system
# See data documentation: https://wiki.company.com/data/transactions
df = df[df['amount'] > 5]
# Handling missing values with median instead of mean
# Our time series data contains outliers that skew means significantly
df['duration'].fillna(df['duration'].median(), inplace=True)
The ML Code Comment Hierarchy
Follow this structure to maintain consistent commenting throughout your ML codebase:
Level 1: Notebook/Script Purpose
Start with a high-level overview:
"""
Customer Churn Prediction Model
================================
This notebook develops a gradient boosting model to predict customer churn
based on transaction history and customer profile data.
Key outcomes:
- Feature importance analysis for business insights
- Model achieves 0.82 ROC-AUC on validation set
- Deployed as API endpoint for real-time scoring
Data sources:
- customer_profiles.csv: Monthly snapshot from CRM system
- transactions.parquet: Daily transaction logs from payment system
Author: Brave
Last updated: 2025-04-12
"""
Level 2: Section-Level Documentation
Break down analysis steps with clear section comments:
# ===== 1. DATA PREPROCESSING =====
# This section handles:
# - Merging customer and transaction data
# - Cleaning missing values and outliers
# - Feature engineering based on transaction patterns
# - Train/validation/test splitting with time-based validation
# Merge strategy preserves all customers even if they have no transactions
merged_df = pd.merge(customers, transactions, on='customer_id', how='left')
# ...more code...
# ===== 2. EXPLORATORY DATA ANALYSIS =====
# Key insights from EDA:
# - Churned customers have 45% fewer transactions in the last month
# - Average transaction value is not significantly different between churned/retained
# - Geographic distribution shows higher churn in urban areas
Level 3: Critical Code Explanations
Explain the critical methodology choices:
# Stratified time-series split preserves class distribution while preventing data leakage
# Traditional random splits would leak future information into training set
tscv = TimeSeriesSplit(n_splits=5)
for train_idx, test_idx in tscv.split(X):
# Split implementation...
# SMOTE applied only to training data after splitting to prevent data leakage
# Synthetic samples generated for minority class to address 9:1 class imbalance
X_train_resampled, y_train_resampled = SMOTE().fit_resample(X_train, y_train)
ML-Specific Commenting Pitfalls
1. Undocumented Magic Numbers
Avoid unexplained constants:
# BAD:
model = XGBoost(max_depth=6, n_estimators=250, learning_rate=0.08)
# GOOD:
# Max depth of 6 prevents overfitting on our noisy financial data
# 250 estimators balances performance and training time
# Learning rate of 0.08 determined through grid search (see grid_search_results.csv)
model = XGBoost(max_depth=6, n_estimators=250, learning_rate=0.08)
2. Missing Experiment Context
Document what you’ve already tried:
# BAD:
final_model = RandomForestClassifier()
# GOOD:
# Random Forest selected after comparing performance of:
# - Logistic Regression (baseline): 0.72 AUC
# - Random Forest: 0.81 AUC
# - XGBoost: 0.83 AUC but 3x longer inference time
# - Neural Network: 0.79 AUC with more feature engineering requirements
# RF chosen for balance of performance, inference time, and interpretability
final_model = RandomForestClassifier()
3. Unexplained Preprocessing Steps
Clarify the purpose of transformations:
# BAD:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
# GOOD:
# Standardization is critical for distance-based algorithms and regularized models
# MinMaxScaler was tested but StandardScaler improved convergence speed by ~15%
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
ML Experiment Tracking and Reproducibility
Great comments also facilitate reproducibility:
# Experiment: Churn-XGB-2025-04-10
# Seeds fixed for reproducibility across all randomized components
# Results logged to MLflow: experiment_id=42
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
tf.random.set_seed(RANDOM_SEED)
random.seed(RANDOM_SEED)
# Model configuration
# Hyperparameters tuned via Bayesian optimization (see tuning_report.pdf)
config = {
'learning_rate': 0.01, # Lower learning rate for stable convergence
'max_depth': 5, # Limited depth to avoid overfitting
'subsample': 0.8, # Reduces variance in tree construction
'colsample_bytree': 0.7, # Introduces feature randomness
'n_estimators': 200 # Sufficient for convergence without overfitting
}
Bringing It All Together: A Real-World Example
Before (Unclear Data Science Code):
# Load data
df = pd.read_csv('data.csv')
# Process
df['col1'] = df['col1'].fillna(0)
df['col2'] = df['col2'] / df['col3']
df['col4'] = np.log(df['col4'] + 1)
# Model
X = df.drop('target', axis=1)
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = LGBMClassifier(num_leaves=31, learning_rate=0.05, n_estimators=100)
model.fit(X_train, y_train)
After (Well-Commented Data Science Code):
# Load customer transaction data (daily exports from payment system)
# Dataset contains 2.3M transactions from 150K customers over 6 months
df = pd.read_csv('transaction_data.csv')
# Data preprocessing based on business rules and feature importance analysis
# Replace missing values in transaction amount with 0 (missing = no transaction)
df['transaction_amount'] = df['transaction_amount'].fillna(0)
# Create transaction frequency ratio normalized by customer tenure
# Strong predictor of churn according to previous analysis (correlation=0.72)
df['transaction_frequency_ratio'] = df['transaction_count'] / df['tenure_days']
# Log-transform customer revenue to handle high skew (skewness reduced from 8.2 to 0.5)
# Improves model performance across all algorithms tested by ~4% AUC
df['log_revenue'] = np.log(df['total_revenue'] + 1) # +1 to handle zero values
# Model training with time-based validation (predicting future from past)
# Using 80% for training, 20% for testing with stratification to maintain churn ratio
X = df.drop('churned', axis=1)
y = df['churned']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)
# LightGBM selected after comparison with Random Forest and XGBoost
# - Faster training time (3min vs 15min for RF with similar performance)
# - Better handling of categorical features without extensive encoding
# Hyperparameters tuned via 5-fold cross-validation grid search
model = LGBMClassifier(
num_leaves=31, # Reduced from default 127 to prevent overfitting
learning_rate=0.05, # Conservative learning rate for stable convergence
n_estimators=100, # Early stopping monitors validation performance
importance_type='gain' # Using gain for more reliable feature importance
)
model.fit(
X_train, y_train,
eval_set=[(X_test, y_test)],
early_stopping_rounds=20,
verbose=10
)
# Results tracked in experiment log: exp_tracking.com/project/churn-pred/run-123
Tools That Help Data Scientists Document Their Work
Beyond comments, these tools help preserve your thought process:
1. Experiment Tracking
- MLflow for tracking experiments, parameters, and metrics
- Weights & Biases for visualizing experiment results
- DVC for versioning datasets alongside code
2. Computational Notebooks
- Jupyter Notebooks with properly structured markdown between code cells
- Google Colab with collaborative comments
- Deepnote with team documentation capabilities
3. Documentation Generators
- Sphinx for building comprehensive documentation
- pdoc for auto-generating Python module documentation
- Quarto for creating reproducible, publication-quality reports
The Golden Rule of ML Code Comments
Write for the data scientist who has to reproduce your results or update your model six months from now.
That might be you, a teammate, or someone who joined the team after you left. Without proper documentation, they’ll waste days trying to understand:
- Why you chose that feature engineering approach
- What data quirks influenced your decisions
- What experiments failed and why
- How the business requirements shaped your methodology
Conclusion
Data science and machine learning code demands even more thorough commenting than standard software development. The experimental nature, complex transformations, and statistical reasoning all need clear documentation to be truly reproducible and maintainable.
Remember: A machine learning solution is only as good as its ability to be understood, maintained, and improved over time. Your brilliant algorithm might solve today’s problem, but clear documentation ensures it can keep solving tomorrow’s problems too.
The next time you’re tempted to skip comments because “the code is obvious,” imagine explaining your work to a new team member who needs to build on your foundation. The extra minutes spent documenting your reasoning will save hours or days of confusion later.
Now go forth and comment your data science code—your future self (and team) will thank you.