4. Transforming, Scaling, and Fitting the Data

2025-09-21

Methodology

Because our target variable, HP, is a continuous numerical value, we will be using a regression model to predict it. And because we have a large number of features with many that do not have high correlation with our target variable, we can use Lasso Regression for its ability to perform feature selection and regularization. However, we can also see from our correlation heatmap that there are some features that have high correlation with each other which can lead to multicollinearity issues. To address this, we will use elastic net with provide a balance between Lasso and Ridge regression.

Feature Selection

Now we need to prepare our feature set. Based on our exploratory data analysis, I decided to drop total_resistance_multiplier since it had very little variation and didn’t correlate well with HP. We also need to handle any missing text values by filling them with empty strings.

from sklearn.linear_model import ElasticNetCV
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, FunctionTransformer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.metrics import root_mean_squared_error, mean_absolute_error, r2_score
from sklearn.impute import SimpleImputer
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans

df = pd.read_csv('data/first_gen_pokemon_final_features.csv')
df.drop(columns=['id', 'flavorText', 'reasoning', 'ability_name', 'attack_name'], inplace=True)

target = 'hp'
y = df[target]

cols_to_drop = [
  'hp', 'total_resistance_multiplier'
]
X = df.drop(columns=cols_to_drop)

Categorizing Features by Type

Different types of features need different preprocessing steps. Lets organize our features into three categories: numeric features (need scaling), text features (need TF-IDF vectorization), and binary features (can be used as-is).

numeric_features = [
    'level', 'convertedRetreatCost', 'number', 'primary_pokedex_number',
    'pokemon_count', 'total_weakness_multiplier', 'total_weakness_modifier',
    'total_resistance_modifier', 'pokedex_frequency', 'artist_frequency',
    'ability_count', 'attack_count', 'base_damage', 'attack_cost',
    'attack_damage', 'attack_utility', 'ability_damage', 'ability_utility',
  'text_depth', 'word_count', 'keyword_score'
]

binary_features = [
    col for col in X.columns if col not in numeric_features
]

print(f"Numeric features: {len(numeric_features)}")
print(f"Binary features: {len(binary_features)}")

Numeric features: 21
Binary features: 93

We can see that we have 14 numeric features, 2 text features, and many binary features (mostly from our one-hot encoding in our feature engineering step).

Train-Test Split

Before we do any transformations, we need to split our data into training and testing sets. This ensures we don’t leak information from the test set into our model. I’m using a 80/20 split with a random state for reproducibility.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=123)

Building the Preprocessing Pipeline

Now comes the important part - setting up our preprocessing pipeline. We need to:

Numeric features: Impute missing values with the median, then standardize (scale to mean=0, std=1)
Binary features: Pass through unchanged since they’re already in the right format
Text features: Convert to TF-IDF vectors with a maximum of 100 features each

We’ll use scikit-learn’s ColumnTransformer to apply different preprocessing steps to different feature types, and then combine everything with a Ridge regression model in a single pipeline.

Additionally we will be using ElasticNetCV which performs cross-validated elastic net regression to find the optimal hyperparameters for our model. This requires us to set up a grid of alpha and l1_ratio values to search over.

numeric_pipeline = Pipeline(steps=[
  ('impute', SimpleImputer(strategy='median')), 
  ('scale', StandardScaler()),
])

preprocessor = ColumnTransformer(
    transformers=[
        ('numeric', numeric_pipeline, numeric_features),
        ('pass', 'passthrough', binary_features),
    ],
    remainder='drop'
)

l1_ratios = [0.1, 0.5, 0.7, 0.9, 0.95, 0.99, 1.0]

model_pipeline = Pipeline(steps=[
  ('preprocessor', preprocessor),
  ('model', ElasticNetCV(l1_ratio=l1_ratios, cv=5, random_state=123))
])

model_pipeline.fit(X_train, y_train)
y_pred = model_pipeline.predict(X_test)

Results

After training our elastic net regression model, we can evaluate how many features were selected (non-zero coefficients) and the model’s performance on the test set. We see that out of the 114 features we started with, we dropped 48 features, leaving us with 66 features that the model found useful for predicting HP. This could mean a few things: some features may be redundant, some may not have a strong relationship with HP, or the model may be prioritizing simpler explanations.

final_model = model_pipeline.named_steps['model']

feature_names = numeric_features + binary_features 

import pandas as pd
coef_df = pd.DataFrame({
    'Feature': feature_names,
    'Coefficient': final_model.coef_
})

dropped_features = coef_df[coef_df['Coefficient'] == 0]
print(f"Total features: {len(feature_names)}")
print(f"Features dropped (Noise): {len(dropped_features)}")

Total features: 114
Features dropped (Noise): 48

Evaluating the model’s performance on the test set, we achieved an R² score of 0.936 and an RMSE of 15.50. An R² score of 0.936 indicates that our model explains a significant portion of the variance in HP, which is quite good. The RMSE of 15.50 means that, on average, our predictions are off by about 15.5 HP points.

rmse = root_mean_squared_error(y_test, y_pred)
print(f"RMSE: {rmse:.2f} hp")

mae = mean_absolute_error(y_test, y_pred)
print(f"MAE: {mae:.2f} hp")

r2 = r2_score(y_test, y_pred)
print(f"R-Squared: {r2:.3f}")

RMSE: 15.50 hp
MAE: 11.51 hp
R-Squared: 0.936

Finally, we can plot the predicted vs actual HP values to visually assess the model’s performance. Ideally, we want the points to fall along the diagonal line, indicating perfect predictions. We can see that most points are close to the line, but there are some deviations, especially for mid-range and high HP values. This suggests that while the model performs well overall, there may be room for improvement in predicting certain HP ranges.

import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(8, 8))
sns.scatterplot(x=y_test, y=y_pred)
plt.xlabel("Actual HP (y_test)")
plt.ylabel("Predicted HP (y_pred)")
plt.title("Actual vs. Predicted HP")

plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
plt.show()

Discussion and Conclusion

Overall, our elastic net regression model performed well in predicting the HP of Pokemon cards based on various features. The high R² score and low RMSE indicate that the model was able to capture the underlying patterns in the data. The feature selection process also helped reduce the complexity of the model by eliminating less important features. However, there are still areas for improvement. The deviations in the predicted vs actual plot suggest that the model may struggle with certain HP ranges, particularly mid-range and high HP values. Future work could involve exploring more advanced modeling techniques.