← All Projects
Python Machine Learning Data Visualization

2. Feature Engineering

2025-09-21

Brief Look at the Dataset

First we need to load our dataframe from the csv file we created in part 1. Then, lets take a look at all the columns in the dataset.

import pandas as pd
import ast

df = pd.read_csv('data/first_gen_pokemon_cards.csv')

columns_to_parse = ['weaknesses', 'resistances', 'subtypes', 'types', 'abilities', 'attacks', 'nationalPokedexNumbers', 'evolvesTo', 'rules']
for col in columns_to_parse:
    if col in df.columns:
        df[col] = df[col].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) and x != 'nan' and pd.notna(x) else ([] if col != 'nationalPokedexNumbers' else None))

print(df.columns)
Index(['id', 'name', 'supertype', 'subtypes', 'level', 'hp', 'types',
       'evolvesFrom', 'abilities', 'attacks', 'weaknesses', 'retreatCost',
       'convertedRetreatCost', 'number', 'artist', 'rarity', 'flavorText',
       'nationalPokedexNumbers', 'legalities', 'images', 'evolvesTo',
       'resistances', 'rules', 'regulationMark', 'ancientTrait'],
      dtype='object')

I took these columns and created a simple data dictionary for reference:

Column NameData TypeDescriptionAllowed ValuesExamplesMissing Values
idStringUnique identifier for each cardAlphanumeric strings“xy7-54”, “sm3-22”No
nameStringName of the Pokemon cardAlphanumeric strings“Pikachu”, “Charizard”No
supertypeStringBroad category of the card“Pokémon”“Pokémon”No
subtypeStringMore specific category within the supertypeArray of strings[“Basic”, “Stage 1”, “Stage 2”, “EX”, “Team Plasma”…]No
levelStringLevel of the Pokémon (if applicable)Alphanumeric strings or X“12”, “45”, “X”Yes
hpIntegerHit points of the PokémonPositive integers60, 120, 200No
typesArray of stringsTypes of the Pokémon[“Fire”, “Water”, “Grass”, “Electric”, “Psychic”, “Fighting”, “Darkness”, “Metal”, “Fairy”, “Dragon”, “Colorless”][“Fire”], [“Water”, “Flying”]No
evolvesFromStringName of the Pokémon this card evolves from (if applicable)Alphanumeric strings“Pikachu”, “Charmander”Yes
abilitiesArray of objectsSpecial abilities of the PokémonObjects with name, text, and type fields[{name: “Static”, text: “May paralyze opponent’s Pokémon”, type: “Poké-Body”}]Yes
attacksArray of objectsAttacks that the Pokémon can performObjects with name, cost, convertedEnergyCost, damage, and text fields[{name: “Thunder Shock”, cost: [“Electric”, “Colorless”], convertedEnergyCost: 2, damage: “30”, text: “May paralyze opponent’s Pokémon”}]Yes
weaknessesArray of objectsWeaknesses of the PokémonObjects with type and value fields[{type: “Fighting”, value: “×2”}]Yes
retreatCostArray of stringsEnergy types required to retreat the Pokémon[“Colorless”][“Colorless”, “Colorless”]Yes
convertedRetreatCostIntegerTotal number of energy required to retreat the PokémonNon-negative integers1, 2, 3Yes
numberStringCard number within its setAlphanumeric strings“54”, “22”No
artistStringName of the card’s illustratorAlphanumeric strings“Mitsuhiro Arita”, “5ban Graphics”Yes
rarityStringRarity level of the card“Common”, “Uncommon”, “Rare”, “Holo Rare”, “Ultra Rare”, “Secret Rare”, etc.“Common”, “Holo Rare”Yes
flavorTextStringFlavor text providing background or lore about the PokémonAlphanumeric strings“When several of these Pokémon gather, their electricity could build and cause lightning storms.”Yes
nationalPokedexNumbersArray of integersNational Pokédex numbers associated with the PokémonPositive integers[25], [6]No
legalitiesObjectLegality of the card in various formatsFields for “expanded”, “standard”, “unlimited” with values “Legal” or “Not Legal”{expanded: “Legal”, standard: “Not Legal”, unlimited: “Legal”}No
imagesObjectURLs for the card’s imagesFields for “small” and “large” with URL strings{small: “http://…”, large: “http://…”}No
evolvesToArray of stringsNames of Pokémon this card can evolve into (if applicable)Alphanumeric strings[“Raichu”, “Pikachu Libre”]Yes
resistancesArray of objectsResistances of the PokémonObjects with type and value fields[{type: “Metal”, value: “-20”}]Yes
rulesArray of stringsSpecial rules that apply to the cardAlphanumeric strings[“If this Pokémon is your Active Pokémon, your opponent’s attacks do 20 less damage (before applying Weakness and Resistance).”]Yes
regulationMarkStringRegulation mark for tournament legalitySingle uppercase letters“D”, “E”Yes
ancientTraitObjectAncient Trait of the Pokémon (if applicable)Object with name and text fields{name: “Delta Evolution”, text: “This Pokémon can evolve from any type of basic Pokémon.”}Yes

We can see that there are quite a few features that are not necessary; the obvious ones are id and imagessince these features are unique identifiers and urls. We can drop these columns from the dataframe. Now we can focus on the features that would help a model learn the game mechanics that determines the hit points of a pokemon card. Given that this is our goal, we can also drop legalities and regulationMark columns since these columns pertain to the actual card game rules and not the pokemon card itself. Finally, we can also drop the supertype column since all of the cards in our dataset are of the same supertype Pokémon.

The other features still have some columns that I believe are not useful for predicting the hit points of a pokemon card but it is hard to tell without running through some analysis.

Feature Engineering

I look all the columns in the dataset and decided on the following feature engineering steps:

Column NameFeature Engineering Steps
idWe will drop this column since it is a unique identifier and does not provide any useful information for predicting hit points.
imagesWe will drop this column since it contains URLs to images and does not provide any useful information for predicting hit points.
legalitiesWe will drop this column since it pertains to the card game rules and not the pokemon card itself.
regulationMarkWe will drop this column since it pertains to the card game rules and not the pokemon card itself.
supertypeWe will drop this column since all of the cards in our dataset are of the same supertype Pokémon.
hpThis is our target variable that we are trying to predict. We don’t need to do any feature engineering on this column.
levelMost of the values in this column are missing, but we can fill in the missing values with the median level of the pokemons and create a new feature indicating whether the level was missing or not.
nationalPokedexNumbersWe will convert this to a numerical value by taking the first number in the array.
convertedRetreatCostThis is already a numerical value and can be used as is. We will fill in any missing values with 0.
rarityWe can use one-hot encoding to convert this categorical feature into multiple binary features.
evolvesFromWe can create a new binary feature indicating whether the pokemon evolves from another pokemon or not. 0 for no and 1 for yes.
evolvesToThis can be the same as evolvesFrom, we can create a new binary feature indicating whether the pokemon evolves to another pokemon or not. 0 for no and 1 for yes.
subtypesWe can use MultiLabelBinarizer to convert this list-based categorical feature into multiple binary features.
typesSimilar to subtypes, we can use MultiLabelBinarizer to convert this list-based categorical feature into multiple binary features
weaknessesWe can extract three features from this column:
- weakness_types: We can extract the types from the weaknesses and use MultiLabelBinarizer to convert this list-based categorical feature into multiple binary features.
- total_weakness_multiplier: We can extract the multiplier values from the weaknesses and sum them up to create a new numerical feature (e.g. “×2” -> 2).
- total_weakness_modifier: We can extract the modifier values from the weaknesses (e.g. “+20” -> 20) and multiply them together to create a new numerical feature.
resistancesSimilar to weaknesses, we can extract three features from this column:
- resistance_types: We can extract the types from the resistances and use MultiLabelBinarizer to convert this list-based categorical feature into multiple binary features.
- total_resistance_modifier: We can extract the modifier values from the resistances (e.g. “-20” -> -20) and sum them up to create a new numerical feature.
- total_resistance_multiplier: We can extract the multiplier values from the resistances and multiply them together to create a new numerical feature.
retreatCostSince this column is the same as convertedRetreatCost, we can drop this column.
nameThis feature has a very high cardinality. Originally my idea was to count the number of times each name appears in the dataset and use that as a feature. However, we can already count this using the nationalPokedexNumbers feature since each pokemon name corresponds to a unique pokedex number. Therefore, we can drop this feature.
artistSimilar to name, we can count the number of times each artist appears in the dataset and use that as a feature.
abilitiesI will split this into three features:
- ability_count: The number of abilities the pokemon has.
- ability_text: The combined text of all abilities.
- has_pokemon_power: A binary feature indicating whether the pokemon has a Poké-Body or Poké-Power ability.
attacksSimilar to abilities, I will split this into three features:
- attack_count: The number of attacks the pokemon has.
- max_damage: The explicit maximum damage value among all attacks. Since some damage values may contain non-numeric characters (e.g., “30+”, “50x”), we will extract the numeric part and convert it to an integer. If no numeric value is present, we will search for a number in the attack text to use. In the future we can also consider more complex parsing methods to better estimate the maximum damage.
- attack_cost: The total converted energy cost of all attacks.
rulesI will create a binary feature indicating whether the pokemon has any special rules or not.
ancientTraitI will create a binary feature indicating whether the pokemon has an ancient trait or not.
flavorTextI believe the flavor text does not provide any information that could help us predict the HP of a pokemon card but lets use TfidfVectorizer and run analysis on it to see.

Cleaning the Data

In this section we will focus on dropping columns and extracting features from our initial list of features. We will then transform and scale them in the next section. Lets first drop these columns that we decided aren’t useful from the dataframe:

  • id
  • images
  • legalities
  • regulationMark
  • supertype
  • retreatCost
  • name
df.drop(columns=['id', 'images', 'legalities', 'regulationMark', 'supertype', 'retreatCost', 'name'], inplace=True)

Direct Numerical Features

We can start with the columns that are already numerical values. These columns are:

  • level: I am replacing X found in levels with 100 which is the highest level you can train a pokemon to in a game. I will be filling in missing data later.
df['level'] = df['level'].apply(
  lambda x: int(x.replace('X', '100')) if isinstance(x, str) and x != 'nan' and pd.notna(x) else None
)
  • level_was_missing: A binary feature indicating whether the level was missing or not.
df['level_was_missing'] = df['level'].isnull().astype(int)
  • nationalPokedexNumbers: We will convert this to a numerical value by taking the first number in the array.
df['primary_pokedex_number'] = df['nationalPokedexNumbers'].apply(
    lambda x: x[0] if isinstance(x, list) and len(x) > 0 else None
)
  • pokemon_count: Counts how many Pokemon are in the nationalPokedexNumbers array.
df['pokemon_count'] = df['nationalPokedexNumbers'].apply(
    lambda x: len(x) if isinstance(x, list) else 0
)
  • convertedRetreatCost: This is already a numerical value and can be used as is. We just need to fill in any missing values with 0.
df['convertedRetreatCost'] = df['convertedRetreatCost'].fillna(0)
df['convertedRetreatCost'] = df['convertedRetreatCost'].replace('.', 0).astype(int)
  • number: We will convert this to a numerical value by taking the subset number before or after any non-numeric characters. For example, “54a” would be converted to 54.
import re

df['number'] = df['number'].apply(
    lambda x: int(re.search(r'\d+', str(x)).group()) if pd.notna(x) and re.search(r'\d+', str(x)) else None
)

Simple Categorical Features

Next we can look at the simple categorical features that have a limited number of unique values:

  • rarity: We can use one-hot encoding to convert this categorical feature into multiple binary features.
from sklearn.preprocessing import OneHotEncoder

hot_encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
df['rarity'] = df['rarity'].fillna('Unknown')
rarity_encoded = hot_encoder.fit_transform(df[['rarity']])
rarity_encoded_df = pd.DataFrame(rarity_encoded, columns=hot_encoder.get_feature_names_out(['rarity']))
df = pd.concat([df, rarity_encoded_df], axis=1)
df.drop(columns=['rarity'], inplace=True)
  • evolvesFrom: We can create a new binary feature indicating whether the pokemon evolves from another pokemon or not. 0 for no and 1 for yes.
df['evolvesFrom'] = df['evolvesFrom'].notnull().astype(int)
  • evolvesTo: This can be the same as evolvesFrom, we can create a new binary feature indicating whether the pokemon evolves to another pokemon or not. 0 for no and 1 for yes.
df['evolvesTo'] = df['evolvesTo'].apply(lambda x: int(isinstance(x, list) and len(x) > 0))

List-Based Categorical Features

Next we can look at the list-based categorical features. For these features, we will need to extract the modifiers from weaknesses and resistances so we first can define a function to do that. Then we can proceed with the feature extraction.

def extract_modifiers(modifier_list):
  if not isinstance(modifier_list, list):
    return (0, 0)

  total_multiplier = 0
  total_modifier = 0

  for item in modifier_list:
    value_str = item['value'].strip()

    if '×' in value_str:
      numeric_part = value_str.replace('×', '')
      total_multiplier += int(numeric_part)
    elif '+' in value_str or '-' in value_str:
      total_modifier += int(value_str)
          
  return (total_multiplier, total_modifier)
  • subtypes: We can use MultiLabelBinarizer to convert this list-based categorical feature into multiple binary features.
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()

subtypes_encoded = mlb.fit_transform(df['subtypes'].fillna('None').apply(lambda x: x if isinstance(x, list) else [x]))
subtypes_encoded_df = pd.DataFrame(subtypes_encoded, columns=[f'subtype_{cls}' for cls in mlb.classes_])
df = pd.concat([df, subtypes_encoded_df], axis=1)
df.drop(columns=['subtypes'], inplace=True)
  • types: Similar to subtypes, we can use MultiLabelBinarizer to convert this list-based categorical feature into multiple binary features.
types_encoded = mlb.fit_transform(df['types'].fillna('None').apply(lambda x: x if isinstance(x, list) else [x]))
types_encoded_df = pd.DataFrame(types_encoded, columns=[f'type_{cls}' for cls in mlb.classes_])
df = pd.concat([df, types_encoded_df], axis=1)
df.drop(columns=['types'], inplace=True)
  • weaknesses: We can extract three features from this column:
    • weakness_types: We can extract the types from the weaknesses and use MultiLabelBinarizer to convert this list-based categorical feature into multiple binary features.
    • total_weakness_multiplier: We can extract the multiplier values from the weaknesses and sum them up to create a new numerical feature (e.g. “×2” -> 2).
    • total_weakness_modifier: We can extract the modifier values from the weaknesses (e.g. “+20” -> 20) and multiply them together to create a new numerical feature.
mlb_weakness = MultiLabelBinarizer()

weakness_encoded = mlb_weakness.fit_transform(
    df['weaknesses'].apply(
        lambda x: [w['type'] for w in x] if isinstance(x, list) else []
    )
)
weakness_encoded_df = pd.DataFrame(weakness_encoded, columns=[f'weakness_{cls}' for cls in mlb_weakness.classes_])
df = pd.concat([df, weakness_encoded_df], axis=1)

total_weakness_values = df['weaknesses'].apply(extract_modifiers)
df[['total_weakness_multiplier', 'total_weakness_modifier']] = pd.DataFrame(
  total_weakness_values.tolist(), 
  index=df.index
)
df.drop(columns=['weaknesses'], inplace=True)
  • resistances: Similar to weaknesses, we can extract three features from this column:
    • resistance_types: We can extract the types from the resistances and use MultiLabelBinarizer to convert this list-based categorical feature into multiple binary features.
    • total_resistance_modifier: We can extract the modifier values from the resistances (e.g. “-20” -> -20) and sum them up to create a new numerical feature.
    • total_resistance_multiplier: We can extract the multiplier values from the resistances and multiply them together to create a new numerical feature.
mlb_resistance = MultiLabelBinarizer()

resistance_encoded = mlb_resistance.fit_transform(
    df['resistances'].apply(
        lambda x: [r['type'] for r in x] if isinstance(x, list) else []
    )
)
resistance_encoded_df = pd.DataFrame(resistance_encoded, columns=[f'resistance_{cls}' for cls in mlb_resistance.classes_])
df = pd.concat([df, resistance_encoded_df], axis=1)

total_resistance_values = df['resistances'].apply(extract_modifiers)
df[['total_resistance_multiplier', 'total_resistance_modifier']] = pd.DataFrame(
    total_resistance_values.tolist(),
    index=df.index
)
df.drop(columns=['resistances'], inplace=True)

High-Cardinality Categorical Features

Lets take a look at the categorical features that have a high number of unique values:

  • pokedex_frequency: We can count the number of times each pokedex number appears in the dataset and use that as a feature.
# Convert lists to tuples (hashable) for frequency counting
df['pokedex_frequency'] = df['nationalPokedexNumbers'].apply(
    lambda x: tuple(x) if isinstance(x, list) else None
).map(
    df['nationalPokedexNumbers'].apply(
        lambda x: tuple(x) if isinstance(x, list) else None
    ).value_counts()
)
df.drop(columns=['nationalPokedexNumbers'], inplace=True)
  • artist: We can count the number of times each artist appears in the dataset and use that as a feature.
df['artist_frequency'] = df['artist'].map(df['artist'].value_counts())
df.drop(columns=['artist'], inplace=True)

Complex JSON/Text Features

Finally, we have the more complex features that are in JSON format or text:

  • abilities: I will split this into three features:
    • ability_count: The number of abilities the pokemon has.
    • ability_text: The combined text of all abilities.
    • has_pokemon_power: A binary feature indicating whether the pokemon has a Poké-Body or Poké-Power ability.
df['ability_count'] = df['abilities'].apply(lambda x: len(x) if isinstance(x, list) else 0)
df['ability_text'] = df['abilities'].apply(lambda x: ' '.join([ability['text'] for ability in x]) if isinstance(x, list) else '')
df['has_pokemon_power'] = df['abilities'].apply(lambda x: int(any(ability['type'] in ['Poké-Body', 'Poké-Power'] for ability in x)) if isinstance(x, list) else 0)
df.drop(columns=['abilities'], inplace=True)
  • attacks: Similar to abilities, I will split this into three features:
    • attack_count: The number of attacks the pokemon has.
    • max_damage: The explicit maximum damage value among all attacks. Since some damage values may contain non-numeric characters (e.g., “30+”, “50x”), we will extract the numeric part and convert it to an integer. If no numeric value is present, we will search for a number in the attack text to use. In the future we can also consider more complex parsing methods to better estimate the maximum damage.
    • attack_cost: The total converted energy cost of all attacks.
df['attack_count'] = df['attacks'].apply(lambda x: len(x) if isinstance(x, list) else 0)

def extract_max_damage(attacks):
    if not isinstance(attacks, list) or len(attacks) == 0:
        return 0
    
    damages = []
    
    for attack in attacks:
        if isinstance(attack.get('damage'), str):
            damage_str = attack['damage'].replace('+', '').replace('-', '').replace('×', '').strip()
            if damage_str.isdigit():
                damages.append(int(damage_str))
                continue
            
        if isinstance(attack.get('text'), str):
            numbers = re.findall(r'\b(\d+)\b', attack['text'])
            if numbers:
                damages.append(max(int(num) for num in numbers))
    
    return max(damages, default=0)

df['max_damage'] = df['attacks'].apply(extract_max_damage)

df['attack_cost'] = df['attacks'].apply(lambda x: sum([len(attack['cost']) for attack in x]) if isinstance(x, list) else 0)
df.drop(columns=['attacks'], inplace=True)
  • rules: This can be converted into a binary feature indicating whether the pokemon has any special rules or not.
df['has_rules'] = df['rules'].apply(lambda x: int(isinstance(x, list) and len(x) > 0))
df.drop(columns=['rules'], inplace=True)
  • ancientTrait: This can also be converted into a binary feature indicating whether the pokemon has an ancient trait or not.
df["has_ancient_trait"] = df['ancientTrait'].apply(lambda x: int(isinstance(x, dict)))
df.drop(columns=['ancientTrait'], inplace=True)

What Our New Dataset Looks Like

We can see that we have successfully transformed our original dataframe into a more structured format that is suitable for machine learning models. From 25 original columns, we now have 111 features that capture various aspects of the pokemon cards. We save this new dataframe to a csv file for future use.

Creating Accurate Attack Damage and Utility Features

Up to this point we have created and extracted features using simple parsing methods without looking deeply into the dataset. One feature that I believe can be improved is the base_damage feature that we extracted from the attacks column. Currently, we are only extracting explicit numeric values from the damage field, which may not accurately represent the maximum potential damage of an attack. For example, an attack with a damage value of “30+” could potentially deal more than 30 damage, depending on certain conditions. To improve the accuracy of this feature, I am using a large language model (LLM) to parse the attack text and estimate the maximum potential damage based on the description provided.

Additionally we can also perform a similar process to extract utility features from the ability text. Abilities can provide various benefits to the pokemon, such as healing, drawing cards, or manipulating energy. By using an LLM to analyze the ability text, we can identify and extract these utility features, which may contribute to the overall effectiveness of the pokemon card.

I will be creating four new features using the LLM:

  • attack_damage: The estimated maximum damage of the attack based on the attack text.
  • attack_utility: A binary feature indicating whether the attack has any utility effects (e.g., healing, status effects).
  • ability_damage: A binary feature indicating whether the ability has any damage-related effects.
  • ability_utility: A binary feature indicating whether the ability has any utility effects.

I will be using the Gemini-2.5-flash model from Google Cloud for this task. The prompt used for extracting the maximum damage from the attack text is in the appendix.

Creating Text Complexity Features

In addition to the features we have already created, we can also analyze the text complexity of the ability_text and attack_text features. Text complexity can provide insights into how difficult it is to understand the abilities and attacks of a pokemon card. This can be important as more complex text may indicate more powerful or nuanced effects.

To quantify text complexity, we can use the spaCy NLP library to perform Dependency Parsing. By constructing a syntactic tree for each card’s ability text, I will calculate the Maximum Tree Depth, which serves as a proxy for the ‘cognitive load’ required to execute the card’s effects. This syntactic_depth feature will be combined with a frequency count of game-specific keywords (e.g., ‘Discard’, ‘Shuffle’) and text lenth to create a robust ‘Mechanic Complexity’ predictor using Principal Component Analysis (PCA). This composite feature aims to capture the overall complexity of a card’s mechanics, which may correlate with its effectiveness in gameplay.

I will be creating the following features:

  • text_depth: The maximum depth of the syntactic tree for the ability and attack text.
  • word_count: The total number of words in the ability and attack text.
  • keyword_count: The count of game-specific keywords in the ability and attack text.
df.to_csv('data/processed_pokemon_cards.csv', index=False)