2. Feature Engineering

2025-09-21

Brief Look at the Dataset

First we need to load our dataframe from the csv file we created in part 1. Then, lets take a look at all the columns in the dataset.

import pandas as pd
import ast

df = pd.read_csv('data/first_gen_pokemon_cards.csv')

columns_to_parse = ['weaknesses', 'resistances', 'subtypes', 'types', 'abilities', 'attacks', 'nationalPokedexNumbers', 'evolvesTo', 'rules']
for col in columns_to_parse:
    if col in df.columns:
        df[col] = df[col].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) and x != 'nan' and pd.notna(x) else ([] if col != 'nationalPokedexNumbers' else None))

print(df.columns)

Index(['id', 'name', 'supertype', 'subtypes', 'level', 'hp', 'types',
       'evolvesFrom', 'abilities', 'attacks', 'weaknesses', 'retreatCost',
       'convertedRetreatCost', 'number', 'artist', 'rarity', 'flavorText',
       'nationalPokedexNumbers', 'legalities', 'images', 'evolvesTo',
       'resistances', 'rules', 'regulationMark', 'ancientTrait'],
      dtype='object')

I took these columns and created a simple data dictionary for reference:

Column Name	Data Type	Description	Allowed Values	Examples	Missing Values
id	String	Unique identifier for each card	Alphanumeric strings	“xy7-54”, “sm3-22”	No
name	String	Name of the Pokemon card	Alphanumeric strings	“Pikachu”, “Charizard”	No
supertype	String	Broad category of the card	“Pokémon”	“Pokémon”	No
subtype	String	More specific category within the supertype	Array of strings	[“Basic”, “Stage 1”, “Stage 2”, “EX”, “Team Plasma”…]	No
level	String	Level of the Pokémon (if applicable)	Alphanumeric strings or X	“12”, “45”, “X”	Yes
hp	Integer	Hit points of the Pokémon	Positive integers	60, 120, 200	No
types	Array of strings	Types of the Pokémon	[“Fire”, “Water”, “Grass”, “Electric”, “Psychic”, “Fighting”, “Darkness”, “Metal”, “Fairy”, “Dragon”, “Colorless”]	[“Fire”], [“Water”, “Flying”]	No
evolvesFrom	String	Name of the Pokémon this card evolves from (if applicable)	Alphanumeric strings	“Pikachu”, “Charmander”	Yes
abilities	Array of objects	Special abilities of the Pokémon	Objects with name, text, and type fields	[{name: “Static”, text: “May paralyze opponent’s Pokémon”, type: “Poké-Body”}]	Yes
attacks	Array of objects	Attacks that the Pokémon can perform	Objects with name, cost, convertedEnergyCost, damage, and text fields	[{name: “Thunder Shock”, cost: [“Electric”, “Colorless”], convertedEnergyCost: 2, damage: “30”, text: “May paralyze opponent’s Pokémon”}]	Yes
weaknesses	Array of objects	Weaknesses of the Pokémon	Objects with type and value fields	[{type: “Fighting”, value: “×2”}]	Yes
retreatCost	Array of strings	Energy types required to retreat the Pokémon	[“Colorless”]	[“Colorless”, “Colorless”]	Yes
convertedRetreatCost	Integer	Total number of energy required to retreat the Pokémon	Non-negative integers	1, 2, 3	Yes
number	String	Card number within its set	Alphanumeric strings	“54”, “22”	No
artist	String	Name of the card’s illustrator	Alphanumeric strings	“Mitsuhiro Arita”, “5ban Graphics”	Yes
rarity	String	Rarity level of the card	“Common”, “Uncommon”, “Rare”, “Holo Rare”, “Ultra Rare”, “Secret Rare”, etc.	“Common”, “Holo Rare”	Yes
flavorText	String	Flavor text providing background or lore about the Pokémon	Alphanumeric strings	“When several of these Pokémon gather, their electricity could build and cause lightning storms.”	Yes
nationalPokedexNumbers	Array of integers	National Pokédex numbers associated with the Pokémon	Positive integers	[25], [6]	No
legalities	Object	Legality of the card in various formats	Fields for “expanded”, “standard”, “unlimited” with values “Legal” or “Not Legal”	{expanded: “Legal”, standard: “Not Legal”, unlimited: “Legal”}	No
images	Object	URLs for the card’s images	Fields for “small” and “large” with URL strings	{small: “http://…”, large: “http://…”}	No
evolvesTo	Array of strings	Names of Pokémon this card can evolve into (if applicable)	Alphanumeric strings	[“Raichu”, “Pikachu Libre”]	Yes
resistances	Array of objects	Resistances of the Pokémon	Objects with type and value fields	[{type: “Metal”, value: “-20”}]	Yes
rules	Array of strings	Special rules that apply to the card	Alphanumeric strings	[“If this Pokémon is your Active Pokémon, your opponent’s attacks do 20 less damage (before applying Weakness and Resistance).”]	Yes
regulationMark	String	Regulation mark for tournament legality	Single uppercase letters	“D”, “E”	Yes
ancientTrait	Object	Ancient Trait of the Pokémon (if applicable)	Object with name and text fields	{name: “Delta Evolution”, text: “This Pokémon can evolve from any type of basic Pokémon.”}	Yes

We can see that there are quite a few features that are not necessary; the obvious ones are id and imagessince these features are unique identifiers and urls. We can drop these columns from the dataframe. Now we can focus on the features that would help a model learn the game mechanics that determines the hit points of a pokemon card. Given that this is our goal, we can also drop legalities and regulationMark columns since these columns pertain to the actual card game rules and not the pokemon card itself. Finally, we can also drop the supertype column since all of the cards in our dataset are of the same supertype Pokémon.

The other features still have some columns that I believe are not useful for predicting the hit points of a pokemon card but it is hard to tell without running through some analysis.

Feature Engineering

I look all the columns in the dataset and decided on the following feature engineering steps:

Column Name	Feature Engineering Steps
`id`	We will drop this column since it is a unique identifier and does not provide any useful information for predicting hit points.
`images`	We will drop this column since it contains URLs to images and does not provide any useful information for predicting hit points.
`legalities`	We will drop this column since it pertains to the card game rules and not the pokemon card itself.
`regulationMark`	We will drop this column since it pertains to the card game rules and not the pokemon card itself.
`supertype`	We will drop this column since all of the cards in our dataset are of the same supertype `Pokémon`.
`hp`	This is our target variable that we are trying to predict. We don’t need to do any feature engineering on this column.
`level`	Most of the values in this column are missing, but we can fill in the missing values with the median level of the pokemons and create a new feature indicating whether the level was missing or not.
`nationalPokedexNumbers`	We will convert this to a numerical value by taking the first number in the array.
`convertedRetreatCost`	This is already a numerical value and can be used as is. We will fill in any missing values with 0.
`rarity`	We can use one-hot encoding to convert this categorical feature into multiple binary features.
`evolvesFrom`	We can create a new binary feature indicating whether the pokemon evolves from another pokemon or not. 0 for no and 1 for yes.
`evolvesTo`	This can be the same as `evolvesFrom`, we can create a new binary feature indicating whether the pokemon evolves to another pokemon or not. 0 for no and 1 for yes.
`subtypes`	We can use MultiLabelBinarizer to convert this list-based categorical feature into multiple binary features.
`types`	Similar to `subtypes`, we can use MultiLabelBinarizer to convert this list-based categorical feature into multiple binary features
`weaknesses`	We can extract three features from this column: - `weakness_types`: We can extract the types from the weaknesses and use MultiLabelBinarizer to convert this list-based categorical feature into multiple binary features. - `total_weakness_multiplier`: We can extract the multiplier values from the weaknesses and sum them up to create a new numerical feature (e.g. “×2” -> 2). - `total_weakness_modifier`: We can extract the modifier values from the weaknesses (e.g. “+20” -> 20) and multiply them together to create a new numerical feature.
`resistances`	Similar to `weaknesses`, we can extract three features from this column: - `resistance_types`: We can extract the types from the resistances and use MultiLabelBinarizer to convert this list-based categorical feature into multiple binary features. - `total_resistance_modifier`: We can extract the modifier values from the resistances (e.g. “-20” -> -20) and sum them up to create a new numerical feature. - `total_resistance_multiplier`: We can extract the multiplier values from the resistances and multiply them together to create a new numerical feature.
`retreatCost`	Since this column is the same as `convertedRetreatCost`, we can drop this column.
`name`	This feature has a very high cardinality. Originally my idea was to count the number of times each name appears in the dataset and use that as a feature. However, we can already count this using the `nationalPokedexNumbers` feature since each pokemon name corresponds to a unique pokedex number. Therefore, we can drop this feature.
`artist`	Similar to `name`, we can count the number of times each artist appears in the dataset and use that as a feature.
`abilities`	I will split this into three features: - `ability_count`: The number of abilities the pokemon has. - `ability_text`: The combined text of all abilities. - `has_pokemon_power`: A binary feature indicating whether the pokemon has a Poké-Body or Poké-Power ability.
`attacks`	Similar to `abilities`, I will split this into three features: - `attack_count`: The number of attacks the pokemon has. - `max_damage`: The explicit maximum damage value among all attacks. Since some damage values may contain non-numeric characters (e.g., “30+”, “50x”), we will extract the numeric part and convert it to an integer. If no numeric value is present, we will search for a number in the `attack` text to use. In the future we can also consider more complex parsing methods to better estimate the maximum damage. - `attack_cost`: The total converted energy cost of all attacks.
`rules`	I will create a binary feature indicating whether the pokemon has any special rules or not.
`ancientTrait`	I will create a binary feature indicating whether the pokemon has an ancient trait or not.
`flavorText`	I believe the flavor text does not provide any information that could help us predict the HP of a pokemon card but lets use TfidfVectorizer and run analysis on it to see.

Cleaning the Data

In this section we will focus on dropping columns and extracting features from our initial list of features. We will then transform and scale them in the next section. Lets first drop these columns that we decided aren’t useful from the dataframe:

id
images
legalities
regulationMark
supertype
retreatCost
name

df.drop(columns=['id', 'images', 'legalities', 'regulationMark', 'supertype', 'retreatCost', 'name'], inplace=True)

Direct Numerical Features

We can start with the columns that are already numerical values. These columns are:

level: I am replacing X found in levels with 100 which is the highest level you can train a pokemon to in a game. I will be filling in missing data later.

df['level'] = df['level'].apply(
  lambda x: int(x.replace('X', '100')) if isinstance(x, str) and x != 'nan' and pd.notna(x) else None
)

level_was_missing: A binary feature indicating whether the level was missing or not.

df['level_was_missing'] = df['level'].isnull().astype(int)

nationalPokedexNumbers: We will convert this to a numerical value by taking the first number in the array.

df['primary_pokedex_number'] = df['nationalPokedexNumbers'].apply(
    lambda x: x[0] if isinstance(x, list) and len(x) > 0 else None
)

pokemon_count: Counts how many Pokemon are in the nationalPokedexNumbers array.

df['pokemon_count'] = df['nationalPokedexNumbers'].apply(
    lambda x: len(x) if isinstance(x, list) else 0
)

convertedRetreatCost: This is already a numerical value and can be used as is. We just need to fill in any missing values with 0.

df['convertedRetreatCost'] = df['convertedRetreatCost'].fillna(0)
df['convertedRetreatCost'] = df['convertedRetreatCost'].replace('.', 0).astype(int)

number: We will convert this to a numerical value by taking the subset number before or after any non-numeric characters. For example, “54a” would be converted to 54.

import re

df['number'] = df['number'].apply(
    lambda x: int(re.search(r'\d+', str(x)).group()) if pd.notna(x) and re.search(r'\d+', str(x)) else None
)

Simple Categorical Features

Next we can look at the simple categorical features that have a limited number of unique values:

rarity: We can use one-hot encoding to convert this categorical feature into multiple binary features.

from sklearn.preprocessing import OneHotEncoder

hot_encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
df['rarity'] = df['rarity'].fillna('Unknown')
rarity_encoded = hot_encoder.fit_transform(df[['rarity']])
rarity_encoded_df = pd.DataFrame(rarity_encoded, columns=hot_encoder.get_feature_names_out(['rarity']))
df = pd.concat([df, rarity_encoded_df], axis=1)
df.drop(columns=['rarity'], inplace=True)

evolvesFrom: We can create a new binary feature indicating whether the pokemon evolves from another pokemon or not. 0 for no and 1 for yes.

df['evolvesFrom'] = df['evolvesFrom'].notnull().astype(int)

evolvesTo: This can be the same as evolvesFrom, we can create a new binary feature indicating whether the pokemon evolves to another pokemon or not. 0 for no and 1 for yes.

df['evolvesTo'] = df['evolvesTo'].apply(lambda x: int(isinstance(x, list) and len(x) > 0))

List-Based Categorical Features

Next we can look at the list-based categorical features. For these features, we will need to extract the modifiers from weaknesses and resistances so we first can define a function to do that. Then we can proceed with the feature extraction.

def extract_modifiers(modifier_list):
  if not isinstance(modifier_list, list):
    return (0, 0)

  total_multiplier = 0
  total_modifier = 0

  for item in modifier_list:
    value_str = item['value'].strip()

    if '×' in value_str:
      numeric_part = value_str.replace('×', '')
      total_multiplier += int(numeric_part)
    elif '+' in value_str or '-' in value_str:
      total_modifier += int(value_str)
          
  return (total_multiplier, total_modifier)

subtypes: We can use MultiLabelBinarizer to convert this list-based categorical feature into multiple binary features.

from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()

subtypes_encoded = mlb.fit_transform(df['subtypes'].fillna('None').apply(lambda x: x if isinstance(x, list) else [x]))
subtypes_encoded_df = pd.DataFrame(subtypes_encoded, columns=[f'subtype_{cls}' for cls in mlb.classes_])
df = pd.concat([df, subtypes_encoded_df], axis=1)
df.drop(columns=['subtypes'], inplace=True)

types: Similar to subtypes, we can use MultiLabelBinarizer to convert this list-based categorical feature into multiple binary features.

types_encoded = mlb.fit_transform(df['types'].fillna('None').apply(lambda x: x if isinstance(x, list) else [x]))
types_encoded_df = pd.DataFrame(types_encoded, columns=[f'type_{cls}' for cls in mlb.classes_])
df = pd.concat([df, types_encoded_df], axis=1)
df.drop(columns=['types'], inplace=True)

weaknesses: We can extract three features from this column:
- weakness_types: We can extract the types from the weaknesses and use MultiLabelBinarizer to convert this list-based categorical feature into multiple binary features.
- total_weakness_multiplier: We can extract the multiplier values from the weaknesses and sum them up to create a new numerical feature (e.g. “×2” -> 2).
- total_weakness_modifier: We can extract the modifier values from the weaknesses (e.g. “+20” -> 20) and multiply them together to create a new numerical feature.

mlb_weakness = MultiLabelBinarizer()

weakness_encoded = mlb_weakness.fit_transform(
    df['weaknesses'].apply(
        lambda x: [w['type'] for w in x] if isinstance(x, list) else []
    )
)
weakness_encoded_df = pd.DataFrame(weakness_encoded, columns=[f'weakness_{cls}' for cls in mlb_weakness.classes_])
df = pd.concat([df, weakness_encoded_df], axis=1)

total_weakness_values = df['weaknesses'].apply(extract_modifiers)
df[['total_weakness_multiplier', 'total_weakness_modifier']] = pd.DataFrame(
  total_weakness_values.tolist(), 
  index=df.index
)
df.drop(columns=['weaknesses'], inplace=True)

resistances: Similar to weaknesses, we can extract three features from this column:
- resistance_types: We can extract the types from the resistances and use MultiLabelBinarizer to convert this list-based categorical feature into multiple binary features.
- total_resistance_modifier: We can extract the modifier values from the resistances (e.g. “-20” -> -20) and sum them up to create a new numerical feature.
- total_resistance_multiplier: We can extract the multiplier values from the resistances and multiply them together to create a new numerical feature.

mlb_resistance = MultiLabelBinarizer()

resistance_encoded = mlb_resistance.fit_transform(
    df['resistances'].apply(
        lambda x: [r['type'] for r in x] if isinstance(x, list) else []
    )
)
resistance_encoded_df = pd.DataFrame(resistance_encoded, columns=[f'resistance_{cls}' for cls in mlb_resistance.classes_])
df = pd.concat([df, resistance_encoded_df], axis=1)

total_resistance_values = df['resistances'].apply(extract_modifiers)
df[['total_resistance_multiplier', 'total_resistance_modifier']] = pd.DataFrame(
    total_resistance_values.tolist(),
    index=df.index
)
df.drop(columns=['resistances'], inplace=True)

High-Cardinality Categorical Features

Lets take a look at the categorical features that have a high number of unique values:

pokedex_frequency: We can count the number of times each pokedex number appears in the dataset and use that as a feature.

# Convert lists to tuples (hashable) for frequency counting
df['pokedex_frequency'] = df['nationalPokedexNumbers'].apply(
    lambda x: tuple(x) if isinstance(x, list) else None
).map(
    df['nationalPokedexNumbers'].apply(
        lambda x: tuple(x) if isinstance(x, list) else None
    ).value_counts()
)
df.drop(columns=['nationalPokedexNumbers'], inplace=True)

artist: We can count the number of times each artist appears in the dataset and use that as a feature.

df['artist_frequency'] = df['artist'].map(df['artist'].value_counts())
df.drop(columns=['artist'], inplace=True)

Complex JSON/Text Features

Finally, we have the more complex features that are in JSON format or text:

abilities: I will split this into three features:
- ability_count: The number of abilities the pokemon has.
- ability_text: The combined text of all abilities.
- has_pokemon_power: A binary feature indicating whether the pokemon has a Poké-Body or Poké-Power ability.

df['ability_count'] = df['abilities'].apply(lambda x: len(x) if isinstance(x, list) else 0)
df['ability_text'] = df['abilities'].apply(lambda x: ' '.join([ability['text'] for ability in x]) if isinstance(x, list) else '')
df['has_pokemon_power'] = df['abilities'].apply(lambda x: int(any(ability['type'] in ['Poké-Body', 'Poké-Power'] for ability in x)) if isinstance(x, list) else 0)
df.drop(columns=['abilities'], inplace=True)

attacks: Similar to abilities, I will split this into three features:
- attack_count: The number of attacks the pokemon has.
- max_damage: The explicit maximum damage value among all attacks. Since some damage values may contain non-numeric characters (e.g., “30+”, “50x”), we will extract the numeric part and convert it to an integer. If no numeric value is present, we will search for a number in the attack text to use. In the future we can also consider more complex parsing methods to better estimate the maximum damage.
- attack_cost: The total converted energy cost of all attacks.

df['attack_count'] = df['attacks'].apply(lambda x: len(x) if isinstance(x, list) else 0)

def extract_max_damage(attacks):
    if not isinstance(attacks, list) or len(attacks) == 0:
        return 0
    
    damages = []
    
    for attack in attacks:
        if isinstance(attack.get('damage'), str):
            damage_str = attack['damage'].replace('+', '').replace('-', '').replace('×', '').strip()
            if damage_str.isdigit():
                damages.append(int(damage_str))
                continue
            
        if isinstance(attack.get('text'), str):
            numbers = re.findall(r'\b(\d+)\b', attack['text'])
            if numbers:
                damages.append(max(int(num) for num in numbers))
    
    return max(damages, default=0)

df['max_damage'] = df['attacks'].apply(extract_max_damage)

df['attack_cost'] = df['attacks'].apply(lambda x: sum([len(attack['cost']) for attack in x]) if isinstance(x, list) else 0)
df.drop(columns=['attacks'], inplace=True)

rules: This can be converted into a binary feature indicating whether the pokemon has any special rules or not.

df['has_rules'] = df['rules'].apply(lambda x: int(isinstance(x, list) and len(x) > 0))
df.drop(columns=['rules'], inplace=True)

ancientTrait: This can also be converted into a binary feature indicating whether the pokemon has an ancient trait or not.

df["has_ancient_trait"] = df['ancientTrait'].apply(lambda x: int(isinstance(x, dict)))
df.drop(columns=['ancientTrait'], inplace=True)

What Our New Dataset Looks Like

We can see that we have successfully transformed our original dataframe into a more structured format that is suitable for machine learning models. From 25 original columns, we now have 111 features that capture various aspects of the pokemon cards. We save this new dataframe to a csv file for future use.

Creating Accurate Attack Damage and Utility Features

Up to this point we have created and extracted features using simple parsing methods without looking deeply into the dataset. One feature that I believe can be improved is the base_damage feature that we extracted from the attacks column. Currently, we are only extracting explicit numeric values from the damage field, which may not accurately represent the maximum potential damage of an attack. For example, an attack with a damage value of “30+” could potentially deal more than 30 damage, depending on certain conditions. To improve the accuracy of this feature, I am using a large language model (LLM) to parse the attack text and estimate the maximum potential damage based on the description provided.

Additionally we can also perform a similar process to extract utility features from the ability text. Abilities can provide various benefits to the pokemon, such as healing, drawing cards, or manipulating energy. By using an LLM to analyze the ability text, we can identify and extract these utility features, which may contribute to the overall effectiveness of the pokemon card.

I will be creating four new features using the LLM:

attack_damage: The estimated maximum damage of the attack based on the attack text.
attack_utility: A binary feature indicating whether the attack has any utility effects (e.g., healing, status effects).
ability_damage: A binary feature indicating whether the ability has any damage-related effects.
ability_utility: A binary feature indicating whether the ability has any utility effects.

I will be using the Gemini-2.5-flash model from Google Cloud for this task. The prompt used for extracting the maximum damage from the attack text is in the appendix.

Creating Text Complexity Features

In addition to the features we have already created, we can also analyze the text complexity of the ability_text and attack_text features. Text complexity can provide insights into how difficult it is to understand the abilities and attacks of a pokemon card. This can be important as more complex text may indicate more powerful or nuanced effects.

To quantify text complexity, we can use the spaCy NLP library to perform Dependency Parsing. By constructing a syntactic tree for each card’s ability text, I will calculate the Maximum Tree Depth, which serves as a proxy for the ‘cognitive load’ required to execute the card’s effects. This syntactic_depth feature will be combined with a frequency count of game-specific keywords (e.g., ‘Discard’, ‘Shuffle’) and text lenth to create a robust ‘Mechanic Complexity’ predictor using Principal Component Analysis (PCA). This composite feature aims to capture the overall complexity of a card’s mechanics, which may correlate with its effectiveness in gameplay.

I will be creating the following features:

text_depth: The maximum depth of the syntactic tree for the ability and attack text.
word_count: The total number of words in the ability and attack text.
keyword_count: The count of game-specific keywords in the ability and attack text.

df.to_csv('data/processed_pokemon_cards.csv', index=False)