2. Feature Engineering
2025-09-21
Brief Look at the Dataset
First we need to load our dataframe from the csv file we created in part 1. Then, lets take a look at all the columns in the dataset.
import pandas as pd
import ast
df = pd.read_csv('data/first_gen_pokemon_cards.csv')
columns_to_parse = ['weaknesses', 'resistances', 'subtypes', 'types', 'abilities', 'attacks', 'nationalPokedexNumbers', 'evolvesTo', 'rules']
for col in columns_to_parse:
if col in df.columns:
df[col] = df[col].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) and x != 'nan' and pd.notna(x) else ([] if col != 'nationalPokedexNumbers' else None))
print(df.columns)
Index(['id', 'name', 'supertype', 'subtypes', 'level', 'hp', 'types',
'evolvesFrom', 'abilities', 'attacks', 'weaknesses', 'retreatCost',
'convertedRetreatCost', 'number', 'artist', 'rarity', 'flavorText',
'nationalPokedexNumbers', 'legalities', 'images', 'evolvesTo',
'resistances', 'rules', 'regulationMark', 'ancientTrait'],
dtype='object')
I took these columns and created a simple data dictionary for reference:
| Column Name | Data Type | Description | Allowed Values | Examples | Missing Values |
|---|---|---|---|---|---|
| id | String | Unique identifier for each card | Alphanumeric strings | “xy7-54”, “sm3-22” | No |
| name | String | Name of the Pokemon card | Alphanumeric strings | “Pikachu”, “Charizard” | No |
| supertype | String | Broad category of the card | “Pokémon” | “Pokémon” | No |
| subtype | String | More specific category within the supertype | Array of strings | [“Basic”, “Stage 1”, “Stage 2”, “EX”, “Team Plasma”…] | No |
| level | String | Level of the Pokémon (if applicable) | Alphanumeric strings or X | “12”, “45”, “X” | Yes |
| hp | Integer | Hit points of the Pokémon | Positive integers | 60, 120, 200 | No |
| types | Array of strings | Types of the Pokémon | [“Fire”, “Water”, “Grass”, “Electric”, “Psychic”, “Fighting”, “Darkness”, “Metal”, “Fairy”, “Dragon”, “Colorless”] | [“Fire”], [“Water”, “Flying”] | No |
| evolvesFrom | String | Name of the Pokémon this card evolves from (if applicable) | Alphanumeric strings | “Pikachu”, “Charmander” | Yes |
| abilities | Array of objects | Special abilities of the Pokémon | Objects with name, text, and type fields | [{name: “Static”, text: “May paralyze opponent’s Pokémon”, type: “Poké-Body”}] | Yes |
| attacks | Array of objects | Attacks that the Pokémon can perform | Objects with name, cost, convertedEnergyCost, damage, and text fields | [{name: “Thunder Shock”, cost: [“Electric”, “Colorless”], convertedEnergyCost: 2, damage: “30”, text: “May paralyze opponent’s Pokémon”}] | Yes |
| weaknesses | Array of objects | Weaknesses of the Pokémon | Objects with type and value fields | [{type: “Fighting”, value: “×2”}] | Yes |
| retreatCost | Array of strings | Energy types required to retreat the Pokémon | [“Colorless”] | [“Colorless”, “Colorless”] | Yes |
| convertedRetreatCost | Integer | Total number of energy required to retreat the Pokémon | Non-negative integers | 1, 2, 3 | Yes |
| number | String | Card number within its set | Alphanumeric strings | “54”, “22” | No |
| artist | String | Name of the card’s illustrator | Alphanumeric strings | “Mitsuhiro Arita”, “5ban Graphics” | Yes |
| rarity | String | Rarity level of the card | “Common”, “Uncommon”, “Rare”, “Holo Rare”, “Ultra Rare”, “Secret Rare”, etc. | “Common”, “Holo Rare” | Yes |
| flavorText | String | Flavor text providing background or lore about the Pokémon | Alphanumeric strings | “When several of these Pokémon gather, their electricity could build and cause lightning storms.” | Yes |
| nationalPokedexNumbers | Array of integers | National Pokédex numbers associated with the Pokémon | Positive integers | [25], [6] | No |
| legalities | Object | Legality of the card in various formats | Fields for “expanded”, “standard”, “unlimited” with values “Legal” or “Not Legal” | {expanded: “Legal”, standard: “Not Legal”, unlimited: “Legal”} | No |
| images | Object | URLs for the card’s images | Fields for “small” and “large” with URL strings | {small: “http://…”, large: “http://…”} | No |
| evolvesTo | Array of strings | Names of Pokémon this card can evolve into (if applicable) | Alphanumeric strings | [“Raichu”, “Pikachu Libre”] | Yes |
| resistances | Array of objects | Resistances of the Pokémon | Objects with type and value fields | [{type: “Metal”, value: “-20”}] | Yes |
| rules | Array of strings | Special rules that apply to the card | Alphanumeric strings | [“If this Pokémon is your Active Pokémon, your opponent’s attacks do 20 less damage (before applying Weakness and Resistance).”] | Yes |
| regulationMark | String | Regulation mark for tournament legality | Single uppercase letters | “D”, “E” | Yes |
| ancientTrait | Object | Ancient Trait of the Pokémon (if applicable) | Object with name and text fields | {name: “Delta Evolution”, text: “This Pokémon can evolve from any type of basic Pokémon.”} | Yes |
We can see that there are quite a few features that are not necessary;
the obvious ones are id and imagessince these features are unique
identifiers and urls. We can drop these columns from the dataframe. Now
we can focus on the features that would help a model learn the game
mechanics that determines the hit points of a pokemon card. Given that
this is our goal, we can also drop legalities and regulationMark
columns since these columns pertain to the actual card game rules and
not the pokemon card itself. Finally, we can also drop the supertype
column since all of the cards in our dataset are of the same supertype
Pokémon.
The other features still have some columns that I believe are not useful for predicting the hit points of a pokemon card but it is hard to tell without running through some analysis.
Feature Engineering
I look all the columns in the dataset and decided on the following feature engineering steps:
| Column Name | Feature Engineering Steps |
|---|---|
id | We will drop this column since it is a unique identifier and does not provide any useful information for predicting hit points. |
images | We will drop this column since it contains URLs to images and does not provide any useful information for predicting hit points. |
legalities | We will drop this column since it pertains to the card game rules and not the pokemon card itself. |
regulationMark | We will drop this column since it pertains to the card game rules and not the pokemon card itself. |
supertype | We will drop this column since all of the cards in our dataset are of the same supertype Pokémon. |
hp | This is our target variable that we are trying to predict. We don’t need to do any feature engineering on this column. |
level | Most of the values in this column are missing, but we can fill in the missing values with the median level of the pokemons and create a new feature indicating whether the level was missing or not. |
nationalPokedexNumbers | We will convert this to a numerical value by taking the first number in the array. |
convertedRetreatCost | This is already a numerical value and can be used as is. We will fill in any missing values with 0. |
rarity | We can use one-hot encoding to convert this categorical feature into multiple binary features. |
evolvesFrom | We can create a new binary feature indicating whether the pokemon evolves from another pokemon or not. 0 for no and 1 for yes. |
evolvesTo | This can be the same as evolvesFrom, we can create a new binary feature indicating whether the pokemon evolves to another pokemon or not. 0 for no and 1 for yes. |
subtypes | We can use MultiLabelBinarizer to convert this list-based categorical feature into multiple binary features. |
types | Similar to subtypes, we can use MultiLabelBinarizer to convert this list-based categorical feature into multiple binary features |
weaknesses | We can extract three features from this column: - weakness_types: We can extract the types from the weaknesses and use MultiLabelBinarizer to convert this list-based categorical feature into multiple binary features. - total_weakness_multiplier: We can extract the multiplier values from the weaknesses and sum them up to create a new numerical feature (e.g. “×2” -> 2). - total_weakness_modifier: We can extract the modifier values from the weaknesses (e.g. “+20” -> 20) and multiply them together to create a new numerical feature. |
resistances | Similar to weaknesses, we can extract three features from this column: - resistance_types: We can extract the types from the resistances and use MultiLabelBinarizer to convert this list-based categorical feature into multiple binary features. - total_resistance_modifier: We can extract the modifier values from the resistances (e.g. “-20” -> -20) and sum them up to create a new numerical feature. - total_resistance_multiplier: We can extract the multiplier values from the resistances and multiply them together to create a new numerical feature. |
retreatCost | Since this column is the same as convertedRetreatCost, we can drop this column. |
name | This feature has a very high cardinality. Originally my idea was to count the number of times each name appears in the dataset and use that as a feature. However, we can already count this using the nationalPokedexNumbers feature since each pokemon name corresponds to a unique pokedex number. Therefore, we can drop this feature. |
artist | Similar to name, we can count the number of times each artist appears in the dataset and use that as a feature. |
abilities | I will split this into three features: - ability_count: The number of abilities the pokemon has. - ability_text: The combined text of all abilities. - has_pokemon_power: A binary feature indicating whether the pokemon has a Poké-Body or Poké-Power ability. |
attacks | Similar to abilities, I will split this into three features: - attack_count: The number of attacks the pokemon has. - max_damage: The explicit maximum damage value among all attacks. Since some damage values may contain non-numeric characters (e.g., “30+”, “50x”), we will extract the numeric part and convert it to an integer. If no numeric value is present, we will search for a number in the attack text to use. In the future we can also consider more complex parsing methods to better estimate the maximum damage. - attack_cost: The total converted energy cost of all attacks. |
rules | I will create a binary feature indicating whether the pokemon has any special rules or not. |
ancientTrait | I will create a binary feature indicating whether the pokemon has an ancient trait or not. |
flavorText | I believe the flavor text does not provide any information that could help us predict the HP of a pokemon card but lets use TfidfVectorizer and run analysis on it to see. |
Cleaning the Data
In this section we will focus on dropping columns and extracting features from our initial list of features. We will then transform and scale them in the next section. Lets first drop these columns that we decided aren’t useful from the dataframe:
idimageslegalitiesregulationMarksupertyperetreatCostname
df.drop(columns=['id', 'images', 'legalities', 'regulationMark', 'supertype', 'retreatCost', 'name'], inplace=True)
Direct Numerical Features
We can start with the columns that are already numerical values. These columns are:
level: I am replacingXfound in levels with100which is the highest level you can train a pokemon to in a game. I will be filling in missing data later.
df['level'] = df['level'].apply(
lambda x: int(x.replace('X', '100')) if isinstance(x, str) and x != 'nan' and pd.notna(x) else None
)
level_was_missing: A binary feature indicating whether the level was missing or not.
df['level_was_missing'] = df['level'].isnull().astype(int)
nationalPokedexNumbers: We will convert this to a numerical value by taking the first number in the array.
df['primary_pokedex_number'] = df['nationalPokedexNumbers'].apply(
lambda x: x[0] if isinstance(x, list) and len(x) > 0 else None
)
pokemon_count: Counts how many Pokemon are in thenationalPokedexNumbersarray.
df['pokemon_count'] = df['nationalPokedexNumbers'].apply(
lambda x: len(x) if isinstance(x, list) else 0
)
convertedRetreatCost: This is already a numerical value and can be used as is. We just need to fill in any missing values with 0.
df['convertedRetreatCost'] = df['convertedRetreatCost'].fillna(0)
df['convertedRetreatCost'] = df['convertedRetreatCost'].replace('.', 0).astype(int)
number: We will convert this to a numerical value by taking the subset number before or after any non-numeric characters. For example, “54a” would be converted to 54.
import re
df['number'] = df['number'].apply(
lambda x: int(re.search(r'\d+', str(x)).group()) if pd.notna(x) and re.search(r'\d+', str(x)) else None
)
Simple Categorical Features
Next we can look at the simple categorical features that have a limited number of unique values:
rarity: We can use one-hot encoding to convert this categorical feature into multiple binary features.
from sklearn.preprocessing import OneHotEncoder
hot_encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
df['rarity'] = df['rarity'].fillna('Unknown')
rarity_encoded = hot_encoder.fit_transform(df[['rarity']])
rarity_encoded_df = pd.DataFrame(rarity_encoded, columns=hot_encoder.get_feature_names_out(['rarity']))
df = pd.concat([df, rarity_encoded_df], axis=1)
df.drop(columns=['rarity'], inplace=True)
evolvesFrom: We can create a new binary feature indicating whether the pokemon evolves from another pokemon or not. 0 for no and 1 for yes.
df['evolvesFrom'] = df['evolvesFrom'].notnull().astype(int)
evolvesTo: This can be the same asevolvesFrom, we can create a new binary feature indicating whether the pokemon evolves to another pokemon or not. 0 for no and 1 for yes.
df['evolvesTo'] = df['evolvesTo'].apply(lambda x: int(isinstance(x, list) and len(x) > 0))
List-Based Categorical Features
Next we can look at the list-based categorical features. For these
features, we will need to extract the modifiers from weaknesses and
resistances so we first can define a function to do that. Then we can
proceed with the feature extraction.
def extract_modifiers(modifier_list):
if not isinstance(modifier_list, list):
return (0, 0)
total_multiplier = 0
total_modifier = 0
for item in modifier_list:
value_str = item['value'].strip()
if '×' in value_str:
numeric_part = value_str.replace('×', '')
total_multiplier += int(numeric_part)
elif '+' in value_str or '-' in value_str:
total_modifier += int(value_str)
return (total_multiplier, total_modifier)
subtypes: We can use MultiLabelBinarizer to convert this list-based categorical feature into multiple binary features.
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
subtypes_encoded = mlb.fit_transform(df['subtypes'].fillna('None').apply(lambda x: x if isinstance(x, list) else [x]))
subtypes_encoded_df = pd.DataFrame(subtypes_encoded, columns=[f'subtype_{cls}' for cls in mlb.classes_])
df = pd.concat([df, subtypes_encoded_df], axis=1)
df.drop(columns=['subtypes'], inplace=True)
types: Similar tosubtypes, we can use MultiLabelBinarizer to convert this list-based categorical feature into multiple binary features.
types_encoded = mlb.fit_transform(df['types'].fillna('None').apply(lambda x: x if isinstance(x, list) else [x]))
types_encoded_df = pd.DataFrame(types_encoded, columns=[f'type_{cls}' for cls in mlb.classes_])
df = pd.concat([df, types_encoded_df], axis=1)
df.drop(columns=['types'], inplace=True)
weaknesses: We can extract three features from this column:weakness_types: We can extract the types from the weaknesses and use MultiLabelBinarizer to convert this list-based categorical feature into multiple binary features.total_weakness_multiplier: We can extract the multiplier values from the weaknesses and sum them up to create a new numerical feature (e.g. “×2” -> 2).total_weakness_modifier: We can extract the modifier values from the weaknesses (e.g. “+20” -> 20) and multiply them together to create a new numerical feature.
mlb_weakness = MultiLabelBinarizer()
weakness_encoded = mlb_weakness.fit_transform(
df['weaknesses'].apply(
lambda x: [w['type'] for w in x] if isinstance(x, list) else []
)
)
weakness_encoded_df = pd.DataFrame(weakness_encoded, columns=[f'weakness_{cls}' for cls in mlb_weakness.classes_])
df = pd.concat([df, weakness_encoded_df], axis=1)
total_weakness_values = df['weaknesses'].apply(extract_modifiers)
df[['total_weakness_multiplier', 'total_weakness_modifier']] = pd.DataFrame(
total_weakness_values.tolist(),
index=df.index
)
df.drop(columns=['weaknesses'], inplace=True)
resistances: Similar toweaknesses, we can extract three features from this column:resistance_types: We can extract the types from the resistances and use MultiLabelBinarizer to convert this list-based categorical feature into multiple binary features.total_resistance_modifier: We can extract the modifier values from the resistances (e.g. “-20” -> -20) and sum them up to create a new numerical feature.total_resistance_multiplier: We can extract the multiplier values from the resistances and multiply them together to create a new numerical feature.
mlb_resistance = MultiLabelBinarizer()
resistance_encoded = mlb_resistance.fit_transform(
df['resistances'].apply(
lambda x: [r['type'] for r in x] if isinstance(x, list) else []
)
)
resistance_encoded_df = pd.DataFrame(resistance_encoded, columns=[f'resistance_{cls}' for cls in mlb_resistance.classes_])
df = pd.concat([df, resistance_encoded_df], axis=1)
total_resistance_values = df['resistances'].apply(extract_modifiers)
df[['total_resistance_multiplier', 'total_resistance_modifier']] = pd.DataFrame(
total_resistance_values.tolist(),
index=df.index
)
df.drop(columns=['resistances'], inplace=True)
High-Cardinality Categorical Features
Lets take a look at the categorical features that have a high number of unique values:
pokedex_frequency: We can count the number of times each pokedex number appears in the dataset and use that as a feature.
# Convert lists to tuples (hashable) for frequency counting
df['pokedex_frequency'] = df['nationalPokedexNumbers'].apply(
lambda x: tuple(x) if isinstance(x, list) else None
).map(
df['nationalPokedexNumbers'].apply(
lambda x: tuple(x) if isinstance(x, list) else None
).value_counts()
)
df.drop(columns=['nationalPokedexNumbers'], inplace=True)
artist: We can count the number of times each artist appears in the dataset and use that as a feature.
df['artist_frequency'] = df['artist'].map(df['artist'].value_counts())
df.drop(columns=['artist'], inplace=True)
Complex JSON/Text Features
Finally, we have the more complex features that are in JSON format or text:
abilities: I will split this into three features:ability_count: The number of abilities the pokemon has.ability_text: The combined text of all abilities.has_pokemon_power: A binary feature indicating whether the pokemon has a Poké-Body or Poké-Power ability.
df['ability_count'] = df['abilities'].apply(lambda x: len(x) if isinstance(x, list) else 0)
df['ability_text'] = df['abilities'].apply(lambda x: ' '.join([ability['text'] for ability in x]) if isinstance(x, list) else '')
df['has_pokemon_power'] = df['abilities'].apply(lambda x: int(any(ability['type'] in ['Poké-Body', 'Poké-Power'] for ability in x)) if isinstance(x, list) else 0)
df.drop(columns=['abilities'], inplace=True)
attacks: Similar toabilities, I will split this into three features:attack_count: The number of attacks the pokemon has.max_damage: The explicit maximum damage value among all attacks. Since some damage values may contain non-numeric characters (e.g., “30+”, “50x”), we will extract the numeric part and convert it to an integer. If no numeric value is present, we will search for a number in theattacktext to use. In the future we can also consider more complex parsing methods to better estimate the maximum damage.attack_cost: The total converted energy cost of all attacks.
df['attack_count'] = df['attacks'].apply(lambda x: len(x) if isinstance(x, list) else 0)
def extract_max_damage(attacks):
if not isinstance(attacks, list) or len(attacks) == 0:
return 0
damages = []
for attack in attacks:
if isinstance(attack.get('damage'), str):
damage_str = attack['damage'].replace('+', '').replace('-', '').replace('×', '').strip()
if damage_str.isdigit():
damages.append(int(damage_str))
continue
if isinstance(attack.get('text'), str):
numbers = re.findall(r'\b(\d+)\b', attack['text'])
if numbers:
damages.append(max(int(num) for num in numbers))
return max(damages, default=0)
df['max_damage'] = df['attacks'].apply(extract_max_damage)
df['attack_cost'] = df['attacks'].apply(lambda x: sum([len(attack['cost']) for attack in x]) if isinstance(x, list) else 0)
df.drop(columns=['attacks'], inplace=True)
rules: This can be converted into a binary feature indicating whether the pokemon has any special rules or not.
df['has_rules'] = df['rules'].apply(lambda x: int(isinstance(x, list) and len(x) > 0))
df.drop(columns=['rules'], inplace=True)
ancientTrait: This can also be converted into a binary feature indicating whether the pokemon has an ancient trait or not.
df["has_ancient_trait"] = df['ancientTrait'].apply(lambda x: int(isinstance(x, dict)))
df.drop(columns=['ancientTrait'], inplace=True)
What Our New Dataset Looks Like
We can see that we have successfully transformed our original dataframe into a more structured format that is suitable for machine learning models. From 25 original columns, we now have 111 features that capture various aspects of the pokemon cards. We save this new dataframe to a csv file for future use.
Creating Accurate Attack Damage and Utility Features
Up to this point we have created and extracted features using simple
parsing methods without looking deeply into the dataset. One feature
that I believe can be improved is the base_damage feature that we
extracted from the attacks column. Currently, we are only extracting
explicit numeric values from the damage field, which may not accurately
represent the maximum potential damage of an attack. For example, an
attack with a damage value of “30+” could potentially deal more than 30
damage, depending on certain conditions. To improve the accuracy of this
feature, I am using a large language model (LLM) to parse the attack
text and estimate the maximum potential damage based on the description
provided.
Additionally we can also perform a similar process to extract utility features from the ability text. Abilities can provide various benefits to the pokemon, such as healing, drawing cards, or manipulating energy. By using an LLM to analyze the ability text, we can identify and extract these utility features, which may contribute to the overall effectiveness of the pokemon card.
I will be creating four new features using the LLM:
attack_damage: The estimated maximum damage of the attack based on the attack text.attack_utility: A binary feature indicating whether the attack has any utility effects (e.g., healing, status effects).ability_damage: A binary feature indicating whether the ability has any damage-related effects.ability_utility: A binary feature indicating whether the ability has any utility effects.
I will be using the Gemini-2.5-flash model from Google Cloud for this task. The prompt used for extracting the maximum damage from the attack text is in the appendix.
Creating Text Complexity Features
In addition to the features we have already created, we can also analyze
the text complexity of the ability_text and attack_text features.
Text complexity can provide insights into how difficult it is to
understand the abilities and attacks of a pokemon card. This can be
important as more complex text may indicate more powerful or nuanced
effects.
To quantify text complexity, we can use the spaCy NLP library to perform Dependency Parsing. By constructing a syntactic tree for each card’s ability text, I will calculate the Maximum Tree Depth, which serves as a proxy for the ‘cognitive load’ required to execute the card’s effects. This syntactic_depth feature will be combined with a frequency count of game-specific keywords (e.g., ‘Discard’, ‘Shuffle’) and text lenth to create a robust ‘Mechanic Complexity’ predictor using Principal Component Analysis (PCA). This composite feature aims to capture the overall complexity of a card’s mechanics, which may correlate with its effectiveness in gameplay.
I will be creating the following features:
text_depth: The maximum depth of the syntactic tree for the ability and attack text.word_count: The total number of words in the ability and attack text.keyword_count: The count of game-specific keywords in the ability and attack text.
df.to_csv('data/processed_pokemon_cards.csv', index=False)