2.7. Classes#
2.7.1. Inheritance in Python#
Have you ever had multiple classes that have similar attributes and methods? In the code below, the class Dachshund
and Poodle
have similar attributes (color
) and methods (show_info
).
class Dachshund:
def __init__(self, color: str):
self.color = color
def show_info(self):
print(f"This is a Dachshund with {self.color} color.")
class Poodle:
def __init__(self, color: str):
self.color = color
def show_info(self):
print(f"This is a Poodle with {self.color} color.")
bim = Dachshund("black")
bim.show_info()
This is a Dachshund with black color.
If so, use inheritance to organize your classes. Inheritance allows us to define a parent class and child classes. A child class inherits all the methods and attributes of the parent class.
super().__init__
makes the child class inherit all the methods and properties from its parent.
In the code below, we define the parent class to be Dog
and the child classes to be Dachshund
and Poodle
. With class inheritance, we avoid repeating the same piece of code multiple times.
class Dog:
def __init__(self, type_: str, color: str):
self.type = type_
self.color = color
def show_info(self):
print(f"This is a {self.type} with {self.color} color.")
class Dachshund(Dog):
def __init__(self, color: str):
super().__init__(type_="Dachshund", color=color)
class Poodle(Dog):
def __init__(self, color: str):
super().__init__(type_="Poodle", color=color)
bim = Dachshund("black")
bim.show_info()
This is a Dachshund with black color.
coco = Poodle("brown")
coco.show_info()
This is a Poodle with brown color.
Learn more about inheritance in Python here.
2.7.2. Abstract Classes: Declare Methods without Implementation#
To ensure that all subclasses implement a set of methods and properties, use abstract methods within an abstract class. This promotes code reusability and a consistent interface across different implementations.
In the following code, Drink serves as an abstract class with an abstract method consume.
from abc import ABC, abstractmethod
class Drink(ABC):
def __init__(self, name, volume):
self.name = name
self.volume = volume
@abstractmethod
def consume(self):
pass
class Tea(Drink):
def consume(self):
print(f"Drinking {self.name} tea...")
class Smoothie(Drink):
def consume(self):
print(f"Drinking {self.name} smoothie...")
Tea("English Breakfast", 250).consume()
Smoothie("Tropical Blast", 500).consume()
Drinking English Breakfast tea...
Drinking Tropical Blast smoothie...
2.7.3. Distinguishing Instance-Level and Class Methods#
An instance-level method requires instantiating a class object to operate, while a class method doesn’t.
Class methods can provide alternate ways to construct objects. In the code below, the from_csv
class method instantiates the class by reading data from a CSV file.
import pandas as pd
class DataAnalyzer:
def __init__(self, data):
self.data = data
def analyze(self): # instance-level method
print(f"Shape of data: {self.data.shape}")
@classmethod
def from_csv(cls, csv_path): # class method
data = pd.read_csv(csv_path)
return cls(data)
data = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
analyzer = DataAnalyzer(data)
analyzer.analyze()
Shape of data: (3, 2)
# Using the class method to create an instance from a CSV file
csv_file_path = "data.csv"
analyzer = DataAnalyzer.from_csv(csv_file_path)
analyzer.analyze()
Shape of data: (2, 3)
2.7.4. getattr: a Better Way to Get the Attribute of a Class#
If you want to get a default value when calling an attribute that is not in a class, use getattr()
method.
The getattr(class, attribute_name)
method simply gets the value of an attribute of a class. However, if the attribute is not found in a class, it returns the default value provided to the function.
class Food:
def __init__(self, name: str, color: str):
self.name = name
self.color = color
apple = Food("apple", "red")
print("The color of apple is", getattr(apple, "color", "yellow"))
The color of apple is red
print("The flavor of apple is", getattr(apple, "flavor", "sweet"))
The flavor of apple is sweet
print("The flavor of apple is", apple.sweet)
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
/tmp/ipykernel_337430/3178150741.py in <module>
----> 1 print("The flavor of apple is", apple.sweet)
AttributeError: 'Food' object has no attribute 'sweet'
2.7.5. __call__
: Call your Class Instance like a Function#
If you want to call your class instance like a function, add __call__
method to your class.
class DataLoader:
def __init__(self, data_dir: str):
self.data_dir = data_dir
print("Instance is created")
def __call__(self):
print("Instance is called")
data_loader = DataLoader("my_data_dir")
# Instance is created
data_loader()
# Instance is called
Instance is created
Instance is called
2.7.6. Instance Comparison in Python Classes#
Even if two class instances have the same attributes, they are not equal because they are stored in separate memory locations.
To define how class instances should be compared, add the __eq__
method.
class Dog:
def __init__(self, name: str):
self.name = name
dog1 = Dog("Bim")
dog2 = Dog("Bim")
dog1 == dog2
False
class Dog:
def __init__(self, name: str):
self.name = name
def __eq__(self, other):
return self.name == other.name
dog1 = Dog("Bim")
dog2 = Dog("Bim")
dog1 == dog2
True
2.7.7. Static method: use the function without adding the attributes required for a new instance#
Have you ever had a function in your class that doesn’t access any properties of a class but fits well in a class? You might find it redundant to instantiate the class to use that function. That is when you can turn your function into a static method.
All you need to turn your function into a static method is the decorator @staticmethod
. Now you can use the function without adding the attributes required for a new instance.
import re
class ProcessText:
def __init__(self, text_column: str):
self.text_column = text_column
@staticmethod
def remove_URL(sample: str) -> str:
"""Replace url with empty space"""
return re.sub(r"http\S+", "", sample)
text = ProcessText.remove_URL("My favorite page is https://www.google.com")
print(text)
My favorite page is
2.7.8. Minimize Data Risks with Python Private Variables#
To restrict external access and modification of a variable outside of a class, make it a private variable by using double underscores. This helps minimize the chances of unintended alterations.
class Grocery:
def __init__(self, item, price):
# Making 'price' a private variable
self.__price = price
self.item = item
# Getter method to access the private variable 'price'
def get_price(self):
print(f"The price of {self.item} is ${self.__price}")
# Create an instance of the Grocery class
grocery_item = Grocery("Apples", 2.99)
# Access the private variable 'price' using the getter method
grocery_item.get_price()
# Access the private variable directly
grocery_item.__price
The price of Apples is $2.99
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In[4], line 18
15 grocery_item.get_price()
17 # Access the private variable directly
---> 18 grocery_item.__price
AttributeError: 'Grocery' object has no attribute '__price'
2.7.9. Ensure Data Integrity with Getters and Setters in Python#
In Python, the property decorator controls property access and modification through getters and setters.
For example, in a BankAccount
class, without getters and setters, the balance can be directly modified, potentially leading to an invalid state.
class BankAccount:
def __init__(self, initial_balance: float):
self.balance = initial_balance
account = BankAccount(100)
print(account.balance)
# Directly modifying the balance (no validation)
account.balance = -50
print(account.balance)
100
-50
Using getters and setters ensures the balance cannot be set to an invalid value.
class BankAccount:
def __init__(self, initial_balance: float):
self._balance = initial_balance
@property
def balance(self):
return self._balance
@balance.setter
def balance(self, value):
if value < 0:
raise ValueError("Balance cannot be negative")
self._balance = value
# Example usage
account = BankAccount(100)
print(account.balance)
# Attempting to set an invalid balance
account.balance = -50
100
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[6], line 21
18 print(account.balance)
20 # Attempting to set an invalid balance
---> 21 account.balance = -50
Cell In[6], line 12, in BankAccount.balance(self, value)
9 @balance.setter
10 def balance(self, value):
11 if value < 0:
---> 12 raise ValueError("Balance cannot be negative")
13 self._balance = value
ValueError: Balance cannot be negative
2.7.10. __str__
and __repr__
: Create a String Representation of a Python Object#
If you want to create a string representation of an object, add __str__
and __repr__
.
__str__
shows readable outputs when printing the object. __repr__
shows outputs that are useful for displaying and debugging the object.
class Food:
def __init__(self, name: str, color: str):
self.name = name
self.color = color
def __str__(self):
return f"{self.color} {self.name}"
def __repr__(self):
return f"Food({self.color}, {self.name})"
food = Food("apple", "red")
print(food) # str__
red apple
food # __repr__
Food(red, apple)
2.7.11. __add__
: Add the Attributes of Two Class Instances#
If you want to add the attributes of class instances, use __add__
. In the code below, I use __add__
to add the ages of two class instances bim
and coco
when calling bim + coco
.
class Dog:
def __init__(self, age: int):
self.age = age
def __add__(self, other):
return self.age + other.age
class Cat:
def __init__(self, age: int):
self.age = age
def __add__(self, other):
return self.age + other.age
bim = Dog(age=5)
coco = Cat(age=2)
bim + coco
7
2.7.12. Optimizing Memory Usage in Python with Slots#
Show code cell content
!pip install memory_profiler
In Python, objects can store their attributes in a flexible dictionary-like structure that can use a lot of memory. Slots make your objects more memory-efficient by reserving space for their attributes ahead of time.
The code below shows that using slots significant reduces the memory usage.
%%writefile without_slot.py
from random import randint
from memory_profiler import profile
class Dog:
def __init__(self, age):
self.age = age
@profile
def main():
return [Dog(age=randint(0, 30)) for _ in range(100000)]
if __name__ == "__main__":
main()
$ python -m memory_profiler without_slot.py
Filename: without_slot.py
Line # Mem usage Increment Occurrences Line Contents
=============================================================
10 41.6 MiB 41.6 MiB 1 @profile
11 def main():
12 57.8 MiB 16.2 MiB 100003 return [Dog(randint(0, 30)) for _ in range(100000)]
%%writefile with_slot.py
from random import randint
from memory_profiler import profile
class Dog:
# defining slots
__slots__ = ["age"]
def __init__(self, age):
self.age = age
@profile
def main():
return [Dog(age=randint(0, 30)) for _ in range(100000)]
if __name__ == "__main__":
main()
$ python -m memory_profiler with_slot.py
Filename: with_slot.py
Line # Mem usage Increment Occurrences Line Contents
=============================================================
11 41.3 MiB 41.3 MiB 1 @profile
12 def main():
13 46.7 MiB 5.4 MiB 100003 return [Dog(randint(0, 30)) for _ in range(100000)]
2.7.13. Improve Code Readability with Enums#
Hard-coded values without proper context can decrease code readability.
By using an Enum, meaningful names can be assigned to these values, improving code readability.
current_status = 200
if current_status == 200:
print("You can go through the gate.")
elif current_status == 500:
print("You can't go through the gate.")
else:
print("Invalid status code.")
You can go through the gate.
from enum import Enum
class StatusCode(Enum):
OK = 200
ERROR = 500
current_status = 200
if current_status == StatusCode.OK.value:
print("You can go through the gate.")
elif current_status == StatusCode.ERROR.value:
print("You can't go through the gate.")
else:
print("Invalid status code.")
You can go through the gate.
2.7.14. Embrace the Open-Closed Principle to Design Extensible Classes#
You should design classes that are open for extension but closed for modification (Open-Closed Principle).
In the current implementation of the DataPipeline
class, the data processing methods are directly implemented within the class.
import pandas as pd
class DataPipeline:
def drop_missing_data(self, data: pd.DataFrame) -> pd.DataFrame:
return data.dropna()
def standardize_data(self, data: pd.DataFrame) -> pd.DataFrame:
return (data - data.mean()) / data.std()
def process(self, data: pd.DataFrame) -> pd.DataFrame:
return data.pipe(self.drop_missing_data).pipe(self.standardize_data)
pipeline = DataPipeline()
data = pd.DataFrame({"A": [1, 2, 3, None, 5], "B": [5, 4, 2, 1, 3]})
processed_data = pipeline.process(data)
Adding another processing step to the pipeline requires modifying the process
method.
import pandas as pd
class DataPipeline:
def drop_missing_data(self, data: pd.DataFrame) -> pd.DataFrame:
return data.dropna()
def standardize_data(self, data: pd.DataFrame) -> pd.DataFrame:
return (data - data.mean()) / data.std()
# Adding this
def encode_categorical_data(self, data: pd.DataFrame) -> pd.DataFrame:
return pd.get_dummies(data)
# Requires modifying the code
def process(self, data: pd.DataFrame) -> pd.DataFrame:
return (
data.pipe(self.drop_missing_data)
.pipe(self.encode_categorical_data)
.pipe(self.standardize_data)
)
pipeline = DataPipeline()
data = pd.DataFrame(
{"A": [1, 2, 3, None, 5], "B": [5, 4, 2, 1, 3], "C": ["a", "a", "b", "b", "a"]}
)
pipeline.process(data)
A | B | C_a | C_b | |
---|---|---|---|---|
0 | -1.024695 | 1.161895 | 0.5 | -0.5 |
1 | -0.439155 | 0.387298 | 0.5 | -0.5 |
2 | 0.146385 | -1.161895 | -1.5 | 1.5 |
4 | 1.317465 | -0.387298 | 0.5 | -0.5 |
Refactor the DataPipeline
class to accept pluggable strategies.
from abc import ABC, abstractmethod
import pandas as pd
class DataProcessingStrategy(ABC):
@abstractmethod
def apply(self, data: pd.DataFrame) -> pd.DataFrame:
pass
class DropMissingDataStrategy(DataProcessingStrategy):
def apply(self, data: pd.DataFrame) -> pd.DataFrame:
return data.dropna()
class StandardizeDataStrategy(DataProcessingStrategy):
def apply(self, data: pd.DataFrame) -> pd.DataFrame:
return (data - data.mean()) / data.std()
class DataPipeline:
def __init__(self):
self.strategies = []
def add_strategy(self, strategy: DataProcessingStrategy):
self.strategies.append(strategy)
def process(self, data: pd.DataFrame) -> pd.DataFrame:
for strategy in self.strategies:
data = strategy.apply(data)
return data
pipeline = DataPipeline()
pipeline.add_strategy(DropMissingDataStrategy())
pipeline.add_strategy(StandardizeDataStrategy())
# Imagine we have some sample data
data = pd.DataFrame({"A": [1, 2, 3, None, 5], "B": [5, 4, 2, 1, 3]})
pipeline.process(data)
A | B | |
---|---|---|
0 | -1.024695 | 1.161895 |
1 | -0.439155 | 0.387298 |
2 | 0.146385 | -1.161895 |
4 | 1.317465 | -0.387298 |
This design ensures the pipeline’s processing strategies can be swapped, extended, or reordered without modifying the DataPipeline
class.
from abc import ABC, abstractmethod
import pandas as pd
class DataProcessingStrategy(ABC):
@abstractmethod
def apply(self, data: pd.DataFrame) -> pd.DataFrame:
pass
class DropMissingDataStrategy(DataProcessingStrategy):
def apply(self, data: pd.DataFrame) -> pd.DataFrame:
return data.dropna()
class StandardizeDataStrategy(DataProcessingStrategy):
def apply(self, data: pd.DataFrame) -> pd.DataFrame:
return (data - data.mean()) / data.std()
class EncodeDataStrategy(DataProcessingStrategy):
def apply(self, data: pd.DataFrame) -> pd.DataFrame:
return pd.get_dummies(data)
class DataPipeline:
def __init__(self):
self.strategies = []
def add_strategy(self, strategy: DataProcessingStrategy):
self.strategies.append(strategy)
def process(self, data: pd.DataFrame) -> pd.DataFrame:
for strategy in self.strategies:
data = strategy.apply(data)
return data
pipeline = DataPipeline()
pipeline.add_strategy(DropMissingDataStrategy())
pipeline.add_strategy(EncodeDataStrategy())
pipeline.add_strategy(StandardizeDataStrategy())
data = pd.DataFrame(
{"A": [1, 2, 3, None, 5], "B": [5, 4, 2, 1, 3], "C": ["a", "a", "b", "b", "a"]}
)
pipeline.process(data)
A | B | C_a | C_b | |
---|---|---|---|---|
0 | -1.024695 | 1.161895 | 0.5 | -0.5 |
1 | -0.439155 | 0.387298 | 0.5 | -0.5 |
2 | 0.146385 | -1.161895 | -1.5 | 1.5 |
4 | 1.317465 | -0.387298 | 0.5 | -0.5 |
2.7.15. Use Mixins Over Inheritance for Enhanced Modularity#
Use mixin instead of inheritance to add shared functionality without changing the primary structure of a class.
In the following example, the inheritance-based approach assumes that the PickleableModel
class has a .model
attribute that needs to be serialized. This assumption may not be true for all subclasses.
import joblib
from sklearn.cluster import KMeans
from sklearn.svm import SVC
from sklearn.datasets import make_blobs
class PickleableModel:
def to_pickle(self, file_path):
"""
Serialize the object to a pickle file.
"""
with open(file_path, "wb") as file:
# It's important to only serialize the model attribute here
# otherwise joblib might fail to serialize the object
joblib.dump(self.model, file)
@classmethod
def from_pickle(cls, file_path):
"""
Deserialize pickle file to an object.
"""
with open(file_path, "rb") as file:
obj = cls()
obj.model = joblib.load(file)
return obj
class PickleableKmeans(PickleableModel):
def __init__(self, n_clusters=3, **kwargs):
self.model = KMeans(n_clusters=n_clusters, **kwargs)
def fit(self, X, y=None):
self.model.fit(X)
def predict(self, X):
return self.model.predict(X)
class PickleableSVM(PickleableModel):
def __init__(self, C=1.0, kernel="rbf", **kwargs):
self.model = SVC(C=C, kernel=kernel, **kwargs)
def fit(self, X, y):
self.model.fit(X, y)
def predict(self, X):
return self.model.predict(X)
X, _ = make_blobs(n_samples=300, centers=4, n_features=2, random_state=42)
kmeans = PickleableKmeans(n_clusters=3, n_init="auto")
kmeans.fit(X)
# Serialize the model to a file
kmeans_file_path = "kmeans_model.pkl"
kmeans.to_pickle(kmeans_file_path)
The PickleableMixin
can be applied to any class that needs serialization, not just machine learning models.
class PickleableMixin:
def to_pickle(self, file_path):
"""
Serialize the object to a pickle file.
"""
with open(file_path, "wb") as file:
joblib.dump(self, file)
@staticmethod
def from_pickle(file_path):
"""
Deserialize pickle file to an object.
"""
with open(file_path, "rb") as file:
return joblib.load(file)
# Enhanced models with serialization capability
class PickleableKmeans(KMeans, PickleableMixin):
pass
class PickleableSVM(SVC, PickleableMixin):
pass
kmeans = PickleableKmeans(n_clusters=3, n_init="auto")
kmeans.fit(X)
# Serialize the model to a file
kmeans_file_path = "kmeans_model.pkl"
kmeans.to_pickle(kmeans_file_path)
2.7.16. Embracing Duck Typing for Cleaner, More Adaptable Data Science Code#
Duck typing comes from the phrase “If it walks like a duck and quacks like a duck, then it must be a duck.”
This lets you write flexible code that works with different types of objects, as long as they have the methods or attributes you’re using.
For data scientists, duck typing allows creating versatile functions that work with various data structures without explicitly checking their types.
Here’s a simple example:
import numpy as np
import pandas as pd
class CustomDataFrame:
def __init__(self, data):
self.data = data
def mean(self):
return np.mean(self.data)
def std(self):
return np.std(self.data)
def analyze_data(data):
print(f"Mean: {data.mean()}")
print(f"Standard Deviation: {data.std()}")
# These all work, thanks to duck typing
numpy_array = np.array([1, 2, 3, 4, 5])
pandas_series = pd.Series([1, 2, 3, 4, 5])
custom_df = CustomDataFrame([1, 2, 3, 4, 5])
analyze_data(numpy_array)
analyze_data(pandas_series)
analyze_data(custom_df)
Mean: 3.0
Standard Deviation: 1.4142135623730951
Mean: 3.0
Standard Deviation: 1.5811388300841898
Mean: 3.0
Standard Deviation: 1.4142135623730951
In this example, the analyze_data
function works with NumPy arrays, Pandas Series, and our custom CustomDataFrame
class, because they all have mean
and std
methods. This flexibility is powerful in data science workflows where you might be working with various data structures.
This flexibility is valuable in data science because:
It saves time: You don’t need separate functions for different data types.
Bad example:
def analyze_numpy_array(data):
print(f"Mean: {np.mean(data)}")
print(f"Standard Deviation: {np.std(data)}")
def analyze_pandas_series(data):
print(f"Mean: {data.mean()}")
print(f"Standard Deviation: {data.std()}")
def analyze_custom_df(data):
print(f"Mean: {data.mean()}")
print(f"Standard Deviation: {data.std()}")
numpy_array = np.array([1, 2, 3, 4, 5])
pandas_series = pd.Series([1, 2, 3, 4, 5])
custom_df = CustomDataFrame([1, 2, 3, 4, 5])
analyze_numpy_array(numpy_array)
analyze_pandas_series(pandas_series)
analyze_custom_df(custom_df)
Mean: 3.0
Standard Deviation: 1.4142135623730951
Mean: 3.0
Standard Deviation: 1.5811388300841898
Mean: 3.0
Standard Deviation: 1.4142135623730951
It’s cleaner: You avoid lots of
if
statements checking for types.
def analyze_data(data):
if isinstance(data, np.ndarray):
mean = np.mean(data)
std = np.std(data)
elif isinstance(data, pd.Series):
mean = data.mean()
std = data.std()
elif isinstance(data, CustomDataFrame):
mean = data.mean()
std = data.std()
else:
raise TypeError("Unsupported data type")
print(f"Mean: {mean}")
print(f"Standard Deviation: {std}")
numpy_array = np.array([1, 2, 3, 4, 5])
pandas_series = pd.Series([1, 2, 3, 4, 5])
custom_df = CustomDataFrame([1, 2, 3, 4, 5])
python_list = [1, 2, 3, 4, 5]
analyze_data(numpy_array)
analyze_data(pandas_series)
analyze_data(custom_df)
Mean: 3.0
Standard Deviation: 1.4142135623730951
Mean: 3.0
Standard Deviation: 1.5811388300841898
Mean: 3.0
Standard Deviation: 1.4142135623730951
It’s more adaptable: Your code can handle new data types easily.
Bad example:
def analyze_data(data):
if isinstance(data, np.ndarray):
print(f"Mean: {np.mean(data)}")
print(f"Standard Deviation: {np.std(data)}")
elif isinstance(data, pd.Series):
print(f"Mean: {data.mean()}")
print(f"Standard Deviation: {data.std()}")
elif isinstance(data, CustomDataFrame):
print(f"Mean: {data.mean()}")
print(f"Standard Deviation: {data.std()}")
else:
raise TypeError("Unsupported data type")
# If you introduce a new data type, you have to modify the function
class NewDataType:
def __init__(self, data):
self.data = data
def mean(self):
return sum(self.data) / len(self.data)
def std(self):
mean = self.mean()
return (sum((x - mean) ** 2 for x in self.data) / len(self.data)) ** 0.5
new_data = NewDataType([1, 2, 3, 4, 5])
# This will raise a TypeError
# analyze_data(new_data)