2.7. Classes#

2.7.1. Inheritance in Python#

Have you ever had multiple classes that have similar attributes and methods? In the code below, the class Dachshund and Poodle have similar attributes (color) and methods (show_info).

class Dachshund:
    def __init__(self, color: str):
        self.color = color

    def show_info(self):
        print(f"This is a Dachshund with {self.color} color.")


class Poodle:
    def __init__(self, color: str):
        self.color = color

    def show_info(self):
        print(f"This is a Poodle with {self.color} color.")


bim = Dachshund("black")
bim.show_info()
This is a Dachshund with black color.

If so, use inheritance to organize your classes. Inheritance allows us to define a parent class and child classes. A child class inherits all the methods and attributes of the parent class.

super().__init__ makes the child class inherit all the methods and properties from its parent.

In the code below, we define the parent class to be Dog and the child classes to be Dachshund and Poodle. With class inheritance, we avoid repeating the same piece of code multiple times.

class Dog:
    def __init__(self, type_: str, color: str):
        self.type = type_
        self.color = color

    def show_info(self):
        print(f"This is a {self.type} with {self.color} color.")


class Dachshund(Dog):
    def __init__(self, color: str):
        super().__init__(type_="Dachshund", color=color)


class Poodle(Dog):
    def __init__(self, color: str):
        super().__init__(type_="Poodle", color=color)


bim = Dachshund("black")
bim.show_info()
This is a Dachshund with black color.
coco = Poodle("brown")
coco.show_info()
This is a Poodle with brown color.

Learn more about inheritance in Python here.

2.7.2. Abstract Classes: Declare Methods without Implementation#

To ensure that all subclasses implement a set of methods and properties, use abstract methods within an abstract class. This promotes code reusability and a consistent interface across different implementations.

In the following code, Drink serves as an abstract class with an abstract method consume.

from abc import ABC, abstractmethod


class Drink(ABC):
    def __init__(self, name, volume):
        self.name = name
        self.volume = volume

    @abstractmethod
    def consume(self):
        pass


class Tea(Drink):
    def consume(self):
        print(f"Drinking {self.name} tea...")


class Smoothie(Drink):
    def consume(self):
        print(f"Drinking {self.name} smoothie...")


Tea("English Breakfast", 250).consume()
Smoothie("Tropical Blast", 500).consume()
Drinking English Breakfast tea...
Drinking Tropical Blast smoothie...

2.7.3. Distinguishing Instance-Level and Class Methods#

An instance-level method requires instantiating a class object to operate, while a class method doesn’t.

Class methods can provide alternate ways to construct objects. In the code below, the from_csv class method instantiates the class by reading data from a CSV file.

import pandas as pd

class DataAnalyzer:
    def __init__(self, data):
        self.data = data

    def analyze(self): # instance-level method
        print(f"Shape of data: {self.data.shape}")

    @classmethod
    def from_csv(cls, csv_path): # class method
        data = pd.read_csv(csv_path)
        return cls(data)


data = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
analyzer = DataAnalyzer(data)
analyzer.analyze()
Shape of data: (3, 2)
# Using the class method to create an instance from a CSV file
csv_file_path = "data.csv"
analyzer = DataAnalyzer.from_csv(csv_file_path) 
analyzer.analyze()
Shape of data: (2, 3)

2.7.4. getattr: a Better Way to Get the Attribute of a Class#

If you want to get a default value when calling an attribute that is not in a class, use getattr() method.

The getattr(class, attribute_name) method simply gets the value of an attribute of a class. However, if the attribute is not found in a class, it returns the default value provided to the function.

class Food:
    def __init__(self, name: str, color: str):
        self.name = name
        self.color = color


apple = Food("apple", "red")

print("The color of apple is", getattr(apple, "color", "yellow"))
The color of apple is red
print("The flavor of apple is", getattr(apple, "flavor", "sweet"))
The flavor of apple is sweet
print("The flavor of apple is", apple.sweet)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/tmp/ipykernel_337430/3178150741.py in <module>
----> 1 print("The flavor of apple is", apple.sweet)

AttributeError: 'Food' object has no attribute 'sweet'

2.7.5. __call__: Call your Class Instance like a Function#

If you want to call your class instance like a function, add __call__ method to your class.

class DataLoader:
    def __init__(self, data_dir: str):
        self.data_dir = data_dir
        print("Instance is created")

    def __call__(self):
        print("Instance is called")


data_loader = DataLoader("my_data_dir")
# Instance is created

data_loader()
# Instance is called
Instance is created
Instance is called

2.7.6. Instance Comparison in Python Classes#

Even if two class instances have the same attributes, they are not equal because they are stored in separate memory locations.

To define how class instances should be compared, add the __eq__ method.

class Dog:
    def __init__(self, name: str):
        self.name = name


dog1 = Dog("Bim")
dog2 = Dog("Bim")
dog1 == dog2
False
class Dog:
    def __init__(self, name: str):
        self.name = name

    def __eq__(self, other):
        return self.name == other.name


dog1 = Dog("Bim")
dog2 = Dog("Bim")
dog1 == dog2
True

2.7.7. Static method: use the function without adding the attributes required for a new instance#

Have you ever had a function in your class that doesn’t access any properties of a class but fits well in a class? You might find it redundant to instantiate the class to use that function. That is when you can turn your function into a static method.

All you need to turn your function into a static method is the decorator @staticmethod. Now you can use the function without adding the attributes required for a new instance.

import re


class ProcessText:
    def __init__(self, text_column: str):
        self.text_column = text_column

    @staticmethod
    def remove_URL(sample: str) -> str:
        """Replace url with empty space"""
        return re.sub(r"http\S+", "", sample)


text = ProcessText.remove_URL("My favorite page is https://www.google.com")
print(text)
My favorite page is 

2.7.8. Minimize Data Risks with Python Private Variables#

To restrict external access and modification of a variable outside of a class, make it a private variable by using double underscores. This helps minimize the chances of unintended alterations.

class Grocery:
    def __init__(self, item, price):
        # Making 'price' a private variable
        self.__price = price
        self.item = item

    # Getter method to access the private variable 'price'
    def get_price(self):
        print(f"The price of {self.item} is ${self.__price}")


# Create an instance of the Grocery class
grocery_item = Grocery("Apples", 2.99)

# Access the private variable 'price' using the getter method
grocery_item.get_price()

# Access the private variable directly
grocery_item.__price
The price of Apples is $2.99
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[4], line 18
     15 grocery_item.get_price()
     17 # Access the private variable directly
---> 18 grocery_item.__price

AttributeError: 'Grocery' object has no attribute '__price'

2.7.9. Ensure Data Integrity with Getters and Setters in Python#

In Python, the property decorator controls property access and modification through getters and setters.

For example, in a BankAccount class, without getters and setters, the balance can be directly modified, potentially leading to an invalid state.

class BankAccount:
    def __init__(self, initial_balance: float):
        self.balance = initial_balance

account = BankAccount(100)
print(account.balance) 

# Directly modifying the balance (no validation)
account.balance = -50
print(account.balance) 
100
-50

Using getters and setters ensures the balance cannot be set to an invalid value.

class BankAccount:
    def __init__(self, initial_balance: float):
        self._balance = initial_balance

    @property
    def balance(self):
        return self._balance

    @balance.setter
    def balance(self, value):
        if value < 0:
            raise ValueError("Balance cannot be negative")
        self._balance = value


# Example usage
account = BankAccount(100)
print(account.balance) 

# Attempting to set an invalid balance
account.balance = -50 
100
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[6], line 21
     18 print(account.balance) 
     20 # Attempting to set an invalid balance
---> 21 account.balance = -50 

Cell In[6], line 12, in BankAccount.balance(self, value)
      9 @balance.setter
     10 def balance(self, value):
     11     if value < 0:
---> 12         raise ValueError("Balance cannot be negative")
     13     self._balance = value

ValueError: Balance cannot be negative

2.7.10. __str__ and __repr__: Create a String Representation of a Python Object#

If you want to create a string representation of an object, add __str__ and __repr__.

__str__ shows readable outputs when printing the object. __repr__ shows outputs that are useful for displaying and debugging the object.

class Food:
    def __init__(self, name: str, color: str):
        self.name = name
        self.color = color

    def __str__(self):
        return f"{self.color} {self.name}"

    def __repr__(self):
        return f"Food({self.color}, {self.name})"


food = Food("apple", "red")

print(food)  #  str__
red apple
food  # __repr__
Food(red, apple)

2.7.11. Simplify Custom Object Operations with Python Magic Methods#

Manually defining arithmetic operations between custom objects can make code less readable and harder to understand.

Python’s special methods, such as __add__, enable natural arithmetic syntax between custom objects, making code more readable and intuitive, and allowing you to focus on program logic.

Let’s consider an example to illustrate the problem. Suppose we have a class Animal with attributes species and weight, and we want to calculate the total weight of two animals.

class Animal:
    def __init__(self, species: str, weight: float):
        self.species = species
        self.weight = weight


lion = Animal("Lion", 200)
tiger = Animal("Tiger", 180)

# Messy calculations
total_weight = lion.weight + tiger.weight
total_weight
380

In this example, we have to manually access the weight attribute of each Animal object and perform the addition. This approach is not only verbose but also error-prone, as we might accidentally access the wrong attribute or perform the wrong operation.

To enable natural arithmetic syntax, we can implement the __add__ method in the Animal class. This method will be called when we use the + operator between two Animal objects.

class Animal:
    def __init__(self, species: str, weight: float):
        self.species = species
        self.weight = weight

    def __add__(self, other):
        # Combine weights
        return Animal(f"{self.species}+{other.species}", self.weight + other.weight)


lion = Animal("Lion", 200)
tiger = Animal("Tiger", 180)
combined = lion + tiger
print(combined.weight)
380

By implementing the __add__ method, we can now use the + operator to combine the weights of two Animal objects. This approach is not only more readable but also more intuitive.

2.7.12. Optimizing Memory Usage in Python with Slots#

Hide code cell content
!pip install memory_profiler

In Python, objects can store their attributes in a flexible dictionary-like structure that can use a lot of memory. Slots make your objects more memory-efficient by reserving space for their attributes ahead of time.

The code below shows that using slots significant reduces the memory usage.

%%writefile without_slot.py
from random import randint
from memory_profiler import profile


class Dog:
    def __init__(self, age):
        self.age = age


@profile
def main():
    return [Dog(age=randint(0, 30)) for _ in range(100000)]


if __name__ == "__main__":
    main()
$ python -m memory_profiler without_slot.py
Filename: without_slot.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    10     41.6 MiB     41.6 MiB           1   @profile
    11                                         def main():
    12     57.8 MiB     16.2 MiB      100003       return [Dog(randint(0, 30)) for _ in range(100000)]
%%writefile with_slot.py
from random import randint
from memory_profiler import profile


class Dog:
    # defining slots
    __slots__ = ["age"] 
    def __init__(self, age):
        self.age = age


@profile
def main():
    return [Dog(age=randint(0, 30)) for _ in range(100000)]


if __name__ == "__main__":
    main()
$ python -m memory_profiler with_slot.py
Filename: with_slot.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    11     41.3 MiB     41.3 MiB           1   @profile
    12                                         def main():
    13     46.7 MiB      5.4 MiB      100003       return [Dog(randint(0, 30)) for _ in range(100000)]

2.7.13. Improve Code Readability with Enums#

Hard-coded values without proper context can decrease code readability.

By using an Enum, meaningful names can be assigned to these values, improving code readability.

current_status = 200
if current_status == 200:
    print("You can go through the gate.")
elif current_status == 500:
    print("You can't go through the gate.")
else:
    print("Invalid status code.")
You can go through the gate.
from enum import Enum


class StatusCode(Enum):
    OK = 200
    ERROR = 500


current_status = 200
if current_status == StatusCode.OK.value:
    print("You can go through the gate.")
elif current_status == StatusCode.ERROR.value:
    print("You can't go through the gate.")
else:
    print("Invalid status code.")
You can go through the gate.

Learn more about Enum.

2.7.14. Embrace the Open-Closed Principle to Design Extensible Classes#

You should design classes that are open for extension but closed for modification (Open-Closed Principle).

In the current implementation of the DataPipeline class, the data processing methods are directly implemented within the class.

import pandas as pd


class DataPipeline:
    def drop_missing_data(self, data: pd.DataFrame) -> pd.DataFrame:
        return data.dropna()

    def standardize_data(self, data: pd.DataFrame) -> pd.DataFrame:
        return (data - data.mean()) / data.std()

    def process(self, data: pd.DataFrame) -> pd.DataFrame:
        return data.pipe(self.drop_missing_data).pipe(self.standardize_data)


pipeline = DataPipeline()
data = pd.DataFrame({"A": [1, 2, 3, None, 5], "B": [5, 4, 2, 1, 3]})
processed_data = pipeline.process(data)

Adding another processing step to the pipeline requires modifying the process method.

import pandas as pd


class DataPipeline:
    def drop_missing_data(self, data: pd.DataFrame) -> pd.DataFrame:
        return data.dropna()

    def standardize_data(self, data: pd.DataFrame) -> pd.DataFrame:
        return (data - data.mean()) / data.std()

    # Adding this
    def encode_categorical_data(self, data: pd.DataFrame) -> pd.DataFrame:
        return pd.get_dummies(data)

    # Requires modifying the code
    def process(self, data: pd.DataFrame) -> pd.DataFrame:
        return (
            data.pipe(self.drop_missing_data)
            .pipe(self.encode_categorical_data)
            .pipe(self.standardize_data)
        )


pipeline = DataPipeline()
data = pd.DataFrame(
    {"A": [1, 2, 3, None, 5], "B": [5, 4, 2, 1, 3], "C": ["a", "a", "b", "b", "a"]}
)
pipeline.process(data)
A B C_a C_b
0 -1.024695 1.161895 0.5 -0.5
1 -0.439155 0.387298 0.5 -0.5
2 0.146385 -1.161895 -1.5 1.5
4 1.317465 -0.387298 0.5 -0.5

Refactor the DataPipeline class to accept pluggable strategies.

from abc import ABC, abstractmethod
import pandas as pd


class DataProcessingStrategy(ABC):
    @abstractmethod
    def apply(self, data: pd.DataFrame) -> pd.DataFrame:
        pass


class DropMissingDataStrategy(DataProcessingStrategy):
    def apply(self, data: pd.DataFrame) -> pd.DataFrame:
        return data.dropna()


class StandardizeDataStrategy(DataProcessingStrategy):
    def apply(self, data: pd.DataFrame) -> pd.DataFrame:
        return (data - data.mean()) / data.std()


class DataPipeline:
    def __init__(self):
        self.strategies = []

    def add_strategy(self, strategy: DataProcessingStrategy):
        self.strategies.append(strategy)

    def process(self, data: pd.DataFrame) -> pd.DataFrame:
        for strategy in self.strategies:
            data = strategy.apply(data)
        return data


pipeline = DataPipeline()
pipeline.add_strategy(DropMissingDataStrategy())
pipeline.add_strategy(StandardizeDataStrategy())

# Imagine we have some sample data
data = pd.DataFrame({"A": [1, 2, 3, None, 5], "B": [5, 4, 2, 1, 3]})

pipeline.process(data)
A B
0 -1.024695 1.161895
1 -0.439155 0.387298
2 0.146385 -1.161895
4 1.317465 -0.387298

This design ensures the pipeline’s processing strategies can be swapped, extended, or reordered without modifying the DataPipeline class.

from abc import ABC, abstractmethod
import pandas as pd


class DataProcessingStrategy(ABC):
    @abstractmethod
    def apply(self, data: pd.DataFrame) -> pd.DataFrame:
        pass


class DropMissingDataStrategy(DataProcessingStrategy):
    def apply(self, data: pd.DataFrame) -> pd.DataFrame:
        return data.dropna()


class StandardizeDataStrategy(DataProcessingStrategy):
    def apply(self, data: pd.DataFrame) -> pd.DataFrame:
        return (data - data.mean()) / data.std()


class EncodeDataStrategy(DataProcessingStrategy):
    def apply(self, data: pd.DataFrame) -> pd.DataFrame:
        return pd.get_dummies(data)


class DataPipeline:
    def __init__(self):
        self.strategies = []

    def add_strategy(self, strategy: DataProcessingStrategy):
        self.strategies.append(strategy)

    def process(self, data: pd.DataFrame) -> pd.DataFrame:
        for strategy in self.strategies:
            data = strategy.apply(data)
        return data


pipeline = DataPipeline()
pipeline.add_strategy(DropMissingDataStrategy())
pipeline.add_strategy(EncodeDataStrategy())
pipeline.add_strategy(StandardizeDataStrategy())

data = pd.DataFrame(
    {"A": [1, 2, 3, None, 5], "B": [5, 4, 2, 1, 3], "C": ["a", "a", "b", "b", "a"]}
)

pipeline.process(data)
A B C_a C_b
0 -1.024695 1.161895 0.5 -0.5
1 -0.439155 0.387298 0.5 -0.5
2 0.146385 -1.161895 -1.5 1.5
4 1.317465 -0.387298 0.5 -0.5

2.7.15. Use Mixins Over Inheritance for Enhanced Modularity#

Use mixin instead of inheritance to add shared functionality without changing the primary structure of a class.

In the following example, the inheritance-based approach assumes that the PickleableModel class has a .model attribute that needs to be serialized. This assumption may not be true for all subclasses.

import joblib
from sklearn.cluster import KMeans
from sklearn.svm import SVC
from sklearn.datasets import make_blobs


class PickleableModel:
    def to_pickle(self, file_path):
        """
        Serialize the object to a pickle file.
        """
        with open(file_path, "wb") as file:
            # It's important to only serialize the model attribute here
            # otherwise joblib might fail to serialize the object
            joblib.dump(self.model, file)

    @classmethod
    def from_pickle(cls, file_path):
        """
        Deserialize pickle file to an object.
        """
        with open(file_path, "rb") as file:
            obj = cls()
            obj.model = joblib.load(file)
            return obj


class PickleableKmeans(PickleableModel):
    def __init__(self, n_clusters=3, **kwargs):
        self.model = KMeans(n_clusters=n_clusters, **kwargs)

    def fit(self, X, y=None):
        self.model.fit(X)

    def predict(self, X):
        return self.model.predict(X)


class PickleableSVM(PickleableModel):
    def __init__(self, C=1.0, kernel="rbf", **kwargs):
        self.model = SVC(C=C, kernel=kernel, **kwargs)

    def fit(self, X, y):
        self.model.fit(X, y)

    def predict(self, X):
        return self.model.predict(X)
X, _ = make_blobs(n_samples=300, centers=4, n_features=2, random_state=42)

kmeans = PickleableKmeans(n_clusters=3, n_init="auto")
kmeans.fit(X)

# Serialize the model to a file
kmeans_file_path = "kmeans_model.pkl"
kmeans.to_pickle(kmeans_file_path)

The PickleableMixin can be applied to any class that needs serialization, not just machine learning models.

class PickleableMixin:
    def to_pickle(self, file_path):
        """
        Serialize the object to a pickle file.
        """
        with open(file_path, "wb") as file:
            joblib.dump(self, file)

    @staticmethod
    def from_pickle(file_path):
        """
        Deserialize pickle file to an object.
        """
        with open(file_path, "rb") as file:
            return joblib.load(file)


# Enhanced models with serialization capability
class PickleableKmeans(KMeans, PickleableMixin):
    pass


class PickleableSVM(SVC, PickleableMixin):
    pass
kmeans = PickleableKmeans(n_clusters=3, n_init="auto")
kmeans.fit(X)

# Serialize the model to a file
kmeans_file_path = "kmeans_model.pkl"
kmeans.to_pickle(kmeans_file_path)

2.7.16. Embracing Duck Typing for Cleaner, More Adaptable Data Science Code#

Duck typing comes from the phrase “If it walks like a duck and quacks like a duck, then it must be a duck.”

This lets you write flexible code that works with different types of objects, as long as they have the methods or attributes you’re using.

For data scientists, duck typing allows creating versatile functions that work with various data structures without explicitly checking their types.

Here’s a simple example:

import numpy as np
import pandas as pd


class CustomDataFrame:
    def __init__(self, data):
        self.data = data

    def mean(self):
        return np.mean(self.data)

    def std(self):
        return np.std(self.data)


def analyze_data(data):
    print(f"Mean: {data.mean()}")
    print(f"Standard Deviation: {data.std()}")


# These all work, thanks to duck typing
numpy_array = np.array([1, 2, 3, 4, 5])
pandas_series = pd.Series([1, 2, 3, 4, 5])
custom_df = CustomDataFrame([1, 2, 3, 4, 5])

analyze_data(numpy_array)
analyze_data(pandas_series)
analyze_data(custom_df)
Mean: 3.0
Standard Deviation: 1.4142135623730951
Mean: 3.0
Standard Deviation: 1.5811388300841898
Mean: 3.0
Standard Deviation: 1.4142135623730951

In this example, the analyze_data function works with NumPy arrays, Pandas Series, and our custom CustomDataFrame class, because they all have mean and std methods. This flexibility is powerful in data science workflows where you might be working with various data structures.

This flexibility is valuable in data science because:

  1. It saves time: You don’t need separate functions for different data types.

Bad example:

def analyze_numpy_array(data):
    print(f"Mean: {np.mean(data)}")
    print(f"Standard Deviation: {np.std(data)}")

def analyze_pandas_series(data):
    print(f"Mean: {data.mean()}")
    print(f"Standard Deviation: {data.std()}")

def analyze_custom_df(data):
    print(f"Mean: {data.mean()}")
    print(f"Standard Deviation: {data.std()}")

numpy_array = np.array([1, 2, 3, 4, 5])
pandas_series = pd.Series([1, 2, 3, 4, 5])
custom_df = CustomDataFrame([1, 2, 3, 4, 5])

analyze_numpy_array(numpy_array)
analyze_pandas_series(pandas_series)
analyze_custom_df(custom_df)
Mean: 3.0
Standard Deviation: 1.4142135623730951
Mean: 3.0
Standard Deviation: 1.5811388300841898
Mean: 3.0
Standard Deviation: 1.4142135623730951
  1. It’s cleaner: You avoid lots of if statements checking for types.

def analyze_data(data):
    if isinstance(data, np.ndarray):
        mean = np.mean(data)
        std = np.std(data)
    elif isinstance(data, pd.Series):
        mean = data.mean()
        std = data.std()
    elif isinstance(data, CustomDataFrame):
        mean = data.mean()
        std = data.std()
    else:
        raise TypeError("Unsupported data type")
    
    print(f"Mean: {mean}")
    print(f"Standard Deviation: {std}")

numpy_array = np.array([1, 2, 3, 4, 5])
pandas_series = pd.Series([1, 2, 3, 4, 5])
custom_df = CustomDataFrame([1, 2, 3, 4, 5])
python_list = [1, 2, 3, 4, 5]

analyze_data(numpy_array)
analyze_data(pandas_series)
analyze_data(custom_df)
Mean: 3.0
Standard Deviation: 1.4142135623730951
Mean: 3.0
Standard Deviation: 1.5811388300841898
Mean: 3.0
Standard Deviation: 1.4142135623730951
  1. It’s more adaptable: Your code can handle new data types easily.

Bad example:

def analyze_data(data):
    if isinstance(data, np.ndarray):
        print(f"Mean: {np.mean(data)}")
        print(f"Standard Deviation: {np.std(data)}")
    elif isinstance(data, pd.Series):
        print(f"Mean: {data.mean()}")
        print(f"Standard Deviation: {data.std()}")
    elif isinstance(data, CustomDataFrame):
        print(f"Mean: {data.mean()}")
        print(f"Standard Deviation: {data.std()}")
    else:
        raise TypeError("Unsupported data type")


# If you introduce a new data type, you have to modify the function
class NewDataType:
    def __init__(self, data):
        self.data = data
    def mean(self):
        return sum(self.data) / len(self.data)
    def std(self):
        mean = self.mean()
        return (sum((x - mean) ** 2 for x in self.data) / len(self.data)) ** 0.5

new_data = NewDataType([1, 2, 3, 4, 5])
# This will raise a TypeError
# analyze_data(new_data)