4.5. Create a DataFrame#

This section shows some tips to read or create a DataFrame.

4.5.1. Fix Unnamed:0 When Reading a CSV in pandas#

Sometimes, when reading a CSV in pandas, you will get an Unnamed:0 column.

# Create data
import pandas as pd  

df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
df.to_csv('data.csv')
import pandas as pd  

df = pd.read_csv('data.csv')
print(df)
   Unnamed: 0  a  b
0           0  1  4
1           1  2  5
2           2  3  6

To fix this, add index_col=0 to pandas.read_csv.

df = pd.read_csv('data.csv', index_col=0)
print(df)
   a  b
0  1  4
1  2  5
2  3  6

4.5.2. Read Data from a Website#

pandas allows you to read data from a website without downloading the data.

For example, to read a CSV from GitHub, click Raw then copy the link.

import pandas as pd  

df = pd.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/exercise.csv", index_col=0)
df.head(5)
id diet pulse time kind
0 1 low fat 85 1 min rest
1 1 low fat 85 15 min rest
2 1 low fat 88 30 min rest
3 2 low fat 90 1 min rest
4 2 low fat 92 15 min rest

4.5.3. Read Data From a SQL Table#

Loading SQL tables into DataFrames allows you to analyze and preprocess the data using the rich functionality of pandas.

To read a SQL table into a pandas DataFrame, pass the database connection obtained from the SQLAlchemy Engine to the pandas.read_sql method.

import pandas as pd
import sqlalchemy

# Create a SQLAlchemy engine
engine = sqlalchemy.create_engine(
    "postgresql://username:password@host:port/database_name"
)


# Read a SQL table into a DataFrame
df = pd.read_sql("SELECT * FROM table_name", engine)

4.5.4. Divide a Large pandas DataFrame into Chunks#

As a data professional, you might work with large datasets that don’t fit in the memory.

When dealing with these datasets, you can divide the data into smaller portions using the pandas’s chunksize parameter.

In the code below, using no chunksize takes an average of 2.33s to run. Using chunksize=100000 takes an average of 424 µs to run, which is 563k times faster.

import pandas as pd  
import warnings
warnings.filterwarnings("ignore")
%%timeit 
pd.read_csv('../data/flight_data_2018_to_2022.csv')
2.33 s ± 58.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
df = pd.read_csv('../data/flight_data_2018_to_2022.csv')
df.shape 
(563737, 120)
%%timeit  
pd.read_csv('../data/flight_data_2018_to_2022.csv', chunksize=100000)
424 µs ± 30.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

We can see that using chunksize=100000 divides the DataFrame into 6 portions, 5 of which have 100000 rows.

df_chunks = pd.read_csv('../data/flight_data_2018_to_2022.csv', chunksize=100000)
for df_ in df_chunks:
    print(df_.shape)
(100000, 120)
(100000, 120)
(100000, 120)
(100000, 120)
(100000, 120)
(63737, 120)

4.5.5. Read HTML Tables Using Pandas#

If you want to quickly extract a table on a website and turn it into a Pandas DataFrame, use pd.read_html. In the code below, I extracted the table from a Wikipedia page in one line of code.

import pandas as pd  

df = pd.read_html('https://en.wikipedia.org/wiki/Poverty')
df[1]
Region $1 per day $1.25 per day[94] $1.90 per day[95]
Region 1990 2002 2004 1981 2008 1981 1990 2000 2010 2015 2018
0 East Asia and Pacific 15.4% 12.3% 9.1% 77.2% 14.3% 80.2% 60.9% 34.8% 10.8% 2.1% 1.2%
1 Europe and Central Asia 3.6% 1.3% 1.0% 1.9% 0.5% 7.3% 2.4% 1.5% 1.1%
2 Latin America and the Caribbean 9.6% 9.1% 8.6% 11.9% 6.5% 13.7% 15.5% 12.7% 6% 3.7% 3.7%
3 Middle East and North Africa 2.1% 1.7% 1.5% 9.6% 2.7% 6.5% 3.5% 2% 4.3% 7%
4 South Asia 35.0% 33.4% 30.8% 61.1% 36% 58% 49.1% 26%
5 Sub-Saharan Africa 46.1% 42.6% 41.1% 51.5% 47.5% 54.9% 58.4% 46.6% 42.3% 40.4%
6 World 52.2% 22.4% 42.7% 36.2% 27.8% 16% 10.1%

4.5.6. DataFrame.copy(): Make a Copy of a DataFrame#

Have you ever tried to make a copy of a DataFrame using =? You will not get a copy but a reference to the original DataFrame. Thus, changing the new DataFrame will also change the original DataFrame.

import pandas as pd 

df = pd.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]})
df  
col1 col2
0 1 4
1 2 5
2 3 6
df2 = df
df2['col1'] = [7, 8, 9]
df
col1 col2
0 7 4
1 8 5
2 9 6

A better way to make a copy is to use df.copy(). Now, changing the copy will not affect the original DataFrame.

df = pd.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]})

# Create a copy of the original DataFrame
df3 = df.copy()

# Change the value of the copy
df3['col1'] = [7, 8, 9]

# Check if the original DataFrame has been changed
df
col1 col2
0 1 4
1 2 5
2 3 6

4.5.7. Copy on Write Mode in pandas 2.0#

Hide code cell content
!pip install pandas==2.0.0

pandas DataFrame returns a view by default when selecting a subset, meaning changes to the view will change the original.

import pandas as pd

df1 = pd.DataFrame({"col1": [1, 2], "col2": [3, 4]})

# Create a view of the original DataFrame
df2 = df1["col1"]

# Change the value of the view
df2.iloc[0] = 10

# The original DataFrame has been changed
df1
col1 col2
0 10 3
1 2 4

pandas 2.0 offers the option to return a copy instead of a view by default, preventing changes to the copy from affecting the original.

pd.options.mode.copy_on_write = True

df1 = pd.DataFrame({"col1": [1, 2], "col2": [3, 4]})

# Create a copy of the original DataFrame
df2 = df1["col1"]

# Change the value of the copy
df2.iloc[0] = 10

# The original DataFrame has not been changed
df1
col1 col2
0 1 3
1 2 4