4.3. Work with Datetime#
4.3.1. parse_dates: Convert Columns into Datetime When Using pandas to Read CSV Files#
If there are datetime columns in your CSV file, use the parse_dates
parameter when reading CSV file with pandas. This reduces one extra step to convert these columns from string to datetime after reading the file.
# Create data files
import pandas as pd
data = pd.DataFrame(
{
"date_column_1": ["2021/02/10", "2021/02/12"],
"date_column_2": ["2021/02/11", "2021/02/13"],
"value": [3, 3],
}
)
data.to_csv("data.csv", index=False)
import pandas as pd
df = pd.read_csv("data.csv", parse_dates=["date_column_1", "date_column_2"])
df
date_column_1 | date_column_2 | value | |
---|---|---|---|
0 | 2021-02-10 | 2021-02-11 | 3 |
1 | 2021-02-12 | 2021-02-13 | 3 |
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date_column_1 2 non-null datetime64[ns]
1 date_column_2 2 non-null datetime64[ns]
2 value 2 non-null int64
dtypes: datetime64[ns](2), int64(1)
memory usage: 176.0 bytes
4.3.2. pandas’ DateOffset: Add a Time Interval to a pandas Timestamp#
If you want to add days, months, or other time intervals to a pandas Timestamp
, use pd.DateOffset
.
import pandas as pd
from pandas.tseries.offsets import DateOffset, BDay
ts = pd.Timestamp('2021-10-10 9:00:00')
# Increase the timestamp by 3 months
ts + DateOffset(months=3)
Timestamp('2022-01-10 09:00:00')
# Increase the timestamp by 3 years and 3 hours
ts + DateOffset(years=3, hours=3)
Timestamp('2024-10-10 12:00:00')
You can also increase the timestamp by n
business days using BDay
.
# Increase the timestamp by 6 business days
ts + BDay(n=6)
Timestamp('2021-10-18 09:00:00')
4.3.3. DataFrame rolling: Find The Average of The Previous n Datapoints Using pandas#
If you want to find the average of the previous n data points (simple moving average) with pandas, use df.rolling(time_period).mean()
.
The code below shows how to find the simple moving average of the previous 3 data-points.
from datetime import date
import pandas as pd
df = pd.DataFrame(
{
"date": [
date(2021, 1, 20),
date(2021, 1, 21),
date(2021, 1, 22),
date(2021, 1, 23),
date(2021, 1, 24),
],
"value": [1, 2, 3, 4, 5],
}
).set_index("date")
df
value | |
---|---|
date | |
2021-01-20 | 1 |
2021-01-21 | 2 |
2021-01-22 | 3 |
2021-01-23 | 4 |
2021-01-24 | 5 |
df.rolling(3).mean()
value | |
---|---|
date | |
2021-01-20 | NaN |
2021-01-21 | NaN |
2021-01-22 | 2.0 |
2021-01-23 | 3.0 |
2021-01-24 | 4.0 |
4.3.4. pandas Grouper: Group Values Based on a Specific Frequency#
Imagine you are given a DataFrame with a date column. If you want to group your DataFrame by a specific frequency, use pd.Grouper
. A Grouper
allows you to customize your groupby instruction.
In the code below, I set freq=1W
to group my DataFrame by weeks.
import pandas as pd
df = pd.DataFrame(
{"date": ["3-15-22", "3-16-22", "3-22-22"], "price": [2, 3, 4]}
)
df["date"] = pd.to_datetime(df["date"])
df
date | price | |
---|---|---|
0 | 2022-03-15 | 2 |
1 | 2022-03-16 | 3 |
2 | 2022-03-22 | 4 |
df.groupby(pd.Grouper(key="date", freq="1W")).mean()
price | |
---|---|
date | |
2022-03-20 | 2.5 |
2022-03-27 | 4.0 |
4.3.5. pandas.Series.dt: Access Datetime Properties of a pandas Series#
The easiest way to access datetime properties of pandas Series values is to use pandas.Series.dt
.
import pandas as pd
df = pd.DataFrame({"date": ["2021/05/13 15:00", "2022-6-20 14:00"], "values": [1, 3]})
df["date"] = pd.to_datetime(df["date"])
df["date"].dt.year
0 2021
1 2022
Name: date, dtype: int64
df["date"].dt.time
0 15:00:00
1 14:00:00
Name: date, dtype: object
4.3.6. Get Rows within a Year Range#
If you want to get all data starting in a particular year and exclude the previous years, simply use df.loc['year':]
like below. This works when the index of your pd.Dataframe
is DatetimeIndex
.
from datetime import datetime
import pandas as pd
df = pd.DataFrame(
{
"date": [datetime(2018, 10, 1), datetime(2019, 10, 1), datetime(2020, 10, 1)],
"val": [1, 2, 3],
}
).set_index("date")
df
val | |
---|---|
date | |
2018-10-01 | 1 |
2019-10-01 | 2 |
2020-10-01 | 3 |
df.loc["2019":]
val | |
---|---|
date | |
2019-10-01 | 2 |
2020-10-01 | 3 |
4.3.7. pandas.reindex: Replace the Values of the Missing Dates with 0#
Have you ever got a time series with missing dates? This can cause a problem since many time series methods require a fixed frequency index.
To fix this issue, you can replace the values of the missing dates with 0 using pd.date_range
and pd.reindex
.
import pandas as pd
s = pd.Series([1, 2, 3], index=["2021-07-20", "2021-07-23", "2021-07-25"])
s.index = pd.to_datetime(s.index)
s
2021-07-20 1
2021-07-23 2
2021-07-25 3
dtype: int64
# Get dates ranging from 2021/7/20 to 2021/7/25
new_index = pd.date_range("2021-07-20", "2021-07-25")
# Conform Series to new index
new_s = s.reindex(new_index, fill_value=0)
new_s
2021-07-20 1
2021-07-21 0
2021-07-22 0
2021-07-23 2
2021-07-24 0
2021-07-25 3
Freq: D, dtype: int64
4.3.8. Select DataFrame Rows Before or After a Specific Date#
If you want to get the rows whose dates are before or after a specific date, use the comparison operator and a date string.
import pandas as pd
df = pd.DataFrame(
{"date": pd.date_range(start="2021-7-19", end="2021-7-23"), "value": list(range(5))}
)
df
date | value | |
---|---|---|
0 | 2021-07-19 | 0 |
1 | 2021-07-20 | 1 |
2 | 2021-07-21 | 2 |
3 | 2021-07-22 | 3 |
4 | 2021-07-23 | 4 |
filtered_df = df[df.date <= "2021-07-21"]
filtered_df
date | value | |
---|---|---|
0 | 2021-07-19 | 0 |
1 | 2021-07-20 | 1 |
2 | 2021-07-21 | 2 |
4.3.9. resample: Resample Time-Series Data#
If you want to change the frequency of time-series data, use resample
. In the code below, I use resample
to show the records every two days instead of every day.
import pandas as pd
from numpy.random import randint
index = pd.date_range("2022-02-01", "2022-02-6")
s = pd.Series(index=index, data=randint(0, 10, 6))
s
2022-02-01 9
2022-02-02 9
2022-02-03 8
2022-02-04 3
2022-02-05 3
2022-02-06 4
Freq: D, dtype: int64
s.resample('2D').sum()
2022-02-01 18
2022-02-03 11
2022-02-05 7
Freq: 2D, dtype: int64
4.3.10. Shift the Index of a DataFrame by a Specific Number of Periods#
If you want to shift the index of a DataFrame by a specific number of periods, use pandas.DataFrame.shift
.
import pandas as pd
df = pd.DataFrame(
{"a": [1, 2, 3], "b": [4, 5, 6]}, index=pd.date_range("2022-01-01", "2022-01-03")
)
df
a | b | |
---|---|---|
2022-01-01 | 1 | 4 |
2022-01-02 | 2 | 5 |
2022-01-03 | 3 | 6 |
shifted = df.shift(periods=1)
shifted
a | b | |
---|---|---|
2022-01-01 | NaN | NaN |
2022-01-02 | 1.0 | 4.0 |
2022-01-03 | 2.0 | 5.0 |
shifted.dropna()
a | b | |
---|---|---|
2022-01-02 | 1.0 | 4.0 |
2022-01-03 | 2.0 | 5.0 |