{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"id": "35e0b76d",
"metadata": {},
"source": [
"## Transform a DataFrame"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "18322b2c",
"metadata": {},
"source": [
"This section covers some pandas methods to transform a DataFrame into another form using methods such as aggregation and groupby."
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "1d656b71",
"metadata": {},
"source": [
"### pandas.DataFrame.agg: Aggregate over Columns or Rows Using Multiple Operations"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "45077b51",
"metadata": {},
"source": [
"If you want to aggregate over columns or rows using one or more operations, try `pd.DataFrame.agg`."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "502ce1e2",
"metadata": {
"ExecuteTime": {
"end_time": "2021-09-11T15:21:40.635035Z",
"start_time": "2021-09-11T15:21:40.595611Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" coll | \n",
" col2 | \n",
"
\n",
" \n",
" \n",
" \n",
" sum | \n",
" 9 | \n",
" 12 | \n",
"
\n",
" \n",
" count_two | \n",
" 0 | \n",
" 1 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" coll col2\n",
"sum 9 12\n",
"count_two 0 1"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from collections import Counter\n",
"import pandas as pd\n",
"\n",
"\n",
"def count_two(nums: list):\n",
" return Counter(nums)[2]\n",
"\n",
"\n",
"df = pd.DataFrame({\"coll\": [1, 3, 5], \"col2\": [2, 4, 6]})\n",
"df.agg([\"sum\", count_two])"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "adf7c712",
"metadata": {},
"source": [
"### pandas.DataFrame.agg: Apply Different Aggregations to Different Columns"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "b1104500",
"metadata": {},
"source": [
"If you want to apply different aggregations to different columns, insert a dictionary of column and aggregation methods to the `pd.DataFrame.agg` method."
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "f8ab39e9",
"metadata": {
"ExecuteTime": {
"end_time": "2021-11-18T14:31:25.153410Z",
"start_time": "2021-11-18T14:31:25.125430Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" a | \n",
" b | \n",
"
\n",
" \n",
" \n",
" \n",
" sum | \n",
" 10.0 | \n",
" NaN | \n",
"
\n",
" \n",
" mean | \n",
" 2.5 | \n",
" NaN | \n",
"
\n",
" \n",
" min | \n",
" NaN | \n",
" 2.0 | \n",
"
\n",
" \n",
" max | \n",
" NaN | \n",
" 5.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" a b\n",
"sum 10.0 NaN\n",
"mean 2.5 NaN\n",
"min NaN 2.0\n",
"max NaN 5.0"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd \n",
"\n",
"df = pd.DataFrame({\"a\": [1, 2, 3, 4], \"b\": [2, 3, 4, 5]})\n",
"\n",
"df.agg({\"a\": [\"sum\", \"mean\"], \"b\": [\"min\", \"max\"]})"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "ce1597d5",
"metadata": {},
"source": [
"### Group DataFrame's Rows into a List Using groupby"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "b9e19f45",
"metadata": {},
"source": [
"It is common to use `groupby` to get the statistics of rows in the same group such as count, mean, median, etc. If you want to group rows into a list instead, use `lambda x: list(x)`."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "322370d0",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" col1 | \n",
" col3 | \n",
"
\n",
" \n",
" col2 | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" a | \n",
" 1.5 | \n",
" [d, e] | \n",
"
\n",
" \n",
" b | \n",
" 3.5 | \n",
" [f, g] | \n",
"
\n",
" \n",
" c | \n",
" 3.0 | \n",
" [h] | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" col1 col3\n",
"col2 \n",
"a 1.5 [d, e]\n",
"b 3.5 [f, g]\n",
"c 3.0 [h]"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"\n",
"df = pd.DataFrame(\n",
" {\n",
" \"col1\": [1, 2, 3, 4, 3],\n",
" \"col2\": [\"a\", \"a\", \"b\", \"b\", \"c\"],\n",
" \"col3\": [\"d\", \"e\", \"f\", \"g\", \"h\"],\n",
" }\n",
")\n",
"\n",
"df.groupby([\"col2\"]).agg({\"col1\": \"mean\", \"col3\": lambda x: list(x)})\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "2321e6f3",
"metadata": {},
"source": [
"### Get the N Largest Values for Each Category in a DataFrame"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "13cea017",
"metadata": {},
"source": [
"If you want to get the `n` largest values for each category in a pandas DataFrame, use the combination of `groupby` and `nlargest`. "
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "6b7f376e",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" type | \n",
" value | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" a | \n",
" 3 | \n",
"
\n",
" \n",
" 1 | \n",
" a | \n",
" 2 | \n",
"
\n",
" \n",
" 2 | \n",
" b | \n",
" 2 | \n",
"
\n",
" \n",
" 3 | \n",
" b | \n",
" 1 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" type value\n",
"0 a 3\n",
"1 a 2\n",
"2 b 2\n",
"3 b 1"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"\n",
"df = pd.DataFrame({\"type\": [\"a\", \"a\", \"a\", \"b\", \"b\"], \"value\": [1, 2, 3, 1, 2]})\n",
"\n",
"# Get n largest values for each type\n",
"(\n",
" df.groupby(\"type\")\n",
" .apply(lambda df_: df_.nlargest(2, \"value\"))\n",
" .reset_index(drop=True)\n",
")\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "04b9863a",
"metadata": {},
"source": [
"### Assign Name to a Pandas Aggregation"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "ddda0cf0",
"metadata": {},
"source": [
"By default, aggregating a column returns the name of that column."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "0991ea21",
"metadata": {
"ExecuteTime": {
"end_time": "2022-08-06T14:54:23.431477Z",
"start_time": "2022-08-06T14:54:22.542597Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" price | \n",
"
\n",
" \n",
" size | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" L | \n",
" 5.0 | \n",
"
\n",
" \n",
" M | \n",
" 4.0 | \n",
"
\n",
" \n",
" S | \n",
" 2.5 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" price\n",
"size \n",
"L 5.0\n",
"M 4.0\n",
"S 2.5"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd \n",
"\n",
"df = pd.DataFrame({\"size\": [\"S\", \"S\", \"M\", \"L\"], \"price\": [2, 3, 4, 5]})\n",
"\n",
"df.groupby('size').agg({'price': 'mean'})"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "bfd086bb",
"metadata": {},
"source": [
"If you want to assign a new name to an aggregation, add `name = (column, agg_method)` to `agg`."
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "187cfe4d",
"metadata": {
"ExecuteTime": {
"end_time": "2021-11-25T13:50:35.839507Z",
"start_time": "2021-11-25T13:50:35.810041Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" mean_price | \n",
"
\n",
" \n",
" size | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" L | \n",
" 5.0 | \n",
"
\n",
" \n",
" M | \n",
" 4.0 | \n",
"
\n",
" \n",
" S | \n",
" 2.5 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" mean_price\n",
"size \n",
"L 5.0\n",
"M 4.0\n",
"S 2.5"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.groupby('size').agg(mean_price=('price', 'mean'))"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "9b4bb39b",
"metadata": {},
"source": [
"### pandas.pivot_table: Turn Your DataFrame Into a Pivot Table"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "e53d804a",
"metadata": {},
"source": [
"A pivot table is useful to summarize and analyze the patterns in your data. If you want to turn your DataFrame into a pivot table, use `pandas.pivot_table`."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "55ab2d62",
"metadata": {
"ExecuteTime": {
"end_time": "2021-11-18T14:31:18.972681Z",
"start_time": "2021-11-18T14:31:18.947196Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" item | \n",
" size | \n",
" location | \n",
" price | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" apple | \n",
" small | \n",
" Walmart | \n",
" 3.0 | \n",
"
\n",
" \n",
" 1 | \n",
" apple | \n",
" small | \n",
" Aldi | \n",
" 2.0 | \n",
"
\n",
" \n",
" 2 | \n",
" apple | \n",
" large | \n",
" Walmart | \n",
" 4.0 | \n",
"
\n",
" \n",
" 3 | \n",
" apple | \n",
" large | \n",
" Aldi | \n",
" 3.0 | \n",
"
\n",
" \n",
" 4 | \n",
" apple | \n",
" large | \n",
" Aldi | \n",
" 2.5 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" item size location price\n",
"0 apple small Walmart 3.0\n",
"1 apple small Aldi 2.0\n",
"2 apple large Walmart 4.0\n",
"3 apple large Aldi 3.0\n",
"4 apple large Aldi 2.5"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd \n",
"\n",
"df = pd.DataFrame(\n",
" {\n",
" \"item\": [\"apple\", \"apple\", \"apple\", \"apple\", \"apple\"],\n",
" \"size\": [\"small\", \"small\", \"large\", \"large\", \"large\"],\n",
" \"location\": [\"Walmart\", \"Aldi\", \"Walmart\", \"Aldi\", \"Aldi\"],\n",
" \"price\": [3, 2, 4, 3, 2.5],\n",
" }\n",
")\n",
"\n",
"df"
]
},
{
"cell_type": "code",
"execution_count": 51,
"id": "4fcb498b",
"metadata": {
"ExecuteTime": {
"end_time": "2021-09-11T15:55:44.866288Z",
"start_time": "2021-09-11T15:55:44.810075Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" location | \n",
" Aldi | \n",
" Walmart | \n",
"
\n",
" \n",
" item | \n",
" size | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" apple | \n",
" large | \n",
" 2.75 | \n",
" 4.0 | \n",
"
\n",
" \n",
" small | \n",
" 2.00 | \n",
" 3.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
"location Aldi Walmart\n",
"item size \n",
"apple large 2.75 4.0\n",
" small 2.00 3.0"
]
},
"execution_count": 51,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"application/javascript": "\n setTimeout(function() {\n var nbb_cell_id = 51;\n var nbb_unformatted_code = \"pivot = pd.pivot_table(\\n df, values=\\\"price\\\", index=[\\\"item\\\", \\\"size\\\"], columns=[\\\"location\\\"], aggfunc=\\\"mean\\\"\\n)\\npivot \";\n var nbb_formatted_code = \"pivot = pd.pivot_table(\\n df, values=\\\"price\\\", index=[\\\"item\\\", \\\"size\\\"], columns=[\\\"location\\\"], aggfunc=\\\"mean\\\"\\n)\\npivot\";\n var nbb_cells = Jupyter.notebook.get_cells();\n for (var i = 0; i < nbb_cells.length; ++i) {\n if (nbb_cells[i].input_prompt_number == nbb_cell_id) {\n if (nbb_cells[i].get_text() == nbb_unformatted_code) {\n nbb_cells[i].set_text(nbb_formatted_code);\n }\n break;\n }\n }\n }, 500);\n ",
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"pivot = pd.pivot_table(\n",
" df, values=\"price\", index=[\"item\", \"size\"], columns=[\"location\"], aggfunc=\"mean\"\n",
")\n",
"pivot"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "ec81dac9",
"metadata": {},
"source": [
"### DataFrame.groupby.sample: Get a Random Sample of Items from Each Category in a Column\t"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "a3d347c5",
"metadata": {},
"source": [
"If you want to get a random sample of items from each category in a column, use `pandas.DataFrame.groupby.sample`.This method is useful when you want to get a subset of a DataFrame while keeping all categories in a column."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "3181fa65",
"metadata": {
"ExecuteTime": {
"end_time": "2021-11-18T14:31:13.800210Z",
"start_time": "2021-11-18T14:31:13.450771Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" col1 | \n",
" col2 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" a | \n",
" 4 | \n",
"
\n",
" \n",
" 2 | \n",
" b | \n",
" 6 | \n",
"
\n",
" \n",
" 4 | \n",
" c | \n",
" 8 | \n",
"
\n",
" \n",
" 5 | \n",
" d | \n",
" 9 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" col1 col2\n",
"0 a 4\n",
"2 b 6\n",
"4 c 8\n",
"5 d 9"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd \n",
"\n",
"df = pd.DataFrame({\"col1\": [\"a\", \"a\", \"b\", \"c\", \"c\", \"d\"], \"col2\": [4, 5, 6, 7, 8, 9]})\n",
"df.groupby(\"col1\").sample(n=1)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "be68d598",
"metadata": {},
"source": [
"To get 2 items from each category, use `n=2`."
]
},
{
"cell_type": "code",
"execution_count": 37,
"id": "7d2d9319",
"metadata": {
"ExecuteTime": {
"end_time": "2021-09-11T16:18:36.604036Z",
"start_time": "2021-09-11T16:18:36.557369Z"
},
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" col1 | \n",
" col2 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" a | \n",
" 4 | \n",
"
\n",
" \n",
" 1 | \n",
" a | \n",
" 5 | \n",
"
\n",
" \n",
" 4 | \n",
" b | \n",
" 8 | \n",
"
\n",
" \n",
" 2 | \n",
" b | \n",
" 6 | \n",
"
\n",
" \n",
" 5 | \n",
" c | \n",
" 9 | \n",
"
\n",
" \n",
" 6 | \n",
" c | \n",
" 10 | \n",
"
\n",
" \n",
" 8 | \n",
" d | \n",
" 12 | \n",
"
\n",
" \n",
" 7 | \n",
" d | \n",
" 11 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" col1 col2\n",
"0 a 4\n",
"1 a 5\n",
"4 b 8\n",
"2 b 6\n",
"5 c 9\n",
"6 c 10\n",
"8 d 12\n",
"7 d 11"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"application/javascript": "\n setTimeout(function() {\n var nbb_cell_id = 37;\n var nbb_unformatted_code = \"df = pd.DataFrame({\\\"col1\\\": [\\\"a\\\", \\\"a\\\", \\\"b\\\", \\\"b\\\", \\\"b\\\", \\\"c\\\", \\\"c\\\", \\\"d\\\", \\\"d\\\"], \\\"col2\\\": [4, 5, 6, 7, 8, 9, 10, 11, 12]})\\ndf.groupby(\\\"col1\\\").sample(n=2)\";\n var nbb_formatted_code = \"df = pd.DataFrame(\\n {\\n \\\"col1\\\": [\\\"a\\\", \\\"a\\\", \\\"b\\\", \\\"b\\\", \\\"b\\\", \\\"c\\\", \\\"c\\\", \\\"d\\\", \\\"d\\\"],\\n \\\"col2\\\": [4, 5, 6, 7, 8, 9, 10, 11, 12],\\n }\\n)\\ndf.groupby(\\\"col1\\\").sample(n=2)\";\n var nbb_cells = Jupyter.notebook.get_cells();\n for (var i = 0; i < nbb_cells.length; ++i) {\n if (nbb_cells[i].input_prompt_number == nbb_cell_id) {\n if (nbb_cells[i].get_text() == nbb_unformatted_code) {\n nbb_cells[i].set_text(nbb_formatted_code);\n }\n break;\n }\n }\n }, 500);\n ",
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"df = pd.DataFrame(\n",
" {\n",
" \"col1\": [\"a\", \"a\", \"b\", \"b\", \"b\", \"c\", \"c\", \"d\", \"d\"],\n",
" \"col2\": [4, 5, 6, 7, 8, 9, 10, 11, 12],\n",
" }\n",
")\n",
"df.groupby(\"col1\").sample(n=2)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "77854c48",
"metadata": {},
"source": [
"### size: Compute the Size of Each Group"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "295778a3",
"metadata": {},
"source": [
"If you want to get the count of elements in one column, use `groupby` and `count`. "
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "b1caa12a",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" col2 | \n",
"
\n",
" \n",
" col1 | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" a | \n",
" 1 | \n",
"
\n",
" \n",
" b | \n",
" 2 | \n",
"
\n",
" \n",
" c | \n",
" 2 | \n",
"
\n",
" \n",
" d | \n",
" 1 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" col2\n",
"col1 \n",
"a 1\n",
"b 2\n",
"c 2\n",
"d 1"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"\n",
"df = pd.DataFrame(\n",
" {\"col1\": [\"a\", \"b\", \"b\", \"c\", \"c\", \"d\"], \"col2\": [\"S\", \"S\", \"M\", \"L\", \"L\", \"L\"]}\n",
")\n",
"\n",
"df.groupby(['col1']).count()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "177840fc",
"metadata": {},
"source": [
"If you want to get the size of groups composed of 2 or more columns, use `groupby` and `size` instead. "
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "f85a7e71",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"col1 col2\n",
"a S 1\n",
"b M 1\n",
" S 1\n",
"c L 2\n",
"d L 1\n",
"dtype: int64"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.groupby(['col1', 'col2']).size()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "6aa68674",
"metadata": {},
"source": [
"### pandas.melt: Unpivot a DataFrame"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "ac8c3e45",
"metadata": {},
"source": [
"If you want to unpivot a DataFrame from wide to long format, use pandas.melt."
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "7636d047",
"metadata": {},
"source": [
"For example, you can use pandas.melt to turn multiple columns (`Aldi`, `Walmart`, `Costco`) into values of one column (`store`). "
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "87591963",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" fruit | \n",
" Aldi | \n",
" Walmart | \n",
" Costco | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" apple | \n",
" 1 | \n",
" 3 | \n",
" 5 | \n",
"
\n",
" \n",
" 1 | \n",
" orange | \n",
" 2 | \n",
" 4 | \n",
" 6 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" fruit Aldi Walmart Costco\n",
"0 apple 1 3 5\n",
"1 orange 2 4 6"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"\n",
"df = pd.DataFrame(\n",
" {\"fruit\": [\"apple\", \"orange\"], \"Aldi\": [1, 2], \"Walmart\": [3, 4], \"Costco\": [5, 6]}\n",
")\n",
"df\n"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "88164efd",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" fruit | \n",
" store | \n",
" value | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" apple | \n",
" Aldi | \n",
" 1 | \n",
"
\n",
" \n",
" 1 | \n",
" orange | \n",
" Aldi | \n",
" 2 | \n",
"
\n",
" \n",
" 2 | \n",
" apple | \n",
" Walmart | \n",
" 3 | \n",
"
\n",
" \n",
" 3 | \n",
" orange | \n",
" Walmart | \n",
" 4 | \n",
"
\n",
" \n",
" 4 | \n",
" apple | \n",
" Costco | \n",
" 5 | \n",
"
\n",
" \n",
" 5 | \n",
" orange | \n",
" Costco | \n",
" 6 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" fruit store value\n",
"0 apple Aldi 1\n",
"1 orange Aldi 2\n",
"2 apple Walmart 3\n",
"3 orange Walmart 4\n",
"4 apple Costco 5\n",
"5 orange Costco 6"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.melt(id_vars=[\"fruit\"], value_vars=[\"Aldi\", \"Walmart\", \"Costco\"], var_name='store')\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "db60ba89",
"metadata": {},
"source": [
"### pandas.crosstab: Create a Cross Tabulation"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "3a7ba408",
"metadata": {},
"source": [
"Cross tabulation allows you to analyze the relationship between multiple variables. To turn a pandas DataFrame into a cross tabulation, use `pandas.crosstab`."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "1608b004",
"metadata": {
"ExecuteTime": {
"end_time": "2022-08-06T14:54:32.420386Z",
"start_time": "2022-08-06T14:54:32.393511Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" person2 | \n",
" Ben | \n",
" Josh | \n",
" Khuyen | \n",
" Lauren | \n",
" Thinh | \n",
"
\n",
" \n",
" person1 | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" Ben | \n",
" 0 | \n",
" 1 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" Josh | \n",
" 1 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" Khuyen | \n",
" 1 | \n",
" 1 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
"
\n",
" \n",
" Lauren | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 1 | \n",
"
\n",
" \n",
" Thinh | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
"person2 Ben Josh Khuyen Lauren Thinh\n",
"person1 \n",
"Ben 0 1 1 0 0\n",
"Josh 1 0 1 0 0\n",
"Khuyen 1 1 0 1 0\n",
"Lauren 0 0 1 0 1\n",
"Thinh 0 0 0 1 0"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd \n",
"\n",
"network = [\n",
" (\"Ben\", \"Khuyen\"),\n",
" (\"Ben\", \"Josh\"),\n",
" (\"Lauren\", \"Thinh\"),\n",
" (\"Lauren\", \"Khuyen\"),\n",
" (\"Khuyen\", \"Josh\"),\n",
"]\n",
"\n",
"# Create a dataframe of the network\n",
"friends1 = pd.DataFrame(network, columns=[\"person1\", \"person2\"])\n",
"\n",
"# Reverse the order of the columns\n",
"friends2 = pd.DataFrame(network, columns=[\"person2\", \"person1\"])\n",
"\n",
"# Create a symmetric dataframe\n",
"friends = pd.concat([friends1, friends2])\n",
"\n",
"# Create a cross tabulation \n",
"pd.crosstab(friends.person1, friends.person2)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "1c125580",
"metadata": {},
"source": [
"### Stack Columns into Rows in Pandas"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "b73286d0",
"metadata": {},
"source": [
"If you want to stack the columns into rows, use `DataFrame.stack()`."
]
},
{
"cell_type": "code",
"execution_count": 21,
"id": "d012ff27",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Original DataFrame:\n",
" A B\n",
"x 1 4\n",
"y 2 5\n",
"z 3 6\n"
]
}
],
"source": [
"import pandas as pd\n",
"\n",
"# Create a DataFrame with two columns and a MultiIndex\n",
"df = pd.DataFrame(\n",
" {\"A\": [1, 2, 3], \"B\": [4, 5, 6]}, index=[\"x\", \"y\", \"z\"]\n",
")\n",
"\n",
"# Original DataFrame\n",
"print(\"Original DataFrame:\")\n",
"print(df)\n"
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "28164283",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"Stacked DataFrame:\n",
"x A 1\n",
" B 4\n",
"y A 2\n",
" B 5\n",
"z A 3\n",
" B 6\n",
"dtype: int64\n"
]
}
],
"source": [
"# Stacked DataFrame\n",
"stacked_df = df.stack()\n",
"\n",
"print(\"\\nStacked DataFrame:\")\n",
"print(stacked_df)\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "cb40ca09",
"metadata": {},
"source": [
"### Turn a pandas Series into a pandas DataFrame"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "a9a174fc",
"metadata": {},
"source": [
"If you want to turn a pandas Series into a pandas DataFrame, use `str.get_dummies()`."
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "aa06a80f",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 a\n",
"1 b\n",
"2 a,b\n",
"3 a,c\n",
"dtype: object"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"\n",
"s = pd.Series([\"a\", \"b\", \"a,b\", \"a,c\"])\n",
"s "
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "263bd90c",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" a | \n",
" b | \n",
" c | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 1 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
"
\n",
" \n",
" 2 | \n",
" 1 | \n",
" 1 | \n",
" 0 | \n",
"
\n",
" \n",
" 3 | \n",
" 1 | \n",
" 0 | \n",
" 1 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" a b c\n",
"0 1 0 0\n",
"1 0 1 0\n",
"2 1 1 0\n",
"3 1 0 1"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Split the string by \",\"\n",
"s.str.get_dummies(sep=\",\")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "061c58ab",
"metadata": {},
"source": [
"### Align Pandas Objects for Effective Data Manipulation"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "02e1dfdf",
"metadata": {},
"source": [
"To perform operations between two pandas objects, it's often necessary to ensure that two pandas objects have the same row or column labels. The `df.align` method allows you to align two pandas objects along specified axes."
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "42f654ee",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"df1:\n",
" b a\n",
"0 1 2\n",
"1 4 5 \n",
"\n",
"df2:\n",
" a b c\n",
"0 4 3 2\n",
"1 6 5 4\n"
]
}
],
"source": [
"import pandas as pd\n",
"\n",
"df1 = pd.DataFrame([[1, 2], [4, 5]], columns=[\"b\", \"a\"])\n",
"df2 = pd.DataFrame([[4, 3, 2], [6, 5, 4]], columns=[\"a\", \"b\", \"c\"])\n",
"\n",
"print(\"df1:\\n\", df1, \"\\n\")\n",
"print(\"df2:\\n\", df2)\n"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "53d95563",
"metadata": {},
"outputs": [],
"source": [
"# ensure df1 and df2 have the same column labels\n",
"# by including all unique column labels from both objects\n",
"left, right = df1.align(df2, join='outer', axis=1)"
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "23ba84dc",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"df1:\n",
" a b c\n",
"0 2 1 NaN\n",
"1 5 4 NaN \n",
"\n",
"df2:\n",
" a b c\n",
"0 4 3 2\n",
"1 6 5 4\n"
]
}
],
"source": [
"print(\"df1:\\n\", left, \"\\n\")\n",
"print(\"df2:\\n\", right)"
]
}
],
"metadata": {
"hide_input": false,
"interpreter": {
"hash": "484329849bb907480cd798e750759bc6f1d66c93f9e78e7055aa0a2c2de6b47b"
},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.4"
},
"toc": {
"base_numbering": 1,
"nav_menu": {},
"number_sections": true,
"sideBar": true,
"skip_h1_title": false,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": false,
"toc_position": {},
"toc_section_display": true,
"toc_window_display": false
}
},
"nbformat": 4,
"nbformat_minor": 5
}