{ "cells": [ { "attachments": {}, "cell_type": "markdown", "id": "35e0b76d", "metadata": {}, "source": [ "## Transform a DataFrame" ] }, { "attachments": {}, "cell_type": "markdown", "id": "18322b2c", "metadata": {}, "source": [ "This section covers some pandas methods to transform a DataFrame into another form using methods such as aggregation and groupby." ] }, { "attachments": {}, "cell_type": "markdown", "id": "1d656b71", "metadata": {}, "source": [ "### pandas.DataFrame.agg: Aggregate over Columns or Rows Using Multiple Operations" ] }, { "attachments": {}, "cell_type": "markdown", "id": "45077b51", "metadata": {}, "source": [ "If you want to aggregate over columns or rows using one or more operations, try `pd.DataFrame.agg`." ] }, { "cell_type": "code", "execution_count": 5, "id": "502ce1e2", "metadata": { "ExecuteTime": { "end_time": "2021-09-11T15:21:40.635035Z", "start_time": "2021-09-11T15:21:40.595611Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
collcol2
sum912
count_two01
\n", "
" ], "text/plain": [ " coll col2\n", "sum 9 12\n", "count_two 0 1" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from collections import Counter\n", "import pandas as pd\n", "\n", "\n", "def count_two(nums: list):\n", " return Counter(nums)[2]\n", "\n", "\n", "df = pd.DataFrame({\"coll\": [1, 3, 5], \"col2\": [2, 4, 6]})\n", "df.agg([\"sum\", count_two])" ] }, { "attachments": {}, "cell_type": "markdown", "id": "adf7c712", "metadata": {}, "source": [ "### pandas.DataFrame.agg: Apply Different Aggregations to Different Columns" ] }, { "attachments": {}, "cell_type": "markdown", "id": "b1104500", "metadata": {}, "source": [ "If you want to apply different aggregations to different columns, insert a dictionary of column and aggregation methods to the `pd.DataFrame.agg` method." ] }, { "cell_type": "code", "execution_count": 3, "id": "f8ab39e9", "metadata": { "ExecuteTime": { "end_time": "2021-11-18T14:31:25.153410Z", "start_time": "2021-11-18T14:31:25.125430Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ab
sum10.0NaN
mean2.5NaN
minNaN2.0
maxNaN5.0
\n", "
" ], "text/plain": [ " a b\n", "sum 10.0 NaN\n", "mean 2.5 NaN\n", "min NaN 2.0\n", "max NaN 5.0" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd \n", "\n", "df = pd.DataFrame({\"a\": [1, 2, 3, 4], \"b\": [2, 3, 4, 5]})\n", "\n", "df.agg({\"a\": [\"sum\", \"mean\"], \"b\": [\"min\", \"max\"]})" ] }, { "attachments": {}, "cell_type": "markdown", "id": "ce1597d5", "metadata": {}, "source": [ "### Group DataFrame's Rows into a List Using groupby" ] }, { "attachments": {}, "cell_type": "markdown", "id": "b9e19f45", "metadata": {}, "source": [ "It is common to use `groupby` to get the statistics of rows in the same group such as count, mean, median, etc. If you want to group rows into a list instead, use `lambda x: list(x)`." ] }, { "cell_type": "code", "execution_count": 4, "id": "322370d0", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
col1col3
col2
a1.5[d, e]
b3.5[f, g]
c3.0[h]
\n", "
" ], "text/plain": [ " col1 col3\n", "col2 \n", "a 1.5 [d, e]\n", "b 3.5 [f, g]\n", "c 3.0 [h]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "df = pd.DataFrame(\n", " {\n", " \"col1\": [1, 2, 3, 4, 3],\n", " \"col2\": [\"a\", \"a\", \"b\", \"b\", \"c\"],\n", " \"col3\": [\"d\", \"e\", \"f\", \"g\", \"h\"],\n", " }\n", ")\n", "\n", "df.groupby([\"col2\"]).agg({\"col1\": \"mean\", \"col3\": lambda x: list(x)})\n" ] }, { "attachments": {}, "cell_type": "markdown", "id": "2321e6f3", "metadata": {}, "source": [ "### Get the N Largest Values for Each Category in a DataFrame" ] }, { "attachments": {}, "cell_type": "markdown", "id": "13cea017", "metadata": {}, "source": [ "If you want to get the `n` largest values for each category in a pandas DataFrame, use the combination of `groupby` and `nlargest`. " ] }, { "cell_type": "code", "execution_count": 8, "id": "6b7f376e", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
typevalue
0a3
1a2
2b2
3b1
\n", "
" ], "text/plain": [ " type value\n", "0 a 3\n", "1 a 2\n", "2 b 2\n", "3 b 1" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "df = pd.DataFrame({\"type\": [\"a\", \"a\", \"a\", \"b\", \"b\"], \"value\": [1, 2, 3, 1, 2]})\n", "\n", "# Get n largest values for each type\n", "(\n", " df.groupby(\"type\")\n", " .apply(lambda df_: df_.nlargest(2, \"value\"))\n", " .reset_index(drop=True)\n", ")\n" ] }, { "attachments": {}, "cell_type": "markdown", "id": "04b9863a", "metadata": {}, "source": [ "### Assign Name to a Pandas Aggregation" ] }, { "attachments": {}, "cell_type": "markdown", "id": "ddda0cf0", "metadata": {}, "source": [ "By default, aggregating a column returns the name of that column." ] }, { "cell_type": "code", "execution_count": 1, "id": "0991ea21", "metadata": { "ExecuteTime": { "end_time": "2022-08-06T14:54:23.431477Z", "start_time": "2022-08-06T14:54:22.542597Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
price
size
L5.0
M4.0
S2.5
\n", "
" ], "text/plain": [ " price\n", "size \n", "L 5.0\n", "M 4.0\n", "S 2.5" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd \n", "\n", "df = pd.DataFrame({\"size\": [\"S\", \"S\", \"M\", \"L\"], \"price\": [2, 3, 4, 5]})\n", "\n", "df.groupby('size').agg({'price': 'mean'})" ] }, { "attachments": {}, "cell_type": "markdown", "id": "bfd086bb", "metadata": {}, "source": [ "If you want to assign a new name to an aggregation, add `name = (column, agg_method)` to `agg`." ] }, { "cell_type": "code", "execution_count": 13, "id": "187cfe4d", "metadata": { "ExecuteTime": { "end_time": "2021-11-25T13:50:35.839507Z", "start_time": "2021-11-25T13:50:35.810041Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
mean_price
size
L5.0
M4.0
S2.5
\n", "
" ], "text/plain": [ " mean_price\n", "size \n", "L 5.0\n", "M 4.0\n", "S 2.5" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby('size').agg(mean_price=('price', 'mean'))" ] }, { "attachments": {}, "cell_type": "markdown", "id": "9b4bb39b", "metadata": {}, "source": [ "### pandas.pivot_table: Turn Your DataFrame Into a Pivot Table" ] }, { "attachments": {}, "cell_type": "markdown", "id": "e53d804a", "metadata": {}, "source": [ "A pivot table is useful to summarize and analyze the patterns in your data. If you want to turn your DataFrame into a pivot table, use `pandas.pivot_table`." ] }, { "cell_type": "code", "execution_count": 2, "id": "55ab2d62", "metadata": { "ExecuteTime": { "end_time": "2021-11-18T14:31:18.972681Z", "start_time": "2021-11-18T14:31:18.947196Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
itemsizelocationprice
0applesmallWalmart3.0
1applesmallAldi2.0
2applelargeWalmart4.0
3applelargeAldi3.0
4applelargeAldi2.5
\n", "
" ], "text/plain": [ " item size location price\n", "0 apple small Walmart 3.0\n", "1 apple small Aldi 2.0\n", "2 apple large Walmart 4.0\n", "3 apple large Aldi 3.0\n", "4 apple large Aldi 2.5" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd \n", "\n", "df = pd.DataFrame(\n", " {\n", " \"item\": [\"apple\", \"apple\", \"apple\", \"apple\", \"apple\"],\n", " \"size\": [\"small\", \"small\", \"large\", \"large\", \"large\"],\n", " \"location\": [\"Walmart\", \"Aldi\", \"Walmart\", \"Aldi\", \"Aldi\"],\n", " \"price\": [3, 2, 4, 3, 2.5],\n", " }\n", ")\n", "\n", "df" ] }, { "cell_type": "code", "execution_count": 51, "id": "4fcb498b", "metadata": { "ExecuteTime": { "end_time": "2021-09-11T15:55:44.866288Z", "start_time": "2021-09-11T15:55:44.810075Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
locationAldiWalmart
itemsize
applelarge2.754.0
small2.003.0
\n", "
" ], "text/plain": [ "location Aldi Walmart\n", "item size \n", "apple large 2.75 4.0\n", " small 2.00 3.0" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" }, { "data": { "application/javascript": "\n setTimeout(function() {\n var nbb_cell_id = 51;\n var nbb_unformatted_code = \"pivot = pd.pivot_table(\\n df, values=\\\"price\\\", index=[\\\"item\\\", \\\"size\\\"], columns=[\\\"location\\\"], aggfunc=\\\"mean\\\"\\n)\\npivot \";\n var nbb_formatted_code = \"pivot = pd.pivot_table(\\n df, values=\\\"price\\\", index=[\\\"item\\\", \\\"size\\\"], columns=[\\\"location\\\"], aggfunc=\\\"mean\\\"\\n)\\npivot\";\n var nbb_cells = Jupyter.notebook.get_cells();\n for (var i = 0; i < nbb_cells.length; ++i) {\n if (nbb_cells[i].input_prompt_number == nbb_cell_id) {\n if (nbb_cells[i].get_text() == nbb_unformatted_code) {\n nbb_cells[i].set_text(nbb_formatted_code);\n }\n break;\n }\n }\n }, 500);\n ", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "pivot = pd.pivot_table(\n", " df, values=\"price\", index=[\"item\", \"size\"], columns=[\"location\"], aggfunc=\"mean\"\n", ")\n", "pivot" ] }, { "attachments": {}, "cell_type": "markdown", "id": "ec81dac9", "metadata": {}, "source": [ "### DataFrame.groupby.sample: Get a Random Sample of Items from Each Category in a Column\t" ] }, { "attachments": {}, "cell_type": "markdown", "id": "a3d347c5", "metadata": {}, "source": [ "If you want to get a random sample of items from each category in a column, use `pandas.DataFrame.groupby.sample`.This method is useful when you want to get a subset of a DataFrame while keeping all categories in a column." ] }, { "cell_type": "code", "execution_count": 1, "id": "3181fa65", "metadata": { "ExecuteTime": { "end_time": "2021-11-18T14:31:13.800210Z", "start_time": "2021-11-18T14:31:13.450771Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
col1col2
0a4
2b6
4c8
5d9
\n", "
" ], "text/plain": [ " col1 col2\n", "0 a 4\n", "2 b 6\n", "4 c 8\n", "5 d 9" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd \n", "\n", "df = pd.DataFrame({\"col1\": [\"a\", \"a\", \"b\", \"c\", \"c\", \"d\"], \"col2\": [4, 5, 6, 7, 8, 9]})\n", "df.groupby(\"col1\").sample(n=1)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "be68d598", "metadata": {}, "source": [ "To get 2 items from each category, use `n=2`." ] }, { "cell_type": "code", "execution_count": 37, "id": "7d2d9319", "metadata": { "ExecuteTime": { "end_time": "2021-09-11T16:18:36.604036Z", "start_time": "2021-09-11T16:18:36.557369Z" }, "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
col1col2
0a4
1a5
4b8
2b6
5c9
6c10
8d12
7d11
\n", "
" ], "text/plain": [ " col1 col2\n", "0 a 4\n", "1 a 5\n", "4 b 8\n", "2 b 6\n", "5 c 9\n", "6 c 10\n", "8 d 12\n", "7 d 11" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" }, { "data": { "application/javascript": "\n setTimeout(function() {\n var nbb_cell_id = 37;\n var nbb_unformatted_code = \"df = pd.DataFrame({\\\"col1\\\": [\\\"a\\\", \\\"a\\\", \\\"b\\\", \\\"b\\\", \\\"b\\\", \\\"c\\\", \\\"c\\\", \\\"d\\\", \\\"d\\\"], \\\"col2\\\": [4, 5, 6, 7, 8, 9, 10, 11, 12]})\\ndf.groupby(\\\"col1\\\").sample(n=2)\";\n var nbb_formatted_code = \"df = pd.DataFrame(\\n {\\n \\\"col1\\\": [\\\"a\\\", \\\"a\\\", \\\"b\\\", \\\"b\\\", \\\"b\\\", \\\"c\\\", \\\"c\\\", \\\"d\\\", \\\"d\\\"],\\n \\\"col2\\\": [4, 5, 6, 7, 8, 9, 10, 11, 12],\\n }\\n)\\ndf.groupby(\\\"col1\\\").sample(n=2)\";\n var nbb_cells = Jupyter.notebook.get_cells();\n for (var i = 0; i < nbb_cells.length; ++i) {\n if (nbb_cells[i].input_prompt_number == nbb_cell_id) {\n if (nbb_cells[i].get_text() == nbb_unformatted_code) {\n nbb_cells[i].set_text(nbb_formatted_code);\n }\n break;\n }\n }\n }, 500);\n ", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df = pd.DataFrame(\n", " {\n", " \"col1\": [\"a\", \"a\", \"b\", \"b\", \"b\", \"c\", \"c\", \"d\", \"d\"],\n", " \"col2\": [4, 5, 6, 7, 8, 9, 10, 11, 12],\n", " }\n", ")\n", "df.groupby(\"col1\").sample(n=2)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "77854c48", "metadata": {}, "source": [ "### size: Compute the Size of Each Group" ] }, { "attachments": {}, "cell_type": "markdown", "id": "295778a3", "metadata": {}, "source": [ "If you want to get the count of elements in one column, use `groupby` and `count`. " ] }, { "cell_type": "code", "execution_count": 12, "id": "b1caa12a", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
col2
col1
a1
b2
c2
d1
\n", "
" ], "text/plain": [ " col2\n", "col1 \n", "a 1\n", "b 2\n", "c 2\n", "d 1" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "df = pd.DataFrame(\n", " {\"col1\": [\"a\", \"b\", \"b\", \"c\", \"c\", \"d\"], \"col2\": [\"S\", \"S\", \"M\", \"L\", \"L\", \"L\"]}\n", ")\n", "\n", "df.groupby(['col1']).count()" ] }, { "attachments": {}, "cell_type": "markdown", "id": "177840fc", "metadata": {}, "source": [ "If you want to get the size of groups composed of 2 or more columns, use `groupby` and `size` instead. " ] }, { "cell_type": "code", "execution_count": 8, "id": "f85a7e71", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "col1 col2\n", "a S 1\n", "b M 1\n", " S 1\n", "c L 2\n", "d L 1\n", "dtype: int64" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby(['col1', 'col2']).size()" ] }, { "attachments": {}, "cell_type": "markdown", "id": "6aa68674", "metadata": {}, "source": [ "### pandas.melt: Unpivot a DataFrame" ] }, { "attachments": {}, "cell_type": "markdown", "id": "ac8c3e45", "metadata": {}, "source": [ "If you want to unpivot a DataFrame from wide to long format, use pandas.melt." ] }, { "attachments": {}, "cell_type": "markdown", "id": "7636d047", "metadata": {}, "source": [ "For example, you can use pandas.melt to turn multiple columns (`Aldi`, `Walmart`, `Costco`) into values of one column (`store`). " ] }, { "cell_type": "code", "execution_count": 7, "id": "87591963", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
fruitAldiWalmartCostco
0apple135
1orange246
\n", "
" ], "text/plain": [ " fruit Aldi Walmart Costco\n", "0 apple 1 3 5\n", "1 orange 2 4 6" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "df = pd.DataFrame(\n", " {\"fruit\": [\"apple\", \"orange\"], \"Aldi\": [1, 2], \"Walmart\": [3, 4], \"Costco\": [5, 6]}\n", ")\n", "df\n" ] }, { "cell_type": "code", "execution_count": 8, "id": "88164efd", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
fruitstorevalue
0appleAldi1
1orangeAldi2
2appleWalmart3
3orangeWalmart4
4appleCostco5
5orangeCostco6
\n", "
" ], "text/plain": [ " fruit store value\n", "0 apple Aldi 1\n", "1 orange Aldi 2\n", "2 apple Walmart 3\n", "3 orange Walmart 4\n", "4 apple Costco 5\n", "5 orange Costco 6" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.melt(id_vars=[\"fruit\"], value_vars=[\"Aldi\", \"Walmart\", \"Costco\"], var_name='store')\n" ] }, { "attachments": {}, "cell_type": "markdown", "id": "db60ba89", "metadata": {}, "source": [ "### pandas.crosstab: Create a Cross Tabulation" ] }, { "attachments": {}, "cell_type": "markdown", "id": "3a7ba408", "metadata": {}, "source": [ "Cross tabulation allows you to analyze the relationship between multiple variables. To turn a pandas DataFrame into a cross tabulation, use `pandas.crosstab`." ] }, { "cell_type": "code", "execution_count": 2, "id": "1608b004", "metadata": { "ExecuteTime": { "end_time": "2022-08-06T14:54:32.420386Z", "start_time": "2022-08-06T14:54:32.393511Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
person2BenJoshKhuyenLaurenThinh
person1
Ben01100
Josh10100
Khuyen11010
Lauren00101
Thinh00010
\n", "
" ], "text/plain": [ "person2 Ben Josh Khuyen Lauren Thinh\n", "person1 \n", "Ben 0 1 1 0 0\n", "Josh 1 0 1 0 0\n", "Khuyen 1 1 0 1 0\n", "Lauren 0 0 1 0 1\n", "Thinh 0 0 0 1 0" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd \n", "\n", "network = [\n", " (\"Ben\", \"Khuyen\"),\n", " (\"Ben\", \"Josh\"),\n", " (\"Lauren\", \"Thinh\"),\n", " (\"Lauren\", \"Khuyen\"),\n", " (\"Khuyen\", \"Josh\"),\n", "]\n", "\n", "# Create a dataframe of the network\n", "friends1 = pd.DataFrame(network, columns=[\"person1\", \"person2\"])\n", "\n", "# Reverse the order of the columns\n", "friends2 = pd.DataFrame(network, columns=[\"person2\", \"person1\"])\n", "\n", "# Create a symmetric dataframe\n", "friends = pd.concat([friends1, friends2])\n", "\n", "# Create a cross tabulation \n", "pd.crosstab(friends.person1, friends.person2)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "1c125580", "metadata": {}, "source": [ "### Stack Columns into Rows in Pandas" ] }, { "attachments": {}, "cell_type": "markdown", "id": "b73286d0", "metadata": {}, "source": [ "If you want to stack the columns into rows, use `DataFrame.stack()`." ] }, { "cell_type": "code", "execution_count": 21, "id": "d012ff27", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Original DataFrame:\n", " A B\n", "x 1 4\n", "y 2 5\n", "z 3 6\n" ] } ], "source": [ "import pandas as pd\n", "\n", "# Create a DataFrame with two columns and a MultiIndex\n", "df = pd.DataFrame(\n", " {\"A\": [1, 2, 3], \"B\": [4, 5, 6]}, index=[\"x\", \"y\", \"z\"]\n", ")\n", "\n", "# Original DataFrame\n", "print(\"Original DataFrame:\")\n", "print(df)\n" ] }, { "cell_type": "code", "execution_count": 22, "id": "28164283", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Stacked DataFrame:\n", "x A 1\n", " B 4\n", "y A 2\n", " B 5\n", "z A 3\n", " B 6\n", "dtype: int64\n" ] } ], "source": [ "# Stacked DataFrame\n", "stacked_df = df.stack()\n", "\n", "print(\"\\nStacked DataFrame:\")\n", "print(stacked_df)\n" ] }, { "attachments": {}, "cell_type": "markdown", "id": "cb40ca09", "metadata": {}, "source": [ "### Turn a pandas Series into a pandas DataFrame" ] }, { "attachments": {}, "cell_type": "markdown", "id": "a9a174fc", "metadata": {}, "source": [ "If you want to turn a pandas Series into a pandas DataFrame, use `str.get_dummies()`." ] }, { "cell_type": "code", "execution_count": 15, "id": "aa06a80f", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 a\n", "1 b\n", "2 a,b\n", "3 a,c\n", "dtype: object" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "s = pd.Series([\"a\", \"b\", \"a,b\", \"a,c\"])\n", "s " ] }, { "cell_type": "code", "execution_count": 18, "id": "263bd90c", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abc
0100
1010
2110
3101
\n", "
" ], "text/plain": [ " a b c\n", "0 1 0 0\n", "1 0 1 0\n", "2 1 1 0\n", "3 1 0 1" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Split the string by \",\"\n", "s.str.get_dummies(sep=\",\")" ] }, { "attachments": {}, "cell_type": "markdown", "id": "061c58ab", "metadata": {}, "source": [ "### Align Pandas Objects for Effective Data Manipulation" ] }, { "attachments": {}, "cell_type": "markdown", "id": "02e1dfdf", "metadata": {}, "source": [ "To perform operations between two pandas objects, it's often necessary to ensure that two pandas objects have the same row or column labels. The `df.align` method allows you to align two pandas objects along specified axes." ] }, { "cell_type": "code", "execution_count": 17, "id": "42f654ee", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "df1:\n", " b a\n", "0 1 2\n", "1 4 5 \n", "\n", "df2:\n", " a b c\n", "0 4 3 2\n", "1 6 5 4\n" ] } ], "source": [ "import pandas as pd\n", "\n", "df1 = pd.DataFrame([[1, 2], [4, 5]], columns=[\"b\", \"a\"])\n", "df2 = pd.DataFrame([[4, 3, 2], [6, 5, 4]], columns=[\"a\", \"b\", \"c\"])\n", "\n", "print(\"df1:\\n\", df1, \"\\n\")\n", "print(\"df2:\\n\", df2)\n" ] }, { "cell_type": "code", "execution_count": 18, "id": "53d95563", "metadata": {}, "outputs": [], "source": [ "# ensure df1 and df2 have the same column labels\n", "# by including all unique column labels from both objects\n", "left, right = df1.align(df2, join='outer', axis=1)" ] }, { "cell_type": "code", "execution_count": 19, "id": "23ba84dc", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "df1:\n", " a b c\n", "0 2 1 NaN\n", "1 5 4 NaN \n", "\n", "df2:\n", " a b c\n", "0 4 3 2\n", "1 6 5 4\n" ] } ], "source": [ "print(\"df1:\\n\", left, \"\\n\")\n", "print(\"df2:\\n\", right)" ] } ], "metadata": { "hide_input": false, "interpreter": { "hash": "484329849bb907480cd798e750759bc6f1d66c93f9e78e7055aa0a2c2de6b47b" }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.4" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 5 }