~/Efficient Pandas Data Processing Patterns

Mar 24, 2022

Pandas is a popular Python library for data analysis. Here are quick patterns for processing data with Pandas:

Load Data

Read CSV data into a DataFrame:

1
2

import pandas as pd
df = pd.read_csv("file.csv")

Filter Rows

Select rows matching a condition:

`1`	`df_filtered = df[df["col"] > 10]`

Select Columns

Pick one or more columns:

`1`	`df_cols = df[["name", "age"]]`

Apply Functions

Apply a function to each row or column:

`1`	`df["output"] = df["input"].apply(lambda x: x * 2)`

Group and Aggregate

Group data and aggregate values:

`1`	`df_grouped = df.groupby("category")["value"].sum()`

Handle Missing Data

Fill or drop missing values:

1
2

df = df.fillna(0)
df = df.dropna()

Speed Up With Vectorized Methods

Most operations in Pandas are vectorized, e.g.,

`1`	`df["new_col"] = df["old_col"] * 4`

For large or complex tasks, consider chunk processing:

1
2

for chunk in pd.read_csv("bigfile.csv", chunksize=100000):
    process(chunk)

Use query() for cleaner filtering syntax:

`1`	`df.query("age > 21 and income > 50000")`

See the official user guide for full pattern coverage.

Tags: [pandas] [data] [python]