~/Efficient Pandas Data Processing Patterns

Mar 24, 2022


Pandas is a popular Python library for data analysis. Here are quick patterns for processing data with Pandas:

Load Data

Read CSV data into a DataFrame:

1
2
import pandas as pd
df = pd.read_csv("file.csv")

Filter Rows

Select rows matching a condition:

1
df_filtered = df[df["col"] > 10]

Select Columns

Pick one or more columns:

1
df_cols = df[["name", "age"]]

Apply Functions

Apply a function to each row or column:

1
df["output"] = df["input"].apply(lambda x: x * 2)

Group and Aggregate

Group data and aggregate values:

1
df_grouped = df.groupby("category")["value"].sum()

Handle Missing Data

Fill or drop missing values:

1
2
df = df.fillna(0)
df = df.dropna()

Speed Up With Vectorized Methods

Most operations in Pandas are vectorized, e.g.,

1
df["new_col"] = df["old_col"] * 4

For large or complex tasks, consider chunk processing:

1
2
for chunk in pd.read_csv("bigfile.csv", chunksize=100000):
    process(chunk)

Use query() for cleaner filtering syntax:

1
df.query("age > 21 and income > 50000")

See the official user guide for full pattern coverage.

Tags: [pandas] [data] [python]