Python for data analysis

 


Python for Data Analysis” can mean two things:

  1. You want to learn how to use Python for data analysis, or
  2. You are referring to the popular book Python for Data Analysis by Wes McKinney (Pandas creator).

Below I cover both. If you want something different, tell me!


✅ 1. Using Python for Data Analysis — Quick Guide

Core Libraries

Task Library Notes
Data manipulation pandas Tables, CSV/Excel, grouping, merging
Numerical computing numpy Arrays, math, performance
Visualization matplotlib, seaborn, plotly Static + interactive charts
Machine learning scikit-learn Regression, classification, clustering
Big data pyspark, dask Scales beyond RAM

📊 Basic Workflow Example

Load data

import pandas as pd

df = pd.read_csv("data.csv")
print(df.head())

Clean data

df = df.dropna()                     # remove missing values
df["age"] = df["age"].astype(int)   # type conversion

Exploratory analysis

print(df.describe())
print(df.groupby("gender")["income"].mean())

Visualization

import seaborn as sns
import matplotlib.pyplot as plt

sns.histplot(df["income"], kde=True)
plt.show()

Modeling

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

X = df[["age", "education_years"]]
y = df["income"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

model = LinearRegression()
model.fit(X_train, y_train)

print(model.score(X_test, y_test))

📚 2. About the book Python for Data Analysis

Written by Wes McKinney, creator of pandas.
Covers:

  • Pandas fundamentals
  • Data wrangling
  • Merging, reshaping, time series
  • NumPy basics
  • Practical datasets

If you want a summary, table of contents, or exercises, just ask.

Great! Here’s more depth, including step-by-step workflows, intermediate/advanced techniques, and practical examples you can use immediately.


🔥 1. Full Data Analysis Workflow in Python

A complete project usually follows these steps:

  1. Import libraries
  2. Load data
  3. Inspect the dataset
  4. Clean the data
  5. Analyze patterns
  6. Visualize results
  7. Build predictive models (optional)
  8. Report insights

Below is each step with practical code.


📌 2. Import Necessary Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

📌 3. Load Data From Different Sources

CSV

df = pd.read_csv("data.csv")

Excel

df = pd.read_excel("file.xlsx", sheet_name="Sheet1")

SQL

import sqlite3

conn = sqlite3.connect("database.db")
df = pd.read_sql("SELECT * FROM customers", conn)

JSON

df = pd.read_json("data.json")

📌 4. Inspect Data (Exploratory Data Analysis)

df.head()
df.tail()
df.info()
df.describe()
df.nunique()
df.shape

Check missing values:

df.isna().sum()

📌 5. Data Cleaning Techniques

Remove duplicates

df = df.drop_duplicates()

Replace missing values

df['salary'] = df['salary'].fillna(df['salary'].median())

Apply transformations

df['date'] = pd.to_datetime(df['date'])
df['price_log'] = np.log(df['price'] + 1)

Filter rows

df = df[df['age'] > 25]

📌 6. Data Visualization (More Examples)

Histogram

sns.histplot(df['age'])
plt.show()

Boxplot (detect outliers)

sns.boxplot(x=df['salary'])
plt.show()

Scatter plot

sns.scatterplot(x='age', y='salary', data=df)
plt.show()

Heatmap (correlation)

sns.heatmap(df.corr(numeric_only=True), annot=True, cmap='coolwarm')
plt.show()

📌 7. Feature Engineering

Create new columns

df['income_per_year'] = df['salary'] * 12

Binning

df['age_group'] = pd.cut(df['age'], bins=[0,18,35,60,100], labels=['child','young','adult','senior'])

One-hot encoding (for ML)

df = pd.get_dummies(df, columns=['gender', 'city'])

📌 8. Machine Learning Examples

a) Train/Test Split

from sklearn.model_selection import train_test_split

X = df[['age', 'experience']]
y = df['salary']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

b) Linear Regression

from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train)

print("Score:", model.score(X_test, y_test))

c) Classification (Logistic Regression)

from sklearn.linear_model import LogisticRegression

clf = LogisticRegression()
clf.fit(X_train, y_train)

d) Random Forest

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier()
rf.fit(X_train, y_train)

📌 9. Save & Export Results

Save to CSV

df.to_csv("cleaned_data.csv", index=False)

Save model

import joblib
joblib.dump(model, "model.pkl")

🎁 10. Real Project Example

If you upload a dataset, I can produce:

✔ Complete analysis
✔ Plots
✔ Statistical insights
✔ ML model
✔ Code in Python or Jupyter Notebook


If you want, I can also show:

➡ A full end-to-end case study
➡ Advanced data cleaning
➡ Time-series analysis
➡ Deep learning with TensorFlow or PyTorch
➡ SQL + Python workflows

Just tell me “Show an example project” or send your data!


Post a Comment

0 Comments