“Python for Data Analysis” can mean two things:
- You want to learn how to use Python for data analysis, or
- You are referring to the popular book Python for Data Analysis by Wes McKinney (Pandas creator).
Below I cover both. If you want something different, tell me!
✅ 1. Using Python for Data Analysis — Quick Guide
Core Libraries
| Task | Library | Notes |
|---|---|---|
| Data manipulation | pandas | Tables, CSV/Excel, grouping, merging |
| Numerical computing | numpy | Arrays, math, performance |
| Visualization | matplotlib, seaborn, plotly | Static + interactive charts |
| Machine learning | scikit-learn | Regression, classification, clustering |
| Big data | pyspark, dask | Scales beyond RAM |
📊 Basic Workflow Example
Load data
import pandas as pd
df = pd.read_csv("data.csv")
print(df.head())
Clean data
df = df.dropna() # remove missing values
df["age"] = df["age"].astype(int) # type conversion
Exploratory analysis
print(df.describe())
print(df.groupby("gender")["income"].mean())
Visualization
import seaborn as sns
import matplotlib.pyplot as plt
sns.histplot(df["income"], kde=True)
plt.show()
Modeling
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
X = df[["age", "education_years"]]
y = df["income"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = LinearRegression()
model.fit(X_train, y_train)
print(model.score(X_test, y_test))
📚 2. About the book Python for Data Analysis
Written by Wes McKinney, creator of pandas.
Covers:
- Pandas fundamentals
- Data wrangling
- Merging, reshaping, time series
- NumPy basics
- Practical datasets
If you want a summary, table of contents, or exercises, just ask.
Great! Here’s more depth, including step-by-step workflows, intermediate/advanced techniques, and practical examples you can use immediately.
🔥 1. Full Data Analysis Workflow in Python
A complete project usually follows these steps:
- Import libraries
- Load data
- Inspect the dataset
- Clean the data
- Analyze patterns
- Visualize results
- Build predictive models (optional)
- Report insights
Below is each step with practical code.
📌 2. Import Necessary Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
📌 3. Load Data From Different Sources
CSV
df = pd.read_csv("data.csv")
Excel
df = pd.read_excel("file.xlsx", sheet_name="Sheet1")
SQL
import sqlite3
conn = sqlite3.connect("database.db")
df = pd.read_sql("SELECT * FROM customers", conn)
JSON
df = pd.read_json("data.json")
📌 4. Inspect Data (Exploratory Data Analysis)
df.head()
df.tail()
df.info()
df.describe()
df.nunique()
df.shape
Check missing values:
df.isna().sum()
📌 5. Data Cleaning Techniques
Remove duplicates
df = df.drop_duplicates()
Replace missing values
df['salary'] = df['salary'].fillna(df['salary'].median())
Apply transformations
df['date'] = pd.to_datetime(df['date'])
df['price_log'] = np.log(df['price'] + 1)
Filter rows
df = df[df['age'] > 25]
📌 6. Data Visualization (More Examples)
Histogram
sns.histplot(df['age'])
plt.show()
Boxplot (detect outliers)
sns.boxplot(x=df['salary'])
plt.show()
Scatter plot
sns.scatterplot(x='age', y='salary', data=df)
plt.show()
Heatmap (correlation)
sns.heatmap(df.corr(numeric_only=True), annot=True, cmap='coolwarm')
plt.show()
📌 7. Feature Engineering
Create new columns
df['income_per_year'] = df['salary'] * 12
Binning
df['age_group'] = pd.cut(df['age'], bins=[0,18,35,60,100], labels=['child','young','adult','senior'])
One-hot encoding (for ML)
df = pd.get_dummies(df, columns=['gender', 'city'])
📌 8. Machine Learning Examples
a) Train/Test Split
from sklearn.model_selection import train_test_split
X = df[['age', 'experience']]
y = df['salary']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
b) Linear Regression
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)
print("Score:", model.score(X_test, y_test))
c) Classification (Logistic Regression)
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()
clf.fit(X_train, y_train)
d) Random Forest
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
📌 9. Save & Export Results
Save to CSV
df.to_csv("cleaned_data.csv", index=False)
Save model
import joblib
joblib.dump(model, "model.pkl")
🎁 10. Real Project Example
If you upload a dataset, I can produce:
✔ Complete analysis
✔ Plots
✔ Statistical insights
✔ ML model
✔ Code in Python or Jupyter Notebook
If you want, I can also show:
➡ A full end-to-end case study
➡ Advanced data cleaning
➡ Time-series analysis
➡ Deep learning with TensorFlow or PyTorch
➡ SQL + Python workflows
Just tell me “Show an example project” or send your data!
.jpg)
0 Comments