Module 1 – Data Science Project Lifecycle
- Recap of Demo
- Introduction to Types of Analytics
- Project life cycle
Module 2 – Introduction to Python, R and Basic Statistics
- Installation of Python IDE
- Anaconda and Spyder
- Working with Python and some basic commands& Examples
- Introduction to R and RStudio with some basics
Various graphical techniques to understand data
- Bar plot
- Histogram
- Box plot
- Scatter plot
- The various Data Types namely continuous, discrete, categorical, count, qualitative, quantitative and its identification and application. Further classification of data in terms of Nominal, Ordinal, Interval and Ratio types
- Random Variable and its definition
- Probability and Probability Distribution – Continuous probability distribution / Probability density function and Discrete probability distribution / Probability mass function
- Various sampling techniques
- Measure of central tendency
- Mean / Average
- Median
- Mode
- Measure of Dispersion
- Variance
- Standard Deviation
- Range
- Expected value of probability distribution
- Measure of Skewness
- Measure of Kurtosis
- Normal Distribution
- Standard Normal Distribution / Z distribution
- Z scores and Z table
- QQ Plot / Quantile-Quantile plot
Advanced Statistics
- Sampling Variation
- Central Limit Theorem
- Sample size calculator
- T-distribution / Student’s-t distribution
- Confidence interval
- Population parameter – Standard deviation known
- Population parameter – Standard deviation unknown
Introduced to Hypothesis testing, various Hypothesis testing Statistics, understand what is Null Hypothesis, Alternative hypothesis and types of hypothesis testing.
- Type I and Type II errors
- ANOVA
- Chi-Square test
High-Level overview of Machine Learning
- Supervised Learning
- Classifier
- Regression
- Unsupervised Learning
- Clustering
Supervised – Classifiers
Module 4 – Machine Learning Classifiers – KNN
Module 5 – Classifier – Naive Bayes
Module 7 – Logistic Regression
- Simple Logistic Regression
- Multiple Logistic Regression
- Confusion matrix
- False Positive, False Negative
- True Positive, True Negative
- Sensitivity, Recall, Specificity, F1
- Receiver operating characteristics curve (ROC curve)
Module 8 – Bagging And Boosting
- Network Topology
- Support Vector Machines
- Concept with a business case
- ARMA (Auto-Regressive Moving Average), Order p and q
- ARIMA (Auto-Regressive Integrated Moving Average), Order p, d and q
Supervised – Regression
- Scatter Diagram
- Correlation Analysis
- Principles of Regression
- Ordinary least squares
- Simple Linear Regression
- Understanding Overfitting (Variance) vs Underfitting (Bias)
- LINE assumption
- Collinearity (Variance Inflation Factor)
- Linearity
- Normality
- Multiple Linear Regression
Module 13 – Polynomial Regression
Module 14 – Decision Tree & Random Forest
Module 15 – Regularization Techniques
- i).Lasso and Ridge Regressions
Module 16 – Multinomial Regression
- Logit and Log Likelihood
- Category Baselining
- Modeling Nominal categorical data
Data Mining Unsupervised- Clustering
Module 17 – Data Mining Unsupervised – Clustering
- HierarchialClustering / Agglomerative Clustering
- K-Means Clustering
Module 18 – Dimension Reduction
- Why dimension reduction
- Advantages of PCA
- Calculation of PCA weights
- 2D Visualization using Principal components
- Basics of Matrix algebra
- SVD – Decomposition of matrix data
Module 19 – Data Mining Unsupervised – Network Analytics
- Definition of a network (the LinkedIn analogy)
- Introduction to Google Page Ranking
Module 20 – Data Mining Unsupervised – Association Rules
- What is Market Basket / Affinity Analysis
- Measure of association
- Support
- Confidence
- Lift Ratio
- Apriori Algorithm
- Sequential Pattern Mining
Module 21 – Data Mining Unsupervised – Recommender System
Module 23 – Natural Language Processing
Assignments/Projects/Placement Support