This is my review of the course having just taken it. This review serves as a learning aid: to help me synthesize the information after first learning it, and hopefully retain the content much longer. This post also is provides a personal assessment of the course, and whether the course was effective in teaching the subject.
The course begins by describing general machine learning, and the difference between the two main categories: unsupervised learning and supervised learning. Then for the focus of this course, supervised learning, there are two types:
1. Classification, for categorical target variables
2. Regression, for continuous target variables.
Throughout the course, most of the python code follows this scheme:
- Import ML model
- Instantiate ML model
- Fit the model to training data. model.fit(X_train, y_train)
- Use the recently trained model to make predictions.
- Optionally, assess the performance of the model
- Another option is to tune the model’s hyperparameters
The models this course teaches are:
- k-Nearest Neighbors
- Linear Regression
- Ridge or Lasso Regression, alternatives to classic Linear Regression
- Logistic Regression
Sklearn Cross-Validation (CV)
Cross-validation is a method of assessing an ML model by feeding it a different randomization split of the same training data to prevent overfitting. This is to mitigate the problem of training the model with a narrow dataset can lead to overfitting on one particular dataset, rather than a more preferred generalized fit that can accurately predict for more varied data. This seems like a standard practice in employing ML models.
Cross-validation from sklearn is implemented like this
The ROC (stands for Reciever Operating Characteristics) curve is plotted to determine to nature of your fit. The goal is to maximize the Area Under the Curve (AUC) in the ROC. For more info https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html
Another way of evaluating your model’s performance is the confusion matrix.
Hyperparameter Tuning (HPT)
The aspect of employing ML models was taught in the latter half of the course. Hyperparameters are tunable parameters within your ML models. If you just call your models, they will employ default values. However, the default parameters may not be optimal. It seems that if you are serious about optimizing the accuracy of your ML predictions, then you must do hyperparameter tuning combined with cross-validation. The finer details of how you perform HPT is up to you: number of parameters to tune, number of folds in CV. GridSearchCV scans the whole parameter space and CV folds. If that is too much wait time, then RandomizedSearchCV does the same thing but picks random choices of the defined parameter space.
Review of the course
The course attempts to present the subject in an organized manner. No matter how good the course is, seeing this material for the first time can be a bit confusing. Lectures weren’t presented in a particularly interesting way. I find myself falling asleep, and having to go back to re-listen to the lecture.
The coding exercises were also kinda boring and not interesting. It felt like a game of copy, paste and modify from the lecture slides, without critically thinking about how to apply the machine learning concepts. However, upon reviewing the content after finishing the course, I see that the material is actually well structured and informative. I learned more from reviewing the course and connecting the common patterns in each ML model. This course suffers from the same drawback as other Datacamp courses, is that the exercises and lectures are the first time you see the material, but does not foster a deep understanding, especially for something as complex as machine learning.