Machine Learning Basics

This article will summarize the learnings from the following sources:

Databricks lab “Machine Learning with Databricks” which can be accessed with the 1-year lab subscription (200 USD)
YouTube Playlist
Articles on Medium

Classification

Classification is a supervised learning technique where the goal is to categorize data into predefined classes or categories.

Types of classification include:

Binary Classification: Two possible classes (e.g., “spam” or “not spam”).
Multi-class Classification: More than two classes (e.g., categorizing images as “landscape,” “portrait,” or “animal”).
Multi-label Classification: Instances can belong to multiple classes simultaneously (e.g., tagging an image as both “sunset” and “landscape”).

The algorithm learns from a labeled training dataset and predicts the classes of new, unseen data. Examples of classification include:

Binary Classification: Classifying emails as “spam” or “not spam.”
Multi-class Classification: Categorizing customers based on purchasing behavior.
Image Recognition: Categorizing images (e.g., as “landscape,” “portrait,” or “animal”).

Algorithms used for classification include Logistic Regression, Decision Trees, Support Vector Machines (SVMs), K-Nearest Neighbors (kNN), and Neural Networks.

Evaluation metrics for classification:

Accuracy: Proportion of correctly classified instances.
Precision: Proportion of true positive outcomes among all positive predictions.
Recall: Proportion of true positives among all actual positives.
F1-Score: Harmonic mean of precision and recall.
Confusion Matrix: Summarizes counts of true positives, true negatives, false positives, and false negatives.

Clustering

Clustering is an unsupervised learning technique used to group a set of objects or data points into clusters, where objects within the same cluster are more similar to each other than to objects in other clusters. Unlike classification, clustering is used when the data is unlabeled, aiming to uncover inherent structures or patterns. It is versatile and can be applied to various data types, such as numerical, categorical, text, and image data.

Core Components:

Data Point:
Each data point has a set of features (e.g., customer attributes like age, income, and spending habits).
Similarity Measure:
Measures like Euclidean distance are used to determine how close data points are to one another.
Clustering Algorithms:
- K-Means: Partitions data into K clusters.
- Hierarchical Clustering: Builds a hierarchy of clusters.
- DBSCAN: Groups closely packed points and identifies outliers.
- Gaussian Mixture Model (GMM): Assumes data is generated from several Gaussian distributions, each representing a cluster.
Cluster Assignment:
Data points are assigned to clusters based on the algorithm’s criteria.

Evaluation Metrics: Metrics like silhouette score, Davies–Bouldin index, and within-cluster sum of squares (WCSS) assess clustering performance.

Applications: Customer segmentation, market analysis, image segmentation, document clustering, anomaly detection, and recommendation systems.

Challenges: Handling the curse of dimensionality, noisy data, and the subjective nature of defining similarity metrics.