Explaining algorithms involves storytelling principles such as author-driven vs. reader-driven, Martini glass structure, interactive slideshow, and drill-down story.
Dimensionality reduction is used in various domains, including document categorization, protein disorder prediction, drug discovery, and machine learning model debugging.
The disadvantages of dimensionality reduction include hard to preserve semantics of single dimensions, hard to understand and interpret, and error not visible, which can inspire false confidence.
t-Distributed Stochastic Neighbor Embedding (t-SNE) produces highly clustered, visually striking embeddings, captures local structure well, and is non-linear, but it may lose the global structure in favor of preserving local distances, is more computationally expensive, requires setting hyperparameters that influence the quality of the embedding, and is a non-deterministic algorithm.
Dimensionality reduction techniques include linear approaches such as Principal Component Analysis (PCA) and Multidimensional Scaling (MDS), and non-linear approaches like t-distributed stochastic neighbor embedding (t-SNE), Uniform manifold approximation and projection (UMAP), and Self-Organizing Maps (SOM).
The cons of PCA include that it is a linear reduction that limits the information that can be captured, and it may not be as discriminably clustered as other algorithms.
Embeddings can be useful, but it's important to be careful when interpreting patterns as those hyperparameters really matter, cluster sizes in a t-SNE plot mean nothing, distances between clusters might not mean anything, and random noise doesn't always look random.
Principal Component Analysis (PCA) has the pros of being relatively computationally cheap, saving the embedding model to project new data points into the reduced space, and can be used to cluster data.
The disadvantages of UMAP include the requirement to set hyperparameters that influence the quality of the embedding and its non-deterministic algorithm.