Dimensionality Reduction: Comprehensive Implementation and Analysis
A comprehensive implementation and analysis of dimensionality reduction techniques including PCA, t-SNE, UMAP, and Autoencoders. This repository demonstrates the theory, implementation, and evaluation of these methods on standard datasets.
π― Overview
Dimensionality reduction is crucial in machine learning for:
- Data Visualization: Projecting high-dimensional data to 2D/3D for human interpretation
- Computational Efficiency: Reducing feature space for faster processing
- Noise Reduction: Eliminating redundant or noisy features
- Storage Optimization: Compressing data while preserving essential information
This project provides a complete suite of dimensionality reduction methods with detailed explanations, implementations, and performance comparisons.
π Methods Implemented
1. Principal Component Analysis (PCA)
- Type: Linear dimensionality reduction
- Key Feature: Finds directions of maximum variance
- Best For: Data with linear structure, feature compression
- Results:
- Iris: 97.5% accuracy retention with 2 components
- Digits: 52.4% accuracy retention with 2 components
2. t-SNE (t-Distributed Stochastic Neighbor Embedding)
- Type: Non-linear manifold learning
- Key Feature: Preserves local neighborhood structure
- Best For: Data visualization, clustering analysis
- Results:
- Iris: 105.0% accuracy retention
- Digits: 100.4% accuracy retention
3. UMAP (Uniform Manifold Approximation and Projection)
- Type: Non-linear manifold learning
- Key Feature: Preserves both local and global structure
- Best For: Balanced visualization, scalable to large datasets
- Results:
- Iris: 102.5% accuracy retention
- Digits: 99.2% accuracy retention
4. Autoencoder (Neural Network)
- Type: Non-linear neural network approach
- Key Feature: Learns optimal encoding through reconstruction
- Best For: Complex non-linear relationships, customizable architectures
- Architecture: Input β 128 β 64 β Encoding β 64 β 128 β Output
ποΈ Project Structure
dimensionality-reduction/
βββ implementation.ipynb # Complete Jupyter notebook with theory and code
βββ dimensionality_reduction.log # Detailed execution logs
βββ models/ # Saved trained models
β βββ pca_iris.pkl
β βββ pca_digits.pkl
β βββ umap_iris.pkl
β βββ umap_digits.pkl
β βββ autoencoder_iris.pth
β βββ autoencoder_digits.pth
βββ results/ # Analysis results
β βββ dimensionality_reduction_summary.json
βββ visualizations/ # Generated plots and comparisons
β βββ pca_explained_variance.png
β βββ iris_comparison.png
β βββ digits_comparison.png
βββ README.md # This file
π Quick Start
Prerequisites
pip install numpy pandas scikit-learn matplotlib seaborn plotly umap-learn torch torchvision
Running the Analysis
Clone the repository:
git clone https://github.com/GruheshKurra/dimensionality-reduction.git cd dimensionality-reductionInstall dependencies:
pip install -r requirements.txtRun the complete analysis:
jupyter notebook implementation.ipynbOr execute the main script:
python main.py
π Results Summary
Dataset Information
- Iris Dataset: 150 samples, 4 features, 3 classes
- Digits Dataset: 1797 samples, 64 features, 10 classes
Performance Comparison (Accuracy Retention)
| Method | Iris Dataset | Digits Dataset |
|---|---|---|
| PCA | 97.5% | 52.4% |
| t-SNE | 105.0% | 100.4% |
| UMAP | 102.5% | 99.2% |
Key Insights
- PCA works well for low-dimensional data (Iris) but struggles with high-dimensional complex patterns (Digits)
- t-SNE excels at preserving local structure, sometimes even improving classification performance
- UMAP provides excellent balance between local and global structure preservation
- Autoencoders offer flexibility but require careful tuning
π Detailed Analysis
PCA Explained Variance
- Iris: First 2 components explain 95.8% of variance
- Digits: First 2 components explain only 21.6% of variance
Method Characteristics
| Aspect | PCA | t-SNE | UMAP | Autoencoder |
|---|---|---|---|---|
| Linearity | Linear | Non-linear | Non-linear | Non-linear |
| Speed | Fast | Slow | Medium | Medium |
| Deterministic | Yes | No | Yes* | Yes* |
| New Data | β | β | β | β |
| Interpretability | High | Low | Medium | Low |
*With fixed random seed
π Educational Content
The implementation.ipynb notebook includes:
- Theory Explanation: Mathematical foundations and intuitive explanations
- Step-by-step Implementation: Detailed code with comprehensive comments
- Visual Comparisons: Side-by-side plots showing method differences
- Performance Evaluation: Classification accuracy retention analysis
- Best Practices: When to use each method and parameter selection
π οΈ Technical Details
Dependencies
numpy: Numerical computingpandas: Data manipulationscikit-learn: Machine learning algorithmsmatplotlib,seaborn: Data visualizationumap-learn: UMAP implementationtorch: Neural network autoencoderplotly: Interactive visualizations
Key Features
- Comprehensive Logging: Detailed execution logs for reproducibility
- Model Persistence: Save and load trained models
- Evaluation Framework: Systematic performance comparison
- Visualization Suite: Publication-quality plots
- Structured Results: JSON summary for further analysis
π Learning Outcomes
After working through this project, you will understand:
- Mathematical Foundations: How each method works mathematically
- Implementation Details: How to implement these methods from scratch
- Performance Trade-offs: When to use each method
- Evaluation Strategies: How to assess dimensionality reduction quality
- Practical Applications: Real-world use cases and considerations
π€ Contributing
Contributions are welcome! Please feel free to:
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
π License
This project is licensed under the MIT License - see the LICENSE file for details.
π Links
- GitHub Repository: dimensionality-reduction
- Hugging Face Space: karthik-2905/dimensionality-reduction
- Documentation: Implementation Notebook
π Contact
For questions or feedback, please:
- Open an issue on GitHub
- Contact the maintainer: Karthik
Note: This is an educational project designed to demonstrate dimensionality reduction techniques. The implementations prioritize clarity and understanding over performance optimization.