YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Dimensionality Reduction: Comprehensive Implementation and Analysis

Python 3.8+ License: MIT GitHub Hugging Face

A comprehensive implementation and analysis of dimensionality reduction techniques including PCA, t-SNE, UMAP, and Autoencoders. This repository demonstrates the theory, implementation, and evaluation of these methods on standard datasets.

🎯 Overview

Dimensionality reduction is crucial in machine learning for:

  • Data Visualization: Projecting high-dimensional data to 2D/3D for human interpretation
  • Computational Efficiency: Reducing feature space for faster processing
  • Noise Reduction: Eliminating redundant or noisy features
  • Storage Optimization: Compressing data while preserving essential information

This project provides a complete suite of dimensionality reduction methods with detailed explanations, implementations, and performance comparisons.

πŸ“Š Methods Implemented

1. Principal Component Analysis (PCA)

  • Type: Linear dimensionality reduction
  • Key Feature: Finds directions of maximum variance
  • Best For: Data with linear structure, feature compression
  • Results:
    • Iris: 97.5% accuracy retention with 2 components
    • Digits: 52.4% accuracy retention with 2 components

2. t-SNE (t-Distributed Stochastic Neighbor Embedding)

  • Type: Non-linear manifold learning
  • Key Feature: Preserves local neighborhood structure
  • Best For: Data visualization, clustering analysis
  • Results:
    • Iris: 105.0% accuracy retention
    • Digits: 100.4% accuracy retention

3. UMAP (Uniform Manifold Approximation and Projection)

  • Type: Non-linear manifold learning
  • Key Feature: Preserves both local and global structure
  • Best For: Balanced visualization, scalable to large datasets
  • Results:
    • Iris: 102.5% accuracy retention
    • Digits: 99.2% accuracy retention

4. Autoencoder (Neural Network)

  • Type: Non-linear neural network approach
  • Key Feature: Learns optimal encoding through reconstruction
  • Best For: Complex non-linear relationships, customizable architectures
  • Architecture: Input β†’ 128 β†’ 64 β†’ Encoding β†’ 64 β†’ 128 β†’ Output

πŸ—‚οΈ Project Structure

dimensionality-reduction/
β”œβ”€β”€ implementation.ipynb          # Complete Jupyter notebook with theory and code
β”œβ”€β”€ dimensionality_reduction.log  # Detailed execution logs
β”œβ”€β”€ models/                      # Saved trained models
β”‚   β”œβ”€β”€ pca_iris.pkl
β”‚   β”œβ”€β”€ pca_digits.pkl
β”‚   β”œβ”€β”€ umap_iris.pkl
β”‚   β”œβ”€β”€ umap_digits.pkl
β”‚   β”œβ”€β”€ autoencoder_iris.pth
β”‚   └── autoencoder_digits.pth
β”œβ”€β”€ results/                     # Analysis results
β”‚   └── dimensionality_reduction_summary.json
β”œβ”€β”€ visualizations/              # Generated plots and comparisons
β”‚   β”œβ”€β”€ pca_explained_variance.png
β”‚   β”œβ”€β”€ iris_comparison.png
β”‚   └── digits_comparison.png
└── README.md                    # This file

πŸš€ Quick Start

Prerequisites

pip install numpy pandas scikit-learn matplotlib seaborn plotly umap-learn torch torchvision

Running the Analysis

  1. Clone the repository:

    git clone https://github.com/GruheshKurra/dimensionality-reduction.git
    cd dimensionality-reduction
    
  2. Install dependencies:

    pip install -r requirements.txt
    
  3. Run the complete analysis:

    jupyter notebook implementation.ipynb
    

    Or execute the main script:

    python main.py
    

πŸ“ˆ Results Summary

Dataset Information

  • Iris Dataset: 150 samples, 4 features, 3 classes
  • Digits Dataset: 1797 samples, 64 features, 10 classes

Performance Comparison (Accuracy Retention)

Method Iris Dataset Digits Dataset
PCA 97.5% 52.4%
t-SNE 105.0% 100.4%
UMAP 102.5% 99.2%

Key Insights

  • PCA works well for low-dimensional data (Iris) but struggles with high-dimensional complex patterns (Digits)
  • t-SNE excels at preserving local structure, sometimes even improving classification performance
  • UMAP provides excellent balance between local and global structure preservation
  • Autoencoders offer flexibility but require careful tuning

πŸ” Detailed Analysis

PCA Explained Variance

  • Iris: First 2 components explain 95.8% of variance
  • Digits: First 2 components explain only 21.6% of variance

Method Characteristics

Aspect PCA t-SNE UMAP Autoencoder
Linearity Linear Non-linear Non-linear Non-linear
Speed Fast Slow Medium Medium
Deterministic Yes No Yes* Yes*
New Data βœ… ❌ βœ… βœ…
Interpretability High Low Medium Low

*With fixed random seed

πŸ“– Educational Content

The implementation.ipynb notebook includes:

  1. Theory Explanation: Mathematical foundations and intuitive explanations
  2. Step-by-step Implementation: Detailed code with comprehensive comments
  3. Visual Comparisons: Side-by-side plots showing method differences
  4. Performance Evaluation: Classification accuracy retention analysis
  5. Best Practices: When to use each method and parameter selection

πŸ› οΈ Technical Details

Dependencies

  • numpy: Numerical computing
  • pandas: Data manipulation
  • scikit-learn: Machine learning algorithms
  • matplotlib, seaborn: Data visualization
  • umap-learn: UMAP implementation
  • torch: Neural network autoencoder
  • plotly: Interactive visualizations

Key Features

  • Comprehensive Logging: Detailed execution logs for reproducibility
  • Model Persistence: Save and load trained models
  • Evaluation Framework: Systematic performance comparison
  • Visualization Suite: Publication-quality plots
  • Structured Results: JSON summary for further analysis

πŸŽ“ Learning Outcomes

After working through this project, you will understand:

  1. Mathematical Foundations: How each method works mathematically
  2. Implementation Details: How to implement these methods from scratch
  3. Performance Trade-offs: When to use each method
  4. Evaluation Strategies: How to assess dimensionality reduction quality
  5. Practical Applications: Real-world use cases and considerations

🀝 Contributing

Contributions are welcome! Please feel free to:

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests if applicable
  5. Submit a pull request

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ”— Links

πŸ“ž Contact

For questions or feedback, please:

  • Open an issue on GitHub
  • Contact the maintainer: Karthik

Note: This is an educational project designed to demonstrate dimensionality reduction techniques. The implementations prioritize clarity and understanding over performance optimization.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support