Spaces:

ND06-25
/

Slash

Sleeping

App Files Files Community

ND06-25 commited on Jun 23

Commit

6880cd9

0 Parent(s):

first commit to AI repo

Browse files

Files changed (16) hide show

README.md +196 -0
api/__init__.py +1 -0
api/__pycache__/__init__.cpython-312.pyc +0 -0
api/__pycache__/main.cpython-312.pyc +0 -0
api/__pycache__/pdf_processor.cpython-312.pyc +0 -0
api/__pycache__/summarizer.cpython-312.pyc +0 -0
api/__pycache__/utils.cpython-312.pyc +0 -0
api/main.py +224 -0
api/pdf_processor.py +172 -0
api/summarizer.py +226 -0
api/utils.py +124 -0
app.py +314 -0
requirements.txt +12 -0
start.bat +32 -0
start.py +135 -0
start.sh +28 -0

README.md ADDED Viewed

	@@ -0,0 +1,196 @@

+# 📚 Book Summarizer AI
+An intelligent web application that extracts text from PDF books and generates comprehensive summaries using state-of-the-art AI models.
+## ✨ Features
+- 📚 **PDF Text Extraction**: Advanced PDF processing with multiple extraction methods
+- 🤖 **AI-Powered Summarization**: Uses transformer models (BART, T5) for high-quality summaries
+- 🌐 **Beautiful Web Interface**: Modern UI built with Streamlit
+- ⚡ **FastAPI Backend**: Scalable and fast API for processing
+- 📝 **Configurable Settings**: Adjust summary length, chunk size, and AI models
+- 📊 **Text Analysis**: Detailed statistics about book content
+- 💾 **Download Summaries**: Save summaries as text files
+## 🚀 Quick Start
+### Option 1: Automated Setup (Recommended)
+**Windows:**
+```bash
+# Double-click start.bat or run:
+start.bat
+```
+**Unix/Linux/Mac:**
+```bash
+# Make script executable and run:
+chmod +x start.sh
+./start.sh
+```
+### Option 2: Manual Setup
+1. **Install dependencies:**
+```bash
+pip install -r requirements.txt
+```
+2. **Download NLTK data:**
+```python
+python -c "import nltk; nltk.download('punkt'); nltk.download('stopwords')"
+```
+3. **Start the FastAPI backend:**
+```bash
+uvicorn api.main:app --reload --port 8000
+```
+4. **Start the Streamlit frontend:**
+```bash
+streamlit run app.py
+```
+5. **Open your browser:**
+   - Frontend: http://localhost:8501
+   - API Docs: http://localhost:8000/docs
+## 📖 Usage
+1. **Upload PDF**: Select a PDF book file (max 50MB)
+2. **Configure Settings**: Choose AI model and summary parameters
+3. **Generate Summary**: Click "Generate Summary" and wait for processing
+4. **Download Result**: Save your AI-generated summary
+## 🛠️ Technology Stack
+### Frontend
+- **Streamlit**: Modern web interface
+- **Custom CSS**: Beautiful styling and responsive design
+### Backend
+- **FastAPI**: High-performance API framework
+- **Uvicorn**: ASGI server for FastAPI
+### AI & ML
+- **Hugging Face Transformers**: State-of-the-art NLP models
+- **PyTorch**: Deep learning framework
+- **BART/T5 Models**: Pre-trained summarization models
+### PDF Processing
+- **PyPDF2**: PDF text extraction
+- **pdfplumber**: Advanced PDF processing
+- **NLTK**: Natural language processing
+## 📁 Project Structure
+```
+book-summarizer/
+├── app.py                 # Streamlit frontend
+├── start.py              # Automated startup script
+├── start.bat             # Windows startup script
+├── start.sh              # Unix/Linux/Mac startup script
+├── api/
+│   ├── __init__.py       # API package
+│   ├── main.py           # FastAPI backend
+│   ├── pdf_processor.py  # PDF text extraction
+│   ├── summarizer.py     # AI summarization logic
+│   └── utils.py          # Utility functions
+├── requirements.txt      # Python dependencies
+└── README.md            # Project documentation
+```
+## ⚙️ Configuration
+### AI Models
+- **facebook/bart-large-cnn**: Best quality, slower processing
+- **t5-small**: Faster processing, good quality
+- **facebook/bart-base**: Balanced performance
+### Summary Settings
+- **Max Length**: 50-500 words (default: 150)
+- **Min Length**: 10-200 words (default: 50)
+- **Chunk Size**: 500-2000 characters (default: 1000)
+- **Overlap**: 50-200 characters (default: 100)
+## 🔧 API Endpoints
+- `GET /` - API information
+- `GET /health` - Health check
+- `POST /upload-pdf` - Validate PDF file
+- `POST /extract-text` - Extract text from PDF
+- `POST /summarize` - Generate book summary
+- `GET /models` - List available AI models
+- `POST /change-model` - Switch AI model
+## 📋 Requirements
+- **Python**: 3.8 or higher
+- **Memory**: At least 4GB RAM (8GB recommended)
+- **Storage**: 2GB free space for models
+- **Internet**: Required for first-time model download
+## 🐛 Troubleshooting
+### Common Issues
+1. **"Module not found" errors:**
+   ```bash
+   pip install -r requirements.txt
+   ```
+2. **NLTK data missing:**
+   ```python
+   python -c "import nltk; nltk.download('punkt'); nltk.download('stopwords')"
+   ```
+3. **API connection failed:**
+   - Ensure FastAPI is running on port 8000
+   - Check firewall settings
+   - Verify no other service is using the port
+4. **Large PDF processing slow:**
+   - Reduce chunk size in advanced settings
+   - Use a faster model (t5-small)
+   - Ensure sufficient RAM
+5. **Model download issues:**
+   - Check internet connection
+   - Clear Hugging Face cache: `rm -rf ~/.cache/huggingface`
+### Performance Tips
+- **GPU Acceleration**: Install CUDA for faster processing
+- **Model Selection**: Use smaller models for faster results
+- **Chunk Size**: Smaller chunks = faster processing but may lose context
+- **Memory**: Close other applications to free up RAM
+## 🤝 Contributing
+1. Fork the repository
+2. Create a feature branch
+3. Make your changes
+4. Add tests if applicable
+5. Submit a pull request
+## 📄 License
+This project is open source and available under the MIT License.
+## 🙏 Acknowledgments
+- Hugging Face for transformer models
+- Streamlit for the web framework
+- FastAPI for the backend framework
+- The open-source community for various libraries
+## 📞 Support
+For issues, questions, or feature requests:
+1. Check the troubleshooting section
+2. Review API documentation at `/docs`
+3. Open an issue on GitHub
+---
+**Happy summarizing! 📚✨**

api/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ # API package for Book Summarizer

api/__pycache__/__init__.cpython-312.pyc ADDED Viewed

Binary file (145 Bytes). View file

api/__pycache__/main.cpython-312.pyc ADDED Viewed

Binary file (9.14 kB). View file

api/__pycache__/pdf_processor.cpython-312.pyc ADDED Viewed

Binary file (6.86 kB). View file

api/__pycache__/summarizer.cpython-312.pyc ADDED Viewed

Binary file (8.62 kB). View file

api/__pycache__/utils.cpython-312.pyc ADDED Viewed

Binary file (3.88 kB). View file

api/main.py ADDED Viewed

	@@ -0,0 +1,224 @@

+from fastapi import FastAPI, File, UploadFile, HTTPException, BackgroundTasks
+from fastapi.middleware.cors import CORSMiddleware
+from pydantic import BaseModel
+from typing import Dict, Any, Optional
+import logging
+import asyncio
+from .pdf_processor import PDFProcessor
+from .summarizer import BookSummarizer
+# Configure logging
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+# Initialize FastAPI app
+app = FastAPI(
+    title="Book Summarizer API",
+    description="AI-powered book summarization service",
+    version="1.0.0"
+)
+# Add CORS middleware
+app.add_middleware(
+    CORSMiddleware,
+    allow_origins=["*"],  # In production, specify your frontend URL
+    allow_credentials=True,
+    allow_methods=["*"],
+    allow_headers=["*"],
+)
+# Initialize components
+pdf_processor = PDFProcessor()
+summarizer = BookSummarizer()
+# Pydantic models
+class SummaryRequest(BaseModel):
+    max_length: int = 150
+    min_length: int = 50
+    chunk_size: int = 1000
+    overlap: int = 100
+    model_name: Optional[str] = None
+class SummaryResponse(BaseModel):
+    success: bool
+    summary: str
+    statistics: Dict[str, Any]
+    message: str
+@app.on_event("startup")
+async def startup_event():
+    """Initialize components on startup."""
+    logger.info("Starting Book Summarizer API...")
+    try:
+        # Load the summarization model
+        summarizer.load_model()
+        logger.info("API startup completed successfully")
+    except Exception as e:
+        logger.error(f"Error during startup: {str(e)}")
+@app.get("/")
+async def root():
+    """Root endpoint."""
+    return {
+        "message": "Book Summarizer API",
+        "version": "1.0.0",
+        "status": "running"
+    }
+@app.get("/health")
+async def health_check():
+    """Health check endpoint."""
+    return {
+        "status": "healthy",
+        "model_loaded": summarizer.summarizer is not None
+    }
+@app.post("/upload-pdf")
+async def upload_pdf(file: UploadFile = File(...)):
+    """
+    Upload and validate a PDF file.
+    """
+    try:
+        # Check file type
+        if not file.filename.lower().endswith('.pdf'):
+            raise HTTPException(status_code=400, detail="Only PDF files are supported")
+        # Read file content
+        content = await file.read()
+        # Validate PDF
+        validation_result = pdf_processor.validate_pdf(content)
+        if not validation_result['valid']:
+            raise HTTPException(status_code=400, detail=validation_result['message'])
+        # Extract metadata
+        metadata = pdf_processor.get_pdf_metadata(content)
+        return {
+            "success": True,
+            "filename": file.filename,
+            "size_mb": validation_result['size_mb'],
+            "pages": validation_result['pages'],
+            "metadata": metadata,
+            "message": "PDF uploaded and validated successfully"
+        }
+    except HTTPException:
+        raise
+    except Exception as e:
+        logger.error(f"Error uploading PDF: {str(e)}")
+        raise HTTPException(status_code=500, detail=f"Error processing PDF: {str(e)}")
+@app.post("/extract-text")
+async def extract_text(file: UploadFile = File(...)):
+    """
+    Extract text from uploaded PDF.
+    """
+    try:
+        # Read file content
+        content = await file.read()
+        # Extract text
+        result = pdf_processor.extract_text_from_pdf(content)
+        if not result['success']:
+            raise HTTPException(status_code=400, detail=result['message'])
+        return {
+            "success": True,
+            "text_length": len(result['text']),
+            "statistics": result['statistics'],
+            "pages": result['pages'],
+            "message": result['message']
+        }
+    except HTTPException:
+        raise
+    except Exception as e:
+        logger.error(f"Error extracting text: {str(e)}")
+        raise HTTPException(status_code=500, detail=f"Error extracting text: {str(e)}")
+@app.post("/summarize")
+async def summarize_book(
+    file: UploadFile = File(...),
+    request: SummaryRequest = SummaryRequest()
+):
+    """
+    Summarize a book from uploaded PDF.
+    """
+    try:
+        # Read file content
+        content = await file.read()
+        # Extract text
+        extraction_result = pdf_processor.extract_text_from_pdf(content)
+        if not extraction_result['success']:
+            raise HTTPException(status_code=400, detail=extraction_result['message'])
+        # Change model if specified
+        if request.model_name:
+            summarizer.change_model(request.model_name)
+        # Summarize the book
+        summary_result = summarizer.summarize_book(
+            text=extraction_result['text'],
+            chunk_size=request.chunk_size,
+            overlap=request.overlap,
+            max_length=request.max_length,
+            min_length=request.min_length
+        )
+        if not summary_result['success']:
+            raise HTTPException(status_code=500, detail=summary_result.get('error', 'Summarization failed'))
+        return {
+            "success": True,
+            "summary": summary_result['summary'],
+            "statistics": summary_result['statistics'],
+            "original_statistics": extraction_result['statistics'],
+            "message": "Book summarized successfully"
+        }
+    except HTTPException:
+        raise
+    except Exception as e:
+        logger.error(f"Error summarizing book: {str(e)}")
+        raise HTTPException(status_code=500, detail=f"Error summarizing book: {str(e)}")
+@app.get("/models")
+async def get_available_models():
+    """
+    Get list of available summarization models.
+    """
+    try:
+        models = summarizer.get_available_models()
+        return {
+            "success": True,
+            "models": models,
+            "current_model": summarizer.model_name
+        }
+    except Exception as e:
+        logger.error(f"Error getting models: {str(e)}")
+        raise HTTPException(status_code=500, detail=f"Error getting models: {str(e)}")
+@app.post("/change-model")
+async def change_model(model_name: str):
+    """
+    Change the summarization model.
+    """
+    try:
+        summarizer.change_model(model_name)
+        summarizer.load_model()
+        return {
+            "success": True,
+            "message": f"Model changed to {model_name}",
+            "current_model": model_name
+        }
+    except Exception as e:
+        logger.error(f"Error changing model: {str(e)}")
+        raise HTTPException(status_code=500, detail=f"Error changing model: {str(e)}")
+if __name__ == "__main__":
+    import uvicorn
+    uvicorn.run(app, host="0.0.0.0", port=8000)

api/pdf_processor.py ADDED Viewed

	@@ -0,0 +1,172 @@

+import PyPDF2
+import pdfplumber
+import io
+from typing import Dict, Any, Optional
+import logging
+from .utils import clean_text, get_text_statistics
+logger = logging.getLogger(__name__)
+class PDFProcessor:
+    """
+    Handles PDF text extraction and processing.
+    """
+    def __init__(self):
+        self.supported_formats = ['.pdf']
+    def extract_text_from_pdf(self, pdf_file: bytes) -> Dict[str, Any]:
+        """
+        Extract text from PDF file bytes.
+        Args:
+            pdf_file: PDF file as bytes
+        Returns:
+            Dictionary containing extracted text and metadata
+        """
+        try:
+            # Try pdfplumber first (better for complex layouts)
+            text = self._extract_with_pdfplumber(pdf_file)
+            if not text or len(text.strip()) < 100:
+                # Fallback to PyPDF2
+                text = self._extract_with_pypdf2(pdf_file)
+            if not text:
+                raise ValueError("Could not extract text from PDF")
+            # Clean the extracted text
+            cleaned_text = clean_text(text)
+            # Get text statistics
+            stats = get_text_statistics(cleaned_text)
+            return {
+                'success': True,
+                'text': cleaned_text,
+                'statistics': stats,
+                'pages': self._get_page_count(pdf_file),
+                'message': 'Text extracted successfully'
+            }
+        except Exception as e:
+            logger.error(f"Error extracting text from PDF: {str(e)}")
+            return {
+                'success': False,
+                'text': '',
+                'statistics': {},
+                'pages': 0,
+                'message': f'Error extracting text: {str(e)}'
+            }
+    def _extract_with_pdfplumber(self, pdf_file: bytes) -> str:
+        """
+        Extract text using pdfplumber (better for complex layouts).
+        """
+        text_parts = []
+        try:
+            with pdfplumber.open(io.BytesIO(pdf_file)) as pdf:
+                for page in pdf.pages:
+                    page_text = page.extract_text()
+                    if page_text:
+                        text_parts.append(page_text)
+            return '\n'.join(text_parts)
+        except Exception as e:
+            logger.warning(f"pdfplumber extraction failed: {str(e)}")
+            return ""
+    def _extract_with_pypdf2(self, pdf_file: bytes) -> str:
+        """
+        Extract text using PyPDF2 (fallback method).
+        """
+        text_parts = []
+        try:
+            pdf_reader = PyPDF2.PdfReader(io.BytesIO(pdf_file))
+            for page in pdf_reader.pages:
+                page_text = page.extract_text()
+                if page_text:
+                    text_parts.append(page_text)
+            return '\n'.join(text_parts)
+        except Exception as e:
+            logger.warning(f"PyPDF2 extraction failed: {str(e)}")
+            return ""
+    def _get_page_count(self, pdf_file: bytes) -> int:
+        """
+        Get the number of pages in the PDF.
+        """
+        try:
+            pdf_reader = PyPDF2.PdfReader(io.BytesIO(pdf_file))
+            return len(pdf_reader.pages)
+        except:
+            return 0
+    def get_pdf_metadata(self, pdf_file: bytes) -> Dict[str, Any]:
+        """
+        Extract metadata from PDF file.
+        """
+        try:
+            pdf_reader = PyPDF2.PdfReader(io.BytesIO(pdf_file))
+            metadata = pdf_reader.metadata
+            return {
+                'title': metadata.get('/Title', 'Unknown'),
+                'author': metadata.get('/Author', 'Unknown'),
+                'subject': metadata.get('/Subject', ''),
+                'creator': metadata.get('/Creator', ''),
+                'producer': metadata.get('/Producer', ''),
+                'pages': len(pdf_reader.pages)
+            }
+        except Exception as e:
+            logger.error(f"Error extracting PDF metadata: {str(e)}")
+            return {
+                'title': 'Unknown',
+                'author': 'Unknown',
+                'subject': '',
+                'creator': '',
+                'producer': '',
+                'pages': 0
+            }
+    def validate_pdf(self, pdf_file: bytes) -> Dict[str, Any]:
+        """
+        Validate PDF file and check if it can be processed.
+        """
+        try:
+            # Check file size
+            file_size = len(pdf_file)
+            max_size = 50 * 1024 * 1024  # 50MB limit
+            if file_size > max_size:
+                return {
+                    'valid': False,
+                    'message': f'File too large. Maximum size is 50MB, got {file_size / (1024*1024):.1f}MB'
+                }
+            # Try to read PDF
+            pdf_reader = PyPDF2.PdfReader(io.BytesIO(pdf_file))
+            if len(pdf_reader.pages) == 0:
+                return {
+                    'valid': False,
+                    'message': 'PDF appears to be empty or corrupted'
+                }
+            return {
+                'valid': True,
+                'message': 'PDF is valid',
+                'pages': len(pdf_reader.pages),
+                'size_mb': file_size / (1024 * 1024)
+            }
+        except Exception as e:
+            return {
+                'valid': False,
+                'message': f'Invalid PDF file: {str(e)}'
+            }

api/summarizer.py ADDED Viewed

	@@ -0,0 +1,226 @@

+from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM
+from typing import List, Dict, Any, Optional, Union
+import torch
+import logging
+from .utils import chunk_text
+logger = logging.getLogger(__name__)
+class BookSummarizer:
+    """
+    Handles AI-powered text summarization using transformer models.
+    """
+    def __init__(self, model_name: str = "facebook/bart-large-cnn"):
+        """
+        Initialize the summarizer with a specific model.
+        Args:
+            model_name: Hugging Face model name for summarization
+        """
+        self.model_name = model_name
+        self.summarizer = None
+        self.tokenizer = None
+        self.model = None
+        self.device = "cuda" if torch.cuda.is_available() else "cpu"
+        logger.info(f"Initializing summarizer with model: {model_name}")
+        logger.info(f"Using device: {self.device}")
+    def load_model(self):
+        """
+        Load the summarization model and tokenizer.
+        """
+        try:
+            logger.info("Loading summarization model...")
+            # Load tokenizer and model
+            self.tokenizer = AutoTokenizer.from_pretrained(self.model_name)
+            self.model = AutoModelForSeq2SeqLM.from_pretrained(self.model_name)
+            # Move model to appropriate device
+            self.model.to(self.device)
+            # Create pipeline
+            self.summarizer = pipeline(
+                "summarization",
+                model=self.model,
+                tokenizer=self.tokenizer,
+                device=0 if self.device == "cuda" else -1
+            )
+            logger.info("Model loaded successfully")
+        except Exception as e:
+            logger.error(f"Error loading model: {str(e)}")
+            raise
+    def summarize_text(self, text: str, max_length: int = 150, min_length: int = 50,
+                      do_sample: bool = False) -> Dict[str, Any]:
+        """
+        Summarize a single text chunk.
+        Args:
+            text: Text to summarize
+            max_length: Maximum length of summary
+            min_length: Minimum length of summary
+            do_sample: Whether to use sampling for generation
+        Returns:
+            Dictionary containing summary and metadata
+        """
+        try:
+            if not self.summarizer:
+                self.load_model()
+            # Check if text is too short
+            if len(text.split()) < 50:
+                return {
+                    'success': True,
+                    'summary': text,
+                    'original_length': len(text.split()),
+                    'summary_length': len(text.split()),
+                    'compression_ratio': 1.0
+                }
+            # Generate summary
+            summary_result = self.summarizer(
+                text,
+                max_length=max_length,
+                min_length=min_length,
+                do_sample=do_sample,
+                truncation=True
+            )
+            summary = summary_result[0]['summary_text']
+            # Calculate compression ratio
+            original_words = len(text.split())
+            summary_words = len(summary.split())
+            compression_ratio = summary_words / original_words if original_words > 0 else 0
+            return {
+                'success': True,
+                'summary': summary,
+                'original_length': original_words,
+                'summary_length': summary_words,
+                'compression_ratio': compression_ratio
+            }
+        except Exception as e:
+            logger.error(f"Error summarizing text: {str(e)}")
+            return {
+                'success': False,
+                'summary': '',
+                'error': str(e)
+            }
+    def summarize_book(self, text: str, chunk_size: int = 1000, overlap: int = 100,
+                      max_length: int = 150, min_length: int = 50) -> Dict[str, Any]:
+        """
+        Summarize a complete book by processing it in chunks.
+        Args:
+            text: Complete book text
+            chunk_size: Size of each text chunk
+            overlap: Overlap between chunks
+            max_length: Maximum length of each summary
+            min_length: Minimum length of each summary
+        Returns:
+            Dictionary containing complete summary and metadata
+        """
+        try:
+            logger.info("Starting book summarization...")
+            # Split text into chunks
+            chunks = chunk_text(text, chunk_size, overlap)
+            logger.info(f"Split text into {len(chunks)} chunks")
+            # Summarize each chunk
+            chunk_summaries = []
+            total_original_words = 0
+            total_summary_words = 0
+            for i, chunk in enumerate(chunks):
+                logger.info(f"Processing chunk {i+1}/{len(chunks)}")
+                result = self.summarize_text(chunk, max_length, min_length)
+                if result['success']:
+                    chunk_summaries.append(result['summary'])
+                    total_original_words += result['original_length']
+                    total_summary_words += result['summary_length']
+                else:
+                    logger.warning(f"Failed to summarize chunk {i+1}: {result.get('error', 'Unknown error')}")
+                    # Include original chunk if summarization fails
+                    chunk_summaries.append(chunk[:200] + "...")
+            # Combine all summaries
+            combined_summary = " ".join(chunk_summaries)
+            # Create final summary if the combined summary is still too long
+            if len(combined_summary.split()) > 500:
+                logger.info("Creating final summary from combined summaries...")
+                final_result = self.summarize_text(combined_summary, max_length=300, min_length=100)
+                if final_result['success']:
+                    combined_summary = final_result['summary']
+            # Calculate overall statistics
+            overall_compression = total_summary_words / total_original_words if total_original_words > 0 else 0
+            return {
+                'success': True,
+                'summary': combined_summary,
+                'statistics': {
+                    'total_chunks': len(chunks),
+                    'total_original_words': total_original_words,
+                    'total_summary_words': total_summary_words,
+                    'overall_compression_ratio': overall_compression,
+                    'final_summary_length': len(combined_summary.split())
+                },
+                'chunk_summaries': chunk_summaries
+            }
+        except Exception as e:
+            logger.error(f"Error in book summarization: {str(e)}")
+            return {
+                'success': False,
+                'summary': '',
+                'error': str(e)
+            }
+    def get_available_models(self) -> List[Dict[str, Union[str, int]]]:
+        """
+        Get list of available summarization models.
+        """
+        return [
+            {
+                'name': 'facebook/bart-large-cnn',
+                'description': 'BART model fine-tuned on CNN news articles (recommended)',
+                'max_length': 1024
+            },
+            {
+                'name': 't5-small',
+                'description': 'Small T5 model, faster but less accurate',
+                'max_length': 512
+            },
+            {
+                'name': 'facebook/bart-base',
+                'description': 'Base BART model, balanced performance',
+                'max_length': 1024
+            }
+        ]
+    def change_model(self, model_name: str):
+        """
+        Change the summarization model.
+        Args:
+            model_name: New model name to use
+        """
+        self.model_name = model_name
+        self.summarizer = None
+        self.tokenizer = None
+        self.model = None
+        logger.info(f"Model changed to: {model_name}")

api/utils.py ADDED Viewed

	@@ -0,0 +1,124 @@

+import re
+import nltk
+from typing import List, Dict, Any
+import logging
+# Configure logging
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+def clean_text(text: str) -> str:
+    """
+    Clean and preprocess extracted text from PDF.
+    """
+    # Remove extra whitespace and normalize
+    text = re.sub(r'\s+', ' ', text)
+    text = re.sub(r'\n+', '\n', text)
+    text = text.strip()
+    # Remove common PDF artifacts
+    text = re.sub(r'[^\w\s\.\,\!\?\;\:\-\(\)\[\]\{\}]', '', text)
+    return text
+def chunk_text(text: str, max_chunk_size: int = 1000, overlap: int = 100) -> List[str]:
+    """
+    Split text into overlapping chunks for processing.
+    Args:
+        text: Input text to chunk
+        max_chunk_size: Maximum size of each chunk
+        overlap: Number of characters to overlap between chunks
+    Returns:
+        List of text chunks
+    """
+    if len(text) <= max_chunk_size:
+        return [text]
+    chunks = []
+    start = 0
+    while start < len(text):
+        end = start + max_chunk_size
+        # Try to break at sentence boundaries
+        if end < len(text):
+            # Look for sentence endings
+            sentence_endings = ['.', '!', '?']
+            for ending in sentence_endings:
+                last_ending = text.rfind(ending, start, end)
+                if last_ending > start + max_chunk_size * 0.8:  # Only break if we're at least 80% through
+                    end = last_ending + 1
+                    break
+        chunk = text[start:end].strip()
+        if chunk:
+            chunks.append(chunk)
+        # Move start position with overlap
+        start = end - overlap
+        if start >= len(text):
+            break
+    return chunks
+def extract_chapters(text: str) -> Dict[str, str]:
+    """
+    Attempt to extract chapters from the text.
+    """
+    chapters = {}
+    # Common chapter patterns
+    chapter_patterns = [
+        r'Chapter\s+(\d+|[IVXLC]+)',
+        r'CHAPTER\s+(\d+|[IVXLC]+)',
+        r'(\d+)\.\s+[A-Z]',
+        r'[IVXLC]+\.\s+[A-Z]'
+    ]
+    lines = text.split('\n')
+    current_chapter = "Introduction"
+    current_content = []
+    for line in lines:
+        line = line.strip()
+        if not line:
+            continue
+        # Check if this line is a chapter header
+        is_chapter_header = False
+        for pattern in chapter_patterns:
+            if re.match(pattern, line, re.IGNORECASE):
+                # Save previous chapter
+                if current_content:
+                    chapters[current_chapter] = '\n'.join(current_content)
+                current_chapter = line
+                current_content = []
+                is_chapter_header = True
+                break
+        if not is_chapter_header:
+            current_content.append(line)
+    # Save the last chapter
+    if current_content:
+        chapters[current_chapter] = '\n'.join(current_content)
+    return chapters
+def get_text_statistics(text: str) -> Dict[str, Any]:
+    """
+    Get basic statistics about the text.
+    """
+    words = text.split()
+    sentences = nltk.sent_tokenize(text)
+    return {
+        'total_characters': len(text),
+        'total_words': len(words),
+        'total_sentences': len(sentences),
+        'average_words_per_sentence': len(words) / len(sentences) if sentences else 0,
+        'estimated_reading_time_minutes': len(words) / 200  # Average reading speed
+    }

app.py ADDED Viewed

	@@ -0,0 +1,314 @@

+import streamlit as st
+import requests
+import json
+import time
+from typing import Dict, Any, Optional
+import io
+# Page configuration
+st.set_page_config(
+    page_title="Book Summarizer AI",
+    page_icon="📚",
+    layout="wide",
+    initial_sidebar_state="expanded"
+)
+# API configuration
+API_BASE_URL = "http://localhost:8000"
+def main():
+    # Custom CSS for better styling
+    st.markdown("""
+    <style>
+    .main-header {
+        font-size: 3rem;
+        font-weight: bold;
+        text-align: center;
+        color: #1f77b4;
+        margin-bottom: 2rem;
+    }
+    .sub-header {
+        font-size: 1.5rem;
+        color: #666;
+        text-align: center;
+        margin-bottom: 2rem;
+    }
+    .success-box {
+        background-color: #d4edda;
+        border: 1px solid #c3e6cb;
+        border-radius: 5px;
+        padding: 1rem;
+        margin: 1rem 0;
+    }
+    .error-box {
+        background-color: #f8d7da;
+        border: 1px solid #f5c6cb;
+        border-radius: 5px;
+        padding: 1rem;
+        margin: 1rem 0;
+    }
+    .info-box {
+        background-color: #d1ecf1;
+        border: 1px solid #bee5eb;
+        border-radius: 5px;
+        padding: 1rem;
+        margin: 1rem 0;
+    }
+    </style>
+    """, unsafe_allow_html=True)
+    # Header
+    st.markdown('<h1 class="main-header">📚 Book Summarizer AI</h1>', unsafe_allow_html=True)
+    st.markdown('<p class="sub-header">Transform your PDF books into intelligent summaries using AI</p>', unsafe_allow_html=True)
+    # Sidebar
+    with st.sidebar:
+        st.header("⚙️ Settings")
+        # Model selection
+        st.subheader("AI Model")
+        try:
+            models_response = requests.get(f"{API_BASE_URL}/models")
+            if models_response.status_code == 200:
+                models_data = models_response.json()
+                models = models_data.get('models', [])
+                current_model = models_data.get('current_model', '')
+                model_names = [model['name'] for model in models]
+                selected_model = st.selectbox(
+                    "Choose AI Model",
+                    model_names,
+                    index=model_names.index(current_model) if current_model in model_names else 0
+                )
+                # Show model description
+                selected_model_info = next((m for m in models if m['name'] == selected_model), None)
+                if selected_model_info:
+                    st.info(f"**{selected_model_info['description']}**")
+            else:
+                st.error("Failed to load models")
+                selected_model = "facebook/bart-large-cnn"
+        except Exception as e:
+            st.error(f"Error loading models: {str(e)}")
+            selected_model = "facebook/bart-large-cnn"
+        # Summary settings
+        st.subheader("Summary Settings")
+        max_length = st.slider("Maximum Summary Length", 50, 500, 150, help="Maximum number of words in the summary")
+        min_length = st.slider("Minimum Summary Length", 10, 200, 50, help="Minimum number of words in the summary")
+        # Advanced settings
+        with st.expander("Advanced Settings"):
+            chunk_size = st.slider("Chunk Size", 500, 2000, 1000, help="Size of text chunks for processing")
+            overlap = st.slider("Chunk Overlap", 50, 200, 100, help="Overlap between text chunks")
+        # API status
+        st.subheader("API Status")
+        try:
+            health_response = requests.get(f"{API_BASE_URL}/health")
+            if health_response.status_code == 200:
+                st.success("✅ API Connected")
+            else:
+                st.error("❌ API Error")
+        except:
+            st.error("❌ API Unavailable")
+    # Main content
+    tab1, tab2, tab3 = st.tabs(["📖 Summarize Book", "📊 Text Analysis", "ℹ️ About"])
+    with tab1:
+        st.header("📖 Book Summarization")
+        # File upload
+        uploaded_file = st.file_uploader(
+            "Choose a PDF book file",
+            type=['pdf'],
+            help="Upload a PDF file (max 50MB)"
+        )
+        if uploaded_file is not None:
+            # File info
+            file_size = len(uploaded_file.getvalue()) / (1024 * 1024)  # MB
+            st.info(f"📄 **File:** {uploaded_file.name} ({file_size:.1f} MB)")
+            # Validate file
+            if st.button("🔍 Validate PDF", type="secondary"):
+                with st.spinner("Validating PDF..."):
+                    try:
+                        files = {"file": uploaded_file.getvalue()}
+                        response = requests.post(f"{API_BASE_URL}/upload-pdf", files=files)
+                        if response.status_code == 200:
+                            data = response.json()
+                            st.success(f"✅ {data['message']}")
+                            # Display metadata
+                            metadata = data.get('metadata', {})
+                            col1, col2, col3 = st.columns(3)
+                            with col1:
+                                st.metric("Pages", data['pages'])
+                            with col2:
+                                st.metric("Size", f"{data['size_mb']:.1f} MB")
+                            with col3:
+                                st.metric("Title", metadata.get('title', 'Unknown'))
+                        else:
+                            st.error(f"❌ Validation failed: {response.json().get('detail', 'Unknown error')}")
+                    except Exception as e:
+                        st.error(f"❌ Error: {str(e)}")
+            # Summarize button
+            if st.button("🚀 Generate Summary", type="primary"):
+                if uploaded_file is not None:
+                    with st.spinner("Processing your book..."):
+                        try:
+                            # Prepare request
+                            files = {"file": uploaded_file.getvalue()}
+                            data = {
+                                "max_length": max_length,
+                                "min_length": min_length,
+                                "chunk_size": chunk_size,
+                                "overlap": overlap,
+                                "model_name": selected_model
+                            }
+                            # Send request
+                            response = requests.post(f"{API_BASE_URL}/summarize", files=files, data=data)
+                            if response.status_code == 200:
+                                result = response.json()
+                                # Display success message
+                                st.success("✅ Summary generated successfully!")
+                                # Display statistics
+                                col1, col2, col3, col4 = st.columns(4)
+                                stats = result.get('statistics', {})
+                                orig_stats = result.get('original_statistics', {})
+                                with col1:
+                                    st.metric("Original Words", f"{orig_stats.get('total_words', 0):,}")
+                                with col2:
+                                    st.metric("Summary Words", f"{stats.get('final_summary_length', 0):,}")
+                                with col3:
+                                    compression = stats.get('overall_compression_ratio', 0)
+                                    st.metric("Compression", f"{compression:.1%}")
+                                with col4:
+                                    st.metric("Chunks Processed", stats.get('total_chunks', 0))
+                                # Display summary
+                                st.subheader("📝 Generated Summary")
+                                summary = result.get('summary', '')
+                                st.text_area(
+                                    "Summary",
+                                    value=summary,
+                                    height=400,
+                                    disabled=True
+                                )
+                                # Download button
+                                summary_bytes = summary.encode('utf-8')
+                                st.download_button(
+                                    label="📥 Download Summary",
+                                    data=summary_bytes,
+                                    file_name=f"{uploaded_file.name.replace('.pdf', '')}_summary.txt",
+                                    mime="text/plain"
+                                )
+                            else:
+                                error_msg = response.json().get('detail', 'Unknown error')
+                                st.error(f"❌ Summarization failed: {error_msg}")
+                        except Exception as e:
+                            st.error(f"❌ Error: {str(e)}")
+    with tab2:
+        st.header("📊 Text Analysis")
+        if uploaded_file is not None:
+            if st.button("📊 Analyze Text"):
+                with st.spinner("Analyzing text..."):
+                    try:
+                        files = {"file": uploaded_file.getvalue()}
+                        response = requests.post(f"{API_BASE_URL}/extract-text", files=files)
+                        if response.status_code == 200:
+                            data = response.json()
+                            stats = data.get('statistics', {})
+                            # Display statistics
+                            col1, col2, col3, col4 = st.columns(4)
+                            with col1:
+                                st.metric("Total Words", f"{stats.get('total_words', 0):,}")
+                            with col2:
+                                st.metric("Total Sentences", f"{stats.get('total_sentences', 0):,}")
+                            with col3:
+                                st.metric("Avg Words/Sentence", f"{stats.get('average_words_per_sentence', 0):.1f}")
+                            with col4:
+                                st.metric("Reading Time", f"{stats.get('estimated_reading_time_minutes', 0):.1f} min")
+                            # Text preview
+                            st.subheader("📄 Text Preview")
+                            text_response = requests.post(f"{API_BASE_URL}/extract-text", files=files)
+                            if text_response.status_code == 200:
+                                text_data = text_response.json()
+                                preview_text = text_data.get('text', '')[:1000] + "..." if len(text_data.get('text', '')) > 1000 else text_data.get('text', '')
+                                st.text_area("First 1000 characters:", value=preview_text, height=200, disabled=True)
+                        else:
+                            st.error(f"❌ Analysis failed: {response.json().get('detail', 'Unknown error')}")
+                    except Exception as e:
+                        st.error(f"❌ Error: {str(e)}")
+        else:
+            st.info("📄 Please upload a PDF file to analyze its text.")
+    with tab3:
+        st.header("ℹ️ About")
+        st.markdown("""
+        ## 🤖 Book Summarizer AI
+        This application uses advanced AI models to automatically summarize PDF books.
+        It processes the text in chunks and generates comprehensive summaries while
+        maintaining the key information and context.
+        ### ✨ Features
+        - **PDF Text Extraction**: Advanced PDF processing with fallback methods
+        - **AI Summarization**: State-of-the-art transformer models
+        - **Configurable Settings**: Adjust summary length and processing parameters
+        - **Multiple Models**: Choose from different AI models for various use cases
+        - **Text Analysis**: Detailed statistics about the book content
+        ### 🛠️ Technology Stack
+        - **Frontend**: Streamlit
+        - **Backend**: FastAPI
+        - **AI Models**: Hugging Face Transformers (BART, T5)
+        - **PDF Processing**: PyPDF2, pdfplumber
+        - **Text Processing**: NLTK
+        ### 📋 How It Works
+        1. **Upload**: Select a PDF book file (max 50MB)
+        2. **Extract**: The system extracts and cleans text from the PDF
+        3. **Chunk**: Large texts are split into manageable chunks
+        4. **Summarize**: AI models process each chunk and generate summaries
+        5. **Combine**: Individual summaries are combined into a final summary
+        6. **Download**: Get your summary in text format
+        ### 🚀 Getting Started
+        1. Make sure the API server is running (`uvicorn api.main:app --reload`)
+        2. Upload a PDF book file
+        3. Configure your preferred settings
+        4. Click "Generate Summary" and wait for processing
+        5. Download your AI-generated summary
+        ### 📞 Support
+        For issues or questions, please check the API documentation at `/docs`
+        when the server is running.
+        """)
+if __name__ == "__main__":
+    main()

requirements.txt ADDED Viewed

	@@ -0,0 +1,12 @@

+streamlit==1.28.1
+fastapi==0.104.1
+uvicorn==0.24.0
+python-multipart==0.0.6
+PyPDF2==3.0.1
+pdfplumber==0.10.3
+transformers==4.35.2
+torch>=2.2.0
+nltk==3.8.1
+requests==2.31.0
+python-dotenv==1.0.0
+pydantic==2.5.0

start.bat ADDED Viewed

	@@ -0,0 +1,32 @@

+@echo off
+echo 📚 Book Summarizer AI - Windows Startup
+echo ======================================
+echo.
+echo 🔧 Checking Python installation...
+python --version >nul 2>&1
+if errorlevel 1 (
+    echo ❌ Python is not installed or not in PATH
+    echo Please install Python from https://python.org
+    pause
+    exit /b 1
+)
+echo ✅ Python found
+echo.
+echo 📦 Installing dependencies...
+pip install -r requirements.txt
+if errorlevel 1 (
+    echo ❌ Failed to install dependencies
+    pause
+    exit /b 1
+)
+echo ✅ Dependencies installed
+echo.
+echo 🚀 Starting Book Summarizer AI...
+python start.py
+pause

start.py ADDED Viewed

	@@ -0,0 +1,135 @@

+#!/usr/bin/env python3
+"""
+Startup script for Book Summarizer AI
+This script helps you start both the FastAPI backend and Streamlit frontend.
+"""
+import subprocess
+import sys
+import time
+import requests
+import os
+from pathlib import Path
+def check_dependencies():
+    """Check if required packages are installed."""
+    required_packages = [
+        'streamlit', 'fastapi', 'uvicorn', 'transformers',
+        'torch', 'PyPDF2', 'pdfplumber', 'nltk'
+    ]
+    missing_packages = []
+    for package in required_packages:
+        try:
+            __import__(package)
+        except ImportError:
+            missing_packages.append(package)
+    if missing_packages:
+        print("❌ Missing required packages:")
+        for package in missing_packages:
+            print(f"   - {package}")
+        print("\n📦 Install them with: pip install -r requirements.txt")
+        return False
+    print("✅ All dependencies are installed")
+    return True
+def download_nltk_data():
+    """Download required NLTK data."""
+    try:
+        import nltk
+        nltk.download('punkt', quiet=True)
+        nltk.download('stopwords', quiet=True)
+        print("✅ NLTK data downloaded")
+    except Exception as e:
+        print(f"⚠️  Warning: Could not download NLTK data: {e}")
+def check_api_health():
+    """Check if the API is running and healthy."""
+    try:
+        response = requests.get("http://localhost:8000/health", timeout=5)
+        return response.status_code == 200
+    except:
+        return False
+def start_api():
+    """Start the FastAPI backend."""
+    print("🚀 Starting FastAPI backend...")
+    # Check if API is already running
+    if check_api_health():
+        print("✅ API is already running")
+        return True
+    try:
+        # Start the API server
+        api_process = subprocess.Popen([
+            sys.executable, "-m", "uvicorn",
+            "api.main:app",
+            "--reload",
+            "--port", "8000",
+            "--host", "0.0.0.0"
+        ], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
+        # Wait for API to start
+        print("⏳ Waiting for API to start...")
+        for i in range(30):  # Wait up to 30 seconds
+            time.sleep(1)
+            if check_api_health():
+                print("✅ API started successfully")
+                return True
+        print("❌ API failed to start within 30 seconds")
+        return False
+    except Exception as e:
+        print(f"❌ Error starting API: {e}")
+        return False
+def start_frontend():
+    """Start the Streamlit frontend."""
+    print("🌐 Starting Streamlit frontend...")
+    try:
+        # Start Streamlit
+        subprocess.run([
+            sys.executable, "-m", "streamlit", "run", "app.py",
+            "--server.port", "8501",
+            "--server.address", "0.0.0.0"
+        ])
+    except KeyboardInterrupt:
+        print("\n👋 Shutting down...")
+    except Exception as e:
+        print(f"❌ Error starting frontend: {e}")
+def main():
+    """Main startup function."""
+    print("📚 Book Summarizer AI - Startup")
+    print("=" * 40)
+    # Check dependencies
+    if not check_dependencies():
+        sys.exit(1)
+    # Download NLTK data
+    download_nltk_data()
+    print("\n🔧 Starting services...")
+    # Start API
+    if not start_api():
+        print("❌ Failed to start API. Please check the logs.")
+        sys.exit(1)
+    print("\n🎉 Ready! Opening the application...")
+    print("📖 Frontend: http://localhost:8501")
+    print("🔌 API: http://localhost:8000")
+    print("📚 API Docs: http://localhost:8000/docs")
+    print("\n💡 Press Ctrl+C to stop the application")
+    # Start frontend
+    start_frontend()
+if __name__ == "__main__":
+    main()

start.sh ADDED Viewed

	@@ -0,0 +1,28 @@

+#!/bin/bash
+echo "📚 Book Summarizer AI - Unix/Linux/Mac Startup"
+echo "=============================================="
+echo ""
+echo "🔧 Checking Python installation..."
+if ! command -v python3 &> /dev/null; then
+    echo "❌ Python 3 is not installed or not in PATH"
+    echo "Please install Python 3 from https://python.org"
+    exit 1
+fi
+echo "✅ Python 3 found"
+echo ""
+echo "📦 Installing dependencies..."
+pip3 install -r requirements.txt
+if [ $? -ne 0 ]; then
+    echo "❌ Failed to install dependencies"
+    exit 1
+fi
+echo "✅ Dependencies installed"
+echo ""
+echo "🚀 Starting Book Summarizer AI..."
+python3 start.py