Commit
Β·
6880cd9
0
Parent(s):
first commit to AI repo
Browse files- README.md +196 -0
- api/__init__.py +1 -0
- api/__pycache__/__init__.cpython-312.pyc +0 -0
- api/__pycache__/main.cpython-312.pyc +0 -0
- api/__pycache__/pdf_processor.cpython-312.pyc +0 -0
- api/__pycache__/summarizer.cpython-312.pyc +0 -0
- api/__pycache__/utils.cpython-312.pyc +0 -0
- api/main.py +224 -0
- api/pdf_processor.py +172 -0
- api/summarizer.py +226 -0
- api/utils.py +124 -0
- app.py +314 -0
- requirements.txt +12 -0
- start.bat +32 -0
- start.py +135 -0
- start.sh +28 -0
README.md
ADDED
|
@@ -0,0 +1,196 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# π Book Summarizer AI
|
| 2 |
+
|
| 3 |
+
An intelligent web application that extracts text from PDF books and generates comprehensive summaries using state-of-the-art AI models.
|
| 4 |
+
|
| 5 |
+
## β¨ Features
|
| 6 |
+
|
| 7 |
+
- π **PDF Text Extraction**: Advanced PDF processing with multiple extraction methods
|
| 8 |
+
- π€ **AI-Powered Summarization**: Uses transformer models (BART, T5) for high-quality summaries
|
| 9 |
+
- π **Beautiful Web Interface**: Modern UI built with Streamlit
|
| 10 |
+
- β‘ **FastAPI Backend**: Scalable and fast API for processing
|
| 11 |
+
- π **Configurable Settings**: Adjust summary length, chunk size, and AI models
|
| 12 |
+
- π **Text Analysis**: Detailed statistics about book content
|
| 13 |
+
- πΎ **Download Summaries**: Save summaries as text files
|
| 14 |
+
|
| 15 |
+
## π Quick Start
|
| 16 |
+
|
| 17 |
+
### Option 1: Automated Setup (Recommended)
|
| 18 |
+
|
| 19 |
+
**Windows:**
|
| 20 |
+
```bash
|
| 21 |
+
# Double-click start.bat or run:
|
| 22 |
+
start.bat
|
| 23 |
+
```
|
| 24 |
+
|
| 25 |
+
**Unix/Linux/Mac:**
|
| 26 |
+
```bash
|
| 27 |
+
# Make script executable and run:
|
| 28 |
+
chmod +x start.sh
|
| 29 |
+
./start.sh
|
| 30 |
+
```
|
| 31 |
+
|
| 32 |
+
### Option 2: Manual Setup
|
| 33 |
+
|
| 34 |
+
1. **Install dependencies:**
|
| 35 |
+
```bash
|
| 36 |
+
pip install -r requirements.txt
|
| 37 |
+
```
|
| 38 |
+
|
| 39 |
+
2. **Download NLTK data:**
|
| 40 |
+
```python
|
| 41 |
+
python -c "import nltk; nltk.download('punkt'); nltk.download('stopwords')"
|
| 42 |
+
```
|
| 43 |
+
|
| 44 |
+
3. **Start the FastAPI backend:**
|
| 45 |
+
```bash
|
| 46 |
+
uvicorn api.main:app --reload --port 8000
|
| 47 |
+
```
|
| 48 |
+
|
| 49 |
+
4. **Start the Streamlit frontend:**
|
| 50 |
+
```bash
|
| 51 |
+
streamlit run app.py
|
| 52 |
+
```
|
| 53 |
+
|
| 54 |
+
5. **Open your browser:**
|
| 55 |
+
- Frontend: http://localhost:8501
|
| 56 |
+
- API Docs: http://localhost:8000/docs
|
| 57 |
+
|
| 58 |
+
## π Usage
|
| 59 |
+
|
| 60 |
+
1. **Upload PDF**: Select a PDF book file (max 50MB)
|
| 61 |
+
2. **Configure Settings**: Choose AI model and summary parameters
|
| 62 |
+
3. **Generate Summary**: Click "Generate Summary" and wait for processing
|
| 63 |
+
4. **Download Result**: Save your AI-generated summary
|
| 64 |
+
|
| 65 |
+
## π οΈ Technology Stack
|
| 66 |
+
|
| 67 |
+
### Frontend
|
| 68 |
+
- **Streamlit**: Modern web interface
|
| 69 |
+
- **Custom CSS**: Beautiful styling and responsive design
|
| 70 |
+
|
| 71 |
+
### Backend
|
| 72 |
+
- **FastAPI**: High-performance API framework
|
| 73 |
+
- **Uvicorn**: ASGI server for FastAPI
|
| 74 |
+
|
| 75 |
+
### AI & ML
|
| 76 |
+
- **Hugging Face Transformers**: State-of-the-art NLP models
|
| 77 |
+
- **PyTorch**: Deep learning framework
|
| 78 |
+
- **BART/T5 Models**: Pre-trained summarization models
|
| 79 |
+
|
| 80 |
+
### PDF Processing
|
| 81 |
+
- **PyPDF2**: PDF text extraction
|
| 82 |
+
- **pdfplumber**: Advanced PDF processing
|
| 83 |
+
- **NLTK**: Natural language processing
|
| 84 |
+
|
| 85 |
+
## π Project Structure
|
| 86 |
+
|
| 87 |
+
```
|
| 88 |
+
book-summarizer/
|
| 89 |
+
βββ app.py # Streamlit frontend
|
| 90 |
+
βββ start.py # Automated startup script
|
| 91 |
+
βββ start.bat # Windows startup script
|
| 92 |
+
βββ start.sh # Unix/Linux/Mac startup script
|
| 93 |
+
βββ api/
|
| 94 |
+
β βββ __init__.py # API package
|
| 95 |
+
β βββ main.py # FastAPI backend
|
| 96 |
+
β βββ pdf_processor.py # PDF text extraction
|
| 97 |
+
β βββ summarizer.py # AI summarization logic
|
| 98 |
+
β βββ utils.py # Utility functions
|
| 99 |
+
βββ requirements.txt # Python dependencies
|
| 100 |
+
βββ README.md # Project documentation
|
| 101 |
+
```
|
| 102 |
+
|
| 103 |
+
## βοΈ Configuration
|
| 104 |
+
|
| 105 |
+
### AI Models
|
| 106 |
+
- **facebook/bart-large-cnn**: Best quality, slower processing
|
| 107 |
+
- **t5-small**: Faster processing, good quality
|
| 108 |
+
- **facebook/bart-base**: Balanced performance
|
| 109 |
+
|
| 110 |
+
### Summary Settings
|
| 111 |
+
- **Max Length**: 50-500 words (default: 150)
|
| 112 |
+
- **Min Length**: 10-200 words (default: 50)
|
| 113 |
+
- **Chunk Size**: 500-2000 characters (default: 1000)
|
| 114 |
+
- **Overlap**: 50-200 characters (default: 100)
|
| 115 |
+
|
| 116 |
+
## π§ API Endpoints
|
| 117 |
+
|
| 118 |
+
- `GET /` - API information
|
| 119 |
+
- `GET /health` - Health check
|
| 120 |
+
- `POST /upload-pdf` - Validate PDF file
|
| 121 |
+
- `POST /extract-text` - Extract text from PDF
|
| 122 |
+
- `POST /summarize` - Generate book summary
|
| 123 |
+
- `GET /models` - List available AI models
|
| 124 |
+
- `POST /change-model` - Switch AI model
|
| 125 |
+
|
| 126 |
+
## π Requirements
|
| 127 |
+
|
| 128 |
+
- **Python**: 3.8 or higher
|
| 129 |
+
- **Memory**: At least 4GB RAM (8GB recommended)
|
| 130 |
+
- **Storage**: 2GB free space for models
|
| 131 |
+
- **Internet**: Required for first-time model download
|
| 132 |
+
|
| 133 |
+
## π Troubleshooting
|
| 134 |
+
|
| 135 |
+
### Common Issues
|
| 136 |
+
|
| 137 |
+
1. **"Module not found" errors:**
|
| 138 |
+
```bash
|
| 139 |
+
pip install -r requirements.txt
|
| 140 |
+
```
|
| 141 |
+
|
| 142 |
+
2. **NLTK data missing:**
|
| 143 |
+
```python
|
| 144 |
+
python -c "import nltk; nltk.download('punkt'); nltk.download('stopwords')"
|
| 145 |
+
```
|
| 146 |
+
|
| 147 |
+
3. **API connection failed:**
|
| 148 |
+
- Ensure FastAPI is running on port 8000
|
| 149 |
+
- Check firewall settings
|
| 150 |
+
- Verify no other service is using the port
|
| 151 |
+
|
| 152 |
+
4. **Large PDF processing slow:**
|
| 153 |
+
- Reduce chunk size in advanced settings
|
| 154 |
+
- Use a faster model (t5-small)
|
| 155 |
+
- Ensure sufficient RAM
|
| 156 |
+
|
| 157 |
+
5. **Model download issues:**
|
| 158 |
+
- Check internet connection
|
| 159 |
+
- Clear Hugging Face cache: `rm -rf ~/.cache/huggingface`
|
| 160 |
+
|
| 161 |
+
### Performance Tips
|
| 162 |
+
|
| 163 |
+
- **GPU Acceleration**: Install CUDA for faster processing
|
| 164 |
+
- **Model Selection**: Use smaller models for faster results
|
| 165 |
+
- **Chunk Size**: Smaller chunks = faster processing but may lose context
|
| 166 |
+
- **Memory**: Close other applications to free up RAM
|
| 167 |
+
|
| 168 |
+
## π€ Contributing
|
| 169 |
+
|
| 170 |
+
1. Fork the repository
|
| 171 |
+
2. Create a feature branch
|
| 172 |
+
3. Make your changes
|
| 173 |
+
4. Add tests if applicable
|
| 174 |
+
5. Submit a pull request
|
| 175 |
+
|
| 176 |
+
## π License
|
| 177 |
+
|
| 178 |
+
This project is open source and available under the MIT License.
|
| 179 |
+
|
| 180 |
+
## π Acknowledgments
|
| 181 |
+
|
| 182 |
+
- Hugging Face for transformer models
|
| 183 |
+
- Streamlit for the web framework
|
| 184 |
+
- FastAPI for the backend framework
|
| 185 |
+
- The open-source community for various libraries
|
| 186 |
+
|
| 187 |
+
## π Support
|
| 188 |
+
|
| 189 |
+
For issues, questions, or feature requests:
|
| 190 |
+
1. Check the troubleshooting section
|
| 191 |
+
2. Review API documentation at `/docs`
|
| 192 |
+
3. Open an issue on GitHub
|
| 193 |
+
|
| 194 |
+
---
|
| 195 |
+
|
| 196 |
+
**Happy summarizing! πβ¨**
|
api/__init__.py
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
# API package for Book Summarizer
|
api/__pycache__/__init__.cpython-312.pyc
ADDED
|
Binary file (145 Bytes). View file
|
|
|
api/__pycache__/main.cpython-312.pyc
ADDED
|
Binary file (9.14 kB). View file
|
|
|
api/__pycache__/pdf_processor.cpython-312.pyc
ADDED
|
Binary file (6.86 kB). View file
|
|
|
api/__pycache__/summarizer.cpython-312.pyc
ADDED
|
Binary file (8.62 kB). View file
|
|
|
api/__pycache__/utils.cpython-312.pyc
ADDED
|
Binary file (3.88 kB). View file
|
|
|
api/main.py
ADDED
|
@@ -0,0 +1,224 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from fastapi import FastAPI, File, UploadFile, HTTPException, BackgroundTasks
|
| 2 |
+
from fastapi.middleware.cors import CORSMiddleware
|
| 3 |
+
from pydantic import BaseModel
|
| 4 |
+
from typing import Dict, Any, Optional
|
| 5 |
+
import logging
|
| 6 |
+
import asyncio
|
| 7 |
+
from .pdf_processor import PDFProcessor
|
| 8 |
+
from .summarizer import BookSummarizer
|
| 9 |
+
|
| 10 |
+
# Configure logging
|
| 11 |
+
logging.basicConfig(level=logging.INFO)
|
| 12 |
+
logger = logging.getLogger(__name__)
|
| 13 |
+
|
| 14 |
+
# Initialize FastAPI app
|
| 15 |
+
app = FastAPI(
|
| 16 |
+
title="Book Summarizer API",
|
| 17 |
+
description="AI-powered book summarization service",
|
| 18 |
+
version="1.0.0"
|
| 19 |
+
)
|
| 20 |
+
|
| 21 |
+
# Add CORS middleware
|
| 22 |
+
app.add_middleware(
|
| 23 |
+
CORSMiddleware,
|
| 24 |
+
allow_origins=["*"], # In production, specify your frontend URL
|
| 25 |
+
allow_credentials=True,
|
| 26 |
+
allow_methods=["*"],
|
| 27 |
+
allow_headers=["*"],
|
| 28 |
+
)
|
| 29 |
+
|
| 30 |
+
# Initialize components
|
| 31 |
+
pdf_processor = PDFProcessor()
|
| 32 |
+
summarizer = BookSummarizer()
|
| 33 |
+
|
| 34 |
+
# Pydantic models
|
| 35 |
+
class SummaryRequest(BaseModel):
|
| 36 |
+
max_length: int = 150
|
| 37 |
+
min_length: int = 50
|
| 38 |
+
chunk_size: int = 1000
|
| 39 |
+
overlap: int = 100
|
| 40 |
+
model_name: Optional[str] = None
|
| 41 |
+
|
| 42 |
+
class SummaryResponse(BaseModel):
|
| 43 |
+
success: bool
|
| 44 |
+
summary: str
|
| 45 |
+
statistics: Dict[str, Any]
|
| 46 |
+
message: str
|
| 47 |
+
|
| 48 |
+
@app.on_event("startup")
|
| 49 |
+
async def startup_event():
|
| 50 |
+
"""Initialize components on startup."""
|
| 51 |
+
logger.info("Starting Book Summarizer API...")
|
| 52 |
+
try:
|
| 53 |
+
# Load the summarization model
|
| 54 |
+
summarizer.load_model()
|
| 55 |
+
logger.info("API startup completed successfully")
|
| 56 |
+
except Exception as e:
|
| 57 |
+
logger.error(f"Error during startup: {str(e)}")
|
| 58 |
+
|
| 59 |
+
@app.get("/")
|
| 60 |
+
async def root():
|
| 61 |
+
"""Root endpoint."""
|
| 62 |
+
return {
|
| 63 |
+
"message": "Book Summarizer API",
|
| 64 |
+
"version": "1.0.0",
|
| 65 |
+
"status": "running"
|
| 66 |
+
}
|
| 67 |
+
|
| 68 |
+
@app.get("/health")
|
| 69 |
+
async def health_check():
|
| 70 |
+
"""Health check endpoint."""
|
| 71 |
+
return {
|
| 72 |
+
"status": "healthy",
|
| 73 |
+
"model_loaded": summarizer.summarizer is not None
|
| 74 |
+
}
|
| 75 |
+
|
| 76 |
+
@app.post("/upload-pdf")
|
| 77 |
+
async def upload_pdf(file: UploadFile = File(...)):
|
| 78 |
+
"""
|
| 79 |
+
Upload and validate a PDF file.
|
| 80 |
+
"""
|
| 81 |
+
try:
|
| 82 |
+
# Check file type
|
| 83 |
+
if not file.filename.lower().endswith('.pdf'):
|
| 84 |
+
raise HTTPException(status_code=400, detail="Only PDF files are supported")
|
| 85 |
+
|
| 86 |
+
# Read file content
|
| 87 |
+
content = await file.read()
|
| 88 |
+
|
| 89 |
+
# Validate PDF
|
| 90 |
+
validation_result = pdf_processor.validate_pdf(content)
|
| 91 |
+
if not validation_result['valid']:
|
| 92 |
+
raise HTTPException(status_code=400, detail=validation_result['message'])
|
| 93 |
+
|
| 94 |
+
# Extract metadata
|
| 95 |
+
metadata = pdf_processor.get_pdf_metadata(content)
|
| 96 |
+
|
| 97 |
+
return {
|
| 98 |
+
"success": True,
|
| 99 |
+
"filename": file.filename,
|
| 100 |
+
"size_mb": validation_result['size_mb'],
|
| 101 |
+
"pages": validation_result['pages'],
|
| 102 |
+
"metadata": metadata,
|
| 103 |
+
"message": "PDF uploaded and validated successfully"
|
| 104 |
+
}
|
| 105 |
+
|
| 106 |
+
except HTTPException:
|
| 107 |
+
raise
|
| 108 |
+
except Exception as e:
|
| 109 |
+
logger.error(f"Error uploading PDF: {str(e)}")
|
| 110 |
+
raise HTTPException(status_code=500, detail=f"Error processing PDF: {str(e)}")
|
| 111 |
+
|
| 112 |
+
@app.post("/extract-text")
|
| 113 |
+
async def extract_text(file: UploadFile = File(...)):
|
| 114 |
+
"""
|
| 115 |
+
Extract text from uploaded PDF.
|
| 116 |
+
"""
|
| 117 |
+
try:
|
| 118 |
+
# Read file content
|
| 119 |
+
content = await file.read()
|
| 120 |
+
|
| 121 |
+
# Extract text
|
| 122 |
+
result = pdf_processor.extract_text_from_pdf(content)
|
| 123 |
+
|
| 124 |
+
if not result['success']:
|
| 125 |
+
raise HTTPException(status_code=400, detail=result['message'])
|
| 126 |
+
|
| 127 |
+
return {
|
| 128 |
+
"success": True,
|
| 129 |
+
"text_length": len(result['text']),
|
| 130 |
+
"statistics": result['statistics'],
|
| 131 |
+
"pages": result['pages'],
|
| 132 |
+
"message": result['message']
|
| 133 |
+
}
|
| 134 |
+
|
| 135 |
+
except HTTPException:
|
| 136 |
+
raise
|
| 137 |
+
except Exception as e:
|
| 138 |
+
logger.error(f"Error extracting text: {str(e)}")
|
| 139 |
+
raise HTTPException(status_code=500, detail=f"Error extracting text: {str(e)}")
|
| 140 |
+
|
| 141 |
+
@app.post("/summarize")
|
| 142 |
+
async def summarize_book(
|
| 143 |
+
file: UploadFile = File(...),
|
| 144 |
+
request: SummaryRequest = SummaryRequest()
|
| 145 |
+
):
|
| 146 |
+
"""
|
| 147 |
+
Summarize a book from uploaded PDF.
|
| 148 |
+
"""
|
| 149 |
+
try:
|
| 150 |
+
# Read file content
|
| 151 |
+
content = await file.read()
|
| 152 |
+
|
| 153 |
+
# Extract text
|
| 154 |
+
extraction_result = pdf_processor.extract_text_from_pdf(content)
|
| 155 |
+
if not extraction_result['success']:
|
| 156 |
+
raise HTTPException(status_code=400, detail=extraction_result['message'])
|
| 157 |
+
|
| 158 |
+
# Change model if specified
|
| 159 |
+
if request.model_name:
|
| 160 |
+
summarizer.change_model(request.model_name)
|
| 161 |
+
|
| 162 |
+
# Summarize the book
|
| 163 |
+
summary_result = summarizer.summarize_book(
|
| 164 |
+
text=extraction_result['text'],
|
| 165 |
+
chunk_size=request.chunk_size,
|
| 166 |
+
overlap=request.overlap,
|
| 167 |
+
max_length=request.max_length,
|
| 168 |
+
min_length=request.min_length
|
| 169 |
+
)
|
| 170 |
+
|
| 171 |
+
if not summary_result['success']:
|
| 172 |
+
raise HTTPException(status_code=500, detail=summary_result.get('error', 'Summarization failed'))
|
| 173 |
+
|
| 174 |
+
return {
|
| 175 |
+
"success": True,
|
| 176 |
+
"summary": summary_result['summary'],
|
| 177 |
+
"statistics": summary_result['statistics'],
|
| 178 |
+
"original_statistics": extraction_result['statistics'],
|
| 179 |
+
"message": "Book summarized successfully"
|
| 180 |
+
}
|
| 181 |
+
|
| 182 |
+
except HTTPException:
|
| 183 |
+
raise
|
| 184 |
+
except Exception as e:
|
| 185 |
+
logger.error(f"Error summarizing book: {str(e)}")
|
| 186 |
+
raise HTTPException(status_code=500, detail=f"Error summarizing book: {str(e)}")
|
| 187 |
+
|
| 188 |
+
@app.get("/models")
|
| 189 |
+
async def get_available_models():
|
| 190 |
+
"""
|
| 191 |
+
Get list of available summarization models.
|
| 192 |
+
"""
|
| 193 |
+
try:
|
| 194 |
+
models = summarizer.get_available_models()
|
| 195 |
+
return {
|
| 196 |
+
"success": True,
|
| 197 |
+
"models": models,
|
| 198 |
+
"current_model": summarizer.model_name
|
| 199 |
+
}
|
| 200 |
+
except Exception as e:
|
| 201 |
+
logger.error(f"Error getting models: {str(e)}")
|
| 202 |
+
raise HTTPException(status_code=500, detail=f"Error getting models: {str(e)}")
|
| 203 |
+
|
| 204 |
+
@app.post("/change-model")
|
| 205 |
+
async def change_model(model_name: str):
|
| 206 |
+
"""
|
| 207 |
+
Change the summarization model.
|
| 208 |
+
"""
|
| 209 |
+
try:
|
| 210 |
+
summarizer.change_model(model_name)
|
| 211 |
+
summarizer.load_model()
|
| 212 |
+
|
| 213 |
+
return {
|
| 214 |
+
"success": True,
|
| 215 |
+
"message": f"Model changed to {model_name}",
|
| 216 |
+
"current_model": model_name
|
| 217 |
+
}
|
| 218 |
+
except Exception as e:
|
| 219 |
+
logger.error(f"Error changing model: {str(e)}")
|
| 220 |
+
raise HTTPException(status_code=500, detail=f"Error changing model: {str(e)}")
|
| 221 |
+
|
| 222 |
+
if __name__ == "__main__":
|
| 223 |
+
import uvicorn
|
| 224 |
+
uvicorn.run(app, host="0.0.0.0", port=8000)
|
api/pdf_processor.py
ADDED
|
@@ -0,0 +1,172 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import PyPDF2
|
| 2 |
+
import pdfplumber
|
| 3 |
+
import io
|
| 4 |
+
from typing import Dict, Any, Optional
|
| 5 |
+
import logging
|
| 6 |
+
from .utils import clean_text, get_text_statistics
|
| 7 |
+
|
| 8 |
+
logger = logging.getLogger(__name__)
|
| 9 |
+
|
| 10 |
+
class PDFProcessor:
|
| 11 |
+
"""
|
| 12 |
+
Handles PDF text extraction and processing.
|
| 13 |
+
"""
|
| 14 |
+
|
| 15 |
+
def __init__(self):
|
| 16 |
+
self.supported_formats = ['.pdf']
|
| 17 |
+
|
| 18 |
+
def extract_text_from_pdf(self, pdf_file: bytes) -> Dict[str, Any]:
|
| 19 |
+
"""
|
| 20 |
+
Extract text from PDF file bytes.
|
| 21 |
+
|
| 22 |
+
Args:
|
| 23 |
+
pdf_file: PDF file as bytes
|
| 24 |
+
|
| 25 |
+
Returns:
|
| 26 |
+
Dictionary containing extracted text and metadata
|
| 27 |
+
"""
|
| 28 |
+
try:
|
| 29 |
+
# Try pdfplumber first (better for complex layouts)
|
| 30 |
+
text = self._extract_with_pdfplumber(pdf_file)
|
| 31 |
+
|
| 32 |
+
if not text or len(text.strip()) < 100:
|
| 33 |
+
# Fallback to PyPDF2
|
| 34 |
+
text = self._extract_with_pypdf2(pdf_file)
|
| 35 |
+
|
| 36 |
+
if not text:
|
| 37 |
+
raise ValueError("Could not extract text from PDF")
|
| 38 |
+
|
| 39 |
+
# Clean the extracted text
|
| 40 |
+
cleaned_text = clean_text(text)
|
| 41 |
+
|
| 42 |
+
# Get text statistics
|
| 43 |
+
stats = get_text_statistics(cleaned_text)
|
| 44 |
+
|
| 45 |
+
return {
|
| 46 |
+
'success': True,
|
| 47 |
+
'text': cleaned_text,
|
| 48 |
+
'statistics': stats,
|
| 49 |
+
'pages': self._get_page_count(pdf_file),
|
| 50 |
+
'message': 'Text extracted successfully'
|
| 51 |
+
}
|
| 52 |
+
|
| 53 |
+
except Exception as e:
|
| 54 |
+
logger.error(f"Error extracting text from PDF: {str(e)}")
|
| 55 |
+
return {
|
| 56 |
+
'success': False,
|
| 57 |
+
'text': '',
|
| 58 |
+
'statistics': {},
|
| 59 |
+
'pages': 0,
|
| 60 |
+
'message': f'Error extracting text: {str(e)}'
|
| 61 |
+
}
|
| 62 |
+
|
| 63 |
+
def _extract_with_pdfplumber(self, pdf_file: bytes) -> str:
|
| 64 |
+
"""
|
| 65 |
+
Extract text using pdfplumber (better for complex layouts).
|
| 66 |
+
"""
|
| 67 |
+
text_parts = []
|
| 68 |
+
|
| 69 |
+
try:
|
| 70 |
+
with pdfplumber.open(io.BytesIO(pdf_file)) as pdf:
|
| 71 |
+
for page in pdf.pages:
|
| 72 |
+
page_text = page.extract_text()
|
| 73 |
+
if page_text:
|
| 74 |
+
text_parts.append(page_text)
|
| 75 |
+
|
| 76 |
+
return '\n'.join(text_parts)
|
| 77 |
+
except Exception as e:
|
| 78 |
+
logger.warning(f"pdfplumber extraction failed: {str(e)}")
|
| 79 |
+
return ""
|
| 80 |
+
|
| 81 |
+
def _extract_with_pypdf2(self, pdf_file: bytes) -> str:
|
| 82 |
+
"""
|
| 83 |
+
Extract text using PyPDF2 (fallback method).
|
| 84 |
+
"""
|
| 85 |
+
text_parts = []
|
| 86 |
+
|
| 87 |
+
try:
|
| 88 |
+
pdf_reader = PyPDF2.PdfReader(io.BytesIO(pdf_file))
|
| 89 |
+
|
| 90 |
+
for page in pdf_reader.pages:
|
| 91 |
+
page_text = page.extract_text()
|
| 92 |
+
if page_text:
|
| 93 |
+
text_parts.append(page_text)
|
| 94 |
+
|
| 95 |
+
return '\n'.join(text_parts)
|
| 96 |
+
except Exception as e:
|
| 97 |
+
logger.warning(f"PyPDF2 extraction failed: {str(e)}")
|
| 98 |
+
return ""
|
| 99 |
+
|
| 100 |
+
def _get_page_count(self, pdf_file: bytes) -> int:
|
| 101 |
+
"""
|
| 102 |
+
Get the number of pages in the PDF.
|
| 103 |
+
"""
|
| 104 |
+
try:
|
| 105 |
+
pdf_reader = PyPDF2.PdfReader(io.BytesIO(pdf_file))
|
| 106 |
+
return len(pdf_reader.pages)
|
| 107 |
+
except:
|
| 108 |
+
return 0
|
| 109 |
+
|
| 110 |
+
def get_pdf_metadata(self, pdf_file: bytes) -> Dict[str, Any]:
|
| 111 |
+
"""
|
| 112 |
+
Extract metadata from PDF file.
|
| 113 |
+
"""
|
| 114 |
+
try:
|
| 115 |
+
pdf_reader = PyPDF2.PdfReader(io.BytesIO(pdf_file))
|
| 116 |
+
metadata = pdf_reader.metadata
|
| 117 |
+
|
| 118 |
+
return {
|
| 119 |
+
'title': metadata.get('/Title', 'Unknown'),
|
| 120 |
+
'author': metadata.get('/Author', 'Unknown'),
|
| 121 |
+
'subject': metadata.get('/Subject', ''),
|
| 122 |
+
'creator': metadata.get('/Creator', ''),
|
| 123 |
+
'producer': metadata.get('/Producer', ''),
|
| 124 |
+
'pages': len(pdf_reader.pages)
|
| 125 |
+
}
|
| 126 |
+
except Exception as e:
|
| 127 |
+
logger.error(f"Error extracting PDF metadata: {str(e)}")
|
| 128 |
+
return {
|
| 129 |
+
'title': 'Unknown',
|
| 130 |
+
'author': 'Unknown',
|
| 131 |
+
'subject': '',
|
| 132 |
+
'creator': '',
|
| 133 |
+
'producer': '',
|
| 134 |
+
'pages': 0
|
| 135 |
+
}
|
| 136 |
+
|
| 137 |
+
def validate_pdf(self, pdf_file: bytes) -> Dict[str, Any]:
|
| 138 |
+
"""
|
| 139 |
+
Validate PDF file and check if it can be processed.
|
| 140 |
+
"""
|
| 141 |
+
try:
|
| 142 |
+
# Check file size
|
| 143 |
+
file_size = len(pdf_file)
|
| 144 |
+
max_size = 50 * 1024 * 1024 # 50MB limit
|
| 145 |
+
|
| 146 |
+
if file_size > max_size:
|
| 147 |
+
return {
|
| 148 |
+
'valid': False,
|
| 149 |
+
'message': f'File too large. Maximum size is 50MB, got {file_size / (1024*1024):.1f}MB'
|
| 150 |
+
}
|
| 151 |
+
|
| 152 |
+
# Try to read PDF
|
| 153 |
+
pdf_reader = PyPDF2.PdfReader(io.BytesIO(pdf_file))
|
| 154 |
+
|
| 155 |
+
if len(pdf_reader.pages) == 0:
|
| 156 |
+
return {
|
| 157 |
+
'valid': False,
|
| 158 |
+
'message': 'PDF appears to be empty or corrupted'
|
| 159 |
+
}
|
| 160 |
+
|
| 161 |
+
return {
|
| 162 |
+
'valid': True,
|
| 163 |
+
'message': 'PDF is valid',
|
| 164 |
+
'pages': len(pdf_reader.pages),
|
| 165 |
+
'size_mb': file_size / (1024 * 1024)
|
| 166 |
+
}
|
| 167 |
+
|
| 168 |
+
except Exception as e:
|
| 169 |
+
return {
|
| 170 |
+
'valid': False,
|
| 171 |
+
'message': f'Invalid PDF file: {str(e)}'
|
| 172 |
+
}
|
api/summarizer.py
ADDED
|
@@ -0,0 +1,226 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM
|
| 2 |
+
from typing import List, Dict, Any, Optional, Union
|
| 3 |
+
import torch
|
| 4 |
+
import logging
|
| 5 |
+
from .utils import chunk_text
|
| 6 |
+
|
| 7 |
+
logger = logging.getLogger(__name__)
|
| 8 |
+
|
| 9 |
+
class BookSummarizer:
|
| 10 |
+
"""
|
| 11 |
+
Handles AI-powered text summarization using transformer models.
|
| 12 |
+
"""
|
| 13 |
+
|
| 14 |
+
def __init__(self, model_name: str = "facebook/bart-large-cnn"):
|
| 15 |
+
"""
|
| 16 |
+
Initialize the summarizer with a specific model.
|
| 17 |
+
|
| 18 |
+
Args:
|
| 19 |
+
model_name: Hugging Face model name for summarization
|
| 20 |
+
"""
|
| 21 |
+
self.model_name = model_name
|
| 22 |
+
self.summarizer = None
|
| 23 |
+
self.tokenizer = None
|
| 24 |
+
self.model = None
|
| 25 |
+
self.device = "cuda" if torch.cuda.is_available() else "cpu"
|
| 26 |
+
|
| 27 |
+
logger.info(f"Initializing summarizer with model: {model_name}")
|
| 28 |
+
logger.info(f"Using device: {self.device}")
|
| 29 |
+
|
| 30 |
+
def load_model(self):
|
| 31 |
+
"""
|
| 32 |
+
Load the summarization model and tokenizer.
|
| 33 |
+
"""
|
| 34 |
+
try:
|
| 35 |
+
logger.info("Loading summarization model...")
|
| 36 |
+
|
| 37 |
+
# Load tokenizer and model
|
| 38 |
+
self.tokenizer = AutoTokenizer.from_pretrained(self.model_name)
|
| 39 |
+
self.model = AutoModelForSeq2SeqLM.from_pretrained(self.model_name)
|
| 40 |
+
|
| 41 |
+
# Move model to appropriate device
|
| 42 |
+
self.model.to(self.device)
|
| 43 |
+
|
| 44 |
+
# Create pipeline
|
| 45 |
+
self.summarizer = pipeline(
|
| 46 |
+
"summarization",
|
| 47 |
+
model=self.model,
|
| 48 |
+
tokenizer=self.tokenizer,
|
| 49 |
+
device=0 if self.device == "cuda" else -1
|
| 50 |
+
)
|
| 51 |
+
|
| 52 |
+
logger.info("Model loaded successfully")
|
| 53 |
+
|
| 54 |
+
except Exception as e:
|
| 55 |
+
logger.error(f"Error loading model: {str(e)}")
|
| 56 |
+
raise
|
| 57 |
+
|
| 58 |
+
def summarize_text(self, text: str, max_length: int = 150, min_length: int = 50,
|
| 59 |
+
do_sample: bool = False) -> Dict[str, Any]:
|
| 60 |
+
"""
|
| 61 |
+
Summarize a single text chunk.
|
| 62 |
+
|
| 63 |
+
Args:
|
| 64 |
+
text: Text to summarize
|
| 65 |
+
max_length: Maximum length of summary
|
| 66 |
+
min_length: Minimum length of summary
|
| 67 |
+
do_sample: Whether to use sampling for generation
|
| 68 |
+
|
| 69 |
+
Returns:
|
| 70 |
+
Dictionary containing summary and metadata
|
| 71 |
+
"""
|
| 72 |
+
try:
|
| 73 |
+
if not self.summarizer:
|
| 74 |
+
self.load_model()
|
| 75 |
+
|
| 76 |
+
# Check if text is too short
|
| 77 |
+
if len(text.split()) < 50:
|
| 78 |
+
return {
|
| 79 |
+
'success': True,
|
| 80 |
+
'summary': text,
|
| 81 |
+
'original_length': len(text.split()),
|
| 82 |
+
'summary_length': len(text.split()),
|
| 83 |
+
'compression_ratio': 1.0
|
| 84 |
+
}
|
| 85 |
+
|
| 86 |
+
# Generate summary
|
| 87 |
+
summary_result = self.summarizer(
|
| 88 |
+
text,
|
| 89 |
+
max_length=max_length,
|
| 90 |
+
min_length=min_length,
|
| 91 |
+
do_sample=do_sample,
|
| 92 |
+
truncation=True
|
| 93 |
+
)
|
| 94 |
+
|
| 95 |
+
summary = summary_result[0]['summary_text']
|
| 96 |
+
|
| 97 |
+
# Calculate compression ratio
|
| 98 |
+
original_words = len(text.split())
|
| 99 |
+
summary_words = len(summary.split())
|
| 100 |
+
compression_ratio = summary_words / original_words if original_words > 0 else 0
|
| 101 |
+
|
| 102 |
+
return {
|
| 103 |
+
'success': True,
|
| 104 |
+
'summary': summary,
|
| 105 |
+
'original_length': original_words,
|
| 106 |
+
'summary_length': summary_words,
|
| 107 |
+
'compression_ratio': compression_ratio
|
| 108 |
+
}
|
| 109 |
+
|
| 110 |
+
except Exception as e:
|
| 111 |
+
logger.error(f"Error summarizing text: {str(e)}")
|
| 112 |
+
return {
|
| 113 |
+
'success': False,
|
| 114 |
+
'summary': '',
|
| 115 |
+
'error': str(e)
|
| 116 |
+
}
|
| 117 |
+
|
| 118 |
+
def summarize_book(self, text: str, chunk_size: int = 1000, overlap: int = 100,
|
| 119 |
+
max_length: int = 150, min_length: int = 50) -> Dict[str, Any]:
|
| 120 |
+
"""
|
| 121 |
+
Summarize a complete book by processing it in chunks.
|
| 122 |
+
|
| 123 |
+
Args:
|
| 124 |
+
text: Complete book text
|
| 125 |
+
chunk_size: Size of each text chunk
|
| 126 |
+
overlap: Overlap between chunks
|
| 127 |
+
max_length: Maximum length of each summary
|
| 128 |
+
min_length: Minimum length of each summary
|
| 129 |
+
|
| 130 |
+
Returns:
|
| 131 |
+
Dictionary containing complete summary and metadata
|
| 132 |
+
"""
|
| 133 |
+
try:
|
| 134 |
+
logger.info("Starting book summarization...")
|
| 135 |
+
|
| 136 |
+
# Split text into chunks
|
| 137 |
+
chunks = chunk_text(text, chunk_size, overlap)
|
| 138 |
+
logger.info(f"Split text into {len(chunks)} chunks")
|
| 139 |
+
|
| 140 |
+
# Summarize each chunk
|
| 141 |
+
chunk_summaries = []
|
| 142 |
+
total_original_words = 0
|
| 143 |
+
total_summary_words = 0
|
| 144 |
+
|
| 145 |
+
for i, chunk in enumerate(chunks):
|
| 146 |
+
logger.info(f"Processing chunk {i+1}/{len(chunks)}")
|
| 147 |
+
|
| 148 |
+
result = self.summarize_text(chunk, max_length, min_length)
|
| 149 |
+
|
| 150 |
+
if result['success']:
|
| 151 |
+
chunk_summaries.append(result['summary'])
|
| 152 |
+
total_original_words += result['original_length']
|
| 153 |
+
total_summary_words += result['summary_length']
|
| 154 |
+
else:
|
| 155 |
+
logger.warning(f"Failed to summarize chunk {i+1}: {result.get('error', 'Unknown error')}")
|
| 156 |
+
# Include original chunk if summarization fails
|
| 157 |
+
chunk_summaries.append(chunk[:200] + "...")
|
| 158 |
+
|
| 159 |
+
# Combine all summaries
|
| 160 |
+
combined_summary = " ".join(chunk_summaries)
|
| 161 |
+
|
| 162 |
+
# Create final summary if the combined summary is still too long
|
| 163 |
+
if len(combined_summary.split()) > 500:
|
| 164 |
+
logger.info("Creating final summary from combined summaries...")
|
| 165 |
+
final_result = self.summarize_text(combined_summary, max_length=300, min_length=100)
|
| 166 |
+
if final_result['success']:
|
| 167 |
+
combined_summary = final_result['summary']
|
| 168 |
+
|
| 169 |
+
# Calculate overall statistics
|
| 170 |
+
overall_compression = total_summary_words / total_original_words if total_original_words > 0 else 0
|
| 171 |
+
|
| 172 |
+
return {
|
| 173 |
+
'success': True,
|
| 174 |
+
'summary': combined_summary,
|
| 175 |
+
'statistics': {
|
| 176 |
+
'total_chunks': len(chunks),
|
| 177 |
+
'total_original_words': total_original_words,
|
| 178 |
+
'total_summary_words': total_summary_words,
|
| 179 |
+
'overall_compression_ratio': overall_compression,
|
| 180 |
+
'final_summary_length': len(combined_summary.split())
|
| 181 |
+
},
|
| 182 |
+
'chunk_summaries': chunk_summaries
|
| 183 |
+
}
|
| 184 |
+
|
| 185 |
+
except Exception as e:
|
| 186 |
+
logger.error(f"Error in book summarization: {str(e)}")
|
| 187 |
+
return {
|
| 188 |
+
'success': False,
|
| 189 |
+
'summary': '',
|
| 190 |
+
'error': str(e)
|
| 191 |
+
}
|
| 192 |
+
|
| 193 |
+
def get_available_models(self) -> List[Dict[str, Union[str, int]]]:
|
| 194 |
+
"""
|
| 195 |
+
Get list of available summarization models.
|
| 196 |
+
"""
|
| 197 |
+
return [
|
| 198 |
+
{
|
| 199 |
+
'name': 'facebook/bart-large-cnn',
|
| 200 |
+
'description': 'BART model fine-tuned on CNN news articles (recommended)',
|
| 201 |
+
'max_length': 1024
|
| 202 |
+
},
|
| 203 |
+
{
|
| 204 |
+
'name': 't5-small',
|
| 205 |
+
'description': 'Small T5 model, faster but less accurate',
|
| 206 |
+
'max_length': 512
|
| 207 |
+
},
|
| 208 |
+
{
|
| 209 |
+
'name': 'facebook/bart-base',
|
| 210 |
+
'description': 'Base BART model, balanced performance',
|
| 211 |
+
'max_length': 1024
|
| 212 |
+
}
|
| 213 |
+
]
|
| 214 |
+
|
| 215 |
+
def change_model(self, model_name: str):
|
| 216 |
+
"""
|
| 217 |
+
Change the summarization model.
|
| 218 |
+
|
| 219 |
+
Args:
|
| 220 |
+
model_name: New model name to use
|
| 221 |
+
"""
|
| 222 |
+
self.model_name = model_name
|
| 223 |
+
self.summarizer = None
|
| 224 |
+
self.tokenizer = None
|
| 225 |
+
self.model = None
|
| 226 |
+
logger.info(f"Model changed to: {model_name}")
|
api/utils.py
ADDED
|
@@ -0,0 +1,124 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import re
|
| 2 |
+
import nltk
|
| 3 |
+
from typing import List, Dict, Any
|
| 4 |
+
import logging
|
| 5 |
+
|
| 6 |
+
# Configure logging
|
| 7 |
+
logging.basicConfig(level=logging.INFO)
|
| 8 |
+
logger = logging.getLogger(__name__)
|
| 9 |
+
|
| 10 |
+
def clean_text(text: str) -> str:
|
| 11 |
+
"""
|
| 12 |
+
Clean and preprocess extracted text from PDF.
|
| 13 |
+
"""
|
| 14 |
+
# Remove extra whitespace and normalize
|
| 15 |
+
text = re.sub(r'\s+', ' ', text)
|
| 16 |
+
text = re.sub(r'\n+', '\n', text)
|
| 17 |
+
text = text.strip()
|
| 18 |
+
|
| 19 |
+
# Remove common PDF artifacts
|
| 20 |
+
text = re.sub(r'[^\w\s\.\,\!\?\;\:\-\(\)\[\]\{\}]', '', text)
|
| 21 |
+
|
| 22 |
+
return text
|
| 23 |
+
|
| 24 |
+
def chunk_text(text: str, max_chunk_size: int = 1000, overlap: int = 100) -> List[str]:
|
| 25 |
+
"""
|
| 26 |
+
Split text into overlapping chunks for processing.
|
| 27 |
+
|
| 28 |
+
Args:
|
| 29 |
+
text: Input text to chunk
|
| 30 |
+
max_chunk_size: Maximum size of each chunk
|
| 31 |
+
overlap: Number of characters to overlap between chunks
|
| 32 |
+
|
| 33 |
+
Returns:
|
| 34 |
+
List of text chunks
|
| 35 |
+
"""
|
| 36 |
+
if len(text) <= max_chunk_size:
|
| 37 |
+
return [text]
|
| 38 |
+
|
| 39 |
+
chunks = []
|
| 40 |
+
start = 0
|
| 41 |
+
|
| 42 |
+
while start < len(text):
|
| 43 |
+
end = start + max_chunk_size
|
| 44 |
+
|
| 45 |
+
# Try to break at sentence boundaries
|
| 46 |
+
if end < len(text):
|
| 47 |
+
# Look for sentence endings
|
| 48 |
+
sentence_endings = ['.', '!', '?']
|
| 49 |
+
for ending in sentence_endings:
|
| 50 |
+
last_ending = text.rfind(ending, start, end)
|
| 51 |
+
if last_ending > start + max_chunk_size * 0.8: # Only break if we're at least 80% through
|
| 52 |
+
end = last_ending + 1
|
| 53 |
+
break
|
| 54 |
+
|
| 55 |
+
chunk = text[start:end].strip()
|
| 56 |
+
if chunk:
|
| 57 |
+
chunks.append(chunk)
|
| 58 |
+
|
| 59 |
+
# Move start position with overlap
|
| 60 |
+
start = end - overlap
|
| 61 |
+
if start >= len(text):
|
| 62 |
+
break
|
| 63 |
+
|
| 64 |
+
return chunks
|
| 65 |
+
|
| 66 |
+
def extract_chapters(text: str) -> Dict[str, str]:
|
| 67 |
+
"""
|
| 68 |
+
Attempt to extract chapters from the text.
|
| 69 |
+
"""
|
| 70 |
+
chapters = {}
|
| 71 |
+
|
| 72 |
+
# Common chapter patterns
|
| 73 |
+
chapter_patterns = [
|
| 74 |
+
r'Chapter\s+(\d+|[IVXLC]+)',
|
| 75 |
+
r'CHAPTER\s+(\d+|[IVXLC]+)',
|
| 76 |
+
r'(\d+)\.\s+[A-Z]',
|
| 77 |
+
r'[IVXLC]+\.\s+[A-Z]'
|
| 78 |
+
]
|
| 79 |
+
|
| 80 |
+
lines = text.split('\n')
|
| 81 |
+
current_chapter = "Introduction"
|
| 82 |
+
current_content = []
|
| 83 |
+
|
| 84 |
+
for line in lines:
|
| 85 |
+
line = line.strip()
|
| 86 |
+
if not line:
|
| 87 |
+
continue
|
| 88 |
+
|
| 89 |
+
# Check if this line is a chapter header
|
| 90 |
+
is_chapter_header = False
|
| 91 |
+
for pattern in chapter_patterns:
|
| 92 |
+
if re.match(pattern, line, re.IGNORECASE):
|
| 93 |
+
# Save previous chapter
|
| 94 |
+
if current_content:
|
| 95 |
+
chapters[current_chapter] = '\n'.join(current_content)
|
| 96 |
+
|
| 97 |
+
current_chapter = line
|
| 98 |
+
current_content = []
|
| 99 |
+
is_chapter_header = True
|
| 100 |
+
break
|
| 101 |
+
|
| 102 |
+
if not is_chapter_header:
|
| 103 |
+
current_content.append(line)
|
| 104 |
+
|
| 105 |
+
# Save the last chapter
|
| 106 |
+
if current_content:
|
| 107 |
+
chapters[current_chapter] = '\n'.join(current_content)
|
| 108 |
+
|
| 109 |
+
return chapters
|
| 110 |
+
|
| 111 |
+
def get_text_statistics(text: str) -> Dict[str, Any]:
|
| 112 |
+
"""
|
| 113 |
+
Get basic statistics about the text.
|
| 114 |
+
"""
|
| 115 |
+
words = text.split()
|
| 116 |
+
sentences = nltk.sent_tokenize(text)
|
| 117 |
+
|
| 118 |
+
return {
|
| 119 |
+
'total_characters': len(text),
|
| 120 |
+
'total_words': len(words),
|
| 121 |
+
'total_sentences': len(sentences),
|
| 122 |
+
'average_words_per_sentence': len(words) / len(sentences) if sentences else 0,
|
| 123 |
+
'estimated_reading_time_minutes': len(words) / 200 # Average reading speed
|
| 124 |
+
}
|
app.py
ADDED
|
@@ -0,0 +1,314 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import streamlit as st
|
| 2 |
+
import requests
|
| 3 |
+
import json
|
| 4 |
+
import time
|
| 5 |
+
from typing import Dict, Any, Optional
|
| 6 |
+
import io
|
| 7 |
+
|
| 8 |
+
# Page configuration
|
| 9 |
+
st.set_page_config(
|
| 10 |
+
page_title="Book Summarizer AI",
|
| 11 |
+
page_icon="π",
|
| 12 |
+
layout="wide",
|
| 13 |
+
initial_sidebar_state="expanded"
|
| 14 |
+
)
|
| 15 |
+
|
| 16 |
+
# API configuration
|
| 17 |
+
API_BASE_URL = "http://localhost:8000"
|
| 18 |
+
|
| 19 |
+
def main():
|
| 20 |
+
# Custom CSS for better styling
|
| 21 |
+
st.markdown("""
|
| 22 |
+
<style>
|
| 23 |
+
.main-header {
|
| 24 |
+
font-size: 3rem;
|
| 25 |
+
font-weight: bold;
|
| 26 |
+
text-align: center;
|
| 27 |
+
color: #1f77b4;
|
| 28 |
+
margin-bottom: 2rem;
|
| 29 |
+
}
|
| 30 |
+
.sub-header {
|
| 31 |
+
font-size: 1.5rem;
|
| 32 |
+
color: #666;
|
| 33 |
+
text-align: center;
|
| 34 |
+
margin-bottom: 2rem;
|
| 35 |
+
}
|
| 36 |
+
.success-box {
|
| 37 |
+
background-color: #d4edda;
|
| 38 |
+
border: 1px solid #c3e6cb;
|
| 39 |
+
border-radius: 5px;
|
| 40 |
+
padding: 1rem;
|
| 41 |
+
margin: 1rem 0;
|
| 42 |
+
}
|
| 43 |
+
.error-box {
|
| 44 |
+
background-color: #f8d7da;
|
| 45 |
+
border: 1px solid #f5c6cb;
|
| 46 |
+
border-radius: 5px;
|
| 47 |
+
padding: 1rem;
|
| 48 |
+
margin: 1rem 0;
|
| 49 |
+
}
|
| 50 |
+
.info-box {
|
| 51 |
+
background-color: #d1ecf1;
|
| 52 |
+
border: 1px solid #bee5eb;
|
| 53 |
+
border-radius: 5px;
|
| 54 |
+
padding: 1rem;
|
| 55 |
+
margin: 1rem 0;
|
| 56 |
+
}
|
| 57 |
+
</style>
|
| 58 |
+
""", unsafe_allow_html=True)
|
| 59 |
+
|
| 60 |
+
# Header
|
| 61 |
+
st.markdown('<h1 class="main-header">π Book Summarizer AI</h1>', unsafe_allow_html=True)
|
| 62 |
+
st.markdown('<p class="sub-header">Transform your PDF books into intelligent summaries using AI</p>', unsafe_allow_html=True)
|
| 63 |
+
|
| 64 |
+
# Sidebar
|
| 65 |
+
with st.sidebar:
|
| 66 |
+
st.header("βοΈ Settings")
|
| 67 |
+
|
| 68 |
+
# Model selection
|
| 69 |
+
st.subheader("AI Model")
|
| 70 |
+
try:
|
| 71 |
+
models_response = requests.get(f"{API_BASE_URL}/models")
|
| 72 |
+
if models_response.status_code == 200:
|
| 73 |
+
models_data = models_response.json()
|
| 74 |
+
models = models_data.get('models', [])
|
| 75 |
+
current_model = models_data.get('current_model', '')
|
| 76 |
+
|
| 77 |
+
model_names = [model['name'] for model in models]
|
| 78 |
+
selected_model = st.selectbox(
|
| 79 |
+
"Choose AI Model",
|
| 80 |
+
model_names,
|
| 81 |
+
index=model_names.index(current_model) if current_model in model_names else 0
|
| 82 |
+
)
|
| 83 |
+
|
| 84 |
+
# Show model description
|
| 85 |
+
selected_model_info = next((m for m in models if m['name'] == selected_model), None)
|
| 86 |
+
if selected_model_info:
|
| 87 |
+
st.info(f"**{selected_model_info['description']}**")
|
| 88 |
+
else:
|
| 89 |
+
st.error("Failed to load models")
|
| 90 |
+
selected_model = "facebook/bart-large-cnn"
|
| 91 |
+
except Exception as e:
|
| 92 |
+
st.error(f"Error loading models: {str(e)}")
|
| 93 |
+
selected_model = "facebook/bart-large-cnn"
|
| 94 |
+
|
| 95 |
+
# Summary settings
|
| 96 |
+
st.subheader("Summary Settings")
|
| 97 |
+
max_length = st.slider("Maximum Summary Length", 50, 500, 150, help="Maximum number of words in the summary")
|
| 98 |
+
min_length = st.slider("Minimum Summary Length", 10, 200, 50, help="Minimum number of words in the summary")
|
| 99 |
+
|
| 100 |
+
# Advanced settings
|
| 101 |
+
with st.expander("Advanced Settings"):
|
| 102 |
+
chunk_size = st.slider("Chunk Size", 500, 2000, 1000, help="Size of text chunks for processing")
|
| 103 |
+
overlap = st.slider("Chunk Overlap", 50, 200, 100, help="Overlap between text chunks")
|
| 104 |
+
|
| 105 |
+
# API status
|
| 106 |
+
st.subheader("API Status")
|
| 107 |
+
try:
|
| 108 |
+
health_response = requests.get(f"{API_BASE_URL}/health")
|
| 109 |
+
if health_response.status_code == 200:
|
| 110 |
+
st.success("β
API Connected")
|
| 111 |
+
else:
|
| 112 |
+
st.error("β API Error")
|
| 113 |
+
except:
|
| 114 |
+
st.error("β API Unavailable")
|
| 115 |
+
|
| 116 |
+
# Main content
|
| 117 |
+
tab1, tab2, tab3 = st.tabs(["π Summarize Book", "π Text Analysis", "βΉοΈ About"])
|
| 118 |
+
|
| 119 |
+
with tab1:
|
| 120 |
+
st.header("π Book Summarization")
|
| 121 |
+
|
| 122 |
+
# File upload
|
| 123 |
+
uploaded_file = st.file_uploader(
|
| 124 |
+
"Choose a PDF book file",
|
| 125 |
+
type=['pdf'],
|
| 126 |
+
help="Upload a PDF file (max 50MB)"
|
| 127 |
+
)
|
| 128 |
+
|
| 129 |
+
if uploaded_file is not None:
|
| 130 |
+
# File info
|
| 131 |
+
file_size = len(uploaded_file.getvalue()) / (1024 * 1024) # MB
|
| 132 |
+
st.info(f"π **File:** {uploaded_file.name} ({file_size:.1f} MB)")
|
| 133 |
+
|
| 134 |
+
# Validate file
|
| 135 |
+
if st.button("π Validate PDF", type="secondary"):
|
| 136 |
+
with st.spinner("Validating PDF..."):
|
| 137 |
+
try:
|
| 138 |
+
files = {"file": uploaded_file.getvalue()}
|
| 139 |
+
response = requests.post(f"{API_BASE_URL}/upload-pdf", files=files)
|
| 140 |
+
|
| 141 |
+
if response.status_code == 200:
|
| 142 |
+
data = response.json()
|
| 143 |
+
st.success(f"β
{data['message']}")
|
| 144 |
+
|
| 145 |
+
# Display metadata
|
| 146 |
+
metadata = data.get('metadata', {})
|
| 147 |
+
col1, col2, col3 = st.columns(3)
|
| 148 |
+
with col1:
|
| 149 |
+
st.metric("Pages", data['pages'])
|
| 150 |
+
with col2:
|
| 151 |
+
st.metric("Size", f"{data['size_mb']:.1f} MB")
|
| 152 |
+
with col3:
|
| 153 |
+
st.metric("Title", metadata.get('title', 'Unknown'))
|
| 154 |
+
else:
|
| 155 |
+
st.error(f"β Validation failed: {response.json().get('detail', 'Unknown error')}")
|
| 156 |
+
except Exception as e:
|
| 157 |
+
st.error(f"β Error: {str(e)}")
|
| 158 |
+
|
| 159 |
+
# Summarize button
|
| 160 |
+
if st.button("π Generate Summary", type="primary"):
|
| 161 |
+
if uploaded_file is not None:
|
| 162 |
+
with st.spinner("Processing your book..."):
|
| 163 |
+
try:
|
| 164 |
+
# Prepare request
|
| 165 |
+
files = {"file": uploaded_file.getvalue()}
|
| 166 |
+
data = {
|
| 167 |
+
"max_length": max_length,
|
| 168 |
+
"min_length": min_length,
|
| 169 |
+
"chunk_size": chunk_size,
|
| 170 |
+
"overlap": overlap,
|
| 171 |
+
"model_name": selected_model
|
| 172 |
+
}
|
| 173 |
+
|
| 174 |
+
# Send request
|
| 175 |
+
response = requests.post(f"{API_BASE_URL}/summarize", files=files, data=data)
|
| 176 |
+
|
| 177 |
+
if response.status_code == 200:
|
| 178 |
+
result = response.json()
|
| 179 |
+
|
| 180 |
+
# Display success message
|
| 181 |
+
st.success("β
Summary generated successfully!")
|
| 182 |
+
|
| 183 |
+
# Display statistics
|
| 184 |
+
col1, col2, col3, col4 = st.columns(4)
|
| 185 |
+
stats = result.get('statistics', {})
|
| 186 |
+
orig_stats = result.get('original_statistics', {})
|
| 187 |
+
|
| 188 |
+
with col1:
|
| 189 |
+
st.metric("Original Words", f"{orig_stats.get('total_words', 0):,}")
|
| 190 |
+
with col2:
|
| 191 |
+
st.metric("Summary Words", f"{stats.get('final_summary_length', 0):,}")
|
| 192 |
+
with col3:
|
| 193 |
+
compression = stats.get('overall_compression_ratio', 0)
|
| 194 |
+
st.metric("Compression", f"{compression:.1%}")
|
| 195 |
+
with col4:
|
| 196 |
+
st.metric("Chunks Processed", stats.get('total_chunks', 0))
|
| 197 |
+
|
| 198 |
+
# Display summary
|
| 199 |
+
st.subheader("π Generated Summary")
|
| 200 |
+
summary = result.get('summary', '')
|
| 201 |
+
st.text_area(
|
| 202 |
+
"Summary",
|
| 203 |
+
value=summary,
|
| 204 |
+
height=400,
|
| 205 |
+
disabled=True
|
| 206 |
+
)
|
| 207 |
+
|
| 208 |
+
# Download button
|
| 209 |
+
summary_bytes = summary.encode('utf-8')
|
| 210 |
+
st.download_button(
|
| 211 |
+
label="π₯ Download Summary",
|
| 212 |
+
data=summary_bytes,
|
| 213 |
+
file_name=f"{uploaded_file.name.replace('.pdf', '')}_summary.txt",
|
| 214 |
+
mime="text/plain"
|
| 215 |
+
)
|
| 216 |
+
|
| 217 |
+
else:
|
| 218 |
+
error_msg = response.json().get('detail', 'Unknown error')
|
| 219 |
+
st.error(f"β Summarization failed: {error_msg}")
|
| 220 |
+
|
| 221 |
+
except Exception as e:
|
| 222 |
+
st.error(f"β Error: {str(e)}")
|
| 223 |
+
|
| 224 |
+
with tab2:
|
| 225 |
+
st.header("π Text Analysis")
|
| 226 |
+
|
| 227 |
+
if uploaded_file is not None:
|
| 228 |
+
if st.button("π Analyze Text"):
|
| 229 |
+
with st.spinner("Analyzing text..."):
|
| 230 |
+
try:
|
| 231 |
+
files = {"file": uploaded_file.getvalue()}
|
| 232 |
+
response = requests.post(f"{API_BASE_URL}/extract-text", files=files)
|
| 233 |
+
|
| 234 |
+
if response.status_code == 200:
|
| 235 |
+
data = response.json()
|
| 236 |
+
stats = data.get('statistics', {})
|
| 237 |
+
|
| 238 |
+
# Display statistics
|
| 239 |
+
col1, col2, col3, col4 = st.columns(4)
|
| 240 |
+
|
| 241 |
+
with col1:
|
| 242 |
+
st.metric("Total Words", f"{stats.get('total_words', 0):,}")
|
| 243 |
+
with col2:
|
| 244 |
+
st.metric("Total Sentences", f"{stats.get('total_sentences', 0):,}")
|
| 245 |
+
with col3:
|
| 246 |
+
st.metric("Avg Words/Sentence", f"{stats.get('average_words_per_sentence', 0):.1f}")
|
| 247 |
+
with col4:
|
| 248 |
+
st.metric("Reading Time", f"{stats.get('estimated_reading_time_minutes', 0):.1f} min")
|
| 249 |
+
|
| 250 |
+
# Text preview
|
| 251 |
+
st.subheader("π Text Preview")
|
| 252 |
+
text_response = requests.post(f"{API_BASE_URL}/extract-text", files=files)
|
| 253 |
+
if text_response.status_code == 200:
|
| 254 |
+
text_data = text_response.json()
|
| 255 |
+
preview_text = text_data.get('text', '')[:1000] + "..." if len(text_data.get('text', '')) > 1000 else text_data.get('text', '')
|
| 256 |
+
st.text_area("First 1000 characters:", value=preview_text, height=200, disabled=True)
|
| 257 |
+
else:
|
| 258 |
+
st.error(f"β Analysis failed: {response.json().get('detail', 'Unknown error')}")
|
| 259 |
+
except Exception as e:
|
| 260 |
+
st.error(f"β Error: {str(e)}")
|
| 261 |
+
else:
|
| 262 |
+
st.info("π Please upload a PDF file to analyze its text.")
|
| 263 |
+
|
| 264 |
+
with tab3:
|
| 265 |
+
st.header("βΉοΈ About")
|
| 266 |
+
|
| 267 |
+
st.markdown("""
|
| 268 |
+
## π€ Book Summarizer AI
|
| 269 |
+
|
| 270 |
+
This application uses advanced AI models to automatically summarize PDF books.
|
| 271 |
+
It processes the text in chunks and generates comprehensive summaries while
|
| 272 |
+
maintaining the key information and context.
|
| 273 |
+
|
| 274 |
+
### β¨ Features
|
| 275 |
+
|
| 276 |
+
- **PDF Text Extraction**: Advanced PDF processing with fallback methods
|
| 277 |
+
- **AI Summarization**: State-of-the-art transformer models
|
| 278 |
+
- **Configurable Settings**: Adjust summary length and processing parameters
|
| 279 |
+
- **Multiple Models**: Choose from different AI models for various use cases
|
| 280 |
+
- **Text Analysis**: Detailed statistics about the book content
|
| 281 |
+
|
| 282 |
+
### π οΈ Technology Stack
|
| 283 |
+
|
| 284 |
+
- **Frontend**: Streamlit
|
| 285 |
+
- **Backend**: FastAPI
|
| 286 |
+
- **AI Models**: Hugging Face Transformers (BART, T5)
|
| 287 |
+
- **PDF Processing**: PyPDF2, pdfplumber
|
| 288 |
+
- **Text Processing**: NLTK
|
| 289 |
+
|
| 290 |
+
### π How It Works
|
| 291 |
+
|
| 292 |
+
1. **Upload**: Select a PDF book file (max 50MB)
|
| 293 |
+
2. **Extract**: The system extracts and cleans text from the PDF
|
| 294 |
+
3. **Chunk**: Large texts are split into manageable chunks
|
| 295 |
+
4. **Summarize**: AI models process each chunk and generate summaries
|
| 296 |
+
5. **Combine**: Individual summaries are combined into a final summary
|
| 297 |
+
6. **Download**: Get your summary in text format
|
| 298 |
+
|
| 299 |
+
### π Getting Started
|
| 300 |
+
|
| 301 |
+
1. Make sure the API server is running (`uvicorn api.main:app --reload`)
|
| 302 |
+
2. Upload a PDF book file
|
| 303 |
+
3. Configure your preferred settings
|
| 304 |
+
4. Click "Generate Summary" and wait for processing
|
| 305 |
+
5. Download your AI-generated summary
|
| 306 |
+
|
| 307 |
+
### π Support
|
| 308 |
+
|
| 309 |
+
For issues or questions, please check the API documentation at `/docs`
|
| 310 |
+
when the server is running.
|
| 311 |
+
""")
|
| 312 |
+
|
| 313 |
+
if __name__ == "__main__":
|
| 314 |
+
main()
|
requirements.txt
ADDED
|
@@ -0,0 +1,12 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
streamlit==1.28.1
|
| 2 |
+
fastapi==0.104.1
|
| 3 |
+
uvicorn==0.24.0
|
| 4 |
+
python-multipart==0.0.6
|
| 5 |
+
PyPDF2==3.0.1
|
| 6 |
+
pdfplumber==0.10.3
|
| 7 |
+
transformers==4.35.2
|
| 8 |
+
torch>=2.2.0
|
| 9 |
+
nltk==3.8.1
|
| 10 |
+
requests==2.31.0
|
| 11 |
+
python-dotenv==1.0.0
|
| 12 |
+
pydantic==2.5.0
|
start.bat
ADDED
|
@@ -0,0 +1,32 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
@echo off
|
| 2 |
+
echo π Book Summarizer AI - Windows Startup
|
| 3 |
+
echo ======================================
|
| 4 |
+
|
| 5 |
+
echo.
|
| 6 |
+
echo π§ Checking Python installation...
|
| 7 |
+
python --version >nul 2>&1
|
| 8 |
+
if errorlevel 1 (
|
| 9 |
+
echo β Python is not installed or not in PATH
|
| 10 |
+
echo Please install Python from https://python.org
|
| 11 |
+
pause
|
| 12 |
+
exit /b 1
|
| 13 |
+
)
|
| 14 |
+
|
| 15 |
+
echo β
Python found
|
| 16 |
+
|
| 17 |
+
echo.
|
| 18 |
+
echo π¦ Installing dependencies...
|
| 19 |
+
pip install -r requirements.txt
|
| 20 |
+
if errorlevel 1 (
|
| 21 |
+
echo β Failed to install dependencies
|
| 22 |
+
pause
|
| 23 |
+
exit /b 1
|
| 24 |
+
)
|
| 25 |
+
|
| 26 |
+
echo β
Dependencies installed
|
| 27 |
+
|
| 28 |
+
echo.
|
| 29 |
+
echo π Starting Book Summarizer AI...
|
| 30 |
+
python start.py
|
| 31 |
+
|
| 32 |
+
pause
|
start.py
ADDED
|
@@ -0,0 +1,135 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
Startup script for Book Summarizer AI
|
| 4 |
+
This script helps you start both the FastAPI backend and Streamlit frontend.
|
| 5 |
+
"""
|
| 6 |
+
|
| 7 |
+
import subprocess
|
| 8 |
+
import sys
|
| 9 |
+
import time
|
| 10 |
+
import requests
|
| 11 |
+
import os
|
| 12 |
+
from pathlib import Path
|
| 13 |
+
|
| 14 |
+
def check_dependencies():
|
| 15 |
+
"""Check if required packages are installed."""
|
| 16 |
+
required_packages = [
|
| 17 |
+
'streamlit', 'fastapi', 'uvicorn', 'transformers',
|
| 18 |
+
'torch', 'PyPDF2', 'pdfplumber', 'nltk'
|
| 19 |
+
]
|
| 20 |
+
|
| 21 |
+
missing_packages = []
|
| 22 |
+
for package in required_packages:
|
| 23 |
+
try:
|
| 24 |
+
__import__(package)
|
| 25 |
+
except ImportError:
|
| 26 |
+
missing_packages.append(package)
|
| 27 |
+
|
| 28 |
+
if missing_packages:
|
| 29 |
+
print("β Missing required packages:")
|
| 30 |
+
for package in missing_packages:
|
| 31 |
+
print(f" - {package}")
|
| 32 |
+
print("\nπ¦ Install them with: pip install -r requirements.txt")
|
| 33 |
+
return False
|
| 34 |
+
|
| 35 |
+
print("β
All dependencies are installed")
|
| 36 |
+
return True
|
| 37 |
+
|
| 38 |
+
def download_nltk_data():
|
| 39 |
+
"""Download required NLTK data."""
|
| 40 |
+
try:
|
| 41 |
+
import nltk
|
| 42 |
+
nltk.download('punkt', quiet=True)
|
| 43 |
+
nltk.download('stopwords', quiet=True)
|
| 44 |
+
print("β
NLTK data downloaded")
|
| 45 |
+
except Exception as e:
|
| 46 |
+
print(f"β οΈ Warning: Could not download NLTK data: {e}")
|
| 47 |
+
|
| 48 |
+
def check_api_health():
|
| 49 |
+
"""Check if the API is running and healthy."""
|
| 50 |
+
try:
|
| 51 |
+
response = requests.get("http://localhost:8000/health", timeout=5)
|
| 52 |
+
return response.status_code == 200
|
| 53 |
+
except:
|
| 54 |
+
return False
|
| 55 |
+
|
| 56 |
+
def start_api():
|
| 57 |
+
"""Start the FastAPI backend."""
|
| 58 |
+
print("π Starting FastAPI backend...")
|
| 59 |
+
|
| 60 |
+
# Check if API is already running
|
| 61 |
+
if check_api_health():
|
| 62 |
+
print("β
API is already running")
|
| 63 |
+
return True
|
| 64 |
+
|
| 65 |
+
try:
|
| 66 |
+
# Start the API server
|
| 67 |
+
api_process = subprocess.Popen([
|
| 68 |
+
sys.executable, "-m", "uvicorn",
|
| 69 |
+
"api.main:app",
|
| 70 |
+
"--reload",
|
| 71 |
+
"--port", "8000",
|
| 72 |
+
"--host", "0.0.0.0"
|
| 73 |
+
], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
|
| 74 |
+
|
| 75 |
+
# Wait for API to start
|
| 76 |
+
print("β³ Waiting for API to start...")
|
| 77 |
+
for i in range(30): # Wait up to 30 seconds
|
| 78 |
+
time.sleep(1)
|
| 79 |
+
if check_api_health():
|
| 80 |
+
print("β
API started successfully")
|
| 81 |
+
return True
|
| 82 |
+
|
| 83 |
+
print("β API failed to start within 30 seconds")
|
| 84 |
+
return False
|
| 85 |
+
|
| 86 |
+
except Exception as e:
|
| 87 |
+
print(f"β Error starting API: {e}")
|
| 88 |
+
return False
|
| 89 |
+
|
| 90 |
+
def start_frontend():
|
| 91 |
+
"""Start the Streamlit frontend."""
|
| 92 |
+
print("π Starting Streamlit frontend...")
|
| 93 |
+
|
| 94 |
+
try:
|
| 95 |
+
# Start Streamlit
|
| 96 |
+
subprocess.run([
|
| 97 |
+
sys.executable, "-m", "streamlit", "run", "app.py",
|
| 98 |
+
"--server.port", "8501",
|
| 99 |
+
"--server.address", "0.0.0.0"
|
| 100 |
+
])
|
| 101 |
+
except KeyboardInterrupt:
|
| 102 |
+
print("\nπ Shutting down...")
|
| 103 |
+
except Exception as e:
|
| 104 |
+
print(f"β Error starting frontend: {e}")
|
| 105 |
+
|
| 106 |
+
def main():
|
| 107 |
+
"""Main startup function."""
|
| 108 |
+
print("π Book Summarizer AI - Startup")
|
| 109 |
+
print("=" * 40)
|
| 110 |
+
|
| 111 |
+
# Check dependencies
|
| 112 |
+
if not check_dependencies():
|
| 113 |
+
sys.exit(1)
|
| 114 |
+
|
| 115 |
+
# Download NLTK data
|
| 116 |
+
download_nltk_data()
|
| 117 |
+
|
| 118 |
+
print("\nπ§ Starting services...")
|
| 119 |
+
|
| 120 |
+
# Start API
|
| 121 |
+
if not start_api():
|
| 122 |
+
print("β Failed to start API. Please check the logs.")
|
| 123 |
+
sys.exit(1)
|
| 124 |
+
|
| 125 |
+
print("\nπ Ready! Opening the application...")
|
| 126 |
+
print("π Frontend: http://localhost:8501")
|
| 127 |
+
print("π API: http://localhost:8000")
|
| 128 |
+
print("π API Docs: http://localhost:8000/docs")
|
| 129 |
+
print("\nπ‘ Press Ctrl+C to stop the application")
|
| 130 |
+
|
| 131 |
+
# Start frontend
|
| 132 |
+
start_frontend()
|
| 133 |
+
|
| 134 |
+
if __name__ == "__main__":
|
| 135 |
+
main()
|
start.sh
ADDED
|
@@ -0,0 +1,28 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/bin/bash
|
| 2 |
+
|
| 3 |
+
echo "π Book Summarizer AI - Unix/Linux/Mac Startup"
|
| 4 |
+
echo "=============================================="
|
| 5 |
+
|
| 6 |
+
echo ""
|
| 7 |
+
echo "π§ Checking Python installation..."
|
| 8 |
+
if ! command -v python3 &> /dev/null; then
|
| 9 |
+
echo "β Python 3 is not installed or not in PATH"
|
| 10 |
+
echo "Please install Python 3 from https://python.org"
|
| 11 |
+
exit 1
|
| 12 |
+
fi
|
| 13 |
+
|
| 14 |
+
echo "β
Python 3 found"
|
| 15 |
+
|
| 16 |
+
echo ""
|
| 17 |
+
echo "π¦ Installing dependencies..."
|
| 18 |
+
pip3 install -r requirements.txt
|
| 19 |
+
if [ $? -ne 0 ]; then
|
| 20 |
+
echo "β Failed to install dependencies"
|
| 21 |
+
exit 1
|
| 22 |
+
fi
|
| 23 |
+
|
| 24 |
+
echo "β
Dependencies installed"
|
| 25 |
+
|
| 26 |
+
echo ""
|
| 27 |
+
echo "π Starting Book Summarizer AI..."
|
| 28 |
+
python3 start.py
|