ND06-25 commited on
Commit
6880cd9
Β·
0 Parent(s):

first commit to AI repo

Browse files
README.md ADDED
@@ -0,0 +1,196 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # πŸ“š Book Summarizer AI
2
+
3
+ An intelligent web application that extracts text from PDF books and generates comprehensive summaries using state-of-the-art AI models.
4
+
5
+ ## ✨ Features
6
+
7
+ - πŸ“š **PDF Text Extraction**: Advanced PDF processing with multiple extraction methods
8
+ - πŸ€– **AI-Powered Summarization**: Uses transformer models (BART, T5) for high-quality summaries
9
+ - 🌐 **Beautiful Web Interface**: Modern UI built with Streamlit
10
+ - ⚑ **FastAPI Backend**: Scalable and fast API for processing
11
+ - πŸ“ **Configurable Settings**: Adjust summary length, chunk size, and AI models
12
+ - πŸ“Š **Text Analysis**: Detailed statistics about book content
13
+ - πŸ’Ύ **Download Summaries**: Save summaries as text files
14
+
15
+ ## πŸš€ Quick Start
16
+
17
+ ### Option 1: Automated Setup (Recommended)
18
+
19
+ **Windows:**
20
+ ```bash
21
+ # Double-click start.bat or run:
22
+ start.bat
23
+ ```
24
+
25
+ **Unix/Linux/Mac:**
26
+ ```bash
27
+ # Make script executable and run:
28
+ chmod +x start.sh
29
+ ./start.sh
30
+ ```
31
+
32
+ ### Option 2: Manual Setup
33
+
34
+ 1. **Install dependencies:**
35
+ ```bash
36
+ pip install -r requirements.txt
37
+ ```
38
+
39
+ 2. **Download NLTK data:**
40
+ ```python
41
+ python -c "import nltk; nltk.download('punkt'); nltk.download('stopwords')"
42
+ ```
43
+
44
+ 3. **Start the FastAPI backend:**
45
+ ```bash
46
+ uvicorn api.main:app --reload --port 8000
47
+ ```
48
+
49
+ 4. **Start the Streamlit frontend:**
50
+ ```bash
51
+ streamlit run app.py
52
+ ```
53
+
54
+ 5. **Open your browser:**
55
+ - Frontend: http://localhost:8501
56
+ - API Docs: http://localhost:8000/docs
57
+
58
+ ## πŸ“– Usage
59
+
60
+ 1. **Upload PDF**: Select a PDF book file (max 50MB)
61
+ 2. **Configure Settings**: Choose AI model and summary parameters
62
+ 3. **Generate Summary**: Click "Generate Summary" and wait for processing
63
+ 4. **Download Result**: Save your AI-generated summary
64
+
65
+ ## πŸ› οΈ Technology Stack
66
+
67
+ ### Frontend
68
+ - **Streamlit**: Modern web interface
69
+ - **Custom CSS**: Beautiful styling and responsive design
70
+
71
+ ### Backend
72
+ - **FastAPI**: High-performance API framework
73
+ - **Uvicorn**: ASGI server for FastAPI
74
+
75
+ ### AI & ML
76
+ - **Hugging Face Transformers**: State-of-the-art NLP models
77
+ - **PyTorch**: Deep learning framework
78
+ - **BART/T5 Models**: Pre-trained summarization models
79
+
80
+ ### PDF Processing
81
+ - **PyPDF2**: PDF text extraction
82
+ - **pdfplumber**: Advanced PDF processing
83
+ - **NLTK**: Natural language processing
84
+
85
+ ## πŸ“ Project Structure
86
+
87
+ ```
88
+ book-summarizer/
89
+ β”œβ”€β”€ app.py # Streamlit frontend
90
+ β”œβ”€β”€ start.py # Automated startup script
91
+ β”œβ”€β”€ start.bat # Windows startup script
92
+ β”œβ”€β”€ start.sh # Unix/Linux/Mac startup script
93
+ β”œβ”€β”€ api/
94
+ β”‚ β”œβ”€β”€ __init__.py # API package
95
+ β”‚ β”œβ”€β”€ main.py # FastAPI backend
96
+ β”‚ β”œβ”€β”€ pdf_processor.py # PDF text extraction
97
+ β”‚ β”œβ”€β”€ summarizer.py # AI summarization logic
98
+ β”‚ └── utils.py # Utility functions
99
+ β”œβ”€β”€ requirements.txt # Python dependencies
100
+ └── README.md # Project documentation
101
+ ```
102
+
103
+ ## βš™οΈ Configuration
104
+
105
+ ### AI Models
106
+ - **facebook/bart-large-cnn**: Best quality, slower processing
107
+ - **t5-small**: Faster processing, good quality
108
+ - **facebook/bart-base**: Balanced performance
109
+
110
+ ### Summary Settings
111
+ - **Max Length**: 50-500 words (default: 150)
112
+ - **Min Length**: 10-200 words (default: 50)
113
+ - **Chunk Size**: 500-2000 characters (default: 1000)
114
+ - **Overlap**: 50-200 characters (default: 100)
115
+
116
+ ## πŸ”§ API Endpoints
117
+
118
+ - `GET /` - API information
119
+ - `GET /health` - Health check
120
+ - `POST /upload-pdf` - Validate PDF file
121
+ - `POST /extract-text` - Extract text from PDF
122
+ - `POST /summarize` - Generate book summary
123
+ - `GET /models` - List available AI models
124
+ - `POST /change-model` - Switch AI model
125
+
126
+ ## πŸ“‹ Requirements
127
+
128
+ - **Python**: 3.8 or higher
129
+ - **Memory**: At least 4GB RAM (8GB recommended)
130
+ - **Storage**: 2GB free space for models
131
+ - **Internet**: Required for first-time model download
132
+
133
+ ## πŸ› Troubleshooting
134
+
135
+ ### Common Issues
136
+
137
+ 1. **"Module not found" errors:**
138
+ ```bash
139
+ pip install -r requirements.txt
140
+ ```
141
+
142
+ 2. **NLTK data missing:**
143
+ ```python
144
+ python -c "import nltk; nltk.download('punkt'); nltk.download('stopwords')"
145
+ ```
146
+
147
+ 3. **API connection failed:**
148
+ - Ensure FastAPI is running on port 8000
149
+ - Check firewall settings
150
+ - Verify no other service is using the port
151
+
152
+ 4. **Large PDF processing slow:**
153
+ - Reduce chunk size in advanced settings
154
+ - Use a faster model (t5-small)
155
+ - Ensure sufficient RAM
156
+
157
+ 5. **Model download issues:**
158
+ - Check internet connection
159
+ - Clear Hugging Face cache: `rm -rf ~/.cache/huggingface`
160
+
161
+ ### Performance Tips
162
+
163
+ - **GPU Acceleration**: Install CUDA for faster processing
164
+ - **Model Selection**: Use smaller models for faster results
165
+ - **Chunk Size**: Smaller chunks = faster processing but may lose context
166
+ - **Memory**: Close other applications to free up RAM
167
+
168
+ ## 🀝 Contributing
169
+
170
+ 1. Fork the repository
171
+ 2. Create a feature branch
172
+ 3. Make your changes
173
+ 4. Add tests if applicable
174
+ 5. Submit a pull request
175
+
176
+ ## πŸ“„ License
177
+
178
+ This project is open source and available under the MIT License.
179
+
180
+ ## πŸ™ Acknowledgments
181
+
182
+ - Hugging Face for transformer models
183
+ - Streamlit for the web framework
184
+ - FastAPI for the backend framework
185
+ - The open-source community for various libraries
186
+
187
+ ## πŸ“ž Support
188
+
189
+ For issues, questions, or feature requests:
190
+ 1. Check the troubleshooting section
191
+ 2. Review API documentation at `/docs`
192
+ 3. Open an issue on GitHub
193
+
194
+ ---
195
+
196
+ **Happy summarizing! πŸ“šβœ¨**
api/__init__.py ADDED
@@ -0,0 +1 @@
 
 
1
+ # API package for Book Summarizer
api/__pycache__/__init__.cpython-312.pyc ADDED
Binary file (145 Bytes). View file
 
api/__pycache__/main.cpython-312.pyc ADDED
Binary file (9.14 kB). View file
 
api/__pycache__/pdf_processor.cpython-312.pyc ADDED
Binary file (6.86 kB). View file
 
api/__pycache__/summarizer.cpython-312.pyc ADDED
Binary file (8.62 kB). View file
 
api/__pycache__/utils.cpython-312.pyc ADDED
Binary file (3.88 kB). View file
 
api/main.py ADDED
@@ -0,0 +1,224 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from fastapi import FastAPI, File, UploadFile, HTTPException, BackgroundTasks
2
+ from fastapi.middleware.cors import CORSMiddleware
3
+ from pydantic import BaseModel
4
+ from typing import Dict, Any, Optional
5
+ import logging
6
+ import asyncio
7
+ from .pdf_processor import PDFProcessor
8
+ from .summarizer import BookSummarizer
9
+
10
+ # Configure logging
11
+ logging.basicConfig(level=logging.INFO)
12
+ logger = logging.getLogger(__name__)
13
+
14
+ # Initialize FastAPI app
15
+ app = FastAPI(
16
+ title="Book Summarizer API",
17
+ description="AI-powered book summarization service",
18
+ version="1.0.0"
19
+ )
20
+
21
+ # Add CORS middleware
22
+ app.add_middleware(
23
+ CORSMiddleware,
24
+ allow_origins=["*"], # In production, specify your frontend URL
25
+ allow_credentials=True,
26
+ allow_methods=["*"],
27
+ allow_headers=["*"],
28
+ )
29
+
30
+ # Initialize components
31
+ pdf_processor = PDFProcessor()
32
+ summarizer = BookSummarizer()
33
+
34
+ # Pydantic models
35
+ class SummaryRequest(BaseModel):
36
+ max_length: int = 150
37
+ min_length: int = 50
38
+ chunk_size: int = 1000
39
+ overlap: int = 100
40
+ model_name: Optional[str] = None
41
+
42
+ class SummaryResponse(BaseModel):
43
+ success: bool
44
+ summary: str
45
+ statistics: Dict[str, Any]
46
+ message: str
47
+
48
+ @app.on_event("startup")
49
+ async def startup_event():
50
+ """Initialize components on startup."""
51
+ logger.info("Starting Book Summarizer API...")
52
+ try:
53
+ # Load the summarization model
54
+ summarizer.load_model()
55
+ logger.info("API startup completed successfully")
56
+ except Exception as e:
57
+ logger.error(f"Error during startup: {str(e)}")
58
+
59
+ @app.get("/")
60
+ async def root():
61
+ """Root endpoint."""
62
+ return {
63
+ "message": "Book Summarizer API",
64
+ "version": "1.0.0",
65
+ "status": "running"
66
+ }
67
+
68
+ @app.get("/health")
69
+ async def health_check():
70
+ """Health check endpoint."""
71
+ return {
72
+ "status": "healthy",
73
+ "model_loaded": summarizer.summarizer is not None
74
+ }
75
+
76
+ @app.post("/upload-pdf")
77
+ async def upload_pdf(file: UploadFile = File(...)):
78
+ """
79
+ Upload and validate a PDF file.
80
+ """
81
+ try:
82
+ # Check file type
83
+ if not file.filename.lower().endswith('.pdf'):
84
+ raise HTTPException(status_code=400, detail="Only PDF files are supported")
85
+
86
+ # Read file content
87
+ content = await file.read()
88
+
89
+ # Validate PDF
90
+ validation_result = pdf_processor.validate_pdf(content)
91
+ if not validation_result['valid']:
92
+ raise HTTPException(status_code=400, detail=validation_result['message'])
93
+
94
+ # Extract metadata
95
+ metadata = pdf_processor.get_pdf_metadata(content)
96
+
97
+ return {
98
+ "success": True,
99
+ "filename": file.filename,
100
+ "size_mb": validation_result['size_mb'],
101
+ "pages": validation_result['pages'],
102
+ "metadata": metadata,
103
+ "message": "PDF uploaded and validated successfully"
104
+ }
105
+
106
+ except HTTPException:
107
+ raise
108
+ except Exception as e:
109
+ logger.error(f"Error uploading PDF: {str(e)}")
110
+ raise HTTPException(status_code=500, detail=f"Error processing PDF: {str(e)}")
111
+
112
+ @app.post("/extract-text")
113
+ async def extract_text(file: UploadFile = File(...)):
114
+ """
115
+ Extract text from uploaded PDF.
116
+ """
117
+ try:
118
+ # Read file content
119
+ content = await file.read()
120
+
121
+ # Extract text
122
+ result = pdf_processor.extract_text_from_pdf(content)
123
+
124
+ if not result['success']:
125
+ raise HTTPException(status_code=400, detail=result['message'])
126
+
127
+ return {
128
+ "success": True,
129
+ "text_length": len(result['text']),
130
+ "statistics": result['statistics'],
131
+ "pages": result['pages'],
132
+ "message": result['message']
133
+ }
134
+
135
+ except HTTPException:
136
+ raise
137
+ except Exception as e:
138
+ logger.error(f"Error extracting text: {str(e)}")
139
+ raise HTTPException(status_code=500, detail=f"Error extracting text: {str(e)}")
140
+
141
+ @app.post("/summarize")
142
+ async def summarize_book(
143
+ file: UploadFile = File(...),
144
+ request: SummaryRequest = SummaryRequest()
145
+ ):
146
+ """
147
+ Summarize a book from uploaded PDF.
148
+ """
149
+ try:
150
+ # Read file content
151
+ content = await file.read()
152
+
153
+ # Extract text
154
+ extraction_result = pdf_processor.extract_text_from_pdf(content)
155
+ if not extraction_result['success']:
156
+ raise HTTPException(status_code=400, detail=extraction_result['message'])
157
+
158
+ # Change model if specified
159
+ if request.model_name:
160
+ summarizer.change_model(request.model_name)
161
+
162
+ # Summarize the book
163
+ summary_result = summarizer.summarize_book(
164
+ text=extraction_result['text'],
165
+ chunk_size=request.chunk_size,
166
+ overlap=request.overlap,
167
+ max_length=request.max_length,
168
+ min_length=request.min_length
169
+ )
170
+
171
+ if not summary_result['success']:
172
+ raise HTTPException(status_code=500, detail=summary_result.get('error', 'Summarization failed'))
173
+
174
+ return {
175
+ "success": True,
176
+ "summary": summary_result['summary'],
177
+ "statistics": summary_result['statistics'],
178
+ "original_statistics": extraction_result['statistics'],
179
+ "message": "Book summarized successfully"
180
+ }
181
+
182
+ except HTTPException:
183
+ raise
184
+ except Exception as e:
185
+ logger.error(f"Error summarizing book: {str(e)}")
186
+ raise HTTPException(status_code=500, detail=f"Error summarizing book: {str(e)}")
187
+
188
+ @app.get("/models")
189
+ async def get_available_models():
190
+ """
191
+ Get list of available summarization models.
192
+ """
193
+ try:
194
+ models = summarizer.get_available_models()
195
+ return {
196
+ "success": True,
197
+ "models": models,
198
+ "current_model": summarizer.model_name
199
+ }
200
+ except Exception as e:
201
+ logger.error(f"Error getting models: {str(e)}")
202
+ raise HTTPException(status_code=500, detail=f"Error getting models: {str(e)}")
203
+
204
+ @app.post("/change-model")
205
+ async def change_model(model_name: str):
206
+ """
207
+ Change the summarization model.
208
+ """
209
+ try:
210
+ summarizer.change_model(model_name)
211
+ summarizer.load_model()
212
+
213
+ return {
214
+ "success": True,
215
+ "message": f"Model changed to {model_name}",
216
+ "current_model": model_name
217
+ }
218
+ except Exception as e:
219
+ logger.error(f"Error changing model: {str(e)}")
220
+ raise HTTPException(status_code=500, detail=f"Error changing model: {str(e)}")
221
+
222
+ if __name__ == "__main__":
223
+ import uvicorn
224
+ uvicorn.run(app, host="0.0.0.0", port=8000)
api/pdf_processor.py ADDED
@@ -0,0 +1,172 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import PyPDF2
2
+ import pdfplumber
3
+ import io
4
+ from typing import Dict, Any, Optional
5
+ import logging
6
+ from .utils import clean_text, get_text_statistics
7
+
8
+ logger = logging.getLogger(__name__)
9
+
10
+ class PDFProcessor:
11
+ """
12
+ Handles PDF text extraction and processing.
13
+ """
14
+
15
+ def __init__(self):
16
+ self.supported_formats = ['.pdf']
17
+
18
+ def extract_text_from_pdf(self, pdf_file: bytes) -> Dict[str, Any]:
19
+ """
20
+ Extract text from PDF file bytes.
21
+
22
+ Args:
23
+ pdf_file: PDF file as bytes
24
+
25
+ Returns:
26
+ Dictionary containing extracted text and metadata
27
+ """
28
+ try:
29
+ # Try pdfplumber first (better for complex layouts)
30
+ text = self._extract_with_pdfplumber(pdf_file)
31
+
32
+ if not text or len(text.strip()) < 100:
33
+ # Fallback to PyPDF2
34
+ text = self._extract_with_pypdf2(pdf_file)
35
+
36
+ if not text:
37
+ raise ValueError("Could not extract text from PDF")
38
+
39
+ # Clean the extracted text
40
+ cleaned_text = clean_text(text)
41
+
42
+ # Get text statistics
43
+ stats = get_text_statistics(cleaned_text)
44
+
45
+ return {
46
+ 'success': True,
47
+ 'text': cleaned_text,
48
+ 'statistics': stats,
49
+ 'pages': self._get_page_count(pdf_file),
50
+ 'message': 'Text extracted successfully'
51
+ }
52
+
53
+ except Exception as e:
54
+ logger.error(f"Error extracting text from PDF: {str(e)}")
55
+ return {
56
+ 'success': False,
57
+ 'text': '',
58
+ 'statistics': {},
59
+ 'pages': 0,
60
+ 'message': f'Error extracting text: {str(e)}'
61
+ }
62
+
63
+ def _extract_with_pdfplumber(self, pdf_file: bytes) -> str:
64
+ """
65
+ Extract text using pdfplumber (better for complex layouts).
66
+ """
67
+ text_parts = []
68
+
69
+ try:
70
+ with pdfplumber.open(io.BytesIO(pdf_file)) as pdf:
71
+ for page in pdf.pages:
72
+ page_text = page.extract_text()
73
+ if page_text:
74
+ text_parts.append(page_text)
75
+
76
+ return '\n'.join(text_parts)
77
+ except Exception as e:
78
+ logger.warning(f"pdfplumber extraction failed: {str(e)}")
79
+ return ""
80
+
81
+ def _extract_with_pypdf2(self, pdf_file: bytes) -> str:
82
+ """
83
+ Extract text using PyPDF2 (fallback method).
84
+ """
85
+ text_parts = []
86
+
87
+ try:
88
+ pdf_reader = PyPDF2.PdfReader(io.BytesIO(pdf_file))
89
+
90
+ for page in pdf_reader.pages:
91
+ page_text = page.extract_text()
92
+ if page_text:
93
+ text_parts.append(page_text)
94
+
95
+ return '\n'.join(text_parts)
96
+ except Exception as e:
97
+ logger.warning(f"PyPDF2 extraction failed: {str(e)}")
98
+ return ""
99
+
100
+ def _get_page_count(self, pdf_file: bytes) -> int:
101
+ """
102
+ Get the number of pages in the PDF.
103
+ """
104
+ try:
105
+ pdf_reader = PyPDF2.PdfReader(io.BytesIO(pdf_file))
106
+ return len(pdf_reader.pages)
107
+ except:
108
+ return 0
109
+
110
+ def get_pdf_metadata(self, pdf_file: bytes) -> Dict[str, Any]:
111
+ """
112
+ Extract metadata from PDF file.
113
+ """
114
+ try:
115
+ pdf_reader = PyPDF2.PdfReader(io.BytesIO(pdf_file))
116
+ metadata = pdf_reader.metadata
117
+
118
+ return {
119
+ 'title': metadata.get('/Title', 'Unknown'),
120
+ 'author': metadata.get('/Author', 'Unknown'),
121
+ 'subject': metadata.get('/Subject', ''),
122
+ 'creator': metadata.get('/Creator', ''),
123
+ 'producer': metadata.get('/Producer', ''),
124
+ 'pages': len(pdf_reader.pages)
125
+ }
126
+ except Exception as e:
127
+ logger.error(f"Error extracting PDF metadata: {str(e)}")
128
+ return {
129
+ 'title': 'Unknown',
130
+ 'author': 'Unknown',
131
+ 'subject': '',
132
+ 'creator': '',
133
+ 'producer': '',
134
+ 'pages': 0
135
+ }
136
+
137
+ def validate_pdf(self, pdf_file: bytes) -> Dict[str, Any]:
138
+ """
139
+ Validate PDF file and check if it can be processed.
140
+ """
141
+ try:
142
+ # Check file size
143
+ file_size = len(pdf_file)
144
+ max_size = 50 * 1024 * 1024 # 50MB limit
145
+
146
+ if file_size > max_size:
147
+ return {
148
+ 'valid': False,
149
+ 'message': f'File too large. Maximum size is 50MB, got {file_size / (1024*1024):.1f}MB'
150
+ }
151
+
152
+ # Try to read PDF
153
+ pdf_reader = PyPDF2.PdfReader(io.BytesIO(pdf_file))
154
+
155
+ if len(pdf_reader.pages) == 0:
156
+ return {
157
+ 'valid': False,
158
+ 'message': 'PDF appears to be empty or corrupted'
159
+ }
160
+
161
+ return {
162
+ 'valid': True,
163
+ 'message': 'PDF is valid',
164
+ 'pages': len(pdf_reader.pages),
165
+ 'size_mb': file_size / (1024 * 1024)
166
+ }
167
+
168
+ except Exception as e:
169
+ return {
170
+ 'valid': False,
171
+ 'message': f'Invalid PDF file: {str(e)}'
172
+ }
api/summarizer.py ADDED
@@ -0,0 +1,226 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM
2
+ from typing import List, Dict, Any, Optional, Union
3
+ import torch
4
+ import logging
5
+ from .utils import chunk_text
6
+
7
+ logger = logging.getLogger(__name__)
8
+
9
+ class BookSummarizer:
10
+ """
11
+ Handles AI-powered text summarization using transformer models.
12
+ """
13
+
14
+ def __init__(self, model_name: str = "facebook/bart-large-cnn"):
15
+ """
16
+ Initialize the summarizer with a specific model.
17
+
18
+ Args:
19
+ model_name: Hugging Face model name for summarization
20
+ """
21
+ self.model_name = model_name
22
+ self.summarizer = None
23
+ self.tokenizer = None
24
+ self.model = None
25
+ self.device = "cuda" if torch.cuda.is_available() else "cpu"
26
+
27
+ logger.info(f"Initializing summarizer with model: {model_name}")
28
+ logger.info(f"Using device: {self.device}")
29
+
30
+ def load_model(self):
31
+ """
32
+ Load the summarization model and tokenizer.
33
+ """
34
+ try:
35
+ logger.info("Loading summarization model...")
36
+
37
+ # Load tokenizer and model
38
+ self.tokenizer = AutoTokenizer.from_pretrained(self.model_name)
39
+ self.model = AutoModelForSeq2SeqLM.from_pretrained(self.model_name)
40
+
41
+ # Move model to appropriate device
42
+ self.model.to(self.device)
43
+
44
+ # Create pipeline
45
+ self.summarizer = pipeline(
46
+ "summarization",
47
+ model=self.model,
48
+ tokenizer=self.tokenizer,
49
+ device=0 if self.device == "cuda" else -1
50
+ )
51
+
52
+ logger.info("Model loaded successfully")
53
+
54
+ except Exception as e:
55
+ logger.error(f"Error loading model: {str(e)}")
56
+ raise
57
+
58
+ def summarize_text(self, text: str, max_length: int = 150, min_length: int = 50,
59
+ do_sample: bool = False) -> Dict[str, Any]:
60
+ """
61
+ Summarize a single text chunk.
62
+
63
+ Args:
64
+ text: Text to summarize
65
+ max_length: Maximum length of summary
66
+ min_length: Minimum length of summary
67
+ do_sample: Whether to use sampling for generation
68
+
69
+ Returns:
70
+ Dictionary containing summary and metadata
71
+ """
72
+ try:
73
+ if not self.summarizer:
74
+ self.load_model()
75
+
76
+ # Check if text is too short
77
+ if len(text.split()) < 50:
78
+ return {
79
+ 'success': True,
80
+ 'summary': text,
81
+ 'original_length': len(text.split()),
82
+ 'summary_length': len(text.split()),
83
+ 'compression_ratio': 1.0
84
+ }
85
+
86
+ # Generate summary
87
+ summary_result = self.summarizer(
88
+ text,
89
+ max_length=max_length,
90
+ min_length=min_length,
91
+ do_sample=do_sample,
92
+ truncation=True
93
+ )
94
+
95
+ summary = summary_result[0]['summary_text']
96
+
97
+ # Calculate compression ratio
98
+ original_words = len(text.split())
99
+ summary_words = len(summary.split())
100
+ compression_ratio = summary_words / original_words if original_words > 0 else 0
101
+
102
+ return {
103
+ 'success': True,
104
+ 'summary': summary,
105
+ 'original_length': original_words,
106
+ 'summary_length': summary_words,
107
+ 'compression_ratio': compression_ratio
108
+ }
109
+
110
+ except Exception as e:
111
+ logger.error(f"Error summarizing text: {str(e)}")
112
+ return {
113
+ 'success': False,
114
+ 'summary': '',
115
+ 'error': str(e)
116
+ }
117
+
118
+ def summarize_book(self, text: str, chunk_size: int = 1000, overlap: int = 100,
119
+ max_length: int = 150, min_length: int = 50) -> Dict[str, Any]:
120
+ """
121
+ Summarize a complete book by processing it in chunks.
122
+
123
+ Args:
124
+ text: Complete book text
125
+ chunk_size: Size of each text chunk
126
+ overlap: Overlap between chunks
127
+ max_length: Maximum length of each summary
128
+ min_length: Minimum length of each summary
129
+
130
+ Returns:
131
+ Dictionary containing complete summary and metadata
132
+ """
133
+ try:
134
+ logger.info("Starting book summarization...")
135
+
136
+ # Split text into chunks
137
+ chunks = chunk_text(text, chunk_size, overlap)
138
+ logger.info(f"Split text into {len(chunks)} chunks")
139
+
140
+ # Summarize each chunk
141
+ chunk_summaries = []
142
+ total_original_words = 0
143
+ total_summary_words = 0
144
+
145
+ for i, chunk in enumerate(chunks):
146
+ logger.info(f"Processing chunk {i+1}/{len(chunks)}")
147
+
148
+ result = self.summarize_text(chunk, max_length, min_length)
149
+
150
+ if result['success']:
151
+ chunk_summaries.append(result['summary'])
152
+ total_original_words += result['original_length']
153
+ total_summary_words += result['summary_length']
154
+ else:
155
+ logger.warning(f"Failed to summarize chunk {i+1}: {result.get('error', 'Unknown error')}")
156
+ # Include original chunk if summarization fails
157
+ chunk_summaries.append(chunk[:200] + "...")
158
+
159
+ # Combine all summaries
160
+ combined_summary = " ".join(chunk_summaries)
161
+
162
+ # Create final summary if the combined summary is still too long
163
+ if len(combined_summary.split()) > 500:
164
+ logger.info("Creating final summary from combined summaries...")
165
+ final_result = self.summarize_text(combined_summary, max_length=300, min_length=100)
166
+ if final_result['success']:
167
+ combined_summary = final_result['summary']
168
+
169
+ # Calculate overall statistics
170
+ overall_compression = total_summary_words / total_original_words if total_original_words > 0 else 0
171
+
172
+ return {
173
+ 'success': True,
174
+ 'summary': combined_summary,
175
+ 'statistics': {
176
+ 'total_chunks': len(chunks),
177
+ 'total_original_words': total_original_words,
178
+ 'total_summary_words': total_summary_words,
179
+ 'overall_compression_ratio': overall_compression,
180
+ 'final_summary_length': len(combined_summary.split())
181
+ },
182
+ 'chunk_summaries': chunk_summaries
183
+ }
184
+
185
+ except Exception as e:
186
+ logger.error(f"Error in book summarization: {str(e)}")
187
+ return {
188
+ 'success': False,
189
+ 'summary': '',
190
+ 'error': str(e)
191
+ }
192
+
193
+ def get_available_models(self) -> List[Dict[str, Union[str, int]]]:
194
+ """
195
+ Get list of available summarization models.
196
+ """
197
+ return [
198
+ {
199
+ 'name': 'facebook/bart-large-cnn',
200
+ 'description': 'BART model fine-tuned on CNN news articles (recommended)',
201
+ 'max_length': 1024
202
+ },
203
+ {
204
+ 'name': 't5-small',
205
+ 'description': 'Small T5 model, faster but less accurate',
206
+ 'max_length': 512
207
+ },
208
+ {
209
+ 'name': 'facebook/bart-base',
210
+ 'description': 'Base BART model, balanced performance',
211
+ 'max_length': 1024
212
+ }
213
+ ]
214
+
215
+ def change_model(self, model_name: str):
216
+ """
217
+ Change the summarization model.
218
+
219
+ Args:
220
+ model_name: New model name to use
221
+ """
222
+ self.model_name = model_name
223
+ self.summarizer = None
224
+ self.tokenizer = None
225
+ self.model = None
226
+ logger.info(f"Model changed to: {model_name}")
api/utils.py ADDED
@@ -0,0 +1,124 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import re
2
+ import nltk
3
+ from typing import List, Dict, Any
4
+ import logging
5
+
6
+ # Configure logging
7
+ logging.basicConfig(level=logging.INFO)
8
+ logger = logging.getLogger(__name__)
9
+
10
+ def clean_text(text: str) -> str:
11
+ """
12
+ Clean and preprocess extracted text from PDF.
13
+ """
14
+ # Remove extra whitespace and normalize
15
+ text = re.sub(r'\s+', ' ', text)
16
+ text = re.sub(r'\n+', '\n', text)
17
+ text = text.strip()
18
+
19
+ # Remove common PDF artifacts
20
+ text = re.sub(r'[^\w\s\.\,\!\?\;\:\-\(\)\[\]\{\}]', '', text)
21
+
22
+ return text
23
+
24
+ def chunk_text(text: str, max_chunk_size: int = 1000, overlap: int = 100) -> List[str]:
25
+ """
26
+ Split text into overlapping chunks for processing.
27
+
28
+ Args:
29
+ text: Input text to chunk
30
+ max_chunk_size: Maximum size of each chunk
31
+ overlap: Number of characters to overlap between chunks
32
+
33
+ Returns:
34
+ List of text chunks
35
+ """
36
+ if len(text) <= max_chunk_size:
37
+ return [text]
38
+
39
+ chunks = []
40
+ start = 0
41
+
42
+ while start < len(text):
43
+ end = start + max_chunk_size
44
+
45
+ # Try to break at sentence boundaries
46
+ if end < len(text):
47
+ # Look for sentence endings
48
+ sentence_endings = ['.', '!', '?']
49
+ for ending in sentence_endings:
50
+ last_ending = text.rfind(ending, start, end)
51
+ if last_ending > start + max_chunk_size * 0.8: # Only break if we're at least 80% through
52
+ end = last_ending + 1
53
+ break
54
+
55
+ chunk = text[start:end].strip()
56
+ if chunk:
57
+ chunks.append(chunk)
58
+
59
+ # Move start position with overlap
60
+ start = end - overlap
61
+ if start >= len(text):
62
+ break
63
+
64
+ return chunks
65
+
66
+ def extract_chapters(text: str) -> Dict[str, str]:
67
+ """
68
+ Attempt to extract chapters from the text.
69
+ """
70
+ chapters = {}
71
+
72
+ # Common chapter patterns
73
+ chapter_patterns = [
74
+ r'Chapter\s+(\d+|[IVXLC]+)',
75
+ r'CHAPTER\s+(\d+|[IVXLC]+)',
76
+ r'(\d+)\.\s+[A-Z]',
77
+ r'[IVXLC]+\.\s+[A-Z]'
78
+ ]
79
+
80
+ lines = text.split('\n')
81
+ current_chapter = "Introduction"
82
+ current_content = []
83
+
84
+ for line in lines:
85
+ line = line.strip()
86
+ if not line:
87
+ continue
88
+
89
+ # Check if this line is a chapter header
90
+ is_chapter_header = False
91
+ for pattern in chapter_patterns:
92
+ if re.match(pattern, line, re.IGNORECASE):
93
+ # Save previous chapter
94
+ if current_content:
95
+ chapters[current_chapter] = '\n'.join(current_content)
96
+
97
+ current_chapter = line
98
+ current_content = []
99
+ is_chapter_header = True
100
+ break
101
+
102
+ if not is_chapter_header:
103
+ current_content.append(line)
104
+
105
+ # Save the last chapter
106
+ if current_content:
107
+ chapters[current_chapter] = '\n'.join(current_content)
108
+
109
+ return chapters
110
+
111
+ def get_text_statistics(text: str) -> Dict[str, Any]:
112
+ """
113
+ Get basic statistics about the text.
114
+ """
115
+ words = text.split()
116
+ sentences = nltk.sent_tokenize(text)
117
+
118
+ return {
119
+ 'total_characters': len(text),
120
+ 'total_words': len(words),
121
+ 'total_sentences': len(sentences),
122
+ 'average_words_per_sentence': len(words) / len(sentences) if sentences else 0,
123
+ 'estimated_reading_time_minutes': len(words) / 200 # Average reading speed
124
+ }
app.py ADDED
@@ -0,0 +1,314 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import streamlit as st
2
+ import requests
3
+ import json
4
+ import time
5
+ from typing import Dict, Any, Optional
6
+ import io
7
+
8
+ # Page configuration
9
+ st.set_page_config(
10
+ page_title="Book Summarizer AI",
11
+ page_icon="πŸ“š",
12
+ layout="wide",
13
+ initial_sidebar_state="expanded"
14
+ )
15
+
16
+ # API configuration
17
+ API_BASE_URL = "http://localhost:8000"
18
+
19
+ def main():
20
+ # Custom CSS for better styling
21
+ st.markdown("""
22
+ <style>
23
+ .main-header {
24
+ font-size: 3rem;
25
+ font-weight: bold;
26
+ text-align: center;
27
+ color: #1f77b4;
28
+ margin-bottom: 2rem;
29
+ }
30
+ .sub-header {
31
+ font-size: 1.5rem;
32
+ color: #666;
33
+ text-align: center;
34
+ margin-bottom: 2rem;
35
+ }
36
+ .success-box {
37
+ background-color: #d4edda;
38
+ border: 1px solid #c3e6cb;
39
+ border-radius: 5px;
40
+ padding: 1rem;
41
+ margin: 1rem 0;
42
+ }
43
+ .error-box {
44
+ background-color: #f8d7da;
45
+ border: 1px solid #f5c6cb;
46
+ border-radius: 5px;
47
+ padding: 1rem;
48
+ margin: 1rem 0;
49
+ }
50
+ .info-box {
51
+ background-color: #d1ecf1;
52
+ border: 1px solid #bee5eb;
53
+ border-radius: 5px;
54
+ padding: 1rem;
55
+ margin: 1rem 0;
56
+ }
57
+ </style>
58
+ """, unsafe_allow_html=True)
59
+
60
+ # Header
61
+ st.markdown('<h1 class="main-header">πŸ“š Book Summarizer AI</h1>', unsafe_allow_html=True)
62
+ st.markdown('<p class="sub-header">Transform your PDF books into intelligent summaries using AI</p>', unsafe_allow_html=True)
63
+
64
+ # Sidebar
65
+ with st.sidebar:
66
+ st.header("βš™οΈ Settings")
67
+
68
+ # Model selection
69
+ st.subheader("AI Model")
70
+ try:
71
+ models_response = requests.get(f"{API_BASE_URL}/models")
72
+ if models_response.status_code == 200:
73
+ models_data = models_response.json()
74
+ models = models_data.get('models', [])
75
+ current_model = models_data.get('current_model', '')
76
+
77
+ model_names = [model['name'] for model in models]
78
+ selected_model = st.selectbox(
79
+ "Choose AI Model",
80
+ model_names,
81
+ index=model_names.index(current_model) if current_model in model_names else 0
82
+ )
83
+
84
+ # Show model description
85
+ selected_model_info = next((m for m in models if m['name'] == selected_model), None)
86
+ if selected_model_info:
87
+ st.info(f"**{selected_model_info['description']}**")
88
+ else:
89
+ st.error("Failed to load models")
90
+ selected_model = "facebook/bart-large-cnn"
91
+ except Exception as e:
92
+ st.error(f"Error loading models: {str(e)}")
93
+ selected_model = "facebook/bart-large-cnn"
94
+
95
+ # Summary settings
96
+ st.subheader("Summary Settings")
97
+ max_length = st.slider("Maximum Summary Length", 50, 500, 150, help="Maximum number of words in the summary")
98
+ min_length = st.slider("Minimum Summary Length", 10, 200, 50, help="Minimum number of words in the summary")
99
+
100
+ # Advanced settings
101
+ with st.expander("Advanced Settings"):
102
+ chunk_size = st.slider("Chunk Size", 500, 2000, 1000, help="Size of text chunks for processing")
103
+ overlap = st.slider("Chunk Overlap", 50, 200, 100, help="Overlap between text chunks")
104
+
105
+ # API status
106
+ st.subheader("API Status")
107
+ try:
108
+ health_response = requests.get(f"{API_BASE_URL}/health")
109
+ if health_response.status_code == 200:
110
+ st.success("βœ… API Connected")
111
+ else:
112
+ st.error("❌ API Error")
113
+ except:
114
+ st.error("❌ API Unavailable")
115
+
116
+ # Main content
117
+ tab1, tab2, tab3 = st.tabs(["πŸ“– Summarize Book", "πŸ“Š Text Analysis", "ℹ️ About"])
118
+
119
+ with tab1:
120
+ st.header("πŸ“– Book Summarization")
121
+
122
+ # File upload
123
+ uploaded_file = st.file_uploader(
124
+ "Choose a PDF book file",
125
+ type=['pdf'],
126
+ help="Upload a PDF file (max 50MB)"
127
+ )
128
+
129
+ if uploaded_file is not None:
130
+ # File info
131
+ file_size = len(uploaded_file.getvalue()) / (1024 * 1024) # MB
132
+ st.info(f"πŸ“„ **File:** {uploaded_file.name} ({file_size:.1f} MB)")
133
+
134
+ # Validate file
135
+ if st.button("πŸ” Validate PDF", type="secondary"):
136
+ with st.spinner("Validating PDF..."):
137
+ try:
138
+ files = {"file": uploaded_file.getvalue()}
139
+ response = requests.post(f"{API_BASE_URL}/upload-pdf", files=files)
140
+
141
+ if response.status_code == 200:
142
+ data = response.json()
143
+ st.success(f"βœ… {data['message']}")
144
+
145
+ # Display metadata
146
+ metadata = data.get('metadata', {})
147
+ col1, col2, col3 = st.columns(3)
148
+ with col1:
149
+ st.metric("Pages", data['pages'])
150
+ with col2:
151
+ st.metric("Size", f"{data['size_mb']:.1f} MB")
152
+ with col3:
153
+ st.metric("Title", metadata.get('title', 'Unknown'))
154
+ else:
155
+ st.error(f"❌ Validation failed: {response.json().get('detail', 'Unknown error')}")
156
+ except Exception as e:
157
+ st.error(f"❌ Error: {str(e)}")
158
+
159
+ # Summarize button
160
+ if st.button("πŸš€ Generate Summary", type="primary"):
161
+ if uploaded_file is not None:
162
+ with st.spinner("Processing your book..."):
163
+ try:
164
+ # Prepare request
165
+ files = {"file": uploaded_file.getvalue()}
166
+ data = {
167
+ "max_length": max_length,
168
+ "min_length": min_length,
169
+ "chunk_size": chunk_size,
170
+ "overlap": overlap,
171
+ "model_name": selected_model
172
+ }
173
+
174
+ # Send request
175
+ response = requests.post(f"{API_BASE_URL}/summarize", files=files, data=data)
176
+
177
+ if response.status_code == 200:
178
+ result = response.json()
179
+
180
+ # Display success message
181
+ st.success("βœ… Summary generated successfully!")
182
+
183
+ # Display statistics
184
+ col1, col2, col3, col4 = st.columns(4)
185
+ stats = result.get('statistics', {})
186
+ orig_stats = result.get('original_statistics', {})
187
+
188
+ with col1:
189
+ st.metric("Original Words", f"{orig_stats.get('total_words', 0):,}")
190
+ with col2:
191
+ st.metric("Summary Words", f"{stats.get('final_summary_length', 0):,}")
192
+ with col3:
193
+ compression = stats.get('overall_compression_ratio', 0)
194
+ st.metric("Compression", f"{compression:.1%}")
195
+ with col4:
196
+ st.metric("Chunks Processed", stats.get('total_chunks', 0))
197
+
198
+ # Display summary
199
+ st.subheader("πŸ“ Generated Summary")
200
+ summary = result.get('summary', '')
201
+ st.text_area(
202
+ "Summary",
203
+ value=summary,
204
+ height=400,
205
+ disabled=True
206
+ )
207
+
208
+ # Download button
209
+ summary_bytes = summary.encode('utf-8')
210
+ st.download_button(
211
+ label="πŸ“₯ Download Summary",
212
+ data=summary_bytes,
213
+ file_name=f"{uploaded_file.name.replace('.pdf', '')}_summary.txt",
214
+ mime="text/plain"
215
+ )
216
+
217
+ else:
218
+ error_msg = response.json().get('detail', 'Unknown error')
219
+ st.error(f"❌ Summarization failed: {error_msg}")
220
+
221
+ except Exception as e:
222
+ st.error(f"❌ Error: {str(e)}")
223
+
224
+ with tab2:
225
+ st.header("πŸ“Š Text Analysis")
226
+
227
+ if uploaded_file is not None:
228
+ if st.button("πŸ“Š Analyze Text"):
229
+ with st.spinner("Analyzing text..."):
230
+ try:
231
+ files = {"file": uploaded_file.getvalue()}
232
+ response = requests.post(f"{API_BASE_URL}/extract-text", files=files)
233
+
234
+ if response.status_code == 200:
235
+ data = response.json()
236
+ stats = data.get('statistics', {})
237
+
238
+ # Display statistics
239
+ col1, col2, col3, col4 = st.columns(4)
240
+
241
+ with col1:
242
+ st.metric("Total Words", f"{stats.get('total_words', 0):,}")
243
+ with col2:
244
+ st.metric("Total Sentences", f"{stats.get('total_sentences', 0):,}")
245
+ with col3:
246
+ st.metric("Avg Words/Sentence", f"{stats.get('average_words_per_sentence', 0):.1f}")
247
+ with col4:
248
+ st.metric("Reading Time", f"{stats.get('estimated_reading_time_minutes', 0):.1f} min")
249
+
250
+ # Text preview
251
+ st.subheader("πŸ“„ Text Preview")
252
+ text_response = requests.post(f"{API_BASE_URL}/extract-text", files=files)
253
+ if text_response.status_code == 200:
254
+ text_data = text_response.json()
255
+ preview_text = text_data.get('text', '')[:1000] + "..." if len(text_data.get('text', '')) > 1000 else text_data.get('text', '')
256
+ st.text_area("First 1000 characters:", value=preview_text, height=200, disabled=True)
257
+ else:
258
+ st.error(f"❌ Analysis failed: {response.json().get('detail', 'Unknown error')}")
259
+ except Exception as e:
260
+ st.error(f"❌ Error: {str(e)}")
261
+ else:
262
+ st.info("πŸ“„ Please upload a PDF file to analyze its text.")
263
+
264
+ with tab3:
265
+ st.header("ℹ️ About")
266
+
267
+ st.markdown("""
268
+ ## πŸ€– Book Summarizer AI
269
+
270
+ This application uses advanced AI models to automatically summarize PDF books.
271
+ It processes the text in chunks and generates comprehensive summaries while
272
+ maintaining the key information and context.
273
+
274
+ ### ✨ Features
275
+
276
+ - **PDF Text Extraction**: Advanced PDF processing with fallback methods
277
+ - **AI Summarization**: State-of-the-art transformer models
278
+ - **Configurable Settings**: Adjust summary length and processing parameters
279
+ - **Multiple Models**: Choose from different AI models for various use cases
280
+ - **Text Analysis**: Detailed statistics about the book content
281
+
282
+ ### πŸ› οΈ Technology Stack
283
+
284
+ - **Frontend**: Streamlit
285
+ - **Backend**: FastAPI
286
+ - **AI Models**: Hugging Face Transformers (BART, T5)
287
+ - **PDF Processing**: PyPDF2, pdfplumber
288
+ - **Text Processing**: NLTK
289
+
290
+ ### πŸ“‹ How It Works
291
+
292
+ 1. **Upload**: Select a PDF book file (max 50MB)
293
+ 2. **Extract**: The system extracts and cleans text from the PDF
294
+ 3. **Chunk**: Large texts are split into manageable chunks
295
+ 4. **Summarize**: AI models process each chunk and generate summaries
296
+ 5. **Combine**: Individual summaries are combined into a final summary
297
+ 6. **Download**: Get your summary in text format
298
+
299
+ ### πŸš€ Getting Started
300
+
301
+ 1. Make sure the API server is running (`uvicorn api.main:app --reload`)
302
+ 2. Upload a PDF book file
303
+ 3. Configure your preferred settings
304
+ 4. Click "Generate Summary" and wait for processing
305
+ 5. Download your AI-generated summary
306
+
307
+ ### πŸ“ž Support
308
+
309
+ For issues or questions, please check the API documentation at `/docs`
310
+ when the server is running.
311
+ """)
312
+
313
+ if __name__ == "__main__":
314
+ main()
requirements.txt ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ streamlit==1.28.1
2
+ fastapi==0.104.1
3
+ uvicorn==0.24.0
4
+ python-multipart==0.0.6
5
+ PyPDF2==3.0.1
6
+ pdfplumber==0.10.3
7
+ transformers==4.35.2
8
+ torch>=2.2.0
9
+ nltk==3.8.1
10
+ requests==2.31.0
11
+ python-dotenv==1.0.0
12
+ pydantic==2.5.0
start.bat ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ @echo off
2
+ echo πŸ“š Book Summarizer AI - Windows Startup
3
+ echo ======================================
4
+
5
+ echo.
6
+ echo πŸ”§ Checking Python installation...
7
+ python --version >nul 2>&1
8
+ if errorlevel 1 (
9
+ echo ❌ Python is not installed or not in PATH
10
+ echo Please install Python from https://python.org
11
+ pause
12
+ exit /b 1
13
+ )
14
+
15
+ echo βœ… Python found
16
+
17
+ echo.
18
+ echo πŸ“¦ Installing dependencies...
19
+ pip install -r requirements.txt
20
+ if errorlevel 1 (
21
+ echo ❌ Failed to install dependencies
22
+ pause
23
+ exit /b 1
24
+ )
25
+
26
+ echo βœ… Dependencies installed
27
+
28
+ echo.
29
+ echo πŸš€ Starting Book Summarizer AI...
30
+ python start.py
31
+
32
+ pause
start.py ADDED
@@ -0,0 +1,135 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Startup script for Book Summarizer AI
4
+ This script helps you start both the FastAPI backend and Streamlit frontend.
5
+ """
6
+
7
+ import subprocess
8
+ import sys
9
+ import time
10
+ import requests
11
+ import os
12
+ from pathlib import Path
13
+
14
+ def check_dependencies():
15
+ """Check if required packages are installed."""
16
+ required_packages = [
17
+ 'streamlit', 'fastapi', 'uvicorn', 'transformers',
18
+ 'torch', 'PyPDF2', 'pdfplumber', 'nltk'
19
+ ]
20
+
21
+ missing_packages = []
22
+ for package in required_packages:
23
+ try:
24
+ __import__(package)
25
+ except ImportError:
26
+ missing_packages.append(package)
27
+
28
+ if missing_packages:
29
+ print("❌ Missing required packages:")
30
+ for package in missing_packages:
31
+ print(f" - {package}")
32
+ print("\nπŸ“¦ Install them with: pip install -r requirements.txt")
33
+ return False
34
+
35
+ print("βœ… All dependencies are installed")
36
+ return True
37
+
38
+ def download_nltk_data():
39
+ """Download required NLTK data."""
40
+ try:
41
+ import nltk
42
+ nltk.download('punkt', quiet=True)
43
+ nltk.download('stopwords', quiet=True)
44
+ print("βœ… NLTK data downloaded")
45
+ except Exception as e:
46
+ print(f"⚠️ Warning: Could not download NLTK data: {e}")
47
+
48
+ def check_api_health():
49
+ """Check if the API is running and healthy."""
50
+ try:
51
+ response = requests.get("http://localhost:8000/health", timeout=5)
52
+ return response.status_code == 200
53
+ except:
54
+ return False
55
+
56
+ def start_api():
57
+ """Start the FastAPI backend."""
58
+ print("πŸš€ Starting FastAPI backend...")
59
+
60
+ # Check if API is already running
61
+ if check_api_health():
62
+ print("βœ… API is already running")
63
+ return True
64
+
65
+ try:
66
+ # Start the API server
67
+ api_process = subprocess.Popen([
68
+ sys.executable, "-m", "uvicorn",
69
+ "api.main:app",
70
+ "--reload",
71
+ "--port", "8000",
72
+ "--host", "0.0.0.0"
73
+ ], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
74
+
75
+ # Wait for API to start
76
+ print("⏳ Waiting for API to start...")
77
+ for i in range(30): # Wait up to 30 seconds
78
+ time.sleep(1)
79
+ if check_api_health():
80
+ print("βœ… API started successfully")
81
+ return True
82
+
83
+ print("❌ API failed to start within 30 seconds")
84
+ return False
85
+
86
+ except Exception as e:
87
+ print(f"❌ Error starting API: {e}")
88
+ return False
89
+
90
+ def start_frontend():
91
+ """Start the Streamlit frontend."""
92
+ print("🌐 Starting Streamlit frontend...")
93
+
94
+ try:
95
+ # Start Streamlit
96
+ subprocess.run([
97
+ sys.executable, "-m", "streamlit", "run", "app.py",
98
+ "--server.port", "8501",
99
+ "--server.address", "0.0.0.0"
100
+ ])
101
+ except KeyboardInterrupt:
102
+ print("\nπŸ‘‹ Shutting down...")
103
+ except Exception as e:
104
+ print(f"❌ Error starting frontend: {e}")
105
+
106
+ def main():
107
+ """Main startup function."""
108
+ print("πŸ“š Book Summarizer AI - Startup")
109
+ print("=" * 40)
110
+
111
+ # Check dependencies
112
+ if not check_dependencies():
113
+ sys.exit(1)
114
+
115
+ # Download NLTK data
116
+ download_nltk_data()
117
+
118
+ print("\nπŸ”§ Starting services...")
119
+
120
+ # Start API
121
+ if not start_api():
122
+ print("❌ Failed to start API. Please check the logs.")
123
+ sys.exit(1)
124
+
125
+ print("\nπŸŽ‰ Ready! Opening the application...")
126
+ print("πŸ“– Frontend: http://localhost:8501")
127
+ print("πŸ”Œ API: http://localhost:8000")
128
+ print("πŸ“š API Docs: http://localhost:8000/docs")
129
+ print("\nπŸ’‘ Press Ctrl+C to stop the application")
130
+
131
+ # Start frontend
132
+ start_frontend()
133
+
134
+ if __name__ == "__main__":
135
+ main()
start.sh ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/bash
2
+
3
+ echo "πŸ“š Book Summarizer AI - Unix/Linux/Mac Startup"
4
+ echo "=============================================="
5
+
6
+ echo ""
7
+ echo "πŸ”§ Checking Python installation..."
8
+ if ! command -v python3 &> /dev/null; then
9
+ echo "❌ Python 3 is not installed or not in PATH"
10
+ echo "Please install Python 3 from https://python.org"
11
+ exit 1
12
+ fi
13
+
14
+ echo "βœ… Python 3 found"
15
+
16
+ echo ""
17
+ echo "πŸ“¦ Installing dependencies..."
18
+ pip3 install -r requirements.txt
19
+ if [ $? -ne 0 ]; then
20
+ echo "❌ Failed to install dependencies"
21
+ exit 1
22
+ fi
23
+
24
+ echo "βœ… Dependencies installed"
25
+
26
+ echo ""
27
+ echo "πŸš€ Starting Book Summarizer AI..."
28
+ python3 start.py