Spaces:
Sleeping
Sleeping
| # Voice-to-AI Workflow Documentation | |
| ## 🎤➡️🤖 Complete Voice-to-AI Pipeline | |
| ### Current Workflow: | |
| ``` | |
| 1. 🎤 User speaks into microphone/uploads audio file | |
| ↓ | |
| 2. 🔄 Audio gets processed by Whisper-tiny model | |
| ↓ | |
| 3. 📝 Speech is transcribed to English text | |
| ↓ | |
| 4. 🧠 Text is sent to your main model: "model/Whisper-psychology-gemma-3-1b" | |
| ↓ | |
| 5. 🔍 FAISS searches relevant documents for context | |
| ↓ | |
| 6. 💬 Main model generates psychological response | |
| ↓ | |
| 7. 📺 Response is displayed in chat | |
| ↓ | |
| 8. 🔊 (Optional) Response can be converted to speech via TTS | |
| ``` | |
| ### Technical Implementation: | |
| #### Step 1-3: Speech-to-Text | |
| ```python | |
| # Audio processing with Whisper-tiny | |
| transcribed_text = transcribe_audio( | |
| audio_bytes, | |
| st.session_state.whisper_model, # whisper-tiny model | |
| st.session_state.whisper_processor | |
| ) | |
| ``` | |
| #### Step 4-6: AI Processing | |
| ```python | |
| # Main model processing | |
| answer, sources, metadata = process_medical_query( | |
| transcribed_text, # Your speech as text | |
| st.session_state.faiss_index, # Document search | |
| st.session_state.embedding_model, | |
| st.session_state.optimal_docs, | |
| st.session_state.model, # YOUR MAIN MODEL HERE | |
| st.session_state.tokenizer, # model/Whisper-psychology-gemma-3-1b | |
| **generation_params | |
| ) | |
| ``` | |
| #### Step 7-8: Response Display | |
| ```python | |
| # Add to chat and optionally convert to speech | |
| st.session_state.messages.append({ | |
| "role": "assistant", | |
| "content": answer, # Response from your main model | |
| "sources": sources, | |
| "metadata": metadata | |
| }) | |
| ``` | |
| ### Models Used: | |
| 1. **Speech-to-Text**: `stt-model/whisper-tiny/` | |
| - Converts your voice to English text | |
| - Language: English only (forced) | |
| 2. **Main AI Model**: `model/Whisper-psychology-gemma-3-1b/` ⭐ **YOUR MODEL** | |
| - Processes the transcribed text | |
| - Generates psychological responses | |
| - Uses RAG with FAISS for context | |
| 3. **Text-to-Speech**: `tts-model/Kokoro-82M/` | |
| - Converts AI response back to speech | |
| - Currently uses placeholder implementation | |
| 4. **Document Search**: `faiss_index/` | |
| - Provides context for better responses | |
| ### Usage: | |
| 1. **Click the microphone button** 🎤 | |
| 2. **Speak your mental health question** | |
| 3. **Click "🔄 Transcribe Audio"** | |
| 4. **Watch the complete pipeline work automatically:** | |
| - Your speech → Text | |
| - Text → Your AI model | |
| - AI response → Chat | |
| - Optional: Response → Speech | |
| ### What happens when you transcribe: | |
| ✅ **Immediate automatic processing** - No manual steps needed! | |
| ✅ **Your speech text goes directly to your main model** | |
| ✅ **Full psychiatric AI response is generated** | |
| ✅ **Complete conversation appears in chat** | |
| ✅ **Optional TTS for audio response** | |
| The system now automatically sends your transcribed speech to your `model/Whisper-psychology-gemma-3-1b` model and gets a full AI response without any additional steps! | |