# Audio Features Documentation - Whisper AI-Psychiatric ## Overview The Whisper AI-Psychiatric application now includes speech-to-text and text-to-speech capabilities to enhance user interaction through voice input and audio responses. ## Features Added ### 🎤 Speech-to-Text (STT) - **Model**: Whisper-tiny (located in `stt-model/whisper-tiny/`) - **Functionality**: Converts user voice input to text for chat interaction - **Input Methods**: - Real-time audio recording (using microphone) - Audio file upload (supports WAV, MP3, M4A, FLAC) ### 🔊 Text-to-Speech (TTS) - **Model**: Kokoro-82M (located in `tts-model/Kokoro-82M/`) - **Functionality**: Converts AI responses to speech audio - **Features**: - Adjustable speech speed (0.5x to 2.0x) - Auto-play option for responses - Manual play button for each response ## Installation Requirements ### Required Packages Run one of the following to install audio processing packages: **Option 1: Using batch file (Windows)** ```bash install_audio_packages.bat ``` **Option 2: Using PowerShell (Windows)** ```powershell .\install_audio_packages.ps1 ``` **Option 3: Manual installation** ```bash pip install librosa>=0.10.0 pip install soundfile>=0.12.0 pip install audio-recorder-streamlit>=0.0.8 pip install scipy>=1.9.0 ``` ### Updated requirements.txt The requirements.txt file has been updated to include: - `librosa>=0.10.0` - Audio processing library - `soundfile>=0.12.0` - Audio file I/O - `audio-recorder-streamlit>=0.0.8` - Streamlit audio recording component - `scipy>=1.9.0` - Scientific computing (audio processing support) ## Usage Guide ### Using Speech-to-Text 1. **Real-time Recording**: - Click the microphone icon in the "Voice Input" section - Speak your question clearly - Click "Stop" when finished - Click "🔄 Transcribe Audio" to convert speech to text - The transcribed text will automatically be sent to the chat 2. **File Upload**: - If the microphone recorder is not available, use the file uploader - Upload an audio file (WAV, MP3, M4A, FLAC) - Click "🔄 Transcribe Uploaded Audio" - The transcribed text will be processed ### Using Text-to-Speech 1. **Enable/Disable TTS**: - Use the "Enable Text-to-Speech" checkbox in the sidebar - Adjust "Audio Speed" slider (0.5x to 2.0x normal speed) 2. **Playing Responses**: - Each AI response will have a "🔊 Play" button - Click to generate and play the audio version of the response - Audio will auto-play when generated ## Technical Implementation ### Speech-to-Text Pipeline 1. Audio input captured/uploaded 2. Audio processed using librosa (resampled to 16kHz) 3. Whisper model processes audio features 4. Generated transcription added to chat ### Text-to-Speech Pipeline 1. AI response text processed 2. Kokoro-82M model generates speech audio 3. Audio served through HTML5 audio player 4. Supports speed adjustment and auto-play ## Sidebar Features ### Model Status Indicators - ✅ Whisper AI Model Loaded - ✅ FAISS Index Loaded - ✅ Speech-to-Text Loaded ### Audio Settings - **Enable Text-to-Speech**: Toggle TTS functionality - **Audio Speed**: Adjust playback speed (0.5x - 2.0x) ### Voice Input Tips - Speak clearly and distinctly - Minimize background noise - Keep recordings under 30 seconds for best results - Ensure good microphone quality ## Troubleshooting ### Common Issues 1. **Microphone Not Working**: - Check browser permissions for microphone access - Use the file upload option as fallback - Ensure audio-recorder-streamlit is properly installed 2. **Audio Quality Issues**: - Use a quiet environment - Speak clearly and at normal pace - Check microphone quality 3. **TTS Not Working**: - Verify Kokoro-82M model is in correct directory - Check audio player compatibility in browser - Ensure scipy and audio libraries are installed 4. **Import Errors**: - Run the installation scripts - Manually install missing packages - Check virtual environment activation ### Model Paths Ensure the following model directories exist: - Speech-to-Text: `stt-model/whisper-tiny/` - Text-to-Speech: `tts-model/Kokoro-82M/` - Main AI Model: `model/Whisper-psychology-gemma-3-1b/` ## Browser Compatibility ### Recommended Browsers - Chrome (best support for audio features) - Firefox - Edge - Safari (may have limited microphone support) ### Required Permissions - Microphone access for voice recording - Audio playback for TTS responses ## Future Enhancements ### Planned Features - Voice activity detection for hands-free operation - Multiple voice options for TTS - Real-time streaming transcription - Noise cancellation for better STT accuracy - Custom wake words for voice activation ### Performance Optimizations - Model quantization for faster inference - Audio preprocessing optimization - Caching for frequently used TTS phrases - Background audio processing ## Support For issues or questions: 1. Check the troubleshooting section above 2. Verify all dependencies are installed 3. Test with simple audio files first 4. Check browser console for error messages ## Version Information - **Version**: 2.0 (Audio Features) - **Added**: Speech-to-Text and Text-to-Speech capabilities - **Base Version**: 1.0 (Text-only chat interface)