1️⃣: SPEECH RECORDER ELEMENT
==========================
📋 ELEMENT DESCRIPTION
--------------------------------
SPEECH RECORDER is a visual element allowing you to record voice in WAV, OGG, WAV, WEBM or PCM format on all desktop devices and browsers (with exception of iOS, where due to browser policy restrictions it works only in Safari browser). After recording, the element stores the file in the app's storage and returns the file URL.
🔧 STEP-BY-STEP SETUP
--------------------------------
1) Drag and drop the visual element SPEECH RECORDER in your app.
2) Select the SPEECH RECORDER element, in APPEARANCE section, configure the following fields:
FIELDS:
- ENABLE AUTO-BINDING PARENT ELEMENT'S THING: If selected, SPEECH RECORDER will update the parent elements thing, evaluating to a FILE, once the recording will be ready.
- MAX FILE SIZE: Limits the file size of the recording (Megabytes).
- FILE UPLOAD ENABLED: Must be set to yes.
- CHANNELS: Select the number of channels to record.
- FORMAT: Output format of the recording. Valid values are WAV | OGG | PCM | WEBM | MP3.
- BACKGROUND WHEN OFF: Recorder background color when recording is off.
- BACKGROUND WHEN ON: Recorder background color when recording is on.
- RECORDER WHEN OFF: Recorder color when recording is off.
- RECORDER WHEN ON: Recorder color when recording is on.
3) Integrate using these SPEECH RECORDER features:
EVENTS:
- RECORD CAPTURED: Triggered when the record has been captured.
- RECORD ENCOUNTERED ERROR: Triggered when the record has encountered an error. The "ERROR MESSAGE" is then exposed as element STATE.
EXPOSED STATES:
Use any element able to show/process the data of interest (such as a Group with a Text field) stored within the result of the following states of the SPEECH RECORDER element:
- DURATION: Duration of the recording.
- RECORDING: Returns yes while recording.
- FILE SIZE: Size of the recording in bytes.
- SAVING: Returns yes while recording is being saved to the app's storage.
- PAUSED: Returns yes while paused.
- RECORDING FILE: URL of the recording file, saved to the app's storage.
- ERROR MESSAGE: Contains the error message upon "RECORDER ENCOUNTERS AN ERROR" event.
ELEMENT ACTIONS - TRIGGERED IN WORKFLOW:
- START - STOP
- PAUSE - RESUME
- CANCEL RECORDING
2️⃣: TRANSCRIBE SPEECH (SYNC) (FRONT-END DESKTOP & NATIVE MOBILE)
==================================
📋 ELEMENT DESCRIPTION
--------------------------------
GOOGLE CLOUD - SPEECH TO TEXT (FRONT-END DESKTOP & NATIVE MOBILE) provides TRANSCRIBE SPEECH (SYNC) (FRONT-END DESKTOP & NATIVE MOBILE) action to convert speech from an audio file (1-min max) to text. The front-end element is suitable for applications when reactivity is desired, such as but not limited to, mobile applications.
🔧 STEP-BY-STEP SETUP
--------------------------------
ℹ️ The steps from 0) to 1) can be automatically performed by logging in into your Google Cloud Console, opening the Cloud Shell (top right corner of your page) and copy pasting this command and press enter:
wget -q
https://storage.googleapis.com/bubblegcpdemo/demo-assets/wiseable-gcp-speech-sync-only.py && python3 wiseable-gcp-speech-sync-only.py
0) Set-up a project from Google Cloud Console:
https://cloud.google.com/speech-to-text/docs/libraries#setting_up_authentication- Create or select a project
- Enable the SPEECH-TO-TEXT API for that project
- Create a service account
- Download a private key as JSON.
1) Open the private key JSON file with a text editor, copy/paste the following parameters from your file to the Plugin settings:
- CLIENT_EMAIL
- PROJECT_ID
- PRIVATE_KEY, including the -----BEGIN PRIVATE KEY-----\\n prefix and \\n-----END PRIVATE KEY-----\\n suffix.
2) Register on plugins.wiseable.io. Create a new Credential which associates your BUBBLE APP URL, GCP PROJECT_ID, CLIENT_EMAIL & PRIVATE_KEY.
The registration service will generate your PUBLIC ACCESS KEY. This key serves as a secure proxy for your real API key. It allows your application to communicate with the service without exposing your real API key. Since this PUBLIC ACCESS KEY is explicitly tied to your registered BUBBLE APP URL, it can only be used from that domain, ensuring that even if the key is publicly visible, it remains safe and cannot be misused by unauthorized sources.
3) Enter in the PLUGIN SETTINGS your PUBLIC ACCESS KEY (used for front-end element only).
4) Add the GOOGLE CLOUD - SPEECH TO TEXT (FRONT-END DESKTOP & NATIVE MOBILE) element to the page on which the speech recognition feature must be integrated. Select the RESULT DATA TYPE as Returned type, must always be set to "RESULT (TRANSCRIPTION)".
5) Integrate the logic into your application using the following GOOGLE CLOUD - SPEECH TO TEXT (FRONT-END DESKTOP & NATIVE MOBILE) element's states and actions:
FIELDS:
- RESULT DATA TYPE: Returned type, must always be set to "RESULT (TRANSCRIPTION)".
EVENTS:
- SUCCESS: Event triggered upon success
- ERROR: Event triggered upon error
EXPOSED STATES:
Use any element able to show/process the data of interest (such as a Group with a Text field) stored within the result of the following states of the GOOGLE CLOUD - SPEECH TO TEXT (FRONT-END DESKTOP & NATIVE MOBILE) element:
- RESULTS: Populated upon SUCCESS event. Returns the operation progress rate, done status and the list of transcriptions. For each, it returns the transcription, list of words, with related timestamps, confidence rate, and the audio channels (if applicable).
- ERROR MESSAGE: Populated upon ERROR event.
- IS PROCESSING: Set to true when processing is in progress, false otherwise.
ELEMENT ACTIONS - TRIGGERED IN WORKFLOW:
- TRANSCRIBE SPEECH (SYNC) (FRONT-END DESKTOP & NATIVE MOBILE): Convert speech from an audio file to text. Populate RESULTS state upon completion.
Inputs Fields:
- AUDIO: audio file (1-min max) URL from the Bubble.io file uploader, or a Protocol-relative URL (//server/path/file.ext), or a HTTPS URL (
https://server/path/file.ext). The file must be accessible through HTTPS Protocol.
- LANGUAGE CODE: Language identification tag (BCP-47 code).
- AUDIO ENCODING: Valid value are ENCODING_UNSPECIFIED, LINEAR16, FLAC, MULAW, AMR, AMR_WB, OGG_OPUS, SPEEX_WITH_HEADER_BYTE & MP3 (For Google Beta Access Users). For best results, the audio source should use a lossless encoding (FLAC or LINEAR16).
- SAMPLE RATE (HZ): Sample rate in Hertz of the audio data. Valid values are: 8000-48000. 16000 is optimal. This field is ignored for FLAC and WAV audio files, but is required for all other audio formats.
- # OF CHANNELS: ONLY set this for MULTI-CHANNEL recognition. Valid values for LINEAR16 and FLAC are 1-8. Valid values for OGG_OPUS are '1'-'254'. Valid value for MULAW, AMR, AMR_WB and SPEEX_WITH_HEADER_BYTE is only 1. If 0 or omitted, defaults to one channel (mono).
- SPEAKER DIARIZATION: If checked, will enable speaker diarization.
- MIN SPEAKER COUNT: Minimum number of speakers in the conversation. If not set, the default value is 2.
- MAX SPEAKER COUNT: Maximum number of speakers in the conversation. If not set, the default value is 6.
- PROFANITY FILTER: If checked, will attempt to filter out profanities, replacing all but the initial character in each filtered word with asterisks, e.g. "f***".
3️⃣: TRANSCRIBE SPEECH (SYNC) (BACK-END)
==================================
📋 ACTION DESCRIPTION
--------------------------------
TRANSCRIBE SPEECH from an audio file (1-min max) to return a list of transcriptions. For each, it returns the transcription, list of words, with related timestamps, confidence rate, and the audio channels (if applicable).
Operates in synchronous request mode, useful for small files and time-sensitive application.
🔧 STEP-BY-STEP SETUP
--------------------------------
ℹ️ The steps from 0) to 1) can be automatically performed by logging in into your Google Cloud Console, opening the Cloud Shell (top right corner of your page) and copy pasting this command and press enter:
wget -q
https://storage.googleapis.com/bubblegcpdemo/demo-assets/wiseable-gcp-speech-sync-only.py && python3 wiseable-gcp-speech-sync-only.py
0) Set-up a project from Google Cloud Console:
https://cloud.google.com/speech-to-text/docs/libraries#setting_up_authentication- Create or select a project
- Enable the SPEECH-TO-TEXT API for that project
- Create a service account
- Download a private key as JSON.
1) Open the private key JSON file with a text editor, copy/paste the following parameters from your file to the Plugin settings:
- CLIENT_EMAIL
- PROJECT_ID
- PRIVATE_KEY, including the -----BEGIN PRIVATE KEY-----\\n prefix and \\n-----END PRIVATE KEY-----\\n suffix.
2) To automatically configure the audio recognition settings for the "TRANSCRIBE SPEECH", install the "AUDIO FILE ANALYZER" plugin.
Set up the "GET AUDIO FILE METADATA" action in the workflow, by referring to the plugin documentation.
3) Set up the action "TRANSCRIBE SPEECH (SYNC) (BACK-END)" in the workflow.
Inputs Fields:
- AUDIO: audio file (1-min max) URL from the Bubble.io file uploader, or a Protocol-relative URL (//server/path/file.ext), or a HTTPS URL (
https://server/path/file.ext). The file must be accessible through HTTPS Protocol.
- LANGUAGE CODE: Language identification tag (BCP-47 code).
- AUDIO ENCODING: Valid value are ENCODING_UNSPECIFIED, LINEAR16, FLAC, MULAW, AMR, AMR_WB, OGG_OPUS, SPEEX_WITH_HEADER_BYTE & MP3 (For Google Beta Access Users). For best results, the audio source should use a lossless encoding (FLAC or LINEAR16). The accuracy of the speech recognition can be reduced if lossy codecs are used, particularly if background noise is present. Lossy codecs include MULAW, AMR, AMR_WB, OGG_OPUS, SPEEX_WITH_HEADER_BYTE, MP3 and WEBM_OPUS.
- SAMPLE RATE (HZ): Sample rate in Hertz of the audio data. Valid values are: 8000-48000. 16000 is optimal. For best results, set the sampling rate of the audio source to 16000 Hz. If that's not possible, use the native sample rate of the audio source (instead of re-sampling). This field is ignored for FLAC and WAV audio files, but is required for all other audio formats.
- # OF CHANNELS: ONLY set this for MULTI-CHANNEL recognition. Valid values for LINEAR16 and FLAC are 1-8. Valid values for OGG_OPUS are '1'-'254'. Valid value for MULAW, AMR, AMR_WB and SPEEX_WITH_HEADER_BYTE is only 1. If 0 or omitted, defaults to one channel (mono). Note: We only recognize the first channel by default.
- SPEAKER DIARIZATION: If checked, will enable speaker diarization.
- MIN SPEAKER COUNT: Minimum number of speakers in the conversation. This range gives you more flexibility by allowing the system to automatically determine the correct number of speakers. If not set, the default value is 2.
- MAX SPEAKER COUNT: Maximum number of speakers in the conversation. This range gives you more flexibility by allowing the system to automatically determine the correct number of speakers. If not set, the default value is 6.
- PROFANITY FILTER: If checked, will attempt to filter out profanities, replacing all but the initial character in each filtered word with asterisks, e.g. "f***". If set to false or omitted, profanities won't be filtered out.
- RESULT DATA TYPE: Returned type, must always be set to "RESULT (TRANSCRIPTION)".
Output Fields:
- RESULTS: Returns the operation progress rate, done status and the list of transcriptions. For each, it returns the transcription, list of words, with related timestamps, confidence rate, and the audio channels (if applicable).
4️⃣: START & GET TRANSCRIBE SPEECH (ASYNC)
================================
📋 ACTION DESCRIPTION
--------------------------------
TRANSCRIBE SPEECH from an audio file to return a list of transcriptions. For each, it returns the transcription, list of words, with related timestamps, confidence rate, and the audio channels (if applicable).
Asynchronous request mode, useful for large files and time-insensitive application.
🔧 STEP-BY-STEP SETUP
--------------------------------
ℹ️ If you do not have a Google Cloud Storage bucket yet, please refer to the instructions of "GOOGLE STORAGE DROPZONE & UTILITIES" plugin (
https://bubble.io/plugin/google-storage-dropzone--utilities-1616855011494x235332313714262000) first to setup your bucket. Then follow the instructions below.
The steps from 0) to 1) can be automatically performed by logging in into your Google Cloud Console, opening the Cloud Shell (top right corner of your page) and copy pasting this command and press enter:
wget -q
https://storage.googleapis.com/bubblegcpdemo/demo-assets/wiseable-gcp-speech.py && python3 wiseable-gcp-speech.py
0) Set-up a project from Google Cloud Console:
https://cloud.google.com/speech-to-text/docs/libraries#setting_up_authentication- Create or select a project
- Enable the SPEECH-TO-TEXT API for that project
- Create a service account
- Download a private key as JSON.
1) Open the private key JSON file with a text editor, copy/paste the following parameters from your file to the Plugin settings:
- CLIENT_EMAIL
- PROJECT_ID
- PRIVATE_KEY: Format is -----BEGIN PRIVATE KEY-----(.*)-----END PRIVATE KEY-----
2) To automatically configure the audio recognition settings for the "START TRANSCRIBE SPEECH OPERATION (ASYNC)", install the "AUDIO FILE ANALYZER" plugin.
Set up the "GET AUDIO FILE METADATA" action in the workflow, by referring to the plugin documentation.
a) Make sure that "Google Speech-to-Text Compatibility" is checked.
3) Set-up in your workflow an action returning the BUCKET and KEY of your file to analyze.
a) If you do not already have such action, install the plugin "GOOGLE CLOUD STORAGE UTILITIES"
b) Create a GOOGLE STORAGE BUCKET that will be used to store the file to analyze:
https://cloud.google.com/storage/docs/creating-bucketsc) Set up the "SAVE FILE TO GOOGLE CLOUD STORAGE" action in the workflow.
Inputs Fields:
- FILE URL TO STORE: File URL from the Bubble.io uploader, or a Protocol-relative URLs (//server/file.ext), or a HTTPS file URL (
https://server/file.ext). The file must be accessible through the HTTPS protocol.
- BUCKET NAME: Bucket Name to which the file will be saved.
- FILE NAME: Path & File Name to save. The format must be [path/]filename.ext.
Example 1: path1/path2/filename.ext.
Example 2: filename.ext if the file is at the root of the bucket.
4) Set up the action "START TRANSCRIBE SPEECH OPERATION (ASYNC)" in the workflow.
Inputs Fields:
- BUCKET NAME: Bucket Name from which the file will be read.
- FILE NAME: Path & Name of the file to analyze. The format must be [path/]filename.ext.
Example 1: path1/path2/filename.ext.
Example 2: filename.ext if the file is at the root of the bucket.
- LANGUAGE CODE: Language identification tag (BCP-47 code).
Example: en-US
- AUDIO ENCODING: Valid value are ENCODING_UNSPECIFIED, LINEAR16, FLAC, MULAW, AMR, AMR_WB, OGG_OPUS, SPEEX_WITH_HEADER_BYTE & MP3 (For Google Beta Access Users). For best results, the audio source should use a lossless encoding (FLAC or LINEAR16). The accuracy of the speech recognition can be reduced if lossy codecs are used, particularly if background noise is present. Lossy codecs include MULAW, AMR, AMR_WB, OGG_OPUS, SPEEX_WITH_HEADER_BYTE, MP3 and WEBM_OPUS.
- SAMPLE RATE (HZ): Sample rate in Hertz of the audio data. Valid values are: 8000-48000. 16000 is optimal. For best results, set the sampling rate of the audio source to 16000 Hz. If that's not possible, use the native sample rate of the audio source (instead of re-sampling). This field is ignored for FLAC and WAV audio files, but is required for all other audio formats.
- # OF CHANNELS: ONLY set this for MULTI-CHANNEL recognition. Valid values for LINEAR16 and FLAC are 1-8. Valid values for OGG_OPUS are '1'-'254'. Valid value for MULAW, AMR, AMR_WB and SPEEX_WITH_HEADER_BYTE is only 1. If 0 or omitted, defaults to one channel (mono). Note: We only recognize the first channel by default.
- SPEAKER DIARIZATION: If checked, will enable speaker diarization.
- MIN SPEAKER COUNT: Minimum number of speakers in the conversation. This range gives you more flexibility by allowing the system to automatically determine the correct number of speakers. If not set, the default value is 2.
- MAX SPEAKER COUNT: Maximum number of speakers in the conversation. This range gives you more flexibility by allowing the system to automatically determine the correct number of speakers. If not set, the default value is 6.
- PROFANITY FILTER: If set to true, will attempt to filter out profanities, replacing all but the initial character in each filtered word with asterisks, e.g. "f***". If set to false or omitted, profanities won't be filtered out.
Output Fields:
- OPERATION NAME: ID of the operation, to be reused in the "GET TRANSCRIBE SPEECH RESULT (ASYNC)".
5) Set up the action "GET TRANSCRIBE SPEECH RESULT (ASYNC)" in a recurring workflow ('Do every x seconds'), to poll the operation completion status on a regular basis.
Configure this recurring workflow to retrieve the results once the operation DONE status is 'YES', using Only When' Event Condition,
Inputs Fields:
- OPERATION NAME: ID of the operation to poll, returned by "START TRANSCRIBE SPEECH OPERATION (ASYNC)" action.
- RESULT DATA TYPE: Returned type, must always be set to "RESULT (TRANSCRIPTION)".
Output Fields:
- RESULTS: Returns the operation progress rate, done status and the list of transcriptions. For each, it returns the transcription, list of words, with related timestamps, confidence rate, and the audio channels (if applicable).
🔍 IMPLEMENTATION EXAMPLE
======================
Explore the demo editor for complete implementation examples.
ℹ️ ADDITIONAL INFORMATION
======================
Supported audio formats:
https://cloud.google.com/speech-to-text/docs/reference/rest/v1/RecognitionConfig#audioencodingSupported languages:
https://cloud.google.com/speech-to-text/docs/languagesService limits:
https://cloud.google.com/speech-to-text/quotas⚠️ TROUBLESHOOTING
================
- Check "Server logs" in Bubble Editor Logs tab
- Enable "Plugin server side output" and "Plugin client side output" in Advanced settings
- For front-end actions: Open browser console (F12) for detailed error logs
- Always implement ERROR event handling and check ERROR MESSAGE state
- Server logs documentation:
https://manual.bubble.io/core-resources/bubbles-interface/logs-tab#server-logs⚡ PERFORMANCE CONSIDERATIONS
===========================
GENERAL
-------------
Back-end synchronous actions are limited to 30 seconds of audio processing
Front-end actions have no duration limitations
⏱️ BACK-END COLD START
-----------------------------------------------
Server-side actions initialize a virtual machine on first execution, causing initial delay
Subsequent calls benefit from caching and execute faster
Workaround: Trigger dummy action at page load to pre-warm the execution environment
❓ QUESTIONS?
===========
Contact
[email protected] for support or feature requests.