Cloud Blog: Build and refine your audio generation end-to-end with Gemini 1.5 Pro

Source URL: https://cloud.google.com/blog/products/ai-machine-learning/learn-how-to-build-a-podcast-with-gemini-1-5-pro/
Source: Cloud Blog
Title: Build and refine your audio generation end-to-end with Gemini 1.5 Pro

Feedly Summary: Generative AI is giving people new ways to experience audio content, from podcasts to audio summaries. For example, users are embracing NotebookLM’s recent Audio Overview feature, which turns documents into audio conversations. With one click, two AI hosts start up a lively “deep dive” discussion based on the sources you provide. They summarize your material, make connections between topics, and discuss back and forth. 
While Notebook LM offers incredible benefits for making sense of complex information, some users want more control over generating  unique audio experiences – for example, creating their own podcasts. Podcasts are an increasingly popular medium for creators, business leaders, and users to listen to what interests them. Today, we’ll share how Gemini 1.5 Pro and the Text-to-Speech API on Google Cloud can help you create conversations with diverse voices and generate podcast scripts with custom prompts.

aside_block
), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectPath=/vertex-ai/’), (‘image’, None)])]>

The approach: Expand your reach with diverse audio formats
A great podcast starts with accessible audio content. Gemini’s multimodal capabilities, combined with our high-fidelity Text-to-Speech API, offers 380+ voices across 50+ languages and custom voice creation. This unlocks new ways for users to experience content and expand their reach through diverse audio formats. 
This approach also helps content creators reach a wider audience and streamline the content creation process, including:

Expanded reach: Connect with an audience segment that prefers audio content.

Increased engagement:  Foster deeper connections with listeners through personalized audio.

Content repurposing: Maximize the value of existing written content by transforming it into a new format, reaching a wider audience without starting from scratch.

Let’s take a look at how. 
The architecture: Gemini 1.5 Pro and Text-to-Speech 
Our audio overview creation architecture uses two powerful services from Google Cloud:

Gemini 1.5 Pro: This advanced generative AI model excels at understanding  and generating  human-like text. We’ll use Gemini 1.5 Pro to:

Generate engaging scripts: Feed your podcast content overview  to Gemini 1.5 Pro, and it can generate compelling conversational scripts, complete with introductions, transitions, and calls to action.

Adapt content for audio: Gemini 1.5 Pro can optimize written content for the audio format, ensuring a natural flow and engaging listening experience. It can also adjust the tone and style to suit any format such as podcasts.

Text-to-Speech API: This API converts text into natural-sounding speech, giving a voice to your scripts. You can choose from various voices and languages to match your brand and target audience.

How to create an engaging podcast yourself, step-by-step 

Content preparation: Prepare your podcast. Ensure it’s well-structured and edited for clarity. Consider dividing longer posts into multiple episodes for optimal listening duration.

Gemini 1.5 Pro integration: Use Gemini 1.5 Pro to generate a conversational script from your podcast. Experiment with prompts to fine-tune the output, achieving the desired style and tone. Example prompt: “Generate an engaging audio overview script from this podcast, including an introduction, transitions, and a call to action. Target audience is technical developers, engineers, and cloud architects."

Section extraction: For complex or lengthy podcasts, you might use Gemini 1.5 Pro to extract key sections and subsections as JSON, enabling a more structured approach to script generation.

A python function that powers our podcast creation process can look as simple as below:

code_block
<ListValue: [StructValue([(‘code’, ‘def extract_sections_and_subsections(document1: Part, project="<your-project-id>", location = "us-central1") -> str:\r\n """\r\n Extracts hierarchical sections and subsections from a Google Cloud blog post\r\n provided as a PDF document.\r\n\r\n\r\n This function uses the Gemini 1.5 Pro language model to analyze the structure\r\n of a blog post and identify its key sections and subsections. The extracted\r\n information is returned in JSON format for easy parsing and use in\r\n various applications.\r\n\r\n\r\n This is particularly useful for:\r\n\r\n\r\n * **Large documents:** Breaking down content into manageable chunks for\r\n efficient processing and analysis.\r\n * **Podcast creation:** Generating multi-episode series where each episode\r\n focuses on a specific section of the blog post.\r\n\r\n\r\n Args:\r\n document1 (Part): A Part object representing the PDF document,\r\n typically obtained using `Part.from_uri()`.\r\n For example:\r\n “`python\r\n document1 = Part.from_uri(\r\n mime_type="application/pdf",\r\n uri="gs://your-bucket/your-pdf.pdf"\r\n )\r\n “`\r\n location: The region of your Google Cloud project. Defaults to "us-central1".\r\n project: The ID of your Google Cloud project. Defaults to "<your-project-id>".\r\n\r\n\r\n\r\n\r\n Returns:\r\n str: A JSON string representing the extracted sections and subsections.\r\n Returns an empty string if there are issues with processing or\r\n the model output.\r\n """\r\n\r\n\r\n vertexai.init(project=project, location=location) # Initialize Vertex AI\r\n model = GenerativeModel("gemini-1.5-pro-002")\r\n\r\n\r\n prompt = """Analyze the following blog post and extract its sections and subsections. Represent this information in JSON format using the following structure:\r\n [\r\n {\r\n "section": "Section Title",\r\n "subsections": [\r\n "Subsection 1",\r\n "Subsection 2",\r\n // …\r\n ]\r\n },\r\n // … more sections\r\n ]"""\r\n\r\n\r\n try:\r\n responses = model.generate_content(\r\n ["""The pdf file contains a Google Cloud blog post required for podcast-style analysis:""", document1, prompt],\r\n generation_config=generation_config,\r\n safety_settings=safety_settings,\r\n stream=True, # Stream results for better performance with large documents\r\n )\r\n\r\n\r\n response_text = ""\r\n for response in responses:\r\n response_text += response.text\r\n\r\n\r\n return response_text\r\n\r\n\r\n except Exception as e:\r\n print(f"Error during section extraction: {e}")\r\n return ""’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e3c0c5abe80>)])]>

Then, use Gemini 1.5 Pro to generate the podcast script for each section. Again, provide clear instructions in your prompts, specifying target audience, desired tone, and approximate episode length.
For each section and subsection you can use a function like below to generate a script:

code_block
<ListValue: [StructValue([(‘code’, ‘def generate_podcast_content(section, subsection, document1:Part, targetaudience, guestname, hostname, project="<your-project-id>", location="us-central1") -> str:\r\n """Generates a podcast dialogue in JSON format from a blog post subsection.\r\n\r\n\r\n This function uses the Gemini model in Vertex AI to create a conversation\r\n between a host and a guest, covering the specified subsection content. It uses\r\n a provided PDF as source material and outputs the dialogue in JSON.\r\n\r\n\r\n Args:\r\n section: The blog post\’s main section (e.g., "Introduction").\r\n subsection: The specific subsection (e.g., "Benefits of Gemini 1.5").\r\n document1: A `Part` object representing the source PDF (created using\r\n `Part.from_uri(mime_type="application/pdf", uri="gs://your-bucket/your-pdf.pdf")`).\r\n targetaudience: The intended audience for the podcast.\r\n guestname: The name of the podcast guest.\r\n project: Your Google Cloud project ID.\r\n location: Your Google Cloud project location.\r\n\r\n\r\n Returns:\r\n A JSON string representing the generated podcast dialogue.\r\n """\r\n print(f"Processing section: {section} and subsection: {subsection}")\r\n\r\n\r\n prompt = f"""Create a podcast dialogue in JSON format based on a provided subsection of a Google Cloud blog post (found in the attached PDF).\r\n The dialogue should be a lively back-and-forth between a host (R) and a guest (S), presented as a series of turns.\r\n The host should guide the conversation by asking questions, while the guest provides informative and accessible answers.\r\n The script must fully cover all points within the given subsection.\r\n Use clear explanations and relatable analogies.\r\n Maintain a consistently positive and enthusiastic tone (e.g., "Movies, I love them. They\’re like time machines…").\r\n Include only one introductory host greeting (e.g., "Welcome to our next episode…"). No music, sound effects, or production directions.\r\n\r\n\r\n JSON structure:\r\n , // R for host, S for guest\r\n // … more turns\r\n ]\r\n }}\r\n }}\r\n\r\n\r\n Input Data:\r\n Section: "{section}"\r\n Subsections to cover in the podcast: "{subsection}"\r\n Target Audience: "{targetaudience}"\r\n Guest name: "{guestname}"\r\n Host name: "{hostname}"\r\n """\r\n\r\n\r\n vertexai.init(project=project, location=location)\r\n model = GenerativeModel("gemini-1.5-pro-002")\r\n\r\n\r\n responses = model.generate_content(\r\n ["""The pdf file contains a Google Cloud blog post required for podcast-style analysis:""", document1, prompt],\r\n generation_config=generation_config, # Assuming these are defined already\r\n safety_settings=safety_settings, # Assuming these are defined already\r\n stream=True,\r\n )\r\n\r\n\r\n response_text = ""\r\n for response in responses:\r\n response_text += response.text\r\n\r\n\r\n return response_text’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e3c0c5aba60>)])]>

Next, feed the generated  script by Gemini to the Text-to-Speech API. Choose a voice and language appropriate for your target audience and content.
A function as below can generate human quality audio based on text. For this we can use the advanced text-to-speech API in Google Cloud.

code_block
<ListValue: [StructValue([(‘code’, ‘def generate_audio_from_text(input_json):\r\n """Generates audio using Google Text-to-Speech API.\r\n\r\n\r\n Args:\r\n input_json: A dictionary containing the \’multiSpeakerMarkup\’ for the TTS API. This is generated by the Gemini 1.5 Pro model in the buildPodCastContent() function. \r\n\r\n\r\n Returns:\r\n The audio data in bytes (MP3 format) if successful, None otherwise.\r\n """\r\n\r\n\r\n try:\r\n # Build the Text-to-Speech service\r\n service = build(\’texttospeech\’, \’v1beta1\’)\r\n\r\n\r\n # Prepare synthesis input\r\n synthesis_input = {\r\n \’multiSpeakerMarkup\’: input_json[\’multiSpeakerMarkup\’]\r\n }\r\n\r\n\r\n # Configure voice and audio settings\r\n voice = {\r\n \’languageCode\’: \’en-US\’,\r\n \’name\’: \’en-US-Studio-MultiSpeaker\’\r\n }\r\n\r\n\r\n audio_config = {\r\n \’audioEncoding\’: \’MP3\’,\r\n \’pitch\’: 0,\r\n \’speakingRate\’: 0,\r\n \’effectsProfileId\’: [\’small-bluetooth-speaker-class-device\’]\r\n }\r\n\r\n\r\n # Make the API request\r\n response = service.text().synthesize(\r\n body={\r\n \’input\’: synthesis_input,\r\n \’voice\’: voice,\r\n \’audioConfig\’: audio_config\r\n }\r\n ).execute()\r\n\r\n\r\n # Extract and return audio content\r\n audio_content = response[\’audioContent\’]\r\n return audio_content\r\n\r\n\r\n except Exception as e:\r\n print(f"Error: {e}") # More informative error message\r\n return None’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e3c0c5ab670>)])]>

Finally, to store audio content already encoded as base64 MP3 data in Google Cloud Storage, you can use the google-cloud-storage Python library. This allows you to decode the base64 string and upload the resulting bytes directly to a designated bucket, specifying the content type as ‘audio/mp3’.
Hear it for yourself
While the Text-to-Speech API produces high-quality audio, you can further enhance your audio conversation with background music, sound effects, and professional editing using tools. Hear it for yourself – download the audio conversation I created from this blog using Gemini 1.5 Pro and Text-to-Speech API.
To start creating for yourself, explore our full suite of audio generation features using Google Cloud services, such as Text-to-Speech API  and Gemini models using the free tier. We recommend experimenting with different modalities like text and image prompts to experience Gemini’s potential for content creation.

AI Summary and Description: Yes

Summary: The text discusses the integration and benefits of generative AI technologies for creating audio content, specifically focusing on Google Cloud’s Gemini 1.5 Pro and Text-to-Speech API. It highlights new audio experiences for users, including podcast creation and personalized audio engagement, which are significant for content creators and business leaders.

Detailed Description:
The content provides a detailed overview of how generative AI, particularly Google’s technologies, facilitates the creation of engaging audio content in various formats, emphasizing podcasts and audio summaries. The insights are particularly relevant for professionals in AI, cloud computing, and content creation. Here are the key points:

– **Innovative Use of Generative AI**: Generative AI models like Gemini 1.5 Pro enable users to transform complex documents into engaging audio, enhancing comprehension and accessibility.

– **Podcast Creation**: The text illustrates how users can create their podcasts using customizable scripts generated by the AI, catering to an audience that prefers audio formats. This includes:
– **Script Generation**: Gemini 1.5 Pro can generate compelling conversational scripts tailored to specific styles and tones.
– **Adaptation of Content**: The AI optimizes written content for audio formats, ensuring a seamless listening experience.

– **Text-to-Speech API**: Incorporating Google Cloud’s Text-to-Speech API allows for natural-sounding audio production from the generated scripts, further enhancing audience engagement.

– **Content Creation Workflow**: The text outlines a step-by-step approach to creating audio content, from preparation and AI integration to dialogue generation and audio output. This includes:
– **Example Python Functions**: Two essential Python functions demonstrate how to extract document sections for podcasting and generate podcast dialogue in a structured format.
– **Audio Enhancement**: Users can further enhance audio with background music and sound effects for professional quality.

– **Broader Implications for Content Creators**: The ability to repurpose existing written content into audio formats maximizes its value, allowing creators to connect with wider audiences:
– **Engagement through Personalized Audio**: Personalized audio experiences foster deeper connections with listeners.
– **Reach**: Audio content expands the audience base, attracting those who prefer auditory learning.

– **Integration with Google Cloud**: The seamless integration of these technologies is a practical illustration for security and compliance professionals about leveraging AI in cloud environments for innovative solutions.

This content highlights the evolving landscape of audio content creation through AI, underscoring the importance of tools that enhance user experiences and streamline the production process, which is highly relevant in today’s digital-first environment.