Gemini Track/Advanced Prompting with Gemini

Gemini Track

Module 2 of 6

Advanced Prompting with Gemini

Grounding, multimodal input, context caching, Gems (custom personas), and getting reliable outputs.

16 min read

What You'll Learn

Use grounding to get factual, up-to-date responses with source citations
Master multimodal prompting with combined text, image, and video inputs
Understand context caching for cost-effective repeated analysis of large documents
Create and customize Gems for recurring specialized tasks
Apply advanced prompting patterns optimized for Gemini's architecture

Grounding: Factual Responses with Sources

Grounding is Gemini's ability to verify its responses against real-time web data and Google Search results. When grounding is enabled, Gemini does not just generate text from its training data; it searches for current information and includes source citations in its response.

This addresses the biggest concern with AI tools: accuracy. A grounded response includes links to the sources it drew from, letting you verify claims and dig deeper into topics.

How grounding works:

In the Gemini web interface, grounding happens automatically for queries that benefit from current information. You will see a "Google it" chip or source links at the bottom of responses that used web search.

In the API, you can explicitly enable grounding with Google Search as a tool. The model decides when to search based on the query, and the response includes grounding metadata with source URLs and confidence scores.

When grounding matters most:

Current events and recent developments
Pricing and product information that changes frequently
Statistics, market data, and research findings
Technical documentation for specific software versions
Regulatory and compliance information

When grounding is less necessary:

Creative writing and brainstorming
General knowledge questions about established topics
Code generation (unless checking current API documentation)
Personal analysis and opinion-forming

Prompting for grounded responses:

Be explicit when you want current information: "What is the current pricing for Salesforce Enterprise edition? Include the source." Adding "include sources" or "cite your references" signals Gemini to emphasize grounding.

Grounding makes Gemini particularly trustworthy for research tasks where accuracy matters more than creativity.

Verify Critical Information

Even with grounding, always verify critical business decisions against primary sources. Grounding dramatically reduces hallucination but does not eliminate it entirely. Use the source links Gemini provides to confirm key facts, especially for pricing, legal, and compliance information.

Multimodal Input: Text, Images, and Video

Gemini is natively multimodal, built from the ground up to process text, images, audio, and video with the same model. This is not a text model with vision bolted on; the multimodal understanding is core to how Gemini works.

Image analysis capabilities:

Read and extract text from photos, screenshots, and documents
Understand charts, graphs, and data visualizations
Identify objects, scenes, and visual patterns
Compare multiple images side by side
Analyze UI screenshots for design feedback
Process handwritten notes and whiteboard photos

Video understanding:

Gemini can analyze video content, particularly through its YouTube integration. Upload a video or reference a YouTube URL, and Gemini can:

Summarize the video content
Answer specific questions about what happens in the video
Extract key points and timestamps
Identify visual elements and on-screen text

Audio processing:

Gemini can transcribe and analyze audio content. Upload audio files or reference recordings for:

Transcription with speaker identification
Summary of conversations or meetings
Translation of spoken content
Analysis of audio content (podcast summaries, lecture notes)

Combining modalities effectively:

The power of native multimodality emerges when you combine inputs: "Here is a screenshot of our dashboard [image] and the CSV export of the underlying data [file]. The numbers do not match. Can you identify the discrepancy?" Gemini processes both inputs with full understanding of each.

Another powerful pattern: "Here is a video of our product demo [video]. Here is the script we wrote [text]. How closely does the actual demo follow the script? What was improvised?"

Quick Test: Multimodal Input

Take a photo of a chart, graph, or whiteboard with your phone.

1. Upload it to Gemini and ask: "What does this show? Summarize the key data points."

2. Then upload a screenshot of a webpage and ask Gemini to extract all the text and organize it into bullet points.

Notice how naturally Gemini handles visual input alongside your text instructions.

Context Caching: Efficient Large-Document Processing

Context caching is a Gemini API feature that solves a specific cost problem: when you need to ask many questions about the same large document or dataset.

Without caching, every API call sends the full context (your document + the question) and you pay for all those input tokens every time. If you have a 500-page document and ask 20 questions about it, you are paying for 500 pages of input tokens 20 times.

With context caching, you upload the document once, Gemini caches it, and subsequent questions reference the cache. You pay full price for the first call and a significantly reduced rate for follow-up calls that use the cached context.

When context caching makes sense:

Analyzing a large codebase with multiple questions
Conducting extended research on a large document set
Building applications where multiple users query the same knowledge base
Processing a long document with dozens of specific extraction requests

How to use it (API):

Create a cached content object with your documents
Reference the cache ID in subsequent API calls
The cache has a TTL (time to live) that you set based on how long you need it
Pay reduced rates for the cached portion of input tokens

Cost impact:

Cached input tokens are charged at roughly 25% of the standard rate. For a workflow that asks 20 questions about a 100K-token document, caching reduces the input cost by approximately 75%. At scale, this is a significant saving.

Context caching is an API-only feature. In the Gemini web interface, context persistence is handled automatically within a conversation. The API feature matters for developers building applications that need to process the same large context repeatedly.

When to Cache

Use context caching when: (1) your input context is over 32K tokens, (2) you plan to make more than 3 queries against it, and (3) the queries will happen within a few hours. Below these thresholds, the caching overhead is not worth it. Above them, the savings compound quickly.

Gems: Your Custom AI Personalities

Gems are Gemini's version of custom AI personalities (similar to ChatGPT's Custom GPTs). Available on Gemini Advanced, Gems let you create specialized assistants for recurring tasks.

Creating a Gem:

Go to the Gems section in Gemini
Click "New Gem"
Define the Gem's name, personality, and instructions
Optionally upload knowledge files for the Gem to reference
Test and refine

What makes a good Gem:

Specific purpose: "Email Tone Adjuster" is better than "Writing Helper." Narrow scope produces better results.
Detailed instructions: Include the Gem's expertise, tone, format preferences, and workflow. The more specific, the more consistent.
Example outputs: Show the Gem what good output looks like. Include 2-3 examples of the exact format and style you want.
Constraints: What should the Gem avoid? What boundaries should it respect?

Practical Gem examples:

Meeting Prep Gem: Accesses your calendar, researches attendees, prepares talking points and agenda items
Social Media Writer Gem: Follows your brand voice, generates platform-specific content, suggests hashtags and posting times
Code Reviewer Gem: Checks for your team's specific coding standards, security patterns, and performance anti-patterns
Customer Response Gem: Uses your company's FAQ and tone guidelines to draft responses to customer inquiries

Gems save time not by being faster at individual tasks, but by eliminating the setup. Instead of writing a new system prompt every time you need marketing copy, you open your Marketing Writer Gem and start immediately. Over weeks and months, this compounds significantly.

Try This Yourself

Create your first Gem right now. Go to the Gems section in Gemini Advanced, click "New Gem," and build an "Email Tone Checker." Set the instructions to: "You review emails before they are sent. For each email, rate the tone (professional, casual, aggressive, passive), suggest improvements, and rewrite any sentences that could be misinterpreted." Test it with your last three sent emails.

Advanced Prompting for Gemini

Gemini's architecture and training give it specific strengths that respond well to targeted prompting techniques.

Structured output requests. Gemini excels at generating structured data. Ask for JSON, tables, lists with specific formats, or any structured schema. "Return the analysis as JSON with fields: category (string), confidence (number 0-1), reasoning (string), action_items (array of strings)." Gemini follows schema instructions with high fidelity.

Multi-turn refinement. Gemini maintains strong coherence across long conversations. Use this for iterative refinement: generate a first draft, then refine specific aspects over multiple turns. "Now adjust the tone of section 2" or "Add quantitative evidence to support point 3." Each refinement builds on the full conversation context.

Grounding-aware prompts. Combine your knowledge with web research: "Based on what I know about our product [provide context] and the current market conditions [let Gemini search], write a competitive positioning document." This hybrid approach produces output that is both specific to your situation and grounded in current reality.

Multimodal chain prompts. Use previous outputs as inputs for subsequent tasks: "Analyze this chart [image]. Now create a text summary. Now turn that summary into an email to the executive team. Now translate the email into Spanish." Each step builds on the previous one using different modalities.

Google-integrated prompts. Reference your Google data naturally: "Look at my last three expense reports in Drive and identify spending categories that increased by more than 20%." The integration turns Gemini into a personal analyst with access to your actual data.

The meta-pattern: Gemini's greatest prompting advantage is its access to your data and the web simultaneously. Prompts that combine personal context with grounded research produce output that no other AI tool can match.

Build Your First Gem

Identify a task you do weekly that follows a consistent pattern. Create a Gem for it: define the role, write detailed instructions, include 2 examples of ideal output, and specify the format. Use the Gem for the next 3 instances of that task. After the third use, refine the instructions based on what you learned about what works and what needs adjustment.

Core Insights

Grounding connects Gemini responses to real-time web search with source citations, dramatically reducing hallucination for factual and current-information queries
Native multimodality means text, images, audio, and video are processed by the same model, enabling powerful combined-input workflows that other tools handle less naturally
Context caching (API) reduces costs by ~75% for workflows that query the same large document set repeatedly, critical for applications at scale
Gems create custom AI personalities for recurring tasks: define the role, instructions, and examples once, then use consistently across all instances of that task
The most powerful Gemini prompts combine personal Google data with grounded web research, producing output that is both specific to your situation and current

Gemini Fundamentals

Gemini and Google Workspace