Advanced Prompting with Gemini
Grounding, multimodal input, context caching, Gems (custom personas), and getting reliable outputs.
What You'll Learn
- Use grounding to get factual, up-to-date responses with source citations
- Master multimodal prompting with combined text, image, and video inputs
- Understand context caching for cost-effective repeated analysis of large documents
- Create and customize Gems for recurring specialized tasks
- Apply advanced prompting patterns optimized for Gemini's architecture
Grounding: Factual Responses with Sources
Grounding is Gemini's ability to verify its responses against real-time web data and Google Search results. When grounding is enabled, Gemini does not just generate text from its training data; it searches for current information and includes source citations in its response.
This addresses the biggest concern with AI tools: accuracy. A grounded response includes links to the sources it drew from, letting you verify claims and dig deeper into topics.
How grounding works:
In the Gemini web interface, grounding happens automatically for queries that benefit from current information. You will see a "Google it" chip or source links at the bottom of responses that used web search.
In the API, you can explicitly enable grounding with Google Search as a tool. The model decides when to search based on the query, and the response includes grounding metadata with source URLs and confidence scores.
When grounding matters most:
-
Current events and recent developments
-
Pricing and product information that changes frequently
-
Statistics, market data, and research findings
-
Technical documentation for specific software versions
-
Regulatory and compliance information
When grounding is less necessary:
-
Creative writing and brainstorming
-
General knowledge questions about established topics
-
Code generation (unless checking current API documentation)
-
Personal analysis and opinion-forming
Prompting for grounded responses:
Be explicit when you want current information: "What is the current pricing for Salesforce Enterprise edition? Include the source." Adding "include sources" or "cite your references" signals Gemini to emphasize grounding.
Grounding makes Gemini particularly trustworthy for research tasks where accuracy matters more than creativity.
Verify Critical Information
Even with grounding, always verify critical business decisions against primary sources. Grounding dramatically reduces hallucination but does not eliminate it entirely. Use the source links Gemini provides to confirm key facts, especially for pricing, legal, and compliance information.
Multimodal Input: Text, Images, and Video
Gemini is natively multimodal, built from the ground up to process text, images, audio, and video with the same model. This is not a text model with vision bolted on; the multimodal understanding is core to how Gemini works.
Image analysis capabilities:
-
Read and extract text from photos, screenshots, and documents
-
Understand charts, graphs, and data visualizations
-
Identify objects, scenes, and visual patterns
-
Compare multiple images side by side
-
Analyze UI screenshots for design feedback
-
Process handwritten notes and whiteboard photos
Video understanding:
Gemini can analyze video content, particularly through its YouTube integration. Upload a video or reference a YouTube URL, and Gemini can:
-
Summarize the video content
-
Answer specific questions about what happens in the video
-
Extract key points and timestamps
-
Identify visual elements and on-screen text
Audio processing:
Gemini can transcribe and analyze audio content. Upload audio files or reference recordings for:
-
Transcription with speaker identification
-
Summary of conversations or meetings
-
Translation of spoken content
-
Analysis of audio content (podcast summaries, lecture notes)
Combining modalities effectively:
The power of native multimodality emerges when you combine inputs: "Here is a screenshot of our dashboard [image] and the CSV export of the underlying data [file]. The numbers do not match. Can you identify the discrepancy?" Gemini processes both inputs with full understanding of each.
Another powerful pattern: "Here is a video of our product demo [video]. Here is the script we wrote [text]. How closely does the actual demo follow the script? What was improvised?"
Quick Test: Multimodal Input
Take a photo of a chart, graph, or whiteboard with your phone.
1. Upload it to Gemini and ask: "What does this show? Summarize the key data points."
2. Then upload a screenshot of a webpage and ask Gemini to extract all the text and organize it into bullet points.
Notice how naturally Gemini handles visual input alongside your text instructions.
Context Caching: Efficient Large-Document Processing
Context caching is a Gemini API feature that solves a specific cost problem: when you need to ask many questions about the same large document or dataset.
Without caching, every API call sends the full context (your document + the question) and you pay for all those input tokens every time. If you have a 500-page document and ask 20 questions about it, you are paying for 500 pages of input tokens 20 times.
With context caching, you upload the document once, Gemini caches it, and subsequent questions reference the cache. You pay full price for the first call and a significantly reduced rate for follow-up calls that use the cached context.
When context caching makes sense:
-
Analyzing a large codebase with multiple questions
-
Conducting extended research on a large document set
-
Building applications where multiple users query the same knowledge base
-
Processing a long document with dozens of specific extraction requests
How to use it (API):
-
Create a cached content object with your documents
-
Reference the cache ID in subsequent API calls
-
The cache has a TTL (time to live) that you set based on how long you need it
-
Pay reduced rates for the cached portion of input tokens
Cost impact:
Cached input tokens are charged at roughly 25% of the standard rate. For a workflow that asks 20 questions about a 100K-token document, caching reduces the input cost by approximately 75%. At scale, this is a significant saving.
Context caching is an API-only feature. In the Gemini web interface, context persistence is handled automatically within a conversation. The API feature matters for developers building applications that need to process the same large context repeatedly.
When to Cache
Use context caching when: (1) your input context is over 32K tokens, (2) you plan to make more than 3 queries against it, and (3) the queries will happen within a few hours. Below these thresholds, the caching overhead is not worth it. Above them, the savings compound quickly.
Gems: Your Custom AI Personalities
Gems are Gemini's version of custom AI personalities (similar to ChatGPT's Custom GPTs). Available on Gemini Advanced, Gems let you create specialized assistants for recurring tasks.
Creating a Gem:
-
Go to the Gems section in Gemini
-
Click "New Gem"
-
Define the Gem's name, personality, and instructions
-
Optionally upload knowledge files for the Gem to reference
-
Test and refine
What makes a good Gem:
-
Specific purpose: "Email Tone Adjuster" is better than "Writing Helper." Narrow scope produces better results.
-
Detailed instructions: Include the Gem's expertise, tone, format preferences, and workflow. The more specific, the more consistent.
-
Example outputs: Show the Gem what good output looks like. Include 2-3 examples of the exact format and style you want.
-
Constraints: What should the Gem avoid? What boundaries should it respect?
Practical Gem examples:
-
Meeting Prep Gem: Accesses your calendar, researches attendees, prepares talking points and agenda items
-
Social Media Writer Gem: Follows your brand voice, generates platform-specific content, suggests hashtags and posting times
-
Code Reviewer Gem: Checks for your team's specific coding standards, security patterns, and performance anti-patterns
-
Customer Response Gem: Uses your company's FAQ and tone guidelines to draft responses to customer inquiries
Gems save time not by being faster at individual tasks, but by eliminating the setup. Instead of writing a new system prompt every time you need marketing copy, you open your Marketing Writer Gem and start immediately. Over weeks and months, this compounds significantly.
Try This Yourself
Create your first Gem right now. Go to the Gems section in Gemini Advanced, click "New Gem," and build an "Email Tone Checker." Set the instructions to: "You review emails before they are sent. For each email, rate the tone (professional, casual, aggressive, passive), suggest improvements, and rewrite any sentences that could be misinterpreted." Test it with your last three sent emails.
Advanced Prompting for Gemini
Gemini's architecture and training give it specific strengths that respond well to targeted prompting techniques.
Structured output requests. Gemini excels at generating structured data. Ask for JSON, tables, lists with specific formats, or any structured schema. "Return the analysis as JSON with fields: category (string), confidence (number 0-1), reasoning (string), action_items (array of strings)." Gemini follows schema instructions with high fidelity.
Multi-turn refinement. Gemini maintains strong coherence across long conversations. Use this for iterative refinement: generate a first draft, then refine specific aspects over multiple turns. "Now adjust the tone of section 2" or "Add quantitative evidence to support point 3." Each refinement builds on the full conversation context.
Grounding-aware prompts. Combine your knowledge with web research: "Based on what I know about our product [provide context] and the current market conditions [let Gemini search], write a competitive positioning document." This hybrid approach produces output that is both specific to your situation and grounded in current reality.
Multimodal chain prompts. Use previous outputs as inputs for subsequent tasks: "Analyze this chart [image]. Now create a text summary. Now turn that summary into an email to the executive team. Now translate the email into Spanish." Each step builds on the previous one using different modalities.
Google-integrated prompts. Reference your Google data naturally: "Look at my last three expense reports in Drive and identify spending categories that increased by more than 20%." The integration turns Gemini into a personal analyst with access to your actual data.
The meta-pattern: Gemini's greatest prompting advantage is its access to your data and the web simultaneously. Prompts that combine personal context with grounded research produce output that no other AI tool can match.
Build Your First Gem
Identify a task you do weekly that follows a consistent pattern. Create a Gem for it: define the role, write detailed instructions, include 2 examples of ideal output, and specify the format. Use the Gem for the next 3 instances of that task. After the third use, refine the instructions based on what you learned about what works and what needs adjustment.
Core Insights
- Grounding connects Gemini responses to real-time web search with source citations, dramatically reducing hallucination for factual and current-information queries
- Native multimodality means text, images, audio, and video are processed by the same model, enabling powerful combined-input workflows that other tools handle less naturally
- Context caching (API) reduces costs by ~75% for workflows that query the same large document set repeatedly, critical for applications at scale
- Gems create custom AI personalities for recurring tasks: define the role, instructions, and examples once, then use consistently across all instances of that task
- The most powerful Gemini prompts combine personal Google data with grounded web research, producing output that is both specific to your situation and current