AI Script Generation

Use GPT-4o to generate professional narration scripts with prompt engineering and structured JSON output.

15 min read

What You'll Learn

Design effective prompts that generate professional-quality tutorial narration scripts from slide descriptions
Use OpenAI structured output to get reliable JSON responses with per-slide narration and timing estimates
Control narration tone, pacing, and vocabulary for different audiences from technical to non-technical
Iterate and refine AI-generated scripts through systematic prompt adjustments
Build a reusable script generation function that adapts to any tutorial topic

Prompt Engineering for Tutorial Scripts

The narration script is the backbone of your tutorial video. A well-crafted prompt turns slide descriptions into natural, engaging voiceover text that sounds like an expert walking a colleague through the workflow. A poor prompt produces generic, surface-level narration that adds no value over just reading the screenshots.

The prompt structure for tutorial scripts has four key components: role definition, audience specification, content rules, and output format. The role tells GPT-4o to act as a specific type of narrator. For IT tutorials, "a senior engineer showing a colleague something useful" produces better results than "a narrator" because it establishes the right level of technical depth and conversational tone.

Audience specification prevents the AI from either over-explaining basics or assuming too much knowledge. For an MSP audience, specify "IT admins and MSPs who use NinjaOne RMM" rather than just "IT professionals." The more specific the audience, the more targeted the language.

Content rules are negative constraints that prevent common AI narration failures. Explicitly ban phrases like "as you can see," "here we have," and "in this slide" because they waste time and sound robotic. Ban em dashes if your TTS engine stumbles on them. Require specific sentence lengths (2-4 sentences per slide, 10-18 seconds of speaking time) to ensure consistent pacing.

The most important prompt element is the slide context. Each slide description should include not just what the screenshot shows, but why it matters. "Monthly Schedule trigger configured for the 1st of every month at 9:00 AM" is better than "the schedule node settings" because it gives GPT-4o concrete details to weave into the narration.

Negative Constraints Improve Output Quality

Telling the AI what NOT to say is often more effective than telling it what to say. Ban filler phrases, ban UI navigation instructions, ban "as you can see" - the remaining output will be tighter and more professional.

Structured JSON Output

The script generation pipeline requires structured output, not free-form text. Each slide needs a narration string and a duration estimate, packaged in a JSON array that the TTS step can iterate over. OpenAI offers multiple ways to enforce JSON output, and choosing the right method prevents parsing failures downstream.

The simplest approach is to end your prompt with "Return ONLY a JSON array, no other text" and then strip any markdown code fences from the response. This works reliably with GPT-4o but occasionally produces extra commentary. To handle edge cases, check if the response starts with triple backticks and strip the first and last lines before parsing.

For more reliable enforcement, use the response_format parameter with type: "json_object". This forces the model to return valid JSON in every response. However, you still need to specify the exact schema in your prompt so the model knows what fields to include.

The ideal schema for tutorial scripts is an array of objects, each with slide (number), narration (string), and duration_seconds (number). The duration estimate helps the pipeline allocate display time per slide, though the actual TTS audio duration may differ slightly. Always use the measured audio duration for final video assembly rather than the estimated duration.

Temperature setting matters for script generation. A temperature of 0.7 produces natural variation in sentence structure and word choice while maintaining accuracy. Lower temperatures (0.3-0.5) produce more predictable but occasionally monotonous narration. Higher temperatures (0.9+) introduce creative phrasing but risk factual drift from your slide descriptions.

Fine-Tuning Narration Style

Narration style determines whether your tutorial feels like a polished production or an AI reading a manual. Small adjustments to your prompt produce dramatically different output, and finding the right style for your audience takes deliberate experimentation.

For technical audiences (developers, IT admins, MSPs), the best narration style is "explaining to a peer." This means using proper technical terminology without defining basic concepts, referencing specific tools and protocols by name, and keeping sentences direct. Avoid the "teacher" voice that over-explains or uses rhetorical questions.

For non-technical audiences (business users, clients, executives), shift to "translating complexity." Use analogies, avoid acronyms, and focus on outcomes rather than implementation details. Instead of "the OAuth2 credential authenticates the API call," say "the workflow securely connects to NinjaOne using your saved login credentials."

Pacing control happens at the prompt level. Specify "each slide gets 10-18 seconds of narration" and the AI will calibrate sentence count accordingly. For complex slides with dense configuration, allow up to 20 seconds. For simple overview slides, keep it to 8-12 seconds. The first slide should always be a hook ("Tired of manually creating patching tickets every month?") and the last slide should wrap up with the benefit.

Tone markers in your prompt act as style anchors. Adding "professional but approachable" produces different output than "technical and precise" or "casual and energetic." Test three different tone markers on the same slide set and compare the outputs to find what matches your brand voice.

Quick Test: A/B Test Your Narration Tone

Step 1: Take your existing storyboard and generate a script using the tone marker "professional peer."

Step 2: Generate a second script for the same storyboard using "friendly teacher."

Step 3: Generate a third using "technical expert."

Step 4: Compare the three outputs side by side and pick the one that best matches your brand voice.

Iterating and Editing Scripts

AI-generated scripts rarely land perfectly on the first attempt. Building an efficient review and iteration workflow prevents you from spending more time editing than it would take to write from scratch.

The fastest review method is to read the script out loud. If any sentence feels awkward to say, it will sound worse through TTS. Look for tongue-twisters, overly long sentences, and phrases that require unusual emphasis. Replace them with simpler alternatives.

Common issues to watch for include: the AI referencing visual elements it cannot see ("the green toggle on the left"), inventing features or settings that do not exist in the screenshot, and using marketing language when you asked for technical narration. These are easy to catch during a 2-minute read-through.

For systematic iteration, save your tutorial-script.json output and compare it across runs. If the same slide consistently produces weak narration, the issue is your context description rather than the prompt. Add more specific detail to that slide's context and regenerate.

Once you have a script you are happy with, save it as a reference template. When generating scripts for similar workflows, include an example of good narration in your prompt: "Here is an example of the narration style I want:" followed by 2-3 sample slides from your best output. This few-shot approach produces more consistent results than relying solely on style descriptions.

For production workflows, consider building a small validation step that checks script length per slide (flag anything under 5 seconds or over 25 seconds), catches banned phrases, and verifies the total video duration stays within your target range.

Core Insights

The four components of an effective script prompt are role definition, audience specification, content rules (especially negative constraints), and output format.
Structured JSON output with slide number, narration text, and duration estimate enables reliable downstream processing by the TTS and video assembly steps.
Temperature 0.7 balances natural language variation with factual accuracy for tutorial narration - lower for precision, higher for creativity.
Reading scripts aloud is the fastest way to catch awkward phrasing that will sound worse through text-to-speech.
Few-shot prompting with example narration from your best previous outputs produces more consistent style than style descriptions alone.

Screenshot Automation & Content Capture

Text-to-Speech Production