Custom Avatar Creation
Build Instant Avatars from short clips, commission Studio Avatars for premium quality, and follow recording requirements for best results.
What You'll Learn
- Record a qualifying video for Instant Avatar creation and understand the technical requirements
- Submit an Instant Avatar for training and manage the approval and generation process
- Understand the differences between Instant Avatar and Studio Avatar tiers
- Build a custom voice clone using HeyGen's voice training feature
- Set up and test a complete custom avatar with matched voice for production use
Understanding the Avatar Tiers
HeyGen offers three distinct custom avatar types, each with different quality levels, creation requirements, and plan restrictions. Understanding the difference before you start saves you from recording a video and then discovering you are on the wrong plan.
Instant Avatar is the entry-level custom avatar. You record a 2-5 minute consent video following HeyGen's guidelines, submit it for review, and within a few hours you have a digital twin that can speak any script you provide. Instant Avatars are available on Creator plans and above. The realism is solid for most business purposes, though gesture range is somewhat limited compared to the Studio tier.
Studio Avatar is the premium tier. It requires a structured recording session - typically 10-30 minutes of footage with specific gestures, expressions, and head movements captured. HeyGen's team reviews and trains the model, which takes 3-7 business days. Studio Avatars produce significantly more lifelike movement and are recommended for high-visibility content like customer-facing demos, product launch videos, or executive communications.
Instant Avatar Lite is a recent addition that allows avatar creation from a shorter video clip (under 2 minutes). Quality is below full Instant Avatar but it is a useful option for quickly testing what a custom avatar looks like before committing to the full recording.
Custom voice is a separate but related feature. HeyGen can clone your voice from a 2-5 minute audio sample. The resulting voice clone is tied to your account and can be used with any avatar - not just your custom one. This is useful for narration consistency: even if you use a stock avatar for some videos, the voice remains recognizably yours.
Plan availability matters here. Instant Avatar is gated to Creator and above; Studio Avatar is Enterprise only. Voice cloning is available on paid plans with quota limits that vary by tier.
Recording Your Instant Avatar Consent Video
The consent video is the foundation of your Instant Avatar. HeyGen is strict about recording quality because the model trains directly on this footage. A poorly lit, low-resolution recording will produce a low-quality avatar. There are no second chances on the same submission - if it fails review, you start over.
Technical requirements:
- Resolution: 1080p minimum. 4K preferred.
- Frame rate: 30fps or higher
- Lighting: even, front-facing light source with no harsh shadows on the face
- Background: plain, single-color wall works best. Avoid busy patterns.
- Camera distance: chest-up framing. Head should occupy roughly 50% of the vertical frame.
- Stability: use a tripod or phone mount. Handheld footage often fails review.
What to say in the consent video: HeyGen provides a required consent script you must read aloud. This is a legal requirement - you are authorizing HeyGen to train a model on your likeness. Read it clearly and naturally. Do not rush through it.
After the consent portion, continue recording with natural speech for 2-5 minutes total. Talk about anything - your role, a recent project, your industry. The goal is providing diverse speech samples. Vary your sentence length, use natural pauses, and avoid being monotone.
Common rejection reasons to avoid:
- Reading from a script (eyes dart to the side repeatedly)
- Wearing hats, glasses with reflective lenses, or accessories that obstruct the face
- Recording in a room with strong backlighting (window behind you)
- Audio quality issues - use an external microphone if your built-in mic is poor
- Moving out of frame or significant camera shake
After submission, the review process typically takes 2-8 hours during business hours. You will receive an email notification when your avatar is ready. The first test generation is the critical check - speak a few sentences and evaluate lip sync quality and natural movement before using it in production content.
Quick Test: Audit Your Recording Setup Before Submitting
Do a 60-second test recording using your phone or webcam before the real consent video.
Review the playback: is the lighting even on your face?
Is there noticeable camera shake?
Is your face properly framed with head occupying about 50% of vertical space?
Fix any issues before the actual submission - first-attempt approval depends on getting these basics right.
Building and Testing Your Custom Avatar
Once your Instant Avatar is approved, the next step is a systematic quality evaluation before using it in published content. Rushing this phase is a common mistake that results in substandard videos going live.
Avatar quality evaluation checklist:
Generate three test scripts with different characteristics and compare results:
- A fast-paced script with short sentences (stress-tests lip sync timing)
- A slow, deliberate script with long pauses (checks natural stillness and resting expression)
- A script with technical or unusual words (identifies pronunciation edge cases)
For each test, evaluate:
- Lip sync accuracy: do the lip movements match the audio?
- Eye movement: does blinking look natural? Is eye contact maintained?
- Head movement: does the avatar nod and shift naturally, or does it look stiff?
- Hand and shoulder movement (if visible): does it look natural?
Voice cloning setup: Navigate to your Avatar settings and select Voice. If you want to use a cloned voice, record 2-5 minutes of clear, expressive speech for the voice training sample. Avoid background noise, music, or any audio that is not your voice. HeyGen's voice model needs clean input to produce a high-quality clone.
After training, test the voice clone with the same three test scripts you used for the avatar. Voice and avatar are generated somewhat independently, so issues in one do not imply issues in the other.
Production configuration: Once quality is confirmed, set your avatar and voice as defaults in your account settings. This means every new video project starts with your custom avatar pre-selected, saving several clicks per project across a high-volume workflow.
Studio Avatar and Advanced Creation Options
For teams that need the highest quality digital twin representation, the Studio Avatar tier is the answer. The additional realism comes from a longer, more structured recording session that gives the model more expressive material to learn from.
Studio Avatar recording requirements: The full recording session runs 10-30 minutes and captures specific gesture sets the HeyGen team requests. Common items include: nodding slowly and quickly, shaking head, raising hands, pointing forward, gesturing while speaking, sitting still for 30 seconds, and reading several paragraphs at different speeds.
Studio Avatar training is handled by HeyGen's team, not automated. This means:
- Turnaround is 3-7 business days
- A human reviews the footage for quality before training begins
- If something is substandard, HeyGen may request a re-record for specific segments
Studio Avatars are available exclusively on Enterprise plans, so for most individual creators and small teams, Instant Avatar is the appropriate option.
Interactive Avatars represent a different branch of the avatar product. Where standard and custom avatars are for pre-recorded video generation, Interactive Avatars are real-time AI agents that can listen and respond in conversation. They are used for virtual reception desks, interactive training simulations, customer support bots, and live event hosts.
Interactive Avatar setup requires the HeyGen Streaming API and integration with a language model backend to handle conversational logic. This is covered in Module 6 alongside other API-driven workflows.
Key decision framework:
- Need a quick custom presenter for regular content - use Instant Avatar
- Need the most realistic digital twin for high-stakes content - use Studio Avatar (Enterprise)
- Need a real-time conversational agent - use Interactive Avatar with the Streaming API
- Need an animated still image - use Photo Avatar (any paid plan, no training required)
Pro Tip
Use the same script for your avatar test video and your voice clone training recording. This gives you a direct comparison between the AI-generated voice clone and your natural voice on identical content, making it immediately obvious whether the clone is accurate enough for production use.
Core Insights
- HeyGen offers four avatar creation paths - Instant, Studio, Photo, and Interactive - each suited to different quality requirements, budgets, and use cases.
- Instant Avatar quality is largely determined by recording conditions: 1080p minimum, even front-facing light, plain background, and a stable camera prevent the most common rejection reasons.
- Systematic quality evaluation using three test script types (fast, slow, and technical) before publishing is the professional standard that prevents poor content going live.
- Voice cloning and avatar training are independent processes - both require clean, isolated recordings and both should be tested with the same scripts for accurate side-by-side comparison.
- Studio Avatars require Enterprise plans and 3-7 day training time, making them suitable for high-visibility recurring content rather than rapid iteration use cases.