The Real Cost of Generative Video: 63 Videos, IEEE Study

May 20, 20264 min read

If you’ve spent any time playing with generative video models, you know the pitch: every new release promises sharper visuals, longer clips, better prompt adherence. What nobody talks about is the part that actually decides whether you can ship: how much does it cost, and is the quality you’re paying for the quality you’re getting?

I spent the last few months trying to answer that. The result is a paper I just published in IEEE Access: “Cost-Efficiency Metrics: Evaluating Computational and Resource Efficiency in Generative AI Video Models.” Here’s the short version.

The setup

I benchmarked five generative video models head-to-head:

HunyuanVideo-1.5 (open source, heavy)
CogVideoX-2b, CogVideoX-5b, CogVideoX1.5-5B (open source, lighter)
Google Veo 3.1 (cloud API)

63 generated videos. 32 human evaluators rating them blind across 10 dimensions. ANOVA, Tukey HSD, Cohen’s d — the full statistical workup. And critically, a cost figure attached to every single video.

What I found

1. The cheapest model is 20× more cost-effective than the cloud API. CogVideoX-2b runs at $0.085 per video. Google Veo 3.1 runs at $0.75. That’s an 8.8× cost difference for the raw output, and roughly 20× when you factor in quality-per-dollar.

2. The most expensive model produces the best human-rated quality. HunyuanVideo-1.5 scored 4.53/5.0 for visual quality from human raters — the highest in the study. But it costs $1.45 per video and needs 39 GB of VRAM. Enterprise hardware, enterprise prices.

3. The cloud API wins on usability, not quality. Google Veo 3.1 generated videos 4.2× faster than the fastest local model and scored highest on ease-of-use (4.66/5.0). When users were asked to rank their preferred model, 43.8% picked Veo — tied with HunyuanVideo. Only 12.5% picked any CogVideoX variant, despite its dominant cost-efficiency. People don’t pay for efficiency. They pay for quality or convenience.

4. The most uncomfortable finding: automated quality metrics actively lie to you.

This one surprised me. CogVideoX1.5-5B scored the highest CLIP score of any model in the study (32.40) — CLIP being the most-used automated metric for text-to-video alignment. It also scored the lowest human visual quality rating (2.72/5.0).

The correlation between CLIP scores and human ratings: r = −0.87, p < 0.001. That’s a strong negative correlation. The metric the industry uses to claim “our model is better” was, in my sample, predicting the opposite of what humans actually saw.

Frame consistency turned out to be a much better predictor of perceived quality (r = 0.73 with artifact severity).

Why this matters

If you’re an engineering team deciding whether to build on a local model or a cloud API, the math depends entirely on volume:

High-volume production (thousands of videos/month): local models break even within 6–12 months and give you 82× less variance in inference time. Critical for SLAs.
Low-volume / prototyping (50 videos/month or fewer): cloud APIs are cheaper in absolute terms, even at a 784% per-video premium.

And if you’re a researcher: stop trusting CLIP scores alone. The field needs composite evaluation frameworks that integrate temporal consistency and human perception. Automated metrics built on static image-caption pairs cannot capture motion.

Read the paper, fork the code

📄 Paper (open access): IEEE Xplore

📄 DOI: 10.1109/ACCESS.2026.3695013

💻 All code, data, and 63 generated videos: GitHub repository

Everything is reproducible. If you want to run the benchmark on a new model, the harness is there.

What’s next

This paper is a foundation, not a finish line. The next questions I want to tackle:

Does this hold over longer videos (15–60s) where temporal degradation matters more?
Can we build a quality metric that actually correlates with human perception? (VMAF looks promising — r = 0.81 vs. CLIP’s 0.31.)
Do user preferences shift over weeks of exposure, or are first-impression ratings stable?

If you’re working on any of this, I’d love to hear from you. Reach me at on LinkedIn.

Roan Weigert is a DevRel AI Engineer at GMI Cloud based in San Francisco. His research focuses on practical, production-grade evaluation of generative AI systems.

What 63 Generated Videos Taught Me About the Real Cost of Generative Video – IEEE Research

The setup

What I found

Why this matters

Read the paper, fork the code

What’s next

More articles