Google has recently published test results comparing VP8 to H.264. It created somewhat of a stir on the x264-devel mailing list. I thought I would add what I found two years ago as a result of my thesis work. That is, these results are not fully up-to-date, but still interesting.
A Basic Comparison
My thesis involved comparing H.264 and VP8 using some new methods I developed from scratch, and others that I applied from other fields. The basic way that video codecs are compared today is measuring bitrate verses average quality of the encoded video, as seen in Google's results. I found this an insufficient and possibly an inaccurate way to compare videos, which are composed of still frames.
Instead, I decided to compare results frame-by-frame, and then statistically analyze the results using box plots and Students t-test for statistical significance testing, to see if there was actually any difference. Here is one example:
This is a comparison of H.264 Baseline, VP8 at the "good" deadline, and the JM reference encoder for several well known reference videos. As this is a box plot, it shows the minimum PSNR value (the bottom part of the whisker), the 1st quartile (the start of the box), the median (the line in the box), the average (the single point, usually a plus, x, or an asterisk), the 3rd quartile (the top of the box), and the maximum. All encoders were encode for PSNR maximization, and as equivalent as settings as was possible. This graph in particular was two pass 150 kbps.
VP8 does well in this graph, outperforming H.264 Baseline (using the x264 encoder) in most videos. Videos that have low movement (such as Akiyo) do particularly well, where VP8's least well encoded frame is better than H.264's average. At low resolution and bitrate, when compared to H.264 Baseline, VP8 does well.
An important aspect of comparing video codecs is to set the encoder settings correctly. I tried to match the settings between these three encoders as closely as I could. One important part of using x264 properly was to disable the psy-rd optimizations that can hurt PSNR or SSIM scores, but, allegedly, improve subjective viewing performance.
When comparing VP8 to H.264 High profile, the story is different. Note: I'm switching from PSNR to SSIM here. This is to show that I used both in my thesis and not to misrepresent the results. VP8 performed similarly in the PSNR results, and my full thesis contains both graphs and analyses.
This H.264 High profile, VP8 Best deadline, and the JM reference encoder on the same videos, with otherwise similar settings, compared in SSIM rather than PSNR. In this one, H.264 outperforms VP8 in nearly every video by a substantial margin. For these tests, I also compared them using Student's t-test, so I'm not just eyeballing this one: the null hypothesis was rejected soundly.
More interesting graphs
Another graph I produced was RD curves, and then computed p-values for Student's t-test. Here is the graph of the RD curve measured in PSNR for Stockholm, which is a 720p resolution video:
There is seemingly little difference between Baseline and VP8's Good deadline, but a higher difference between High profile and VP8's Best deadline. But was that difference statisitically significant? Yes! The p-values that were computed for High verses Best were for PSNR and for SSIM. Both of these values met the statistical significance threshold of set earlier.
Comparing frame-by-frame statistics and rate distortion curves is good, but what if there are differences and interesting behavior that these course measures hide? I generated some graphs that plotted frame-by-frame results. This are the SSIM values for five videos stitched together: BlueSky, Tractor, Riverbed, Pedestrian Area, and Rush Hour, which can be found on xiph.org. The vertical lines in the graph mark the scene changes.
VP8, in this example does rather well against both H.264 High profile and Baseline profile at the end, and it does alright in the middle and poorly during the zoom in that is featured in Tractor. I thought that this was surprising, as VP8 did poorly in a one-on-one comparison in Rush Hour alone. It turned out that the way the RD optimizer worked in VP8 (at the time) was that it tended to allocate too much bandwidth to the end of videos, even in two pass mode. This explains the dramatic performance boost. I imagine that they have fixed this by now.
This is one reason not to blindly accept frame-by-frame PSNR or SSIM values. They do not tell a full story. Especially when the graph doesn't include a key (see the last slide). I feel that these graphs are more likely to mislead than to elucidate.
An important aspect of video coding is the compression of key frames so that their presence takes the minimum amount of bandwidth necessary for high quality reference pictures and accurate inter-prediction. For this test, every frame was set to be a key frame and the bitrate controlled by quantization factor and the output bitrate plotted.
This graph is of H.264 Baseline compared to VP8's Good deadline. The error bars represent the standard deviation and the point is the average value for SSIM.
VP8 does very well in this comparison, outperforming H.264 Baseline on average in every comparison. But was it statistically significant? No. The standard deviation was high enough, and the difference low enough that we can't reject the null hypothesis (that they are similar in quality).
To comment about the statistical significance of the PSNR and SSIM results, Student's t-test and Welch's t-test were used. First, a Welch's t-test was used to measure whether or not the variances were equal. The standard deviation, was tested. If the resulting p-value was less than the chosen value for , the null hypothesis that the variances were equal is rejected. If the variances are unequal,Welch's t-test was used for the average PSNR or SSIM values corresponding to those standard deviations. Otherwise, Student's t-test was used, which assumes equal variances.
But what about H.264 High profile, which includes the 8-by-8 DCT that should surely improve its intra-coding performance?
These are nearly identical. Obviously both t-tests found no significant difference between these.
VP8 is a powerful modern video codec that is suitable for individuals and organizations that seek a patent-free alternative to H.264. Its quality on medium resolution web videos is comparable with H.264, and excels at low resolution and low bitrate videos. Compared to H.264 Baseline, VP8 outperforms it in quality for the same bitrate.
It is underperforming in higher resolution video, such as HD video, due to its simpler segmentation scheme, which reduces the effectiveness of its adaptive quantization and adaptive loop filter selection. VP8's entropy coder is approximately as efficient as CABAC, but is somewhat simpler, partially due to the lack of needing to adapt after every bit. VP8's intra prediction is sophisticated and performs as well as H.264 High profile on intra prediction tests.
The main reason for its high performing lower resolution results and lower resolution performing results is VP8's equivalent of H.264's flexible macroblock ordering (FMO). VP8 can add an identification number to each macroblock, numbered 1 through 4, and encode these numbered segments similarly. They do not need to be contiguous, unlike H.264's slices without FMO. This offers superior quality at lower resolutions, where the number of segments is not a impedance. At higher resolutions, four segments seriously limits the compression possible with this method.
For VP9, it would be a significant improvement to allow more segments.
This post is really a small subset of the testing and results I gathered for my thesis. If you'd like to read more about this, you can read my thesis. It has a lot more detail, including a detailed description of VP8, my encoding parameters, statistical methodology and more. I'm also interested in what you think of my research, so please contact me with any comments.