Expert Tests InnerFidelity's Headphone Measurement Repeatability and Reproducibility Page 2

Procedures/Results & Discussion
Right off the bat, I wanted to see just how reproducible Tyll's gear was with respect to repeatability. But before I go into that I want to note the differences between precision (repeatability) and accuracy. I think it's best described by the Image above. As you can see, one can be "precise," but not accurate. Accuracy is both precise and "hitting the bull's eye."

Back to the testing ...

First, we wanted to categorize Tyll's gear (dummy head, isolation chamber, microphones, measurement equipment) alone to determine just how much variability the equipment alone was causing in the measurements. Tyll simply placed the HD800s on the dummy head and took 5 successive measurements over a short span of time. To note, the headphones were left on the head without any changes and the 5 successive measurements were done in short order to eliminate any potential variability of room temperature variation. The results were quite impressive to say the least.

Below are the five frequency response graphs of this experiment:


I then determined the standard deviation at each frequency and multiplied that value by 2 and plotted the 2x standard deviation (dB) by frequency (Hz). The reason I used 2x again was that statistically, if we measured the same pair of headphones 100 times, we would expect 95 of them to fall within +/- 2 standard deviations (or sigmas). As well, this graph would show us the areas in the frequency ranges that the measurement system had difficulties measuring.

The results here also looked very good with an average 2x the standard deviation of roughly 0.1dB.


We can use this data to extrapolate that should we measure the exact same pair of headphones successively 1000 times (and discount any placement or temperature variations), we could expect to find 950 of them would measure within the frequency range of +/- 0.1dB. This shows that the measurement tools are very precise. (I am using the word "tools" in place of system as the system includes the effects of changing headphone position on the head, temperature variations, etc…)

Now how does this precision relate to accuracy? Well the dummy head is composed of the dimensions/density of the "average" human head. So, comparing this method to say me using my Radio Shack SPL meter, I've got to believe that Tyll's system is far more accurate in determining how my ears will hear it.

Now the fun begins….

The next set of experiments, Tyll performed the exact same experiment above, BUT this time he removed the headphones and placed them back on the head. The results were very different in terms of precision. For this data set, we have a total of 7 runs. While leaving the headphones on the head yielded remarkably repeatable results, the action of removing and placing the headphones back on the head made substantial changes.

Here is the frequency response graph from this run of 5 successive measurements but this time with taking off the headphones and placing them back on the head:


As we can see, the precision up to about 2kHz is still quite good; however, the data does start to diverge after this point. Comparing this to the first run (when the headphones were not removed from the head), we can see quite a bit more variability in the measurements in the higher frequencies.

Performing a similar analysis on the standard deviation vs. frequency, we see that the deleterious effects on the precision at higher frequencies are more pronounced:


The bass to mid bass/lower mids regions (Regions 1 and 2) still show little effect of precision loss due to headphone placement. Region 3 (1300Hz – 3400Hz) shows a shifting upwards of the variability. Once we approach about 3.4kHz the variability does really begin to become a strong function of headphone placement. The resulting 2x standard deviation is now 1.8dB. Again, this would mean, should we measure the same pair of HD800s 1000 times (and place the headphones on the head in between each measurement) we could expect 950 measurements would be within 1.8dB.

However, once we reach higher frequencies (>8.4kHz), the variability really begins to become strongly affected. The average standard deviation in this region is now almost 5dB. So yet again, variations of +/-5dB are to be expected in this region. The maximum variability is approximately 18dB, higher still. So it appears that the slightest headphone placement on the head can significantly change the high frequency resonances in the dummy head and drastically reduce the measurement system's ability to precisely measure higher frequencies.

Finally, we performed the exact same testing procedure that Tyll outlines on This method uses an average of 5 specific locations on the dummy head to "average" out the above variability. (Slightly forward, back, up, down, and centered)

The results showed that this spatially averaging method does noticeably smooth out the variances seen as noted in the two graphs below:


To confirm, each graph above compares the result of five runs where each run is the averaged result of the 5 different headphone positions. We can see that the variations between runs have been reduced by a good margin; but they still are quite a bit larger than in the tests where the headphones were not repositioned on the dummy head.

The 2X standard deviation vs. frequency graphs shows this as well:


(Please note scale change between this and previous standard deviation graph.)

So now headphone measurements in Region 3 that vary within 1.33dB should be considered equivalent (but please note that we did see variations of up to almost 2.5dB). Furthermore, in Region #4, we did see variations of +/- 3.5dB (with maximum variations of 7.4dB).

What does all this data mean? First off, I learned that measuring headphones (unlike measuring conference room table lengths) is pretty darn difficult. However, the gear used by Tyll does very much seem up to the task of offering us readings that are precise and based on the dummy head that Tyll uses; it does appear to have a more accurate reading of what's going on in the average human head than other methodologies.

Between 10Hz -3.4kHz, the precision of the measurements seem to be relatively unaffected (with 2x standard deviations less than 1dB) . So one could with confidence could use this information and compare say two different headphone models in the bass and mids.

But measurements in the treble region appear to be strongly influenced by the position of the headphones on the head. Even the smallest variances in placements alter the resonance artifacts (peaks and valleys) in the higher frequencies that can make it appear that 2 different headphones measure differently (by up to 7dB). Even when measuring the same pair of HD800s we found variability in the higher frequencies (> 8.4kHz) that averaged approximately 3.5dB with peak variances of up to 7.4dB.

Finally, based on the data, it does very much appear from 10Hz to 3.4kHz the measurements are very precise and not very dependent on headphone positioning. So comparisons between different headphones and models are quite relevant statistically. In the 3.4kHz to about 8.5kHz, the precision is still good, but can vary (so please take this into account). The most problematic area is the treble (frequencies > 8.5kHz) where even the slightest variations in headphone placements can have a drastic negative effect on the repeatability. So please use caution when trying to make comparisons in this region of the frequency range.

Next steps? I'm not sure, but the challenge is there for Tyll (or anyone else for that matter) to work on a measurement system that still maintains a level of accuracy (i.e. dummy head), but improves upon the precision at higher frequencies as even the slightest variances in placement can have such profound differences on the measured results. I'm sure that using this type of methodology to question what's being measured should help in that endeavor.

I would like to thank Tyll for his great amount of work in pulling together this data and actually running the experiments. As well, I've got to say, in the few telephone conversations I've had with Tyll, I've learned more than all the years of independent learning that I undertook since I got into this hobby. He's a great guy with a great depth of knowledge and really just knows his stuff.

Editor's Note: Aw geez, thanks, Peter. And thanks for the terrific analysis. Let me complicate it a little further ...

First, the reader should know thatl the peaks and valleys in the high frequency region arrise from resonances: between the driver and ear; within the concha ridge of the ear; and in the ear canal itself. While the driver might be putting out a completely flat frequency response in the treble, all the resonances will make it appear that it's not. Basically, the ear's not hearing the driver as much as it's hearing all the resonances the driver is exciting in the coupler (the combined headphone/ear acoustic system).

Because these resonant cavities are very small, very small positional changes of the headphones on the head significantly shift the resonant frequencies of the acoustic coupling between the headphone and ear. So the changes in amplitude that are being measured for this study are primarily occurring from the shifting in frequency of the resonant peaks and nulls.

One way to rid ourselves of this pesky problem might be to apply some smoothing to the frequency response curves. Because the resonances typically create adjacent peak and null features, by applying a smoothing filter one will be able to somewhat average the peaks with the nulls to arrive at mean response. This might indicate an approximation of how much energy the driver is emitting into the coupler. (Peter and I have already chatted about this as a further avenue for exploration.)

Another miserable reality is that one headphone may couple with the head in a completely different way than another. Peters current study shows us the repeatability when measuring an HD 800---a headphone that's fairly positionally insensitive. If I were to perform the same test with a Beyerdynamic DT1350, which tends to be fairly positionally sensitive, we might see a completely different analysis from Peter. In that case, variations reported in bass response do to the changing seal would be much higher.

The point is: If Peter's article cautions us to take headphone measurements with more than just a grain of salt, my experience tells me we need to unscrew the top of the salt shaker.

None-the-less, it's the best objective measure we've got, and you can rest assured that I will continue to try to improve my skills and methods as I perform these measurements.

Thanks for the article, Peter. I look forward to producing more data for your future number crunching sessions.


MacedonianHero's picture

See Tyll, from your final "Editor's Note", I've yet learned more about the intricacies of how sound travels and resonates in the human ear. It was a fun endeavor.

Hopefully others will not just use measurements, but also begin to trust their ears too. I think both are needed to truly evaluate gear.


Baka1969's picture

Great job Peter. It was fun going through it with you.

bluemonkeyflyer's picture

Well done explanation of the vagaries of headphone measurements, Macedonian Hero. Many thanks!


ultrabike's picture

EDIT: removed all my (unnecessary) comments.

Cools stuff Tyll and Peter... I can say that I did not know how sensitive headphones were vs positioning prior to this.


Draygonn's picture

Well written, informative, and interesting. I finally found out what Six Sigma means!

MacedonianHero's picture

Thanks for the kind feedback.

The first page says it all about where 6 sigma came from. But there are many tools in the Six Sigma toolbox beyond that and Gage R&Rs.

Glad to illuminate the community on a subject I'm quite passionate about.


Maxvla's picture

Thanks Peter and Tyll for doing this. I had always suspected treble response above 10KHz on these types of graphs could not be blindly trusted. I'd love to see the results of the smoothing you mentioned, Tyll.

firev1's picture

that Tylls measurement technique is being checked yet again, such test of not only the headphones but also the measuring equipment makes for a interesting read. Cool that Macedonian Hero is Lean Six Sigma Black Belt(for those that don't know, that means he is great at quality control/management).

Jazz Casual's picture

and read Tyll's headphone measurements with interest. : )

Frank I's picture

Nice job Peter. Very well done and a very good read. I enjoyed it thoroughly.

schalliol's picture

Amazing info, and it's great to see the collaborative nature of getting to the bottom of this.

svyr's picture

what about IEMs,not FS or on ear HP?

Shahrose's picture

Enjoyed the read. Nice job.

Amclaussen's picture

Recently I bought a set of Shure SRH-940 headphones that I found quite good overall for the price. Just after trying them at home, I instantly found they were notably sensitive about placement compared to my old Sennheisers, so that I had to be careful about perceiving their sound "signature" and jump to conclusions before finding the (then) elusive "sweet spot placement on my ears. After three months of relaxed hearing, I still enjoy them a lot, but now I'm careful to check they are "correctly" positioned and adjusted (headband and earcup rotations) so they "sound" at their best. This has teached me that sometimes, one cannot simply reach a valid opinion on a certain model, because it happens to require a more detailed or careful listening. As the first comment (5:15 pm) says: "Beguin to trust you ears too". Several days after, I visited another store and carefully auditoned them with a Lehman Audio Black Cube Linear headphone amplifier, and found enogh difference and improvement, as to decide to expend more than twice the 940's price... and I'm not as wealthy as I wish! Maybe the differences need to be analyzed and explained too, but I trusted my ears and continue to enjoy them more with the amplifier than with no amplification.

Now, the subject of placement (or more properly:insertion or coupling) of IEM's... I also own the Shure 535, and still cannot get the same precise sound every time due to their (in my ears)large variability. I am using the silicone moulds made by a local auditive specialist because, for me at least, NONE of the supplied sleeves provided me with the satisfactory sound signature, degree of isolation, bass seal or necessary comfort. I found them quite difficult to "set and forget", and much more variable than my old Shure E-1 that were very different in this aspects, since those old ones were so small and light that I was able to insert them far enought that the yellow foam sleeves were able to properly support them, get them perfectly sealed and comfortable enought to really forget I was wearing them. (BTW, my best fitment was with the earphone body upside down, that is, the LEFT one in the RIGHT ear and viceversa, with the cord over the ear). In contrast, I find the 535's too bulky, heavy and cumbersome to a degree that I miss the performance of the older ones, even when they had a more limited frequency response. Can somebody trow some light on this subject as applied to IEM placement?

Mkubota1's picture

...and I'm continually impressed by Tyll's efforts and transparency. Keep it up!

kongmw's picture

And hats off to Tyll. While the analysis confirms that Tyll measurement scheme is solid at bass and mid range levels, it also reveals the uncertainty up in the treble region. It takes a man to post such honest review about his own systems possible shortcomings.

purrin's picture

It's important not to jump to conclusions on the precision of Tyll's measurements in the treble. They may in fact be better than what is presented here.

This is related to what Tyll mentioned in his Editors's Note: certain types of measurement phenomena, the extreme peaks and dips, are artifacts of the measurement system. There are two issues here which need to be considered when interpreting the results:

1) Whether the extreme peaks and dips are erroneous data that should be discarded for purposes of determining precision. It is not an uncommon practice for pollsters (or other data gatherers) to discard what is obviously nonsensical data. In my experience with measurements, the extreme dips are always very suspect. I could go more in-depth into why this occurs, but that would be a another subject.

2) Whether minor frequency shifts of peaks and dips should unnecessarily "punish" the precision of the system because the evaluation method used is one-dimensional, i.e. only changes in amplitude per specific frequency, but not frequency shifts, are taken into account.

For example, say measurement #1 has a peak of 7db at 10kHz. Then measurement #2 has a peak of 7db at 10.5kHz instead. The shifting of frequencies in not uncommon because of placement, or even ambient temperature/pressure, or voice coil temperature differences.

So to make a very simplified illustration: would be then be fair to say the measurement system is 5db off at 10kHz AND 5db off at 10.5Khz; or would it be more fair to say that the measurement system varies the peak at 10kHz at most 1/50 of an octave?

Just some food for thought.

As Tyll mentioned, maybe the analysis should be run on the data 1/3 or 1/6 octave smoothed to mitigate the effects of the two issues mentioned above. We would get more meaningful results. I would certainly be interested in seeing precision of the measurement system when the FR data is smoothed.

Which actually leads to a good argument that FR graphs should have at least some level of smoothing when presented for wide public consumption.

Tyll Hertsens's picture
Thanks Purrin, good observations and exactly my thoughts regarding smoothing once measurements start making it out to wider audiences.
MacedonianHero's picture

Smoothing is definitely worth trying out. Looking at the raw data; particularly the 2X Standard Deviation vs. frequency response (Regions 4 and 5), you can see that it's not just a "few peaks" causing it to rise, but rather a trend that is consistent across the frequency range in the treble region.

But then again, this isn't true "raw data" either as it's already smoothed somewhat as we averaged out the 5 headphone positions for each run.

That said, I'm keen on seeing the effects of different smoothing methodologies.

purrin's picture

Don't disagree with the trend of going up the band being less precise. This behavior is obvious from the get-go to those who have some experience taking the headphone measurements. We usually see the funkiest crap past 10kHz.

However, it's still sort of misleading to say the standard deviation in the treble region is 5db, which is one heck of a lot, basically almost meaning unreliable. From a standard deviation vs. frequency graph POV, this statement is true. However, humans don't hear this way. And even a simple glance at the FR graphs for each measurement don't scream "all over the place, i.e. +/- 5db." Again, we need account for a second axis (allowing for minor frequency shifts in peaks.)

Ideally the best way to measure precision with these graphs would be a 2D vector based system that identifies similar looking curves within a close enough threshold in 2D space, and then measures the deltas (both frequency and amplitude) of matching points of those curves among the FR plots.

Short of that, I'd like to see the a re-crunching of the data with 1/6 and 1/3 octave smoothing using a rectangular function to reduce the influence of artifacts and take into account the frequency shifting phenomenon.

ultrabike's picture

According to the source bellow, the smoothing should be 0.2 octave:

You guys may (may not) find these papers interesting (I know I do):

MacedonianHero's picture

After Tyll's natural smoothing (by taking the average of 5 different dummy head positions) its more like an average 2X standard deviation of approximately 3.4dB from 8.4kHz and up. This is not bad IMO. But some newer to the hobby may look at two different headphone models and extrapolate that this is a meaningful difference; when statistically it's not. It's the average of the 5 different headphone positions that Tyll publishes here not the "raw data" so to speak. Then you raise a very good question, is that what the human hears? Can they hear that?

I agree that other smoothing exercises should be looked at as a means to see how this can be reduced further in the treble region. I am also wondering what can be done physically to the setup to do this at the outset. Any ideas?

Currawong's picture

Thanks MH for the analysis. It's great to see everyone working together on getting more useful data for people to use in what is a complex subject.

Tyll: For positioning consistency, have you thought of doing something like sticking a couple of small pen lasers on the walls of the chamber pointing towards the middle of the dummy ears so that you can align either side more precisely with the head, or are there marks on it already that you can use?

Tyll Hertsens's picture
There are marks on the head already. The real problem is that from headphone to headphone you really don't know where the center should be.
ultrabike's picture

If you make a measurement of the headphone and then take the headphone off and on again, placing it as much as possible in the same spot as it was before, do you still get significant variations in the measurements? or are the variations due to the fact that you measure purposely on the 5 different locations?

If you where to take another set of 5 measurements and go through the usual process, how much variation do you get on the same can final measurements?

Tyll Hertsens's picture
I'm not sure if I get your question, but it sounds like you're asking me for exactly what's in the article.
ultrabike's picture

I guess I got kind of confused when you said "The real problem is that from headphone to headphone you really don't know where the center should be."

Based on your article and your reaction, the real problem is that you just can't get a consistent measurement at high frequencies EVEN if you knew "where the center should be."

Reticuli's picture

I think it's interesting the range that Tyll's measurements are most consistent in is the range used by I also wonder how much this relates to actual headphone listening and if perhaps we are mostly affected by this same range. That would mean we are most sensitive to just minor differences within the middle frequency spectrum: response, decay & distortion, transients, etc. It could also explain the occasional inferior and superior headphone listening moments with the exact same pair of headphones, associated equipment and source material, while not as important perhaps as a glass of wine, medications (some enhancing it, others deleterious), the amount of sound exposed to in recent days, or how much sleep we got the night before, still significant nonetheless.