Big Sound 2015: The Structured Listening Experience

The Big Sound 2015 Structured Listening Experience.
As many of you know, and possibly find a little frustrating, I tend to straddle the objectivist/subjectivist fence. Some, I'm sure, will see this as a wishy-washy position and would like me to man up and take a stand. To others, it might just seem confusing and contradictory. Well, I'm going to try to clear all that up a bit today by revealing the the structured listening tests I've developed for folks visiting Big Sound 2015 and my motives behind it.

Big Sound 2015 Goals
The primary goal of Big Sound 2015 is to put four brand new $1000+ headphones into context of the existing market for headphones of this price. These headphones are: HiFiMAN HE-1000; Enigmacoustics Dharma; Mr. Speakers Ether; and Audio Zenith PMx2. There are lots of $1000+ headphones out the that might have qualified for comparison here on price, but few that I consider actual contenders. So, the pre-existing headphones in this category that will be used for comparison will be: Sennheiser HD 800; Audeze LCD3 and LCD-X; Stax SR-009 and SR-007; JPS Labs Abyss AB1266. And a couple of odds and ends for reference: HD 650/600, and HE-500.

The second goal of the test will be to see if these new cans will pair particularly well with particular amplifiers. To that end I've gathered a rather wide variety of amplifiers to play around with.

To accomplish these goals, I need to acquire by the end a set of data inputs from the listeners that is statistically reducible to some meaningful numbers. I can't let people in to play willy-nilly for eight hours and then hand me some notes; I need a somewhat controlled procedure (Don't worry, there will be plenty of play time for those attending.) And so, I've developed a highly structured listening session that lasts about 2 hours and all Big Sound attendees will perform this procedure at the beginning part of their visit.

This test does two things: It will get everyones critical listening skills tuned up for further listening; and, more importantly, it will teach people a new way of performing blind tests that I find far more sensitive the they way most people do it. There are a LOT of null results in blind listening tests, and I'm convinced it's because people are hanging out in their left brain when they should be listening with their right brain...but we'll get to that in a moment. First I'm going to describe the procedure.

The Big Sound 2015 Structured Listening Procedure
Print

Worksheet for the Big Sound 2015 Structured Listening Procedure.
The first part of the procedure will be four blind listening tests each done twice with two different headphones. In each test the goal is simply to identify the differing sources correctly. The will be preparatory time to play with the switching to become familiar with the various sounds prior to the measured portion of the test. If there are three choices in the test, you can get none right, one right, or all three right. If there are two choices, you either pick right or wrong on the first shot. The columns of number below the test title is where we mark the score. We'll got through each of the tests one by one with the justifications for each:

Comparing Bakoon, Simaudio, and Teton
These are all single-ended output (or will be used single-ended) headphone amps. Each is VERY different from the others. The Bakoon is a current source output with a very high output impedance that will color the sound depending on the severity of impedance swings of the headphone. The Simaudio is a "built-like-a-brick-shithouse" power house of a solid-state amp that drives anything and sounds like a fat pipe with gain---it's an insensitive beast and does it's job no matter what you hook up to it. The TTVJ Teton is an exquisitely designed OTL tube amp with a lovely and gentle tube sound; it's high output impedance will pair well with the HD800 and should should pretty lovely.

With the HD800, the Bakoon is unmistakable due to the tonal changes; the Simaudio is just clear and clean; and the Teton exhibits some lushy tube characteristics. Point it, for folks first test, this one is going to be easy peasy.

With the HE-1000 however, it's dead flat impedance response will not interact hardly at all with the Bakoon's high output impedance; the Simaudio isn't going to care; and the Teton will be a little underdamped with the low impedance of the HE-100 and possibly become a little looser in the bass. So, this test will be a little harder.

Now, I picked the HD-800 and HE-1000 for most of these tests as I believe the HD-800 is still king of resolution, and it's a great headphone; and the HE-1000 for not much other reason than I've found it among the most pleasant of headphones to hear. But for the second test I'm switching things up a bit just so folks don't get too settled in on way of hearing.

Black Widow, Burson Conductor Virtuoso, and Simaudio 430HA
The second test will have people trying to pick blind from three solid state amps using the Enigmacoustics Dharma and the Mr. Speakers Ether. Amps in the test will be the Eddie Current Black Widow, Simaudio 430HA, and Burson Conductor Virtuoso. Each will be driven by the Yggdrasil DAC in unbalanced mode. This test will be quite a bit more difficult than the first test as the amps will behave more similarly. It's important to note here that I'm not trying to get at preferences folks might have with the gear, I'm simply verifying people are hearing differences reliably.

Antelope vs. Yggdrasil
In this test we go back to the HD 800 audio microscope and the pleasing sound of the HE-1000 for tests. Currently there's quite a bit of chatter about the supriority (or not) of delta-sigma DACs vs. R2R DACs. I hear a lot of comments like, "OMG there's a world of difference, I'll never own another ESS DAC (very popular delta-sigma DAC) again. The R2R DAC is just so clearly better." This test is in the mix in order to see just how srtong the perceived differences in these DACs really are. Both the Yggy and the Antelope Platinum are very well regarded implementations of these two types of DACS. If, in the end, we can see a very strong ability for folks to differentiate between these two DACs, then there may be something to the argument that they sound significantly different and one might be preferable over the other strongly. But if we see significant difficulty for people identifying the two DACs we'll know that both methods produce similar results and preferences will be based on subtleties that should be carefully considered.

Cardas vs. Nordost
In this case, I'll be using the same source DAC and headphone amp that have two unbalanced ins and outs and will be switching between to different high-end cables on the input of the headphone amp. Like the DAC test above, this test will try to tease out whether or not people can reliably hear the difference between cables, and what is the statistical strength of that perceived difference. Again, we're not looking for preference, we're looking for the ability to differentiate. Should we be able to show a statistically significant ability to differentiate, we should be able to put to rest some of the objectivist claims that cables simply don't make a difference.

EXTREMELY IMPORTANT! How to perform well in blind tests.
Over the last twenty years, I've been developing my personal approach to blind testing so that I can be a good critical listener. It really started at HeadRoom when our engineer would design our amplifier modules. At each step along the way, whether testing various topologies or which brand of caps or op-amps to buy, we would blind test up to four modules at a time in a little jig purpose built for this type of testing. Suffice it to say, sometimes the differences were quite subtle and difficult to identify...until I stumbled on what I call the subjective evaluative listening mode.

Most of us, when doing a blind test, try to listen like a measurement tool. Is the bass a little punchier? Is the treble slightly more resolving? Am I hearing a little hardness in the mid-range that might indicate distortion? Ect, ect. Now, there's value to this approach, but it's very difficult. Mostly because acoustic memory is so poor, but also because brains really aren't Audio Precision testers and these changes can be subtle enough to be very hard to keep track of. But there is an alternative.

Imagine you park you car in the ally off Main St. You get out and start walking toward the corner and you hear a drum and guitar. INSTANTLY, you know it's live. Before there's time for any judgement your brain tells you, "Hey! There's some dudes playing on Main St. Let's go have a listen." The left brain hasn't analysed the sound and compared it to the characteristics of live vs. recorded music. No, that creative, intuitive right brain just knew in an instant what's up. This subjective right brain that is brilliantly intuitive and observant rarely gets the chance to participate in blind tests because the left brain is saying, "Hey! This is a test and an evaluation and that's my damned job." Well, not so fast left brain, the right brain can be every bit the genius you are...and maybe more.

Objective vs. Subjective Blind Testing
First, we need to start with the the understanding is the goal is differentiation, not analysis. We're trying to reliably differentiate A and B, and not necessarily analyse the differences. (But we will get to a bit of that.)

In objective blind testing we are actively focussed on the music and identifying characteristics of the sound. With subjective listening we care not about the sound at all but what effect it has on out right brained being. With subjective listing you simply relax, sit back, and enjoy the music in as non-judgmental way possible. And then, like a little church mouse, you bring out your inner-observer and note how you're feeling about the music. Nothing about the music itself, but rather things like "I feel smooth and comfy", "it makes my heart leap", "I'm bored", "I'm annoyed", etc. In my experience, this method is far more sensitive than objective blind testing, but few practice it. (Though I willl say over the years that I have met a number of industry people who do it too, so it's not my invention per se.)

I will be spending a little time training folks on this method for the listening sessions in hopes of increasing folks skill set a bit and getting better results. As they go through the four blind tests at the beginning of the procedure, I hope folks will be able to switch over to this new way of listening to be more successful with the more subtle tests.

Evaluating Headphones
After the four blind tests, having sharpened up our listening skills, we'll proceed to a more open ended session of trying out all of these world class headphones. The basic idea is to pick a headphone and play around with it listening to it on various amps and coming to the conclusion of how much you like it and what amps may have sounded particularly good with it. Participants will be asked to draw a line from each headphone to three of the amps they liked with it, and to rank their top three favorite headphones. As long as we're on the road to getting that done, participants will also be able to reconfigure DACS and amps to their hearts content possibly configuring an end game sound for themselves...though that won't be recorded as part of the official Big Sound results...though they are certainly welcome to add that narrative to any written text they'd like to submit to me for publishing on Innerfidelity. I hope amny of you avail yourself of that opportunity.

Data Reduction
When all is said and done—currently Sept 17 is the last visitor day—I will reduce all the data and publish the results. There will be no names associated acquired data; all individual scores will remain confidential. I will ask folks to write their name on the sheets, but that's because I will have a few of the headphone manufacturers up here and I'll have to throw out the headphone preference rating.

During the couse of the testing I will not be revealing any of my personal preferences so as not to bias results. After the data reduction I will begin a series of articles on my impressions and where I think this gear fits in the pantheon od state-of-the-art headphones. I suspect that will run into early October.

During the course of testing I will be posting short articles with a video of each participant giving them a chance talk about their experience for the day. And they are, of course, invited to write a longer article about their experience, which I'll gladly publish as a stand-alone piece on Innerfidelity. (Just be forewarned, if you'd like to do this I'll need two or three photos you take for the piece.)

Alrighty, there you have it. The system is currently up and running, and I expect JK-47 this Wednesday, Jonathan M. on Thursday, and Bob Katz coming in Friday night for a weekend of fun! Doors open at 9AM guys, don't be late, the room get's hot by the end of the day!

COMMENTS
Three Toes of Fury's picture

Outstanding write up Tyll....love the plan of attack and different methods of gathering quantifiable data. Welcome back dude and best of luck to you and your rag-tag-band-o-testers. Cant wait to see the results, thoughts, and suggestions.

Peace .n. Living in Stereo

3ToF

Seth195208's picture

.

tony's picture

or Professor Floyd Toole doing a bit of ghost writing ???

Geez, I love it.

This is the most impressive thing I've come across in Headphone discussions, yet.

Probably all of us are doing this sort of thing in our own ametuer way of buying ( based on best guesses and peer enthusiasms ).

I know this whole project will Sum well, we may even be reading a prediction, scribbled between the above lines.

At my Esoteric Audio ( back in the 1980s ) we did these kinds of tests, using the "man on the street", for our Product Line selections and later for Customer's education in making useful decisions.

I realize ( am sure ) that setting up Big Sound 2015 is one hell of a lot of work. ( is that what knocked you into a hospital bed? )

But, it's work that "sticks" insomuch that it creates a Lab capable of repeatable, duplicatable, verifiable data sets and testing regimens.

Assembling all this seems a natural progression, especially for our Tyll and InnerFidelty.

ABSOLUTELY, I'll smile at the Adverts knowing that this Science is being well funded and I'll thank any and all of em for their support. ( in a way they'll like ).

I've purchased a wide range of headphone gear, the gear I have today survived on it's merits; it's all gear that Tyll wrote about and accurately described.

My own impulse buys [ from meet listenings ] are mostly sold-off, although I have an eye for a nice Tube amp i.e. Valhala 2 and/or Bottlehead Crack double-Uber!

I'm thrilled to be reading this report and delighted to not be reading a note from JA about the unexpected & sad loss of our man.

And it's comforting to know that the Headphone Industry has enough integrity to support Quality Journalism at a time when I was feeling a Glossy Mag. focus on $30,000 vinyl record players.

Phew, now all I have to worry about is Donald Trump getting into the White House!

Tony in Michigan

ps. oh, almost forgot, dear Tyll, can you send me a handful of the Hospital-grade Happy Pills?, that is if you have any left over.

John Grandberg's picture

Hey, I don't see the HeadAmp GS-X mk2 or the Violectric V281 on that worksheet. Any particular reason?

Also, if the Burson makes it in the "amp" field, why not the Hugo TT or the Antelope? The Hugo TT doesn't have analog inputs so maybe that explains it's absence, but the Antelope does have analog ins. Not that we need an even MORE complex arrangement, but it does seem like a question owners of those devices would have - should I bother with an external amp?

Keep up the good work though, looks like quite a fun process!

tony's picture

All good points, but how to do all Amps?

Katz sayz he's bringing his creation so maybe there will be a little bit of Anarchy up there in Snow Land.

Personally, I wanna know "why" that Ape ( in the Ad ) likes the Bifrost/Lyr2 combo ( he could've had the GS-X or the V281 if he wanted, couldn't he ?) and I wanna know if he's gonna get the Bifrost Multi upgrade.

Tony in Michigan

ps. I'll betcha the super expensive Antelope isn't in the running for most folks ( it's over $10,000 isn't it ? ), with a $5,000 Atomic Clock, phew. But, I'm anxious to learn if the Yggy can keep up with it.

I'dve liked to see the MSB Analog, too.

Tyll Hertsens's picture
Must be my IV Antibiotic haze showing through. I'll add the Violectric, and I did get in touch with peter who's sending the GS-X which will be here tomorrow likely. Yeah, just really ran out of room to include the Hugos, though I did try to contact them numerous times early on. Didn't get word back. So, I'll fix the worksheet today. Thanks for keeping track for me, John.
Jazz Casual's picture

Welcome back Tyll. I trust that you're on the mend and not overdoing it. Looking forward to seeing what this methodical process reveals.

ab_ba's picture

Yay Tyll! It’s great to see you’re back at it, and with gusto. I hope you’re feeling good too.

This testing procedure sounds thrilling, and a great way to straddle the objective/subjective divide: let’s listen subjectively, but then be objective about our judgements. Awesome.

One big difference between “left brain” and “right brain” thinking (whatever the actual neural substrate turns out to be) is that left-brain judgements are usually accompanied with high confidence, even if they are wrong. Left brain is like my boss: always confident, sometimes right. With right-brain judgements, people feel like they are just guessing, but they’re still doing better than chance. It’s hard to quantify confidence, but maybe for funsies afterward ask people how much confidence they had in their judgements. I would not be surprised if with DACs and cables, confidence goes way down but accuracy stays fairly high.

tashlin's picture

Wow - really impressed with the level of thought that has gone into this exercise. It should provide some high quality, reliable conclusions and comparative data (over and above the usual measurement data) that is very often lacking in audio-related reviews (i.e. in all publications/boards - not a criticism of InnerFidelity)

The Violectric V281 and the Hugo TT were included in your original list for Big Sound 2015 but don't seem to have been included in these blind tests. As asked by John Grandberg above, I'm just wondering if there is a particular reason for this? (I am considering purchasing both to upgrade my Audio-GD NFB10 but am holding off until I find out more re how they compare to other alternatives and had therefore been eagerly anticipating the results of this project!)

Also very interested to hear how the HE-1000 compares to a properly driven HE-6 (which I believe the V281 is capable of doing).

tashlin's picture

This is the only way I have ever been able to successfully tell the difference between different file types in blind tests - i.e. such as 320kbps MP3 vs FLAC. If I think about it too much I get hopelessly confused and never do much better than 50% in an AB (or ABX). If I just sit back and let the music wash over me, it's still not perfect, but I tend to get that 'hairs standing up on the back of the neck' feeling (and other emotional cliches) with FLAC files much more than I ever get with MP3s, no matter how high the bit rate!

castleofargh's picture

as long as it's blind and has statistical significance, you could get your results by licking the driver and I'll be fine with that.^_^

what you talk about is more about focus than the test methodology. no rule said what we could or couldn't focus on while doing a blind test so go ahead. if we focus on a crappy cue, then stats will reveal we did it wrong. and only when we fail in all cases can we start thinking what the null result means for us.

I do have a small objective check list going in sync with parts of my test tracks like anybody else, because I know the passage is good at revealing a particular thing. but TBH most of the time I just close my eyes and focus on a point on my body(I like the back of my head somehow) trying to think about nothing. that's how I get most differences TBH, then I try an abx to see if stats confirm my cue or if I was full of crap(often the case). and only then do I attempt to decide which one I preferred.

Tyll Hertsens's picture
...that sounds about right to me.
veggieboy2001's picture

...I was wondering what it would be like to put some generic cables in there with the Cardas & the Nordost...would people be able to hear the differences at all??

Just a thought.

And a big Congrats to Hans030390.... Have a great trip Mate! I look forward to your (& everyone else's) impressions!!

moshen's picture

Really love what's being done here. The objectivist/subjectivist methods here is going to set the benchmark for everyone else in terms of testing.

It'd be great to throw the cheap O2 amp there. As it's been well measured and a popular cheap amp, it'd be great to know where it stands.

Long time listener's picture

One thing that might help listeners accurately pinpoint differences in sound is making sure that they are listening to music that is familiar to them--not the person conducting the experiment. If you have to process unfamiliar sounds and textures at the same time that you're trying to process headphone/amp differences it may be too much.

I keep returning to certain pieces that I've found to be very revealing of, say, upper-midrange hardness, or low bass solidity, or treble smoothness. I always turn to them to get a quick idea of whether a new headphone meets my particular requirements. I suppose every serious headphone audiophile does the same thing.

Impulse's picture

This might be one of the few articles ever about TotL gear that I'm really looking forward to... Don't get me wrong, I usually read that kinda thing regardless, but the methodology and comprehensiveness on display here is really gonna set it apart.

money4me247's picture

I think would be even better would be if you didn't tell them what are the options that they are testing for (aka don't tell them that they are listening to different amps or cables or dacs and definitely don't tell them the brand name of the gear that they are using).

This way you remove biases for certain things. Aka some people who may not believe in cable differences will not try as hard to identify differences when doing a cable test and etc.

There should also be a control where you use there is a replicative answer to see if they correctly identify same to same against same vs different.

That type of control of same-same will allow verification of that the differences are due to the different gear as they are not incorrectly guessing different when there is no change.

Tonmeister's picture

Kudos to Tyll for tackling such a challenging and controversial topic, and for taking a more rigorous approach to subjective evaluation.

I have some questions and suggestions that may help improve the sensitivity and validity of the tests and their conclusions.

1. For the first four listening tests, will there be any level-matching done to prevent listeners from identifying headphone amplifiers, cables, D/A converters from level differences? Will you control the absolute playback level in addition to relative level?

2. Will the presentation order of the test objects in the four tests be randomized to prevent listeners from identifying test objects based on order? Will this be done using test software?

3. Will all listeners complete all four tests or will subjects be assigned to different tests? My concern is the length of the four tests (2 hours?) and how fatigue may be a factor. Also, you might consider randomizing the order in which the four tests are completed, if this is practical.

4. How will listeners identify which test object is which during the testing phase? It seems like this approach involves a lot more cognitive load than employing an A/B or AB/X test where listeners simply say whether two test objects are the same or different or that X is A or B? Will listeners be given any feedback on their responses?

5.What about program selection? The music selections will probably interact with headphone/amp identifications and preferences, so this variable should be well controlled as well.

5. In your second test, where listeners rank their top three choices (headphones + amp combination) from a large selection of headphone / amp choices, it appears that the tests are sighted. How will you know their preferences aren't influenced by brand, price, expectation bias, etc? Of course, control of all of the same nuisance variables (level matching,presentation order randomization,etc ).

Cheers
Sean Olive
Director Acoustic Research
Harman International

Tyll Hertsens's picture
1) Yes, I do pretty stringent level matching.

2) Test object order is randomized. I have patch cables from the amps and extensions to the switch box. So, between tests, I pull all the cables apart and scramble them up in my hands and then reconnect. So, I don't even know which is which.

3) Tough one, fatigue can be a problem for sure. I think listeners will sort themselves out in fairly obvious manner. Yesterday's participant worked hard on the first test (this was his first blind listening experience) and actually did pretty well. Once we got to the second test, it was significantly harder, and it became obvious in not too many trials he'd reached his limit, so we called it a day and moved on to playing with the various headphones. While I'm very interested in the blind test results, I'm really more interested in the person having the blind experience going in to provide a no bullshit referent for that kinds of differences we're really talking about here, and the nature of both critical listening and the experience of being presented music. A good warm-up exercise for further subjective listening.

4) Once the test starts all guesses are scored. When listening to three amps, they come to a conclusion about which one they're listening to, and then turn off that amp. If the music goes away, they're right. If it doesn't, they're marked for a miss, and get two more guesses.

5)I have a series of 10-20 second looping clips of various types. One grouping is for wideband response---music with information in all octaves---generally I'll let the listener select one of those clips.

5B) Yup. No controls on that one, just peoples opinions. But, for what it's worth, yesterday's participant after having such a hard time telling amps apart, said it didn't bother him a bit to flit around the room listening to the various headphones driven by all sorts of different amps. He knew as long as the amps were well made they'd only provide modest differences and the differences between the headphones themselves were the biggest factor.

I do think, however, that we will be getting some seriously good listeners through here, and they will have things to say about the amps.

Say, I'd love to see Todd up here. Any chance of that?

Tonmeister's picture

Thanks for your quick responses.

1. Glad to hear you're doing level matching. It should be straightforward for amps and cables as their frequency responses are generally similar if well designed. Matching levels of headphones can be more challenging even ignoring leakage effects that vary among listeners depending on fit.

2. Excellent. You are going to be one exhausted tester doing all that cable swapping. No wonder you are losing weight as someone pointed out :)

3. There has been some research by Soren Bech that shows listener performance in loudspeaker tests gets worse after 30-40 minutes. As the task gets harder and the audible differences get smaller I'm sure performances declines even sooner.

4. That makes sense.

5. Good. Short loops work best as it facilitates listeners to compare test objects under the same signal connect. The new ITU-R BS 1534.2 (see section 5,1) actually says that test loops should be 10-12 s long and no longer. http://www.itu.int/dms_pubrec/itu-r/rec/bs/R-REC-BS.1534-2-201406-I!!PDF...

5(B). I think informal tests are fine as long as the limitations of the test are noted when drawing conclusions.

Don't know if Todd can attend but I can ask him... It would be fun to participate.

Charles Hansen's picture

Hi Tyll,

Loved your write-up and the ambitious plan! That "rest" must have supercharged you! Super cool stuff... I just want to warn you about a massive red herring in your proposed test of the "R-2R" versus "Delta-Sigma" DAC test.

There are about bazillion different variables between these two DAC boxes besides just their DAC chips:

1) For starters, no R-2R DAC chip has a built-in digital filter. A few high end Delta Sigma DAC chips can allow their built-in digital filter to be bypassed, but there are only a handful of products in the world that do this. You will be comparing digital filters as much or more than DAC converter technology.

2) There are another half-dozen features in the ESS DAC chip that can be either used or bypassed besides the digital filter itself. Depending on the model, there is typically a volume control (easily bypassed), an IIR low-pass filter (rarely bypassed), a "Jitter Eliminator" (rarely bypassed), and a bunch of different register settings that all affect the sound quality.

3) The analog circuitry of each DAC is completely different.

4) The circuitry that recovers a digital signal from the source and feeds it to the chip is completely different.

5) The power supplies for both DACs are completely different.

6) The master clock oscillators are completely different.

6) The parts quality and brand of parts are completely different.

7) And on and on and on and on, all the way down to the feet used.

When we upgraded the Ayre QB-9 to the QB-9 DSD, we made various changes to the product, ranging from changing resistor values to replacing the Burr-Brown Delta Sigma DAC chip with the ESS DAC chip. (In both cases we used the identical external digital filter). Even keeping everything else the same, there were clear differences in the sound quality between the B-B and ESS Delta Sigma DACs.

The bottom line is that when you hear a difference, it will likely have almost nothing to do with the particular converter technology used, so don't be too hasty in drawing conclusion when there are so many variables involved. The cable tests will have much greater validity, as while there are clearly many, many differences between the two cables selected, they are basically unmodifiable "black boxes" to anybody except the manufacturers themselves.

Cheers,
Charles Hanssen

X