Clubhouse’s First Mover Disadvantage

When “Social” Audio Stopped Being Enough

Evan Kirkham
8 min readMay 4, 2021

The growth is slowing, the retention is slipping, and when I open Clubhouse (after a two month detox), I’m left choosing between rooms titled “How to PITCH VCs and Angel Investors,” “How to Moderate Like a Pro on Clubhouse,” “TALK O’ TUESDAY,” and “We all about the money over here” (whatever that means).

Eeny, meeny, miny, moe — a random selection. There it is, that quick dopamine hit that comes from live audio immediately piping through your earbuds. It’s a good feeling, but it fades. What next?

Two ways to play this game: listen or be heard. I choose the former. I could probably learn a thing or two about “how to PITCH VCs” and the moderator does have 10k+ followers, a suped-up profile picture, and what appears to be a professionally built bio chock-full of emojis, fancy paragraph breaks, and a Community cell phone number. This guy is legit.

Five minutes go by and I haven’t heard from the moderator yet, instead, some first-time-founder is droning on and on promising that he’ll “get to his question [as soon as he finishes his point].” He never gets to the question.

New strategy, I’m getting on stage. I figure that I’ve been at the fundraising game for a few months, have picked up a few do’s and don’t, and could probably benefit from exposure to the audience members still riding the pine. I raise my hand. I wait… And wait…

When I get to the front of the imaginary (left-to-right, up-to-down) queue, my heart starts racing.

Now I’m unmuted and talking. But wait, I’m suddenly out of content. I’m floundering. Is the moderator still there? Is anyone listening? Wait, am I the “first-time-founder?”

As I’m desperately looking for a life line, I catch a glimpse of the room title: “How to PITCH VCs and Angel Investors.” Not helpful… I’ve run out of generic content.

My follower count doesn’t move, my ego is bruised, and I’ve learned nothing. I force-close the app. When I go to delete the app I realize that they’ve changed the app icon again. Damn that’s cool. I make sure notifications are turned off and move the app into a folder titled “miscellaneous.” Maybe next quarter.

I’m not alone in my experience. And no new strategy or subsequent round of financing will solve the problem Clubhouse is experiencing.

“Social Audio” suffers from three inescapable limitations that can only be remedied by reimagining the creator and consumer’s relationship to the audio content.

Social Audio: Urgent, Shallow, Prepared

Contextual Audio: Evergreen, Deep, Improvised

Urgency vs. Evergreen

There is probably nothing cooler than jumping into a room with Elon Musk at the perfect moment and hearing him implore the audience to buy Dogecoin or to boycott Robinhood. You were there, you heard it live! But this same “get-it-while-it’s-hot” feature is also a major bug. The reality is, you almost never get-it-while-it’s-hot. During a two hour discussion, the room might only be “hot” (excuse the over-extended analogy) for all of five minutes. These moments are easily and consistently missed. Maybe you tuned in late. Maybe you dropped out early. Maybe you didn’t even know the room existed. Maybe you grew tired of the “first-time-founder’s” ramblings and muted the audio temporarily. Maybe the room never got hot. Maybe you just wasted two hours of your life.

This is the urgency problem. It’s a natural occurring characteristic of synchronous (not “synchronized”) audio, and cannot be solved by adding the ability to record the room — as has been tried by CH and many of its clones. Recording the room destroys the magic of synchronous content and produces only sub-par asynchronous content.

When rooms are recorded, creators stop inviting guests on stage (or are paralyzingly cautious about guest appearances) as they become increasingly concerned with the content they’ll ship to 100,000 people after the room closes, rather than engaging with the 1,000 people currently in the room. And still, the recorded product is necessarily worse than a studio-recorded podcast. The only reason you put up with “first-time-founder” in the live room was because you had a chance of getting live content while it was hot. Once recorded, the content loses all of its heat. If you have to listen to “first-time-founder’s” recorded diatribe, you’re out.

The way to solve the urgency problem while maintaining the magic of synchronous audio is to imbibe the experience with context (some shared experience synchronized with but extending beyond the creator’s audio) so that no matter when you join the conversation you know what the speaker is talking about, why they’re talking about it, and how it relates to you and your opinion about the shared experience.

The content is always hot, or at least, significantly warmer even while you’re waiting for the “Dogecoin moment.”

Compare Clubhouse (Urgent) to Twitch (Evergreen). On Clubhouse, you join a room and have no context for where the conversation has been, is going, or when the gold nugget is going to drop. Meanwhile, on Twitch, it doesn’t matter when you join the stream, you’re immediately presented with shared context (the game being played), you understand what the speaker is talking about (the game you’re watching), where the conversation is going (as you watch the game unfold), and when the big moment is going to happen (2 minutes left in the NBA 2K matchup). You’re not waiting, you’re enjoying.

Moreover, when coupled with some externally established context, the content schedule becomes exponentially more predictable (When is Oprah hopping on CH? vs. When is Monday Night Football?). Even better, the hot moments are even hotter — if you weren’t present and listening while the audio and the video were synchronized, you actually missed the moment. Plain and simple. The audio and the video might have recorded, but so long as they are from separate sources, the synchronization is lost.

Contextual Audio is not a live audio podcast replacement, it’s a truly shared one-time experience.

Prepared vs. Improvised

Let’s go back to my recent experience in the “How to Pitch VCs” room. Why did I run out of content close-to-immediately? Well, it’s not only because I’m a rookie founder, it’s also because I had nothing to go off of. What I had to say was largely rehearsed. There was nothing for me to react to. And, when I floundered around for something to say, there it was again, the room title — some generic topic, nothing helpful.

This is perhaps the most significant difference between Social Audio and Contextual Audio platforms. The preparation routine for Social Audio creators (room moderators) is very much like podcast hosts. Creators decide on a topic that they can speak about for 2 or more hours every week, settle on a few subtopics to talk about that same day, and spend the first 15 or 20 minutes talking through a script. After that, the creator fires up the questions-queue and hopes that somebody in the audience has something insightful to say that will spark a subsequent conversation. While the barriers to create are substantially lowered when compared to podcasting (no post-production, somewhat free-flowing conversation, etc.), they are actually quite high when compared to Contextual Audio.

Contextual Audio requires no preparation because, as explained below, the creator can simply react to what they’re seeing. It’s the difference between delivering a pre-rehearsed joke and making fun of something you see on TV. Maybe the rehearsed joke is more polished, but it may or may not be as funny, and, for a number of people, it’s frightening to deliver. For this reason, Contextual Audio platforms promote substantially more content creation.

In an industry where content-creation is key, lowering the creator’s barrier to entry positions Contextual Audio platforms favorably against Social Audio platforms.

Unlayered vs. Layered Content

Social Audio platforms produce “unlayered” content. What you hear is what you get — end of story. Contextual audio platforms produce “layered” content. What you hear is augmented by what you see. The former is a shallower experience, while the latter provides for a significantly deeper and richer connection between the creator and the consumer.

Imagine for a moment that you are in a Clubhouse room and you hear the moderator emote: “Woah! I can’t believe that.” Immediately, the entire audience is left wondering, “can’t believe what?” The emotion is lost because the context is hidden.

Conversely, imagine that the creator and the consumer are both watching an NBA Finals game (in their respective homes), and the creator yells “Woah! I can’t believe that.” Because of the shared visual context (the TV set), the consumer has a point of reference (LeBron’s huge dunk), fully understands the creator’s emotion, and can begin to form her own opinions about the shared experience. Gold nugget!

But, this layered experience doesn’t necessarily have to exist on your TV (external context). It may very well be presented in-app (internal context). For instance, imagine that at the same moment you hear the creator say “Woah! I can’t believe that,” you see LeBron James dunk the basketball, and his stat line update on your phone. How much more dynamic is that than being left wondering what the moderator meant when he said “Woah!?”

Social Audio is unlayered. Contextual Audio is layered. What you hear is only part of what you get from the experience.

Still, the layering of content under the Contextual Audio model means that the creator is not the sole source of entertainment. Even if the creator is excruciatingly boring, self-aggrandizing, or even dead silent, the consumer is still partially entertained by the internal/external context — the game is still unfolding in front of the consumer’s eyes.

This model has been exceedingly successful on platforms like Twitch where, even if the streamer is sub-par, the consumer is still engaged because of contextual compliments (game visuals, chat, leaderboards, etc).

Concluding Thoughts

Every live audio platform will have to decide whether to challenge Clubhouse and its clones for space in the Social Audio space or to explore a live audio approach founded on shared context.

For many existing platforms, this transition will be exceedingly difficult, if not impossible. Technology aside, the transition will be difficult, at least in part, because users have deeply ingrained preconceptions about the subject matter offerings of particular platforms. For example, LinkedIn is and will always be seen as a platform for professional networking. Its live audio offering will be viewed as related. Similarly, Clubhouse has been typecast as a platform for self-help, self-promotion, marketing, culture, crypto, and investing. So on and so forth. These preconceptions severely limit the platform’s ability to evolve from a Social Audio into a Contextual Audio platform because not every content category is compatible with Contextual Audio. For example, beyond the creator’s audio, what could be shared in real time between the creator and consumer related to professional networking?

Contextual Audio platforms are more or less robust depending on the content category.

Social Audio platforms are interested in breadth of content: talk about whatever you want, whenever you want, with whoever you want, for as long as you want. Clubhouse, Twitter, Facebook, and Spotify are all competing to become the YouTube of audio. They don’t seem to be interested in going deep on any particular vertical.

The question is, who will become the Twitch of live audio — carving out a massive vertical and coupling the audio with shared context?

The future is not just social, it’s contextual.

--

--