Yes, yes, we have heard this one before.
Remember those tiny "Echo" devices collecting dust in your rooms?
Alexa governing our lives?
Siri’s ‘intelligence’ wiping out the very word ‘know’ from your brain by solving problems you didn’t know you had.
Well, didn’t happen, did it?
I digress here.
I have recently attended a local meetup focused on LLMs and audio models that are making waves. The intro was spent going down the memory lane of our youth with Google Echo, Siri and other things that utilized voice commands in our early days.
We went over the localization (we're in Poland after all, working with Polish is already a problem in itself) and the common stuff we just couldn’t do before.
With the tech we had available on our hands back then, that is.
But watching those demos I couldn’t stop myself from thinking this time it is going to work indeed.
I will embed the video below; watch it if you happen to have some time, you will get a gist of it.
When those guys, who were running a live demo, spun up an audio agent to make an appointment with a fake dental service - well, it got us all.
The task was to understand Polish, ask questions, and finish the call with a visit booked at a local dentist. We had some good chuckles here and there, but overall, it got the job done.
An agent backed by a language model working 24/7 with a very hard Slavic language for which we didn’t have that many datasets for. The outcome was surprisingly good and we are going to see many of those in 2025.
p.s. I have heard of a small startup from Poland that does that for the hospitality industry. Check them out here.
In this post, I’d like to share a couple of things I have heard over the last 2-3 weeks related to audio:
A friend of mine mentioned a company from Germany called Bragi. I haven’t heard of it, so I did some research. Apparently, they are working on some custom firmware for chips used by headphone brands.
Presumably, it comes with some AI capabilities and their own emerging ecosystem of apps.
Image built-in abilities to hook up your application to your users' headphones and make them interact via voice commands. "Not exactly something new, is it?" - would you say. Probably, but the idea is to have that experience burned into the headphones with a direct link to their hardware. Think of movements with your head, voice commands, gestures that could be used to 'open' apps and manage them.
On top of that, that custom firmware on Qualcomm chips can mean better approach to dynamic noise reduction (via anomaly detection and so on).
Their website lists Bose as their partner, which can mean a lot of interesting things.
The ‘new’ audio interface of ChatGPT is awesome. I was told it is pretty expensive (API usage, not that UI thing half of uses), but there are startups out there that are trying to get that niche conquered on their own.
This space will grow, there will be game and NSFW audio experiences, customer service agents, and a whole lot of scam. Unavoidable these days. But very exiting to see.
This will eat up lots of compute and cause all those companies to build even more nuclear reactors. This means:
Bigger lobbying of "AI" in general by companies like Amazon and Google who will need long-term commitments to make it worth the play.
Proxy conflicts and direct confrontations like we had in Kazakhstan (the evil tongues say it was all about the control over uranium in KZ). But who am I to tell.
The prices of uranium will go up even more, probably some 'supporting' industries too; so if you are a player, time to make those bets on Uranium. Russia probably will be a '
friend' soon as they have lots of it.Lots of other things I don't see now.
Sorry, for some of you it might sound irrelevant, but this kind of modeling is something I can talk about for hours. You never know what goes in and out from that scenario planning.
p.s. Poland would benefit greatly from smaller modular reactors given our current threats and the absence of ANY nuclear plants whatsoever.
Probably the most popular LLM product of today is the NotebookLM from Google.
You can watch Lenny’s podcast with the PM responsible for its launch here. Can't say it is useful; my general complaint about Lenny's podcast is that those episodes, while long,
are not that useful. Due to the status of his guests and their employers, they can't share much. And this leaves us with watery goo that no one can digest. But that's a topic for another day.Across the globe, corporate bigheads are running around to put in place something similar everywhere. Even if it makes no sense, the consensus is that AI makes products more sellable. Like with blockchain a couple of years back, but this time it is slightly more useful.
If done right, that is. But that’s a topic for another day.
LN Notebook allows you to interact with written files by turning them into audio experiences that don’t sound dry or boring. They have managed to convert those dusty papers into synthetized dialogue-like audio experiences. This means for us:
You can tune the volume of consumption of the content and leak from every digital orifice there is. You can't make people read more but you can make them listen, the battle for your ears is on. Companies will start converting their boring blogs into full-fledged podcasts. Some will be good though.
It might be a better solution for those unfortunate of us who have problems with sight. This one's quite fascinating as previously these people were reliant on others recording for them. I haven't explored that space, correct me if I am wrong.
Even more compute.
By the way, an alternative was released by Facebook recently and this thing is actually 'free' to use. I haven't tried it myself yet.
Last example in this journal of mine will be on the radio station from Krakow where management went rogue. They have basically tried to get rid of their contractors by creating 'AI personas' that would 'host' those shows.
It backfired and we have heard angry responses from their management, who blamed the media for spreading fake news. And some other things. Apparently, the experiment was put on hold shortly after.
Which pours some light on important points we will have to tackle quite soon:
Immense volumes of audio and video will be produced. There will be a difficulty in identifying the content that says something of value. I.e. generic content generated by LLMs & voiced over will be well spread. Not like with text where you can tell what the article is about, with audio, we will need some kind of an analyzer, but I don't know what it would look like.
The overtone window will be shifted for everything audio-video. You can make it play/voice pretty much everything you want. And the content will be shaped by many people behind those voices.
There is no accountability, no responsibility there. I.e., previously, the men behind those voices were caring about branding and their own images. There was some sort of intent to stay within accepted boundaries. With AI voices, there will be no accountability, as you can make 'them' tell whatever the new line of politics is. There is no human filter in it, not with that volume we will see soon.
Overall, we are moving towards interesting things during our lifetimes with lots of adoption required.
And the majority will stumble in awe.