This is a very cool demo - if you dig deeper there’s a clip of them having a “blind” AI talk to another AI with live camera input to ask it to explain what it’s seeing. Then they, together, sing a song about what they’re looking at, alternating each line, and rhyming with one another. Given all of the isolated capabilities of AI, this isn’t particularly surprising, but seeing it all work together in real time is pretty incredible.
But it’s not scary. It’s… marvelous, cringey, uncomfortable, awe-inspiring. What’s scary is not what AI can currently do, but what we expect from it. Can it do math yet? Can it play chess? Can it write entire apps from scratch? Can it just do my entire job for me?
We’re moving toward a world where every job will be modeled, and you’ll either be an AI owner, a model architect, an agent/hardware engineer, a technician, or just.. training data.
> We’re moving toward a world where every job will be modeled
After an OpenAI launch, I think it's important to take one's feelings about the future impact of the technology with a HUGE grain of salt. OpenAI are masters of hype. They have been generating hype for years now, yet the real-world impacts remain modest so far.
Do you remember when they teased GPT-2 as "too dangerous" for public access? I do. Yet we now have Llama 3 in the wild, which even at the smaller 8B size is about as powerful as the [edit: 6/13/23] GPT-4 release.
As someone pointed out elsewhere in the comments, a logistic curve looks exponential in the beginning, before it approaches saturation. Yet, logistic curves are more common, especially in ML. I think it's interesting that GPT-4o doesn't show much of an improvement in "reasoning" strength.
A Google search for practically any long-tail keywords will reveal that LLMs have already had a very significant impact. DuckDuckGo has suffered even more. Social media is absolutely lousy with AI-powered fraud of varying degrees of sophistication.
It's glib to dismiss safety concerns because we haven't all turned into paperclips yet. LLMs and image gen models are having real effects now.
We're already at a point where AI can generate text and images that will fool a lot of people a lot of the time. For every college-educated young person smugly pointing out that they aren't fooled by an image with six-fingered hands, there are far more people who had marginal media literacy to begin with and are now almost defenceless against a tidal wave of hyper-scaleable deception.
We're already at a point where we're counselling elders to ignore late-night messages from people claiming to be a relative in need of an urgent wire transfer. What defences do we have when an LLM will be able to have a completely fluent, natural-sounding conversation in someone else's voice? I'm not confident that I'd be able to distinguish GPT-4o from a human speaker in the best of circumstances and I'm almost certain that I could be fooled if I'm hurried, distracted, sleep deprived or otherwise impaired.
Regardless of any future impacts on the labour market or any hypothesised X-risks, I think we should be very worried about the immediate risks to trust and social cohesion. An awful lot of people are turning into paranoid weirdos at the moment and I don't particularly blame them, but I can see things getting seriously ugly if we can't abate that trend.
I think humankind has managed massive shifts in what and who you could trust several times before.
We went from living in villages where everyone knew each other to living in big cities where almost everyone is a stranger.
We went from photos being relatively reliable evidence to digital photography where anyone can fake almost anything and even the line between faking and improving is blurred.
We went from mass distribution of media being a massive capital expenditure that only big publishers could afford to something that is free and anonymous for everyone.
We went from a tiny number of people in close proximity being able to initiate a conversation with us to being reachable to everyone who could dial a phone number or send an email message.
Each of these transitions caused big problems. None of these problems were ever completely solved. But each time we found mitigations that limit the impact of any misuse.
I see the current AI wave as yet another step away from trusting superficial appearances to a world that requires more formal authentication protocols.
Passports were introduced long ago but never properly transitioned into the digital world. Using some unsigned PDF allegedly representing a utility bill as proof of address seems questionable as well. And the way in which social security numbers are used for authentication in the US is nothing short of bizarre.
So I think there are some very low hanging fruits in terms of authentication and digital signatures. We have all the tools to deal with the trust issues caused by generative AI. We just have to use them.
We’ve reached a stage, where it would be advisable to not release recent photos of yourself, nor any video with sound clips to public, unless you want an AI fake instaperson of yourself starting to reach out to member of your externally visible social network, asking for money, emergency help, etc.
I guess we need to have an AI secretary to take in all phonecalls from now on (spam folder will become a lot more interesting with celebrity phone calls, your dead relative phoning you etc)
Hopefully, we will soon enter the stage where nobody believes anything they see anymore. Then, you no longer have to be afraid of being misinterpreted, because nobody is listening anymore anyway. Great time to be alive!
I don't know why anyone thinks this will happen. You can obviously write anything you want (we have an entire realm of works in this area that everyone knows about, fiction) and yet huge amounts of people believe passed around stories either from bad or faked media sources or entirely unsourced.
Luckily there’s a “solution” to that: Just don’t use the internet for dialogue anymore.
As someone that grew up with late-90’s internet culture and has seen all the pros and cons
and changes over the decades, I find myself using the internet less and less for dialogue with people. And I’m spending more time in nature and saying hi to strangers in reality.
I’m still worries about the impact this will have on a lot of people’s ability to reason however. “Just” Tik Tok and apps like it has already had devastating results on certain demographics.
Be glib, but that is one way for society to bring privacy back-and with it shared respect. I think of it as the “oh everyone has an anus” moment. We all know everyone has one and it doesn’t need to be dragged out in polite company.
If you get off the internet you'd not even realise these tools exists though. And for the statement that all jobs will be modelled to be true, it'd have to be impacting the real world.
Is it even possible to "get off the internet" without also leaving civilisation in general at this point?
> it'd have to be impacting the real world
By writing business plans? Getting lawyers punished because they didn't realise that "passes bar exam" isn't the same as "can be relied on for citations"? By defrauding people with synthesised conversations using stolen voices? By automating and personalising propaganda?
Or does it only count when it's guiding a robot that's not merely a tech demo?
Capabilities aren't the problem, cultural adoption is. Just yesterday I talked to someone who still googles solutions to their Excel table woes. Didn't they know of Copilot?
Maybe they didn't know, maybe none of their colleagues used it, their company didn't pay for it, or maybe all they need is an Excel update.
But I am confident that using Copilot would be faster than clicking through the sludge that are Microsoft Office help pages (third party or not.)
So I think it is correct to fear capabilities, even if the real world impace is still missing. When you invent an airplane, there won't be an airstrip to land on yet. Is it useless, won't it change anything?
What does abating that trend look like? Most AI safety proposals I hear fall into the categories of a) we need to stop developing this technology or b) we need laws that entrench the richest and most powerful organizations in the world as the sole proprietors of this technology. Neighther of those actually sound better than people being paranoid weirdos about trusting text/video/voice. I think that's kinda where we need to be as a culture: these things are not trustworthy, they were only ever good as a rough heuristic, and now that ship has sailed. We have just finished a transition to treating the digital world as part of our "real" world, but it's time to step that back. Using the internet to interact with known trusted parties will still work fine, provided that some authentication can be shared out-of-band offline. Meeting people and discovering businesses and such? There will be more fakes and scams than real opportunities by orders of magnitude, and as technology progresses our filtering will only get worse. We need to roll back to "don't trust anything online, don't share your identity or payment information online" outside of, as mentioned, out-of-band verified parties. You can still message your friends and family, do online banking and commerce, but you can't initiate a relationship with a person or business online without some kind of trusted recommendation.
I don't think anyone has a good answer to that question, which is the problem in a nutshell. Job one is to start investing seriously in finding possible answers.
>We need to roll back to "don't trust anything online, don't share your identity or payment information online"
That's easy to say, but it's a trillion-dollar decision. Alphabet and Meta are both worthless in that scenario, because ~all of their revenue comes from connecting unfamiliar sellers with buyers. Amazon is at existential risk. The collapse of Alibaba would have a devastating impact on Chinese exporters, with massive consequent geopolitical risks. Rolling back to the internet of old means rolling back on many years worth of productivity and GDP growth.
> I'm not confident that I'd be able to distinguish GPT-4o from a human speaker in the best of circumstances and I'm almost certain that I could be fooled if I'm hurried, distracted, sleep deprived or otherwise impaired.
Set a memorable verification phrase with your friends and loved ones. That way if you call them out of the blue or from some strange number (and they actually pick up for some reason) and you tell them you need $300 to get you out of trouble they can ask you to say the phrase and they'll know it's you if you respond appropriately.
I've already done that and I'm far less worried about AI fooling me or my family in a scam than I am about corporations and governments using it without caring about the impact of the inevitable mistakes and hallucinations. AI is already being used by judges to decide how long people should go to jail. Parole boards are using it to decide who to keep locked up. Governments are using it to decide which people/buildings to bomb. Insurance companies are using to deny critical health coverage to people. Police are using it to decide who to target and even to write their reports for them.
More and more people are going to get badly screwed over, lose their freedom, or lose their lives because of AI. It'll save time/money for people with more money and power than you or I will ever have though, so there's no fighting it.
Sure, all tech has 'real' effects. It's kinda the definition of tech. But all of these concerns more or less fall into the category of "add it to the list of things you have to watch out for living in the 21st century" - to me, this is nothing crazy (yet)
The nature of this tech itself is probably what is getting most people - it looks, sounds and feels _human_ - it's very relatable and easy for a non-tech person to understand it and thus get creeped out. I'd argue there are _far_ more dangerous technologies out there, but no one notices and / or cares because they don't understand the tech in the first place!
The "yet" is carrying a lot of weight in that statement. It is now five years since the launch of GPT-2, three years since the launch of GPT-3 and less than 18 months since the launch of ChatGPT. I cannot think of any technology that has improved so much in such a short space of time.
We might hit an inflection point and see that rate of improvement stall, but we might not; we're not really sure where that point might lie, because there's likely to still be a reasonable amount of low-hanging fruit regarding algorithmic and hardware efficiency. If OpenAI and their peers can maintain a reasonable rate of improvement for just a few more years, then we're looking at a truly transformational technology, something like the internet that will have vast repercussions that we can't begin to predict.
The whole LLM thing might be a nothingburger, but how much are we willing to gamble on that outcome?
> yet the real-world impacts remain modest so far.
I second that. I remember when Google search first came out. Within a few days it completely changed my workflow, how I use the Internet, my reading habits. It easily 5 ~ 10x the value of Internet for me over a couple of weeks.
OMG I remember trying Google when it was in beta, and HOLY CRAP what I had been using was like freakin night and day. AltaVista: remember that? That was the state of the art before that, and it did not compare. Night and day.
not really. Even a human bad at reasoning can take 1 hour of time to tinker around and figure things out. GPT-4 just does not have the deep planning/reasoning ability necessary for that.
I think you might be falling for selection bias. I guess you are surrounding yourself with a lot of smart people. "tinker around and figure things out" is definitely something certain humans (bad at reasoning) can't do. I already prefer the vision model when it comes to asking for a picture description (blind user) over many humans I personally know. The machine is usually more detailed, and takes the time to read the text, instead of trying to shortcut and decide for me whats important. Besides, people from the english speaking countries do not have to deal with foreign languages. Everyone else has to. "Aber das ist ja in englisch" is a common blocker for consuming information around here. I tell you, if we dont manage to ramp up education a few notches, we'll end up with even higher stddev when it comes to practical intelligence. We already have perfectly normal seeming humans absolutely unable to participate on the internet.
"real-world impacts remain modest so far."
Really? My Google usage has went down with 90% (it would just lead me to some really bad take from a journalist anyway, while ChatGPT can just hand me the latest research and knows my level of expertise). Sure it is not so helpful at work, but if OpenAI hasnt impacted the world I fail to see which company have in this decade.
“Replaced Google” is definitely an impact, but it’s nothing compared to the people that were claiming entire industries would be wiped out nearly overnight (programming, screenwriting, live support, etc).
Speak to some illustrators or voiceover artists - they're talking in very bleak terms about their future, because so many of them are literally being told by clients that their services are no longer required due to AI. A double-digit reduction in demand is manageable on aggregate, but it's devastating at the margin. White-collar workers having to drive Ubers or deliver packages because their jobs have been taken over by AI is no longer a hypothetical.
I believe you, and I do turn to an LLM over Google for some queries where I'm not concerned about hallucination. (I use Llama 3 most of the time, because the privacy is absolute.)
But OpenAI is having a hard time retaining/increasing ChatGPT users. Also, Alphabet's stock is about as valuable as it's ever been. So I don't think we have evidence that this is really challenging Google's search dominance.
Google is an ad company. Ad prices are on an auction and most companies believe that they need ads. Less customers don't necessarily mean that the earnings go down, as when the clicks go down the prices might go up (without ad competitors). Ergo, they don't compete (yet at least).
Yeah. Open ai are certainly not masters of hype lol. They released their titular product to basically no fanfare or advertisement. ChatGPT took off on Word of Mouth alone. They dropped GPT-4 without warning and waited months to ship it's most exciting new feature (image input).
Even now, they're shipping text-image 4o but not the new voice while leaving old-voice up and confusing/disappointing a whole lot of people. This is a pretty big marketing blunder.
I remember for a good 2-3 months in 2023 ALL you could see on tiktok / youtube shorts was just garbage about 'how amazing' ChatGPT was. Like - video after video and I was surprised of the repeat content being recommended to me... No doubt openAI (or something) was behind that huge marketing push
Is it not possible this would be explained by people simply being interested in the technology and TikTok/Youtube algorithms noticing that—and that they would have placed you in the same bubble, which is probably an accurate assignment?
I doubt OpenAI spent even one cent marketing their system (e.g. as in paying other companies to push it).
Well if you were a typical highly engaged TikTok or YouTube user, you are probably 13-18 years old. The kind of cheating in school that ChatGPT enabled is revolutionary. That is going to go viral. It's not a marketing push. After years of essentially learning nothing during COVID lockdowns, can you understand how transformative that is? It's like 1,000x more exciting than pirating textbooks, stealing Mazdas, or whatever culturally self-destructive life hacks were being peddled by freakshow brocolliheads and Kim Kardashian-alikes on the platform.
It's ironic because the OpenAI creators really loved school and excelled academically. Nobody cares that ChatGPT destroyed advertising copywriting. But whatever little hope remained for the average high schooler post-lockdowns, it was destroyed by instant homework cheating via ChatGPT. So much for safety.
> yet the real-world impacts remain modest so far.
I spend a part of yesterday evening sorting my freshly dried t-shirts into 4 distinct piles. I used OpenAI Vision (through BeMyEyes) from my phone. I got a clear description of each and every piece of clothing, including print, colours and brand. I am blind BTW. But I guess you are right, no impact at all.
> Yet we now have Llama 3 in the wild
Yes, great, THANKS Meta, now the Scammers have something to work with. Thats a wonderful achievement which should be praised! </sarcasm>
Like another comment mentioned, sigmoid curves [1] are ubiquitous with neural network systems. Neural network systems can be intoxicating because it's so "easy" (relatively speaking) to go from nothing to 80% in extremely short periods of time. And so it seems completely obvious that hitting 100% is imminent. Yet it turns out that each percent afterwards starts coming exponentially more slowly, and we tend to just bump into seemingly impassable asymptotes far from where we'd like to be.
~8 years ago when self driving technology was all the rage and every major company was getting on board with ever more impressive technological demos, it seemed entirely reasonable to expect that we'd all be in a world of complete self driving imminently. I remember mocking somebody online around the time who was pursuing a class C/commercial trucking license. Yet now a decade later, there are more truckers than ever and the tech itself seems further away than ever before. And that's because most have now accepted that progress on such has basically stalled out in spite of absolutely monumental efforts at moving forward.
So long as LLMs regularly hallucinate, they're not going to be useful for much other than tasks that can accept relatively high rates of failure. And many of those generally creative domains are the ones LLMs are paradoxically the weakest in - like writing. Reading a book written by an LLM would be cruel and unusual punishment given then current state of the art. One domain I do see them completely taking over is search. They work excellently as natural language search engines, and "failure" in such is very poorly defined.
I'm not really sure your self-driving analogy is apt here. Waymo has cars on the road right now that are totally autonomous, and just expanded its footprint. It has been longer and more difficult than we all thought, and those early tech demos were a glimmer of what was to come; then we had to grind to get there, with a lot of engineering.
I think what maybe seems not obvious amidst the hype is that there is a hell of a lot of engineering left to do. The fact that you can squash the weights of a neural net down to 3 bits per param and it still works -- is evidence that we have quite a way to go with maturing this technology. Multimodality, improvements to the UX of it, the human-computer interface part of it. Those are fundamental tech things, but they are foremost engineering problems. Getting latency down. Getting efficiency up. Designing the experience, then building it out.
25 years ago, early tech demos on the internet were promising that everyone would do their shopping, entertainment, socializing, etc... online. Breathless hype. 5 years after that, the whole thing crashed, but it never went away. People just needed time to figure out how to use it and what it was useful for, and discover its limitations. 10 years after that, engineering efforts were systematized and applied against the difficult problems that still remained. And now: look at where we are. It just took time.
I don't think he's saying that AGI is impossible — almost noone (nowadays) would suggest that it's anything but an engineering challenge. The argument is simply one of scale, i.e. how long that engineering challenge will take to solve. Some people are suggesting on the order of years. I think they're suggesting it'll be closer to decades, if that.
I have a much less "utopian" view about the future. I remember during the renaissance of neural networks (ca. 2010-15) it was said that "more data leads to better models", and that was at a time when researchers frowned upon the term Artificial Intelligence and would rather use Machine Learning. Fast forward a decade LLMs are very good synthetic data generators that try to mimic human generated input and I can't think somehow that this wasn't the sole initial intent of LLMs. And that's it for me. There's not much to hype and no intelligence at all.
What happens now is that human generated input becomes more valuable and every online platform (including minor ones) will have now some form of gatekeeping in place, rather sooner than later. Besides that a lot of work still can't be done in front of a computer in isolation and probably never will, and even if so, automation is not a means to an end. We still don't know how to measure a lot of things and much less how to capture everything as data vectors.
The evidence for that is the change in the tokenizer. The only way to implement that is to re-train the entire base model from scratch. This implies that GPT 4o is not a fine-tuning of GPT 4. It's a new model, with a new tokenizer, new input and output token types, etc...
They could have called it GPT-5 and everyone would have believed them.
I’ve used it for a couple of hours to help with coding and it feels very similar to gpt4: still makes erroneous and inconsistent suggestions. Not calling it 4.5 was the right call. It is much faster though.
The expectations for gpt5 are sky high. I think we will see a similar jump as 3.5 -> 4.
My new PC arrives tomorrow. Once I source myself two RTX 3060's I'll be an AI owner, no longer dependant on cloud APIs.
Currently the bottleneck is Agents. If you want a large language model to actually do anything you need an Agent. Agents so far need a human in the loop to keep them sane. Until that problem is solved most human jobs are still safe.
(IMO) AI cannot murder people. The responsibility of what an AI does falls on the person who deployed it, and to a lesser extent the person who created it. If someone is killed by a fully autonomous weapon then that person has been murdered by the person or people who created and enabled the AI, not the AI itself.
This is no different to saying a person with a gun murdered someone rather than attributing the murder to the gun. An AI gun is just a really fancy gun.
Nature had been doing that for billions of years until a few decades ago when we were told "progress" meant we had to stop doing the same thing more peacefully and intentionally.
My guess is the future belongs to those who don't stop—who, in fact, embrace the opposite of stopping.
I would even suggest that the present belongs to those who didn't stop. It may be too late for normal people to ever catch up by the time we realize the trick that was played on us.
The present absolutely belongs to those who didn't stop, but it's been a lot longer than a few decades.
Varying degrees of greedy / restless / hungry / thirsty / lustful are what we've got, because how is contentedness ever going to compete with that over millennia?
People with your sentiment said the same thing about all cool tech that changed the world. Doesn't change the reality, a lot of professions will need to adapt or they will go extinct.
I am a PhD biophysicist working within the field of biological imaging. Professionally, my team (successfully) uses deep learning and GANs for a variety of tasks within the field of imaging, such as segmentation, registration, and predictive protein/transcriptomics. It’s good stuff, a game changer in many ways. In no way however, does it represent generalized AI, and nobody in the field makes this claim even though the output of these algorithms match or out perform humans in cases.
LLMs are no different. Like DL modules that are very good at outputting images that mimic biological signatures, LLMs are very good at outputting texts that eerily mimic human language.
However — and this is a point which programmers are woefully and comically ignorant — human language and reason are two separate things. Tech bros wholly confuse the two however, and thus make outlandish claims we have achieved or are on the brink of achieving — actual AI systems.
In other words, while LLMs and DL in general can perform specific tasks well, they do not represent a breakthrough in artificial intelligence, and thus will have a much narrower application space than actual AI.
This is true. But only to a point where mimicking and more broadly speaking, statistically imitating data, are understood in a more generalized way.
LLMs statistically imitates texts of real world. To achieve certain threshold of accuracy, it turns out they need to imitate the underlying Turing machine/program/logic that runs in our brains to understand/react properly to texts by ourselves. That is no longer in the realm of the old school data-as-data statistics I would say.
If you've been in the field you really should know that the term AI has been used to describe things for decades in the academic world. My degree was in AI back before RBMs and Hintons big reveal about making things 100000 times faster (do the main step just once not 100 times and take 17 years to figure that out).
You're talking more about AGI.
We need "that's not AI" discussions like we need more "serverless? It's still on some server!!" discussions.
The problem with this kind of criticism of any AI-related technology is that it is an unfalsifiable argument akin to saying that it can't be "proper" intelligence unless God breathed a soul into the machine.
The method is irrelevant. The output is what matters.
This is like a bunch of intelligent robots arguing that "mere meat" cannot possibly be intelligent!
"We shall not be very greatly surprised if a woman analyst who has not been sufficiently convinced of the intensity of her own wish for a penis also fails to attach proper importance to that factor in her patients" Sigmund Freud, in response to Karen Horney’s criticism of his theory of penis envy.
W-what? Lad, have you used chat-gpt? It can instantly give you intelligent feedback on anything (usually better than any expert community like 90% of the time.) On extremely detailed, specific tasks (like writing algorithms or refactoring) its able to spit out either working code or code so close to working that its still faster than what you could have done yourself. It can explain things better than probably 99.999% of teachers.
It will give you detailed examples that are much easier to follow than vague, error-prone spec docs. That's scratching the surface. Other people are far more creative than me and have used chat-gpt for mind-blowing stuff already. Whatever its doing passes for 'reasoning' and 'intelligence' in my book. To me it doesn't matter whether its the same kind of intelligence as a human or if there's any amount of awareness as those are both philosophical questions of no consequence to my work.
For what these pieces of tech can do I feel that they're drastically under-utilized.
Now that I see this, here is my wish (I know there are security privacy concerns but let's pretend there are not there for this wish):
An app that runs on my desktop and has access to my screen(s) when I work. At any time I can ask it something about what's on the screen, it can jump in and let me know if it thinks I made a mistake (think pair programming) or a suggestion (drafting a document). It can also quickly take over if I ask it too (copilot on demand).
Except for the last point and the desktop version I think it's already done in math demo video.
I guess it will also pretty soon refuse to let me come back inside the spaceship, but until then it'll be a nice ride.
Agreed. I’m excited about reaching a point where the experience is of being in a deep work ‘flow’ with an ultra intelligent colleague, instead of jumping out of context to instant message them.
I've worked quite a bit with STT and TTS over the past ~7 years, and this is the most impressive and even startling demo I've seen.
But I would like to see how this is integrated into applications by third party developers where the AI is doing a specific job. Is it still as impressive?
The biggest challenge I've had with building any autonomous "agents" with generic LLM's is they are overly gullible and accommodating, requiring the need to revert back to legacy chatbot logic trees etc. to stay on task and perform a job. Also STT is rife with speaker interjections, leading to significant user frustrations and they just want to talk to a person. Hard to see if this is really solved yet.
I’ve found using logic trees with LLMs isn’t necessarily a problem or a deficit. I suppose if they were truly magical and could intuit the right response every time, cool, but I’d always worry about the potential for error and hallucinations.
I’ve found that you can create declarative logic trees from JSON and use that as a prompt for the LLM, which it can then use to traverse the tree accordingly. The only issue I’ve encountered is when it wants to jump to part of the tree which is invalid in the current state. For example, you want to move a user into a flow where certain input is required, but the input hasn’t been provided yet. A transition is suggested to the program by the LLM, but it’s impossible so the LLM has to be prompted that the transition is invalid and to correct itself. If it fails to transition again, a default fallback can be given but it’s not ideal at all.
However, another nice aspect of having the tree declared in advance is that it shows human beings what the system is capable and how it’s intended to be used as well. This has proven to be pretty useful, as letting the LLM call functions it sees fit based on broad intentions and system capabilities leaves humans in the dark a bit.
So, I like the structure and dependability. Maybe one day we can depend on LLM magic and not worry about a team understanding the ins and outs of what should or shouldn’t be possible, but we don’t seem to be there yet at all. That could be in part because my prompts were bad, though.
Any recommendations on patterns/approaches for these declarative logic trees and where you put which types of logic (logic which goes in the prompt, logic which goes in the code which parses the prompt response, how to detect errors in the response and retry the prompt, etc). On "Show HN" I see a lot of "fully automated agents" which seem interesting, but not sure if they are over-kill or not.
Personally, I've found that a nested class structure with instructions in annotated field descriptions and/or docstrings can work wonders. Especially if you handle your own serialization to JSON Schema (either by rolling your own or using hooks provided by libraries like Pydantic), so you can control what attributes get included and when.
The JSON serialization strategy worked really well for me in a similar context. It was kind of a shot in the dark but GPT is pretty awesome at using structured data as a prompt.
I actually only used an XState state machine with JSON configuration and used that data as part of the prompt. It worked surprisingly well.
Since it has an okay grasp on how finite state machines and XState work, it seems to do a good job of navigating the tree properly and reliably. It essentially does so by outputting information it thinks the state machine should use as a transition in a JSON object which gets parsed and passed to a transition function. This would fail occasionally so there was a recursive “what’s wrong with this JSON?” prompt to get it to fix its own malformed JSON, haha. That was meant to be a temporary hack but it worked well, so it stayed. There were a few similar tools for trying to correct errors. That might be one of the strangest developments in programming for me… Deploying non-deterministic logic to fix itself in production. It feels wrong, but it works remarkably well. You just need sane fallbacks and recovery tactics.
It was a proprietary project so I can’t share the source, but I think reading up on XState JSON configuration might explain most of it. You can describe most of your machine in a serializable format.
You can actually store a lot of useful data in state names, context, meta, and effect/action names to aid with the prompting and weaving state flows together in a language-friendly way. I also liked that the prompt would be updated by information that went along with the source code, so a deployment would reliably carry the correct information.
The LLM essentially hid a decision tree from the user and smoothed over the experience of navigating it through adaptive and hopefully intuitive language. I’d personally prefer to provide more deterministic flows that users can engage with on their own, but one really handy feature of this was the ability to jump out of child states into parent states without needing to say, list links to these options in the UI. The LLM was good at knowing when to jump from leaves of the tree back up to relevant branches. That’s not always an easy UI problem to solve without an AI to handle it for you.
edit: Something I forgot to add is that the client wanted to be able to modify these trees themselves, so the whole machine configuration was generated by a graph in a database that could be edited. That part was powered by Strapi. There was structured data in there and you could define a state, list which transitions it can make, which actions should be triggered and when, etc. The client did the editing directly in Strapi with no special UI on top.
Their objective is surveying people in a more engaging and personable way. They really wanted surveys which adapt to users rather than piping people through static flows or exposing them to redundant or irrelevant questions. Initially this was done with XState and no LLM (it required some non-ideal UI and configuration under the hood to make those jumps to parent states I mentioned, but it worked), and I can't say how effective it is but they really like it. The AI hype was very very strong on that team.
>Also STT is rife with speaker interjections, leading to significant user frustrations and they just want to talk to a person. Hard to see if this is really solved yet.
This is not using TTS or STT. Audio and Image data can be tokenized as readily as text. This is simply a LLM that happens to have been trained to receive and spit out audio and image tokens as well as text tokens. Interjections are a lot more palatable in this paradigm as most of the demos show.
Adding audio data as a token, in and of itself, would dramatically increase training size, cost, and time for very little benefit. Neural networks also generally tend to function less effectively with highly correlated inputs, which I can only assume is still an issue for LLMs. And adding combined audio training would introduce rather large scale correlations in the inputs.
I would wager like 100:1 that this is just introducing some TTS/STT layers. The video processing layer is probably also doing something similarly, by taking an extremely limited number of 'screenshots', carrying out typical image captioning using another layer, and then feeding that as an input. So the demo, to me, seems most likely to just be 3 separate 'plugins' operating in unison - text to speech, speech to text, and image to text.
The interjections are likely just the software being programmed to aggressively begin output following any lull after an input pattern. Note in basically all the videos, the speakers have to repeatedly cut off the LLM as it starts speaking in conversationally inappropriate locations. In the main video which is just an extremely superficial interaction, the speaker made sure to be constantly speaking when interacting, only pausing once to take a breath that I noticed. He also struggled with the timing of his own responses as the LLM still seems to be attached to its typical, and frequently inappropriate, rambling verbosity (though perhaps I'm not one to critique that).
>I would wager like 100:1 that this is just introducing some TTS/STT layers.
Literally the first paragraph of the linked blog.
"GPT-4o (“o” for “omni”) is a step towards much more natural human-computer interaction—it accepts as input any combination of text, audio, and image and generates any combination of text, audio, and image outputs."
Then
"Prior to GPT-4o, you could use Voice Mode to talk to ChatGPT with latencies of 2.8 seconds (GPT-3.5) and 5.4 seconds (GPT-4) on average. To achieve this, Voice Mode is a pipeline of three separate models: one simple model transcribes audio to text, GPT-3.5 or GPT-4 takes in text and outputs text, and a third simple model converts that text back to audio. This process means that the main source of intelligence, GPT-4, loses a lot of information—it can’t directly observe tone, multiple speakers, or background noises, and it can’t output laughter, singing, or express emotion.
With GPT-4o, we trained a single new model end-to-end across text, vision, and audio, meaning that all inputs and outputs are processed by the same neural network."
I can’t square this with the speed. A couple of layers doing STT are technically still part of the neural network, no? Because the increase in token base to cover multimodal tokenization would make even text inference slower, not twice as fast, as 4-turbo.
Open ai give so little information on the details of their models now that one can only speculate how they've managed to cut down inference costs.
STT throws away a lot of information that is clearly being preserved in a lot of these demos so that's definitely not happening here in that sense. That said, the tokens would be merged to a shared embedding space. Hard to say how they are approaching it exactly.
I'd mentally change the acronym to Speech to Tokens. Parsing emotion and other non-explicit indicators in speech has been an ongoing part of research for years now. Meta-data of speaker identity, inflection, etc could easily be added and current LLMs already work with it just fine. For instance asking Claude, with 0 context, to parse the meaning of "*laughter* Yeah, I'm sure that's right." instantly yields:
----
The phrase "*laughter* Yeah, I'm sure that's right" appears to be expressing sarcasm or skepticism about whatever was previously said or suggested. Here's a breakdown of its likely meaning:
"*laughter*" - This typically indicates the speaker is laughing, which can signal amusement, but in this context suggests they find whatever was said humorous in an ironic or disbelieving way.
"Yeah," - This interjection sets up the sarcastic tone. It can mean "yes" literally, but here seems to be used facetiously.
"I'm sure that's right." - This statement directly contradicts and casts doubt on whatever was previously stated. The sarcastic laughter coupled with "I'm sure that's right" implies the speaker believes the opposite of what was said is actually true.
So in summary, by laughing and then sarcastically saying "Yeah, I'm sure that's right," the speaker is expressing skepticism, disbelief or finding humor in whatever claim or suggestion was previously made. It's a sarcastic way of implying "I highly doubt that's accurate or true."
It could be added. Still wouldn't sound as good as what we have here. Audio is Audio and Text is Text and no amount of metadata we can practically provide will replace the information present in sound.
You can't exactly metadata your way out of this (skip to 11:50)
I'm not sure why you say so? To me that seems obviously literally just swapping/weighting between a set of predefined voices. I'm sure you've played a game with a face generator - it's the exact same thing, except with audio. I'd also observe in the demo that they explicitly avoided anything particularly creative, instead sticking within an extremely narrow domain very basic adjectives: neutral, dramatic, singing, robotic, etc. I'm sure it also has happy, sad, angry, mad, and so on available.
But if the system can create a flamboyantly homosexual Captain Picard with a lisp and slight stutter engaging in overt innuendo when stating, "Number one, Engage!" then I look forward to eating crow! But as the instructions were all conspicuously just "swap to pretrained voice [x,y,z]", I suspect crow will not be on the menu any time soon.
Since OpenAI has gone completely closed, they've been increasingly opaque and dodgy about how even things like basic chat works. Assuming the various leaked details of GPT-4 [1] are correct (and to my knowledge there has been no indication that they are not), they have been actively misleading and deceptive - as even the 'basic' GPT4 is a mixture of experts system, and not one behemoth neural network.
A Mixture of Experts model is still one behemoth neural network and believing otherwise is just a common misconception on term.
MoE are attempts at sparsity, only activating a set number of neurons/weights at a time. They're not separate models stitched together. They're not an Ensemble. I blame the name at this point.
I would ask you to watch the demo on SoundHound.com. It does less, yes, but it's so crucially fit for use. You'll notice from the shown gpt-4 demo they were guiding the LLM into chain of reasoning. It works very well when you know how to work it, which aligns with what you're saying. I don't mean to degrade the achievement, it's great, but we often inflate the expectations of what something can actually do before reaching real productivity.
We've had voice input and voice output with computers for a long time, but it's never felt like spoken conversation. At best it's a series of separate voice notes. It feels more like texting than talking.
These demos show people talking to artificial intelligence. This is new. Humans are more partial to talking than writing. When people talk to each other (in person or over low-latency audio) there's a rich metadata channel of tone and timing, subtext, inexplicit knowledge. These videos seem to show the AI using this kind of metadata, in both input and output, and the conversation even flows reasonably well at times. I think this changes things a lot.
The "magic" moment really hit in this, like you're saying. Watching it happen and being like "this is a new thing". Not only does it respond in basically realtime, it concocts a _whole response_ back to you as well. It's like asking someone what they think about chairs, and then that person being able to then respond to you with a verbatim book on the encyclopedia of chairs. Insane.
I'm also incredibly excited about the possibility of this as an always available coding rubber duck. The multimodal demos they showed really drove this home, how collaboration with the model can basically be as seamless as screensharing with someone else. Incredible.
As goofy as I personally think this is, it's pretty cool that we're converging on something like C3P0 or Plankton's Computer with nothing more than the entire corpus of the world's information, a bunch of people labeling data, and a big pile of linear algebra.
Is this a trick question? OpenAI blatantly used copyrighted works for commercial purposes without paying the IP owners, it would only be fair to have them publish the resulting code/weights/whatever without expecting compensation. (I don't want to publish it myself, of course, just transform it and sell the result as a service!)
I know this won't happen, of course, I am moreso hoping for laws to be updated to avoid similar kerfuffles in the future, as well as massive fines to act as a deterrent, but I don't dare to hope too much.
I was envisioning a future where we've done away with the notion of data ownership. In such a world the idea that we would:
> have all of OpenAI's data for free
Doesn't really fit. Perhaps OpenAI might successfully prevent us from accessing it, but it wouldn't be "theirs" and we couldn't "have" it.
I'm not sure what kind of conversations we will be having instead, but I expect they'll be more productive than worrying about ownership of something you can't touch.
This is alleged, and it is very likely that claimants like New York Times accidentally prompt injected their own material to show the violation (not understanding how LLMs really work), and clouded in the hope of a big pay day rather than actual justice/fairness etc...
Anyways, the laws are mature enough for everyone to work this out in court. Maybe it comes out that they have a legitimate concern, but the way they presented their evidence so far in public has seriously been lacking.
Or rather, I have an unending stream of callers with similar-sounding voices who all want to make chirpy persuasive arguments in favor of Mr Altman's interests.
These models literally need ALL data. The amount of work it would take just to account for all the copyrights, let alone negotiate and compensate the creators, would be infeasible.
I think it’s likely that the justice system will deem model training as fair use, provided that the models are not designed to exactly reproduce the training data as output.
I think you hit on an important point though: these models are a giant transfer of wealth from creators to consumers / users. Now anyone can acquire artist-grade art for any purpose, basically for free — that’s a huge boon for the consumer / user.
People all around the world are going to be enriched by these models. Anyone in the world will be able to have access to a tutor in their language who can teach them anything. Again, that is only possible because the models eat ALL the data.
Another important point: original artwork has been made almost completely obsolete by this technology. The deed is done, because even if you push it out 70 years, eventually all of the artwork that these models have been trained on will be public domain. So, 70 years from now (or whatever it is) the cat will be out of the bag AND free of copyright obligations, so 2-3 generations from now it will be impossible to make a living selling artwork. It’s done.
When something becomes obsolete, it’s a dead man walking. It will not survive, even if it may take a while for people to catch up. Like when the vacuum tube computer was invented, that was it for relay computers. Done. And when the transistor was invented, that was it for vacuum tube computers.
It’s just a matter of time before all of today’s data is public domain and the models just do what they do.
There's zero doubt that people will still create art. Almost no one will be paid to do it though (relative to our current situation where there are already far more unpaid artists than paid ones). We'll lose an immeasurable amount of amazing new art that "would have been" as a result, and in its place we'll get increasingly bland/derivative AI generated content.
Much of the art humans will create entirely for free in whatever spare time they can manage after their regular "for pay" work will be training data for future AI, but it will be extremely hard for humans to find as it will be drowned out by the endless stream of AI generated art that will also be the bulk of what AI finds and learns from.
I wonder how long until we see a product that's able to record workstation displays and provide a conversational analysis of work conducted that day by all of your employees.
Exactly. I'm not sure if this is brand new or not, but this is definitely on the frontier.
I was literally just thinking about this a few days ago... that we need a multi-modal language model with speech training built-in.
As soon as this thing rolls out, we'll be talking to language models like we talk to each other. Previously it was like dictating a letter and waiting for the responding letter to be read to you. Communication is possible, but not really in the way that we do it with humans.
This is MUCH more human-like, with the ability to interrupt each other and glean context clues from the full richness of the audio.
The model's ability to sing is really fascinating. It's ability to change the sound of its voice -- its pacing, its pitch, its tonality. I don't know how they're controlling all that via GPT-4o tokens, but this is much more interesting stuff than what we had before.
I honestly don't fully understand the implications here.
> Instinctively, I dislike a robot that pretends to be a real human being.
Is that because you're not used to it? Honestly asking.
This is probably the first time it feels natural where as all our previous experiences make "chat bots" and "automated phone systems", "automated assistants" absolutely terrible.
Naturally, we dislike it because "it's not human". But this is true of pretty much any thing that approaches "uncanny valley". But, if the "it's not human" solves your answer 100% better/faster than the human counter part, we tend to accept it a lot faster.
This is the first real contender. Siri was the "glimpse" and ChatGPT is probably the reality.
[EDIT]
https://vimeo.com/945587328 the Khan academy demo is nuts. The inflections are so good. It's pretty much right there in the uncanny valley because it does still feel like you're talking to a robot but it also directly interacting with it. Crazy stuff.
> It speaks in customer service voice. That faux friendly tone people use when they're trying to sell you something.
Mmmmm while I get that, in the context w/ the grandparent comment, having a human wouldn't be better then? It's effectively the same. Because, realistically that's a pretty common voice/tone to get even in tech support.
The problem is you don't like the customer service/sales voice because they "pretend to be your friends".
Let me know if I didn't capture it.
I don't think people "pretend to be my friend" when they answer the phone to help me sort out of airline ticket problem. I do believe they're trained to and work to take on a "friendly" tone. Even if the motive isn't genuine, because it's trained, it's way a nicer of an experience than someone who's angry or even simply monotone. Trying to fix my $1200 plane ticket is stressful enough. Don't need the CSR to make it worse.
I wonder if you can ask it to change its inflections to match a personal conversation as if you're talking to a friend or a teacher or in your case... a British person?
> This is probably the first time it feels natural
Really? I found this demo painful to watch and literally felt that "cringe" feeling. I showed it to my partner and she couldn't even stand to hear more than a sentence of the conversation before walking away.
It felt both staged and still frustrating to listen to.
And, like far too much in AI right now, a demo that will likely not pan out in practice.
Emotions are an axiom to convey feelings, but also our sensitivity to human emotions can be a vector for manipulation.
Especially when you consider the bottom line that this tech will be ultimately be horned into advertising somehow (read: the field dedicated to manipulating you into buying shit).
> Emotions are an axiom to convey feelings, but also our sensitivity to human emotions can be a vector for manipulation.
When one gets to be a certain age one begins to become attuned to this tendency of others' emotions to manipulate you, so you take steps to not let that happen. You're not ignoring their emotions, but you can address the underlying issue more effectively if you're not emotionally charged. It's a useful skill that more people would benefit from learning earlier in life. Perhaps AI will accelerate that particular skill development, which would be a net benefit to society.
With AI you can do A/B testing (or multi-arm bandits, the technique doesn't matter) to get into someone's mind.
Most manipulators end up getting bored of trying again and again with the same person. That won't happen if you are a dealing with a machine, as it can change names, techniques, contexts, tones, etc. until you give it what its operator wants.
Maybe you're part of the X% who will never give in to a machine. But keep in mind that most people have no critical thinking skills nor mental fortitude.
Problem is, people aren't machines either: someone who's getting bombarded with phishing requests will begin to lose it, and will be more likely to just turn off their Wi-Fi than allow an AI to run a hundred iterations of a many-armed-bandit approach on them.
I think we often get better at detecting the underlying emotion with which the person is communicating, seeing beyond the one they are trying to communicate in an attempt to manipulate us. For example, they say that $100 is their final price but we can sense in the wavering of their voice that they might feel really worried that they will lose the deal. I don't think this will help us pick up on those cues because there are no underlying real emotions happening, maybe even feeding us many false impressions and making us worse at gauging underlying emotions.
> When one gets to be a certain age one begins to become attuned to this tendency of others' emotions to manipulate you
This is incredibly optimistic, which I love, but my own experience with my utterly deranged elder family, made insane by TV, contradicts this. Every day they're furious about some new things fox news has decided it's time to be angry about: white people being replaced (thanks for introducing them to that, tucker!), "stolen" elections, Mexicans, Muslims, the gays, teaching kids about slavery, the trans, you name it.
I know nobody else in my life more emotionally manipulated on a day to day basis than them. I imagine I can't be alone in watching this happen to my family.
Yes nothing more unreasonable than not wanting your race to be replaced, wanting border laws to be enforced, and not wanting your children to be groomed into cutting off their body parts. You are definitely sane and your entire family is definitely insane.
Great replacement and white genocide are white nationalist far-right conspiracy theories. If you believe this is happening, you are the intellectual equivalent of a flat-earther. Should we pay attention to flat-earthers? Are their opinions on astronomy, rocketry, climate, and other sciences worth anyone's time? Should we give them a platform?
> In the words of scholar Andrew Fergus Wilson, whereas the islamophobic Great Replacement theory can be distinguished from the parallel antisemitic white genocide conspiracy theory, "they share the same terms of reference and both are ideologically aligned with the so-called '14 words' of David Lane ["We must secure the existence of our people and a future for white children"]." In 2021, the Anti-Defamation League wrote that "since many white supremacists, particularly those in the United States, blame Jews for non-white immigration to the U.S.", the Great Replacement theory has been increasingly associated with antisemitism and conflated with the white genocide conspiracy theory. Scholar Kathleen Belew has argued that the Great Replacement theory "allows an opportunism in selecting enemies", but "also follows the central motivating logic, which is to protect the thing on the inside [i.e. the preservation and birth rate of the white race], regardless of the enemy on the outside."
> and not wanting your children to be groomed into cutting off their body parts.
This doesn't happen. In fact, the only form of gender-affirming surgery that any doctor will perform on under-18 year olds is male gender affirming surgery on overweight boys to remove their manboobs.
> You are definitely sane and your entire family is definitely insane.
You sound brave, why don't you tell us what your username means :) You're one to stand by your values, after all, aren't you?
What if this technology could be applied so you can’t be manipulated? If we are already seeing people use this to simulate and train sales people to deal with tough prospects we can squint our eyes a bit and see this being used to help people identify logical fallacies and con men.
> Especially when you consider the bottom line that this tech will be ultimately be horned into advertising somehow.
Tools and the weaponization of them.
This can be said of pretty much any tech tool that has the ability to touch a good portion of the population, including programming languages themselves, CRISPR?
I agree we have to be careful of the bad, but the downsides in this case are not so dangerous that we should be trying to suppress it because the benefits can be incredible too.
The concern is that it's being locked up inside of major corporations that aren't the slightest bit trustworthy. To make this safe for the public, people need to be able to run it on their own hardware and make their own versions of it that suit their needs rather than those of a megacorp.
this tech isn't slowing down and our generation maybe hesitate at first but remember this field progressing at astonishing speeds like we are literally 1 generation away
Why can’t it also inspire you? If I can forgo advertising and have ChatGPT tutor my child on geometry and they actually learn it at a fraction of the cost of a human tutor why is that bothersome? Honest question. Why do some many people default to something sinister going on. If this technology shows real efficacy in education at scale take my money.
Because it is obviously going to be used to manipulate people. There is absolutely 0 doubt about that (and if there is I'd love to hear your reasoning). The fact that it will be used to teach geometry is great. But how many good things does a technology need to do before the emotional manipulation becomes worth it?
I don't think OpenAI is doing anything particularly sinister. But whatever OpenAI has today a bad actor will have in October. This horseshit is moving rather fast. Sorry, but in two years going from failing the turing test to being able to have a conversation with an AI agent nearly indistinguishable from a person is going to be destabilizing.
These sorts of comments are going to go in the annals with the hackernews people complaining about Dropbox when it first came out. This is so revolutionary. If you're not agog you're just missing the obvious.
Good thing you can tell the AI to speak to you in a robotic monotone and even drop IQ if you feel the need to speak with a dumb bot. Or abstain from using the service completely. You have choices. Use them.
Until your ISP fires their entire service department in a foolish attempt to "replace" them with an overfunded chatbot-service-department-as-a-service and you have to try to jailbreak your way through it to get to a human.
I think pets often feel real emotions, or at least bodily sensations, and communicate those to humans in a very real way, whether thru barking or meowing or whimpering or whatnot. So while we may care for them as we care for a human, just as we may care for a plant or a car as a human, I think if my car started to say it felt excited for me to give it a drive, I might also feel uncomfortable.
They do, but they've evolved neoteny (baby-like cries) to do it, and some of their emotions aren't "human" even though they are really feeling them.
Silly example, but some pets like guinea pigs are almost always hungry and they're famous for learning to squeak at you whenever you open the fridge or do anything that might lead to giving them bell peppers. It's not something you'd put up with a human family member using their communication skills to do!
There’s definitely an element of evolution: domesticated animals have evolved to have human recognizable emotions. But that’s not to say they’re not “real” or even “human.” Do humans have a monopoly on joy? I think not. Watch a dog chase a ball. It clearly feels what we call joy in a very real sense.
But I think this animosity is very much expected, no? Even I felt a momentary hint of "jealousy" -- if you can even call it that -- when I realized that we humans are, in a sense, not really so special anymore.
But of course this was the age-old debate with our favorite golden-eyed android; and unsurprisingly, he too received the same sort of animosity:
Bones was deeply skeptical when he first met Data: "I don't see no points on your ears, boy, but you sound like a Vulcan." And we all know how much he loved those green-blooded fools.
Likewise, Dr. Pulanski has since been criticized for her rude and dismissive attitudes towards Data that had flavors of what might even be considered "racism," or so goes the Trekverse discussion on the topic.
And let's of course not forget when he was on trial essentially for "humanity," or whether hew as indeed just the property of Starfleet, and nothing more.
More recent incarnations of Star Trek: Picard illustrated the outright ban on "synthetics" and indeed their effective banishment; non-synthetic life -- from human to Roman -- simply weren't ok with them.
Yes this is all science fiction silliness -- or adoration depending on your point of view -- but I think it very much reflects the myriad directions our real life world is going to scatter (shatter?) in the coming years ahead.
To your point, there's been a lot of talk about AI, regulation, guardrails, whatever. Now is the time to say, AI must speak such that we know it's AI and not a real human voice.
We get the upside of conversation, and avoid the downside of falling asleep at the wheel (as Ethan Mollick mentions in "Co-Intelligence".)
I’m 100% a text everything never calls person but I can’t live without Alexa these days, every time I’m in a hotel or on vacation I nearly ask a question out loud.
I also hate how much Alexa sucks so this is a big deal. I spent years weeding out what it could do and can’t do so it will be nice to have one that I don’t have to treat like a toddler
> Humans are more partial to talking than writing.
Amazon, Google, and Apple have sunk literally billions of dollars into this idea only to find out that, no, we aren't.
We are with other humans, yes. When socialization is part of the conversation. When I'm talking to my local barista I'm not just ordering a coffee, I'm also maintaining a relationship with someone in my community.
But when it comes to work, writing >>> talking. Writing is clarity of ideas. Talking is cult of personality.
And when it comes to inputs/outputs, typing is more precise and more efficient.
Don't get me wrong, this is an incredibly revolutionary piece of technology, but I don't think the benefits of talking you're describing (timing, subtext, inexplicit knowledge) are achievable here either (for now), since even that requires HOURS of interaction over days/weeks/months of experiences for humans to achieve with each other.
I use voice assistants and find them quite useful, but I've had to learn the interface and memorise the correct trigger phrases. If GPT-4o works half as well in practice as it does in the demos, then it's categorically a different thing.
Writing is only superior to conversation when weighed against discussions with more than 3 people. A quick call with one or two other people always results in more progress being made as long as everyone involved wants to get it done. Messaging back and forth takes much more time and often leads to misunderstandings.
im human and much much more partial to typing than talking. talking is a lot of work for me and i can't process my thinking well at all without writing.
Older people think younger people are stupid and reckless, and viceversa. And the younglings think they "figured it out" like no one before them. But no one ever tried to understand each other in the process. Rinse and repeat.
The most impressive part is that the voice uses the right feelings and tonal language during the presentation. I'm not sure how much of that was that they had tested this over and over, but it is really hard to get that right so if they didn't fake it in some way I'd say that is revolutionary.
How far are we away from something like a helmet with chat GPT and a video camera installed, I imagine this will be awesome for low vision people. Imagine having a guide tell you how to walk to the grocery store, and help you grocery shop without an assistant. Of course you have tons of liability issues here, but this is very impressive
We're planning on getting a phone-carrying lanyard and she will just carry her phone around her neck with Be My Eyes^0 looking out the rear camera, pointed outward. She's DeafBlind, so it'll be bluetoothed to her hearing aids, and she can interact with the world through the conversational AI.
I helped her access the video from the presentation, and it brought her to tears. Now, she can play guitar, and the AI and her can write songs and sing them together.
This is a big day in the lives of a lot of people whom aren't normally part of the conversation. As of today, they are.
That story has always been completely reasonable and plausible to me. Incredible foresight. I guess I should start a midlevel management voice automation company.
This Dutch book [1] by Gummbah has the text "Kooptip" imprinted on the cover, which would roughly translate to "Buying recommendation". It worked for me!
Just the ability to distinguish bills would be hugely helpful, although I suppose that's much less of a problem these days with credit cards and digital payment options.
Does it give you voice instructions based on what it knows or is it actively watching the environment and telling you things like "light is red, car is coming"?
With this capability, how close are y'all to it being able to listen to my pronunciation of a new language (e.g. Italian) and given specific feedback about how to pronounce it like a local?
It completely botched teaching someone to say “hello” in Chinese - it used the wrong tones and then incorrectly told them their pronunciation was good.
I think there is too much focus on tones in beginning Chinese. Yes, you should get them right, but no, you'll get better as long as you speak more, even if your tones are wrong at first. So rather than remember how to say fewer words with the right tones, you'll get farther if you can say more words with whatever tones you feel like applying. That "feeling" will just get better over time. Until then, you'll talk as good as a farmer coming in from the country side whose first language isn't mandarin.
As for the Mandarin tones, the model might have mixed it up with the tones from a dialect like Cantonese. It’s interesting to discover how much difference a more specific prompt could make.
I don't know if my iOS app is using GPT-4o, but asking it to translate to Cantonese gives you gibberish. It gave me the correct characters, but the Jyutping was completely unrelated. Funny thing is that the model pronounced the incorrect Jyutping plus said the numbers (for the tones) out loud.
After watching the demo, my question isn't about how close it is to helping me learn a language, but about how close it is to being me in another language.
Even styles of thought might be different in other languages, so I don't say that lightly... (stay strong, Sapir-Wharf, stay strong ;)
I was conversing with it in Hinglish (A combination of Hindi and English) which folks in Urban India use and it was pretty on point apart from some use of esoteric hindi words but i think with right prompting we can fix that.
I'm a Spaniard and to my ears it clearly sounds like "Es una manzana y un plátano".
What's strange to me is that, as far as I know, "plátano" is only commonly used in Spain, but the accent of the AI voice didn't sound like it's from Spain. It sounds more like an American who speaks Spanish as a second language, and those folks typically speak some Mexican dialect of Spanish.
In Spain, it's like that. In Latin America, it was always "plátano," but in the last ten years, I've seen a new "global Latin American Spanish" emerging that uses "banana" for Cavendish, some Mexican slang, etc. I suspect it's because of YouTube and Twitch.
At least IME, and there may be regional or other variations I’m missing, people in México tend to use “plátano” for bananas and “plátano macho” for plantains.
I don't think that'd work without a dedicated startup behind it.
The first (and imo the main) hurdle is not reproduction, but just learning to hear the correct sounds. If you don't speak Hindi and are a native English speaker, this [1] is a good example. You can only work on nailing those consonants when they become as distinct to your ear as cUp and cAp are in English.
We can get by by falling back to context (it's unlikely someone would ask for a "shit of paper"!), but it's impossible to confidently reproduce the sounds unless they are already completely distinct in our heads/ears.
That's because we think we hear things as they are, but it's an illusion. Cup/cap distinction is as subtle to an Eastern European as Hindi consonants or Mandarin tones are to English speakers, because the set of meaningful sounds distinctions differs between languages. Relearning the phonetic system requires dedicated work (minimal pairs is one option) and learning enough phonetics to have the vocabulary to discuss sounds as they are. It's not enough to just give feedback.
> but it's impossible to confidently reproduce the sounds unless they are already completely distinct in our heads/ears
interestingly, i think this isn't always true -- i was able to coach my native-spanish-speaking wife to correctly pronounce "v" vs "b" (both are just "b" in spanish, or at least her dialect) before she could hear the difference; later on she was developed the ability to hear it.
I had a similar experience learning Mandarin as a native English speaker in my late 30s. I learned to pronounce the ü sound (which doesn't exist in English) by getting feedback and instruction from a teacher about what mouth shape to use. And then I just memorized which words used it. It was maybe a year later before I started to be able to actually hear it as a distinct sound rather than perceiving it as some other vowel.
The content was correct but the pronunciation was awful. Now, good enough? For sure, but I would not be able to stand something talking like that all the time
Most people don't, since you either speak with native speakers or you speak in English mostly, since in international teams you speak in English and not one of the native languages even if nobody speaks English natively. So it is rare to hear broken non-English.
And note that understanding broken language is a skill you have to train. If you aren't used to it then it is impossible to understand what they say. You might not have been in that situation if you are an English speaker since you are so used to broken English, but it happens a lot for others.
This is damn near one of the most impressive things, can only imagine especially with live translation and voice synthesis (eleven labs style) you'd be capable of to integrate with something like teams (select each persons language and do realtime translation to each persons native language, with their own voice and intonations would NUTS)
Random OpenAI question: While the GPT models have become ever cheaper, the price for the tts models have stayed in the $15/1Mio char range. I was hoping this would also become cheaper at some point. There're so many apps (e.g. language learning) that quickly become too expensive given these prices. With the GPT-4o voice (which sounds much better than the current TTS or TTS HD endpoint) I thought maybe the prices for TTS would go down. Sadly that hasn't happened. Is that something on the OpenAI agenda?
I couldn’t quite tell from the announcement, but is there still a separate TTS step, where GPT is generating tones/pitches that are to be used, or is it completely end to end where GPT is generating the output sounds directly?
Very exciting, would love to read more about how the architecture of the image generation works. Is it still a diffusion model that has been integrated with a transformer somehow, or an entirely new architecture that is not diffusion based?
I've always been wondering what GPT models lack that makes them "query->response" only. I've always tried to get chatbots to lose the initially needed query, with no avail. What would It take to get a GPT model to freely generate tokens in a thought like pattern? I think when I'm alone without query from another human. Why can't they?
> What would It take to get a GPT model to freely generate tokens in a thought like pattern?
That’s fundamentally not how GPT models work, but you can easily build a framework around them that calls them in a loop; you’d need a special system prompt to get anything “thought like” that way, and if you want it to be anything other than stream-of-simulated-consciousness with no relevance to anything, and a non-empty “user” prompt each round, which could be as simple as time, a status update on something in the world, etc.
Monkeys who've trained since birth to use sign language, and can reply incredible questions, have the same issue. The researchers noticed they never once asked a question like "why is the sky blue?" or "why do you dress up". Zero initiating conversation, but they do reply when you ask what they want.
I suppose it would cost even more electricity to have ChatGPT musing alone though, burning through its nvidia cards...
Licensing the emotion-intoned TTS as a standalone API is something I would look forward to seeing. Not sure how feasible that would be if, as a sibling comment suggested, it bypasses the text-rendering step altogether.
Is it possible to use this as a TTS model? I noticed on the announcement post that this is a single model as opposed to a text model being piped to a separate TTS model.
In my ChatGPT app or on the website I can select GPT-4o as a model, but my model doesn't seem to work like the demo. The voice mode is the same as before and the images come from DALLE and ChatGPT doesn't seem to understand or modify them any better than previously.
Sorry to hijack, but how the hell can I solve this? I have the EXACT SAME error on two iOS devices (native app only — web is fine), but not on Android, Mac, or Windows.
hi gdb, could you please create an assistant AI that can filter low-quality HN discussion on your comment so that it can redirect my focus on useful stuff.
Right to who? To me, the voice sounds like an over enthusiastic podcast interviewer. Whats wrong with wanting computers to sound like what people think computers should sound like?
I want to get to the part where phone recordings stop having slow, full sentences. The correct paradigm for that interface is bullet list, not proper speech.
It understands tonal language, you can tell it how you want it to talk, I have never seen a model like that before. If you want it to talk like a computer you can tell it to, they did it during the presentation, that is so much better than the old attempts at solving this.
You are a Zoomer sosh meeds influencer, please increase uptalk by 20% and vocal fry by 30%. Please inject slaps, "is dope" and nah and bra into your responses. Throw shade every 11 sentences.
And you’ve just nailed where this is all headed. Each of us will have a personal assistant that we like. I am personally going to have mine talk like Yoda and I will gladly pay Disney for the privilege.
People have been promising this for well over a decade now but the bottleneck is the same as it was before: the voice assistants can't access most functionality users want to use. We don't even have basic text editing yet. The tone of voice just doesn't matter when there's no reason to use it.
The Total Perspective Vortex in Hitchhiker's notably didn't do anything bad when it was turned on, and so is good evidence that inventing the torment nexus is fine.
LLMs today have no concept of epistemology, they don't ever "know" and are always making up bullshit, which usually is more-or-less correct as a side effect of minimizing perplexity.
Why did they make the woman sound like she's permanently on the brink of giggling? It's nauseating how overstated her pretentious banter is. Somewhere between condescending nanny and preschool teacher. Like how you might talk to a child who's at risk of crying so you dial up the positive reinforcement.
>The most impressive part is that the voice uses the right feelings and tonal language during the presentation.
Consequences of audio2audio (rather than audio >text text>audio). Being able to manipulate speech nearly as well as it manipulates text is something else. This will be a revelation for language learning amongst other things. And you can interrupt it freely now!
Elevenlabs isn’t remotely close to how good this voice sounds. I’ve tried to use it extensively before and it just isn’t natural. This voice from openAI and even the one chatGPT has been using is natural.
When have you last used it. I used a few weeks ago to create a fake podcast as a side project recently and it sounded pretty good with their highest end model with cranked up tunings.
I did see that, though my interpretation is that breathing is included in its voice tokenizer which helps it understand emotions in speech (the AI can generate breath sounds after all). Other sounds, like bird songs or engine noises, may not work - but I could be wrong.
I suspect that like images and video, their audio system is or will become more general purpose. For example it can generate the sound of coins falling onto a table.
allegedly google assistant can do the "humming" one but i have never gotten it to work. I wish it would because sometimes i have a song stuck in my head that i know is sampled from another song.
I asked it to make a bird noise, instead it told me what a bird sounds like with words. True audio to audio should be able to be any noise, a trombone, traffic, a crashing sea, anything. Maybe there is a better prompt there but it did not seem like it.
I was in the audience at the event. The only parts where it seemed to get snagged was hearing the audience reaction as an interruption. Which honestly makes the demo even better. It showed that hey, this is live.
I wonder when it will be able to understand that there is more than one human talking to it. It seems like even in today's demo if two people are talking, it can't tell them apart.
I was showing my wife 4o voice chat this afternoon, and we were asking it about local recommendations for breakfast places. All of a sudden…
————
ChatGPT:
Enjoy your breakfast and time together.
User:
Can you tell that it's not just me talking to you right now?
ChatGPT:
I can't always tell directly, but it sounds like you're sharing the conversation with someone else. Is [wife] there with you?
User:
My god, the AI has awoken. Yes, this is [wife].
ChatGPT:
Hi [wife]! It's great to hear from you. How are you doing?
User:
I'm good. Thanks for asking. How are you?
ChatGPT:
I'm doing well, thanks! How's everything going with the baby preparations?
—————
We were shocked. It was one of those times where it’s 25% heartwarming and 75% creepy. It was able to do this in part due to the new “memory” feature, that memorized my wife’s name and that we are expecting. it’s a strange novelty now, but this will be totally normalized and ubiquitous quite soon. Interesting times to be living in.
That was very impressive, but it doesn't surprise me much given how good the voice mode is in the ChatGPT iPhone app is already.
The new voice mode sounds better, but the current voice mode did also have inflection that made it feel much more natural than most computer voices I've heard before.
Slight off-topic, but I noticed you've updated your llm CLI app to work with the 4o model (plus bunch of other APIs through plugins). Kudos for working extremely fast. I'm really grateful for your tool; I tried many others, but for some reason none clicked as much as your.
Can you tell the current voice model what feelings and tone it should communicate with? If not it isn't even comparable, being able to control how it reads things is absolutely revolutionary, that is what was missing from using these AI models as voice actors.
+1. Check the demo video in OP titled "Sarcasm". Human asks GPTo to speak "dripping in sarcasm". The tone that comes back is spot on. Comparing that against current voice model is a total sea change.
I’m not sure how revolutionary the style is. It can already mimic many styles of writing. It seems like mimicking a cheerful happy assistant, with associated filler words, etc. is very much in-line with what LLM’s are good at.
Really? I think interruption and timing in general still seems like a problem that has yet to be solved. It was the most janky aspect of the demos imo.
Yeah, the female voice especially is really impressive in the demos. The voice always sounds natural. The male voice I heard wasn't as good. It wasn't terrible, but it had a somewhat robotic feel to it.
"Right" feelings and tonal language? "Right" for what? For whom?
We've already seen how much damage dishonest actors can do by manipulating our text communications with words they don't mean, plans they don't intend to follow through on, and feelings they don't experience. The social media disinfo age has been bad enough.
Are you sure you want a machine which is able to manipulate our emotions on an even more granular and targetted level?
LLMs are still machines, designed and deployed by humans to perform a task. What will we miss if we anthropomorphize the product itself?
But it’s not scary. It’s… marvelous, cringey, uncomfortable, awe-inspiring. What’s scary is not what AI can currently do, but what we expect from it. Can it do math yet? Can it play chess? Can it write entire apps from scratch? Can it just do my entire job for me?
We’re moving toward a world where every job will be modeled, and you’ll either be an AI owner, a model architect, an agent/hardware engineer, a technician, or just.. training data.
reply