More

sammyyyyyyy · 2026-01-09T13:39:07 1767965947

You should try it! I wouldn’t say it’s the best, far from that. But also wouldn’t say it’s terrible. If you have a 5090, then yes, you can run much more powerful models in real time. Chatterbox is a great model though

iLoveOncall · 2026-01-09T14:00:39 1767967239

> But also wouldn’t say it’s terrible.

But you included 3 samples on your GitHub video and they all sound extremely robotic and have very bad artifacts?

sammyyyyyyy · 2026-01-09T02:02:05 1767924125

Also, I didn’t want to use known voices as the example, so I ended up using generic ones from the datasets

sammyyyyyyy · 2026-01-09T01:59:14 1767923954

I should have posted the reference audio used with the examples. Honestly it doesn’t sound so different from them. Voice cloning can be from a cartoon too, doesn’t have to be from a human being

nemomarx · 2026-01-09T02:10:00 1767924600

A before / after with the reference and output seems useful to me, and maybe a range from more generic to more recognizable / celebrity voice samples so people can kinda see how it tackles different ones?

(Prominent politician or actor or somebody with a distinct speaking tone?)

Gathering6678 · 2026-01-09T02:50:04 1767927004

That is probably a good idea. I was so confused listening to the example.

sammyyyyyyy · 2026-01-09T00:53:01 1767919981

As I said, some reference voices can lead to bad voice quality. But if it sounds that bad, it’s probably not it. Would love to dig into it if you want

codefreakxff · 2026-01-09T03:33:53 1767929633

I agree with the comment above. I have not logged into hacker news in _years_ but did so today just to weigh in here. If people are saying that the audio sounds great, then there is definitely something going on with a subset of users where we are only hearing garbled words with a LOT of distortion. This does not sound like natural speech to met at all. It sounds more like a warped cassette tape. And I do not mean to slight your work at all. I am actually incredibly puzzled here to understand why my perception of this is so radically different from others!

guerrilla · 2026-01-09T03:48:46 1767930526

Thank you for commenting. I wonder if this could be another situation like "the dress" (2015) or maybe something is wrong with our codecs...

Mashimo · 2026-01-09T06:48:02 1767941282

No, nothing wrong with your codecs. It's sounds shitty. But given the small size and speed it's still impressive.

It's like saying .kkrieger looks like a bad game, which it does, but then again .kkrieger is only 96kb or whatever.

guerrilla · 2026-01-09T07:44:21 1767944661

How big are TTS models like this usually?

.kkrieger looks like an amazing game for the mid-90s. It's incomprehensible that it's only 96kb.

Mashimo · 2026-01-09T09:28:03 1767950883

Here is an overview: https://www.inferless.com/learn/comparing-different-text-to-...

Also keep in mind the processing time. The ^ article above used a NVIDIA L4 with 24-GB VRAM. Sopro claims 7.5 second processing time on CPU for 30 seconds of audio!

If you want to get real good quality TTS, you should check out elevenlabs.io

Different tools for different goals.

guerrilla · 2026-01-09T00:54:41 1767920081

I mean I'm talking about the mp4. How could people possibly be worried about scammers after listening to that?

sammyyyyyyy · 2026-01-09T00:56:49 1767920209

I didn’t specially cherry pick those examples. You can try it anyway for yourself. But thanks for the feedback anyway

guerrilla · 2026-01-09T01:26:13 1767921973

No shade on you. It's definitely impressive. I just didn't understand people's reactions.

jrmg · 2026-01-09T11:36:41 1767958601

It sounds like someone using an electrolarynx to me.

sammyyyyyyy · 2026-01-08T22:52:21 1767912741

No, it doesn’t.

sammyyyyyyy · 2026-01-08T22:49:19 1767912559

Yes, you are right. However, there are many upsides to this kind of technology. For example, it can restore the voices of people who were affected by numerous diseases

jacquesm · 2026-01-08T23:12:25 1767913945

Ok, that's an interesting angle, I had not thought of that, but of course you'd still need a good sample of them from before that happened. Thank you for the explanation.

sammyyyyyyy · 2026-01-08T22:32:37 1767911557

Obrigado! Quando (e se fizeres isso) manda pm!

sammyyyyyyy · 2026-01-08T21:59:52 1767909592

Yeah, we are not quite there, but I’m sure we are not far either

sammyyyyyyy · 2026-01-08T21:58:42 1767909522

This is my side “hobby”. And compute is quite expensive. But if the community’s responsive is good, I will definitely think about it! Btw, chatterbox is a great model and inspiration

littlestymaar · 2026-01-09T05:21:10 1767936070

Very cool work, especially for a hobby project.

Do you have any plans to publish a blog post on how you did that? ?What training data and how much? Your training and ablations methodology, etc.

bicepjai · 2026-01-08T23:40:52 1767915652

Thanks can you share details about compute economics you dealt with ?

sammyyyyyyy · 2026-01-08T23:43:12 1767915792

Yeah sure. The training was about ~250 dollars, which is quite low by today’s standards. And I spent a bit more on ablations and research

bicepjai · 2026-01-11T23:29:38 1768174178

I was on similar path and saw my bills going over 1000 dollars as interests to do research and ablations grew. Then I decided to get one Blackwell Pro 6000 and trying things with that :) If you have suggestions on how to manage metrics let us know. Currenty trying langfuse since its one click install on coolify

btbuildem · 2026-01-12T15:57:43 1768233463

Is that something that could be done on a local setup? Eg, 2x RTX3090?

sammyyyyyyy · 2026-01-08T21:39:35 1767908375

Cool! Yeah the voice quality really depends on the reference audio. Also mess with the parameters. All the feedback is welcome