I've shipped all three in real projects: a Python script that emailed me a daily MP3 digest with gTTS, a side-panel reader on the browser's Web Speech API, and a couple of things wired into hosted neural TTS. They get lumped together as "text-to-speech," but under the hood they're three different animals — one is a wrapper around a Google endpoint, one is whatever voices your operating system happens to ship, and one is a GPU somewhere generating a waveform from scratch. Pick the wrong one and you'll spend an afternoon fighting a problem the other tool doesn't even have. Here's the honest breakdown, the gotchas I hit, and a simple rule for choosing.
What each one actually is
This is the part most comparisons skip, and it decides everything.
gTTS (gtts) is a small Python library that does not contain a TTS engine. It's a thin client that builds a request to the same unofficial endpoint Google Translate uses for its little speaker icon, and hands you back an MP3. That's the whole trick. So "gTTS quality" is really "Google Translate's TTS quality," and "gTTS reliability" is really "is that undocumented endpoint up today." You get a file — genuinely useful — but you're a guest on someone else's API.
Browser TTS is the Web Speech API, specifically window.speechSynthesis and SpeechSynthesisUtterance. The browser doesn't synthesize anything itself either — it calls out to whatever TTS voices your operating system provides (and, on Chrome, sometimes Google's online voices). So the exact same line of JavaScript produces a different voice on macOS, Windows, Android, and iOS, and on a stripped-down Linux box it may produce nothing at all. It runs client-side, for free, with zero network round-trip for the local voices.
Neural TTS is the modern deep-learning kind — a model that generates a brand-new waveform rather than playing back clips or relying on the OS. This is the family behind the voices that make you do a double-take. It almost always runs as a hosted API (you send text, you get audio) because the good models are too heavy to run casually on a laptop. I wrote up how neural voices actually work separately if you want the mel-spectrogram-and-vocoder version.
Three different machines. Now the tradeoffs.
gTTS: great for scripts, quietly fragile
I reach for gTTS when I want an audio file out of a Python script and I don't want to think hard. Four lines and you've got an MP3:
from gtts import gTTS
gTTS("Build finished. Twelve tests passed.", lang="en").save("done.mp3")For cron jobs, build notifications, turning a scraped article into something to hear on a walk, quick voice prompts — it's perfect. It speaks many languages, the output is clean for plain prose, and you own the file.
The honest cons, all of which bit me:
- It needs the internet, every time. There's no local fallback. No connection, or the endpoint hiccups, and your script throws. I've had overnight jobs die at 3 a.m. because the request timed out.
- It's an unofficial endpoint with no SLA. Google can rate-limit you, change the response shape, or break it, and you have no recourse. Hammer it in a loop and you'll see
429-style failures; people routinely addtime.sleep()between calls or batch text into bigger chunks. - No SSML, basically no control. You can nudge with
slow=Trueand pick a TLD for accent flavor, but you can't set pitch, fine-grained rate, emphasis, or pauses. One pace, one way. - The voice is fine, not 2026-impressive. Clearly synthetic — serviceable for a notification or a rough listen, behind a good neural voice if you'll sit with it for an hour. (Long inputs also get split into multiple requests and stitched, which can add odd seams.)
Don't use gTTS when: you need offline, you're calling it thousands of times (you'll get throttled), you need real prosody control, or it's user-facing and downtime is unacceptable. Fantastic personal/automation tool; shaky production dependency.
Browser TTS: free, instant, gloriously inconsistent
The Web Speech API is the one I love for the right job and warn people about for the wrong one. The pitch is unbeatable: zero cost, no API key, no server, and for local voices no network — it speaks the instant you call it.
const u = new SpeechSynthesisUtterance("Hello from the browser.");
u.rate = 1.1;
u.pitch = 1;
speechSynthesis.speak(u);And unlike gTTS, you get knobs: rate, pitch, volume, plus onboundary events to highlight words as they're spoken — the synchronized read-along effect, free, no audio file required.
Now the gotchas, in the order they'll trip you:
getVoices()is async on first load and returns[]. The single most common Web Speech bug. The voice list isn't ready when the page loads, so your first call grabs nothing. Wait for thevoiceschangedevent before picking a voice. Miss this and it "randomly doesn't work" on refresh.- Voices are entirely OS-dependent. You hear whatever the user's OS installed. The premium macOS voices aren't on Windows; Android and iOS each have their own set. You can't guarantee a specific voice for every user — only ask for a
langand hope something decent answers. - It's flaky by spec. Chrome notoriously stops long utterances after ~15 seconds unless you keep it alive (the classic
pause()/resume()ping), and behavior differs across browsers. Long-form reading needs babysitting. - Quality is all over the map. Some bundled OS voices are genuinely good now; others still sound like a 2009 GPS.
- It's online for the good Chrome voices. The best-sounding ones are network-backed, so "no network needed" only holds for the basic local set.
Don't use browser TTS when: you need the same voice and quality for every user, you need a downloadable MP3 (the API speaks, it doesn't easily export a file), you need rock-solid long-form playback, or you're targeting locked-down environments where voices may be absent. It's brilliant for a quick "read this page aloud" feature and frustrating as a guaranteed-consistent product surface.
Neural TTS: the quality tier, at a price
When the voice has to be good — content people will sit with, anything that represents you — neural TTS is the answer. The leap from gTTS or an average OS voice to a strong neural voice is the difference between "that's a robot reading" and "wait, is this synthetic?" It also brings real SSML/control, consistent output across every platform (it's your server, not the user's OS), and downloadable audio.
The honest costs:
- It's not free and not local. You're paying per character or per minute, and you're sending text to a third party — which matters if the text is sensitive.
- There's latency. Generation takes real compute, so for long documents you're waiting, or streaming chunks.
- Pricing adds up fast at scale. As 2026 reference points to verify before you commit: ElevenLabs sits around $5/mo (Starter) to $99/mo (Pro) with character caps per tier; Google Cloud and Amazon Polly bill per character (roughly $4–$16 per million characters depending on the voice class, with a monthly free allowance). Read a whole book through a premium neural API and the bill is real.
The honest decision rule
After all of it, here's what I actually do:
- Personal script that needs an audio file, plain prose, offline not required → gTTS. Fastest path to an MP3 in Python. Just add retries and a
sleepso the endpoint doesn't slap you. - A "read this aloud" button on a web page, free, instant, consistency-across-users doesn't matter → browser TTS. Just handle
voiceschangedand keep long utterances alive. - Voice represents your product, must sound great, must be identical everywhere → neural TTS, and budget for it.
- You're the end user and you just want your reading read to you well, everywhere, without writing any of this → don't build it at all. Use a finished reader.
That last bullet is the one most people in a "gTTS vs browser TTS" rabbit hole actually need. If your real goal isn't building TTS but using it — to get through PDFs, ebooks, docs, and long articles — wiring up any of these yourself is the slow path.
Where CastReader fits
CastReader is the finished version of that last option, and it's free to use — any text read aloud in a natural voice, no signup. It pairs natural neural voices with the read-along highlighting you'd otherwise hand-build on onboundary events, and crucially it fetches your content where it lives instead of making you paste text into a box. If you want more, CastReader Pro adds premium ultra-realistic voices, more listening hours, and AI document analysis.
In practice it reads the things developers and researchers actually pile up: a Kindle book in the browser, a long Claude or ChatGPT thread, a Gemini answer, Google Docs and Notion pages, an arXiv paper, and ordinary web articles. You can turn a PDF into an audiobook or an EPUB into audio, and when I'm coding I'll read VS Code aloud so the voice handles the comments and prose while my eyes stay on the code. It runs as a Chrome/Edge extension, a Mac app, and iOS/Android apps, so it follows you off your desk.
If you're weighing the paid leaders instead, we keep honest side-by-sides: a Speechify alternative and a NaturalReader alternative.
You can install the extension from the Chrome Web Store, grab the apps on the App Store and Google Play, or run the Mac app. Questions or edge cases? Email support@castreader.ai — a real person answers.
Frequently asked questions
Is gTTS free?
Yes — the library is free, open-source, and needs no API key. But it depends on an unofficial Google endpoint, so it can rate-limit or break without notice, and it needs the internet every time. Fine for personal scripts; not something to build a paid product on.
Why does my Web Speech API code work sometimes and fail on refresh?
Almost always the getVoices() timing bug. The voice list loads asynchronously and is empty on the first call right after page load. Wait for the voiceschanged event (or poll until getVoices() returns a non-empty array) before selecting a voice, and the "random" failures stop.
Can gTTS or browser TTS run offline?
Not really. gTTS always needs the network — no local engine at all. Browser TTS can use offline OS voices, but the best-sounding Chrome voices are online, so without a connection you only get the basic local set.
Which sounds the most human?
Neural TTS, clearly. gTTS is plainly synthetic, OS voices range from decent to dated, and a strong neural voice is the one that makes you forget you're listening to a machine — at the cost of being metered and cloud-based.
I don't want to build any of this — what should I use?
A finished reader. If your goal is to listen to your PDFs, ebooks, and docs rather than to build a TTS pipeline, use a free reader like CastReader — neural voices and read-along highlighting, nothing to wire up.
The bottom line
gTTS, browser TTS, and neural TTS aren't competitors so much as three tools for three jobs: gTTS when a Python script needs an MP3 and an unofficial, online-only dependency is fine; the Web Speech API for a free, instant "read aloud" button where per-user consistency doesn't matter; neural TTS when the voice has to be great and identical for everyone, paid. And if you only ever wanted to listen and not to build, skip the code and let a free reader do it.