Reviving Flaubert - a chatbot from his letters

For the past months I’ve been chatting with Gustave Flaubert. Or rather: with the voice of his letters.

What started as a small “prompt battle” game evolved into a focused experiment: take thousands of Flaubert’s letters, vectorize them, and let you talk to “him” in a grounded way. You can try it here: flaubert.samu.space.

# Why a Flaubert bot?

Letters are honest and real. They carry the stuff that novels keep hidden: impatience, gossip, deadlines, health, craft. If you want to learn how a writer thought, read their correspondence. A bot that answers in that register can be useful for students, writers, and the just-curious.

# How it works (short version)

  • Ingestion: I collected and cleaned a large set of Flaubert’s letters.
  • Vectorization: The text is embedded and stored in a local persistent vector DB (ChromaDB).
  • Retrieval: On each question, the system retrieves the most relevant letter fragments.
  • Grounded generation: The LLM answers in Flaubert’s voice, constrained by retrieved passages.

Under the hood this is a simple RAG stack. It uses a pre-filled store (whoami_data/gustave_flaubert) with a flaubert_letters collection and a small wrapper that aligns query embeddings with the index. Keeping it local makes iteration fast and cheap.

# Voice mode (Chrome only)

There is an optional voice mode powered by ElevenLabs. In Chrome on desktop, you can enable a speaker icon to have Flaubert’s replies read out loud in a deep, resonant voice that fits 19th-century prose. It starts reading as the text streams in, so you can listen without waiting for the full response. If audio stalls, a quick refresh usually fixes it.

This feature currently works best in Chrome. Other browsers may work, but support is experimental.

# From game to conversation

The project began as a guessing game where you had to identify a historical figure from clues. That was fun, but the most interesting experience was lingering in a single mind. So I removed the win-condition mechanics and built a straight chat mode. No guesses, no counters, just talk.

This change also simplified the prompts and reduced failure modes. Once the system knows it is always “Flaubert in letters,” it can focus on retrieval quality and concise answers.

# What I learned

  • Chunking matters: Letters are uneven. Paragraph-aware chunks worked better than fixed token windows.
  • Metadata helps: Dates and recipients are strong retrieval signals. Queries about travel vs. health vs. process benefit from light filtering.
  • Less prompt, more data: The best improvements came from cleaning, deduplicating, and adding context rather than crafting clever prompts.

# Limitations (honest ones)

  • It’s not a biography oracle. If a topic isn’t in the letters, the bot won’t know.
  • The “voice” is bounded by what retrieval surfaces; sometimes that’s dry or repetitive.
  • Translations vary in tone. Source diversity leaks into answers.

# Try it

If you ask about work habits, travel, health, money, or editorial fights, you’ll usually get rich answers. If you ask about plot specifics or third-party criticism, results are mixed, and that’s ok. The bot reflects its corpus.

Give it a spin: flaubert.samu.space.

Written on October 9, 2025

If you notice anything wrong with this post (factual error, rude tone, bad grammar, typo, etc.), and you feel like giving feedback, please do so by contacting me at hello@samu.space. Thank you!