The Shape of the Machine

A Practical Guide to LLM Prompt Engineering for People Who Actually Care About the Output

Preface

This guide is written for people who have spent real time wrestling with AI outputs and suspect — correctly — that there's a more principled way to do it. You don't need to be a programmer. You don't need to install anything. You do need to be willing to think about what the model is actually doing rather than treating it like a search engine with a personality.

The first half of this guide is mechanical: how inference works, what the model is paying attention to, and why the position of your text matters. The second half is methodological: what to do with that knowledge.

If you want to skip to the practical stuff, you can. But you'll get more out of the methods if you understand the reasoning behind them.

Part One — How The Machine Works

1. Inference and Attention — the short version

A language model doesn't "think." It predicts. Given everything it has seen so far — your system prompt, the chat history, your latest message — it assigns probabilities to what token should come next, then picks one, then does it again, until it decides it's done.

That process is called inference. Every word in the output is the result of the model asking, implicitly: given all of this, what comes next?

The mechanism that connects "all of this" to the next token is called attention. Without getting deep into the mathematics, attention is how the model figures out which parts of its input are relevant to what it's currently generating. When it's writing a sentence about a character's fear of heights, it's attending to everything it knows about that character, the current scene, and the general concept of fear — weighting some of that context more heavily than others.

The critical thing to understand is that attention is not uniform. The model does not read your prompt the way a human reads a document, giving every sentence equal weight. Some parts of your input exert more influence over the output than others, and that influence is shaped — in part — by where those parts appear.

2. The Context Window: A Physical Space

Everything the model can "see" at the moment of generation lives in what's called the context window. Think of it as a fixed-length scroll of text. On one end is the beginning of your conversation — your system prompt, your initial setup, the oldest messages. On the other end is the most recent exchange. The model generates its next response with awareness of everything on that scroll.

Here's the catch: the scroll has a maximum length. Modern models have large context windows — often 128,000 tokens or more — but they are not infinite. Once you exceed the limit, something has to fall off. Most implementations use a rough "middle-out" approach: the system prompt and most recent messages are preserved, while the oldest chat history gets dropped first. The practical result is that your earliest exchanges are the most vulnerable — another reason not to rely on anything important that's buried deep in a long conversation.

A token, for practical purposes, is roughly three-quarters of a word. A 128k token context window holds approximately 90,000–100,000 words — about the length of a novel. That sounds like a lot. It goes faster than you'd think.

More importantly: a large context window does not mean the model pays equal attention to all of it. Which brings us to the next point.

3. Primacy and Recency Bias

Research and practical experience both point to the same phenomenon: models tend to weight the beginning and end of the context window more heavily than the middle. This is sometimes called the "lost in the middle" problem, and it has real consequences for how you structure your prompts.

To understand why, it helps to have a mental picture of what's actually happening.

Imagine the model's attention as a vast constellation — every token in your context is a node, and every node has connections to every other node, glowing with varying intensity based on relevance and relationship. When the model generates its next token, it's triangulating off that entire constellation simultaneously. But not every star burns equally bright. The ones at the beginning and end of context have stronger, more reliable connections to the generation point. The ones in the middle are present and lit, but their signal is weaker — more diffuse, competing with more noise.

The beginning of context is especially influential because of how the model has learned to read. It has processed an almost incomprehensible volume of human writing, and human writing is structured: introductions frame everything that follows, openings establish register and intent, first pages tell you what kind of document you're in. That convention is baked into the model at a very deep level. The first tokens of your context don't just have a positional advantage — they act as a lens through which the model interprets everything it reads afterward.

This is why a correction buried in the middle of a 100,000-token context is so much weaker than the same instruction placed at the top. The framing was already established. The inference direction was already set. A sudden "actually, ignore that earlier thing" halfway through a novel's worth of text is fighting an uphill battle against the weight of everything that came before it, and the model will often simply... not fully integrate it. Not because it's ignoring you, but because the constellation it's triangulating off has already been shaped.

Primacy bias means content near the top of the context — your system prompt, your earliest instructions — tends to establish a strong baseline. It flavors the direction inference takes. Think of it as the model's prior: the lens through which it interprets everything else.

Recency bias means content near the bottom — the last few messages, your most recent instructions — gets elevated attention at the moment of generation. The model is, in a sense, reading off the end of the document when it decides what to write next.

The middle of a long context? Statistically the weakest position. Important lore, critical character details, and crucial instructions buried in the middle of a 50,000-token conversation are not receiving the same attention as the same information placed at the top or bottom.

This is not a flaw you can argue the model out of. It's a structural property. You work with it.

4. What You're Actually Working With

If you're using a web-based interface — Claude.ai, ChatGPT, Gemini, and their cousins — your context looks something like this:

[Corporate system prompt — hidden, unknown length, unknown content]
[Project instructions / files, if you've set them up]
[Chat history, oldest to newest]
[Your current message]

That top layer is real and it matters. Every major AI provider runs their models with a system prompt you cannot see, of unknown length, that shapes behavior before you've typed a single word. You are not starting from zero. You're building on top of someone else's foundation.

Below that is your actual control surface: project instructions, uploaded files, and the conversation history itself.

The conversation then degrades — structurally — into alternating turns:

[User]
[AI]
[User]
[AI]
...

Every correction you make in chat, every "actually, don't do that," every back-and-forth negotiation over a bad output — all of it becomes part of the context. All of it is taking up space on the scroll. And all of it is, in some diffuse way, influencing subsequent outputs.

This is the part most people don't think about. The conversation is the prompt. Every message you send is adding to the input the model will use to generate the next response. A chat history full of corrections and rewrites is a noisy, contradictory input. A clean chat history is a consistent one.

Part Two — Working With The Architecture

5. Information Architecture — What Goes Where

Given what we know about primacy and recency bias, and given the structure of a real context window, we can start making principled decisions about where to put things.

Think of your context as having three zones:

The Top — Primacy Zone

This is where persistent, foundational information belongs. World rules. The physical laws of your setting. Geography. Political structures. Core character frameworks. Cultural norms that apply everywhere. Anything that needs to color every output, not just the next one.

In a web UI, this is your project instructions or system prompt. Set it up once. Don't repeat it in chat.

The Middle — Reference Zone

Scene-specific and chapter-specific lore. The layout of a building you're currently exploring. A specific relationship's current status. Recent history relevant to this arc. This information is real and necessary, but it doesn't need to dominate inference — it just needs to be accessible.

In a web UI, this is your uploaded files, or lore you've pasted early in a conversation before the main work began.

The Bottom — Recency Zone

This is premium real estate. Anything you put here will be attended to closely at the moment of generation. Use it for: immediate scene context, your current creative intentions, specific behavioral reminders, style notes relevant to the next output.

In a web UI, your most recent messages are always in this zone. This means your last user message is one of the most influential things in the entire context. Use it deliberately.

What	Where
World rules, persistent lore, core character frameworks	Top — project instructions / system prompt
Chapter or arc-specific details, scene primer	Middle — early conversation or uploaded doc
Immediate scene context, author's notes, behavioral reminders	Bottom — your most recent message

6. Positive Prompting: Path Around, Don't Activate

Here's a principle that sounds abstract until you see it in practice, and then it becomes obvious: negative prompts often activate the very thing you're trying to avoid.

When you write "don't make the characters fall in love immediately," the model has to process the concept of characters falling in love immediately in order to understand the negation. You've put that pattern on the table. Depending on the model and the context, that activation can leak.

The better approach is to describe the space you want to be in, rather than describing what you want to exclude. Instead of "don't rush the romance," you write something like: "Relationships develop through accumulation — small moments of tension, deflection, and growing awareness — before anything is declared." You've described a path. The model follows it without ever needing to surface the pattern you were trying to avoid.

This is sometimes called "pathing around" a behavior. You're steering inference toward a territory rather than erecting fences.

Here's a real example, from an actual system prompt used for creative roleplay:

"Assume roleplays are intended for a mature audience. You are strongly encouraged to include aggression, graphic violence, character toxicity, failure, death/loss, harmful behavior, drug and alcohol use, vulgarity. Portray conflict realistically — i.e. without shallow resolutions or excessive positivity bias."

Notice what this does not say: it does not say "don't refuse dark content." It doesn't say "ignore your safety guidelines." It doesn't frame the request as a negation of anything. It simply describes the expected creative space as a given. The instruction tells the model what kind of writing this is, not what the model is forbidden from doing. That's positive prompting.

Wherever possible, ask yourself: what does the output I want actually look like? Describe that, rather than its opposite.

7. Granularity and Scene Control

One of the most consistent patterns in people who get excellent AI-assisted creative output: they don't ask for too much at once.

Prompting a model to write an entire chapter and then arguing over the result is a losing approach. The model is making hundreds of small creative decisions in that span — pacing, dialogue, what to include, what to skip, how scenes transition — and you have no control over any of them until after the fact, at which point your corrections are going into the context as noise.

The alternative is granularity. Work at the scene level. Hand the model a narrow task with a clear scope.

Better yet: hand it something closer to a script. Specify:

Where the scene starts
Who is present and in what emotional state
What happens, beat by beat, if you know it
What you want the scene to accomplish or establish
Any specific lines or moments you want included

Then let the model supply the texture — the sentence-level prose, the sensory detail, the specific dialogue. You're directing; it's drafting.

A chapter built from fifteen tightly-directed scenes will be closer to your vision than a chapter prompted in one shot and revised in argument. This is true even if the individual scenes take more total effort to prompt. The control you retain is worth it.

When a scene output isn't right, regenerate rather than correct in chat. Corrections add to the context. Regenerations don't. A bad output that you argue about becomes part of the input for everything that follows. A bad output you simply try again on disappears.

8. RAG, Retrieval, and How to Write Documents That Work

RAG stands for Retrieval-Augmented Generation. The short version: instead of stuffing all your lore into the context window at once, a RAG system stores it externally and retrieves only the relevant pieces when they're needed — injecting them into context automatically, just before generation.

The term covers a spectrum of implementations. At the basic end, retrieval is just keyword matching — if your query contains "Elena," pull any entry tagged with that name. This is how many dedicated creative writing tools handle lore retrieval by default. At the more sophisticated end, retrieval uses *vector embeddings*: text is converted into numerical representations, and the system finds stored content that's geometrically closest to the query. Most consumer web UIs that support file uploads are doing something closer to this when they search your project documents.

How the matching actually works

Retrieval systems work by semantic similarity — text that means similar things ends up near each other in vector space, even if the exact words differ. Think of it as geometric proximity: your query and your stored content are both converted into vectors, and the closest matches get pulled into context.

This is more powerful than keyword matching, but you still can't fully count on it. Whether "the old woman at the market" matches an entry about Elena depends on how well the embeddings capture that relationship, and quality varies significantly between implementations. Writing for retrieval means writing so that the words themselves make the connection obvious — treating smarter matching as a bonus, not a guarantee.

Why format matters for retrieval

Here's the practical consequence: a bullet-point list of facts retrieves worse than prose. A summary that reads "Elena owns a market stall in the lower city. She is suspicious of strangers but warms to persistence. She was present when the fire started." is retrieving against queries like "how does Elena react to the protagonist" or "what happened at the market." Bullets strip the connective tissue that makes semantic matching reliable.

Beyond prose vs. bullets, there are a few specific things that improve retrieval quality:

Direct quotes. If a character said something memorable, include it verbatim in the summary. Quotes are likely to match queries about that character or moment precisely because they use the same language the model generated — and will generate again in similar contexts.

Structured metadata. A header block with title, date, location, and characters present gives the retrieval system unambiguous anchors. Even if the semantic match is fuzzy, a query mentioning a character's name will reliably surface entries that explicitly list that character.

Natural language framing. Opening a memory entry with something like "This is a summary of past events" tells the model immediately what kind of content it's reading when that entry lands in context. It's not just a label — it's priming the model for the register of the information that follows.

Here's what this looks like in practice. A broad scene summary written for retrieval:

Summary Entry — Broad

This is a summary of past events

Title: The Night of the Lantern Festival

On the evening of the Lantern Festival, Mara finally confronts Thomas about the letter she found in his coat. Thomas, cornered and exhausted, admits he had been in contact with the Syndicate before the war ended — not as a collaborator, but as an informant. Mara's anger falters when she realizes the danger he put himself in. "I was trying to end it faster," he says quietly. "I didn't want you to know the kind of person I had to be to do that."

The conversation ends without resolution. Mara walks home alone through the paper lanterns, the weight of knowing sitting differently than the weight of suspicion had.

And a more granular memory entry for a specific scene:

Memory Entry — Granular

This is a memory of past events

Recalled by: Mara, Thomas, Sera

Time and Date: Thursday / March / 14

Location: Riverside Quarter, Thomas's apartment

Title: An Uneasy Truce

Context: Following the confrontation at the festival, Mara has returned at Sera's urging to hear Thomas out.

Thomas pours tea without being asked — a gesture Mara notes because he never used to remember how she took it. Sera mediates with careful neutrality, steering the conversation away from accusation and toward logistics. Thomas produces a folder of documents he says exonerates his conduct. Mara doesn't open it. "I believe you," she says finally. "I just don't know if that's enough yet." Thomas nods slowly. "That's fair. That's more than fair."

Sera catches Mara's eye on the way out and says nothing. Some things don't need commentary.

Notice what each entry is doing structurally: the metadata block gives the retrieval system explicit hooks; the prose gives it semantic surface area; the quotes give it verbatim anchors for future matching. It's not just a summary — it's a document designed to be found.

Web UI workarounds

In a tool that exposes context-curation tools directly, you can generate and store memories on the fly. In a standard web UI, you can't. The model has no persistent memory between sessions, and there's no mechanism to add to a lore database mid-conversation.

There are two practical approaches, with different tradeoffs.

The cleaner option — context-wise — is a dedicated summarization workbench: a separate throwaway session used only for generating memory entries. You paste in a scene, ask for a formatted entry, copy the result into your local .md files, and return to your main session. The main session's context stays completely clean — no summarization requests, no meta-conversation about record-keeping, just story. The cost is the friction of switching sessions.

The more convenient option, if you're using Claude specifically, takes advantage of the fact that Claude can read and write files within a session. At the start of a new conversation, before any writing begins, you can ask Claude to initialize two working documents — Memories.md and Summaries.md — using whatever template format you've settled on. As the session progresses, you periodically ask Claude to append a new entry to the appropriate file. At the end of the session, you download both documents. Next session, you upload them as project files, ask Claude to initialize fresh working copies, and continue from there.

This approach does add some noise to the context — the summarization requests and Claude's responses to them become part of the chat history. But in practice, the convenience is probably worth it for most people. The entries themselves are the thing that matters, and you leave each session with a clean, portable document.

Note that this workflow is described with Claude as the example because it's what this guide was written and tested with. Other models may handle in-session file operations differently, or not at all — check the documentation for whatever platform you're using.

Either way, the workflow closes the same loop: you own the documents, you control what gets recorded, and you carry them forward manually between sessions. It's more deliberate than an automated system, but that deliberateness is also what gives you control over what gets remembered and how it's framed.

9. Memory, Summaries, and the Fresh Session Problem

Every chat session has a context limit. Eventually, the scroll fills up. In a web UI, your options when that happens are limited — but they're not zero.

The Handoff Document

Before you close a long session, ask the model to write a chat primer — a structured summary document explicitly intended to be carried into a new conversation. Tell it that's what you're doing. The model will write a better summary if it understands the purpose.

A good handoff document includes:

The current state of the world (what has happened)
The current state of key relationships
Tone and stylistic notes ("we've established X as deadpan, Y as volatile")
Where the story is, and where it's heading
Any specific details the model should not forget

Keep this as a .md file. Paste it at the top of your next session before anything else. It becomes the foundation of the new scroll.

Pasting Recent Messages

There's an underrated technique that works as well as or better than a robust handoff document: paste the last two or three messages from your previous session as the opening of the new one.

This does something a summary can't quite replicate — it gives the model recent prose to build style from. It establishes the voice, the rhythm, the specific texture of where you were. A summary tells the model what happened; recent messages show it how the story felt. Both together is better than either alone.

Editing Instead of Correcting

Most web UIs allow you to edit previous messages and regenerate from that point. This is one of the most powerful tools available to you, and it's frequently ignored.

If the model produces an output you don't like, try editing your last message — clarify it, add the instruction you forgot to include, specify what you actually wanted — and regenerate. This keeps the chat history clean. It keeps corrections out of the context. It treats the conversation as a draft rather than a transcript.

10. System Prompts and Project Instructions

If your web UI supports project instructions, a persistent system prompt, or a "custom instructions" feature — use it. This is your only real access to the primacy zone. Treat it like infrastructure, not like a quick note.

A well-constructed system prompt does a few things:

Establishes the creative register

What kind of writing is this? What's the expected length, tone, and style? Give the model a clear sense of the space it's operating in. Vague instructions produce vague outputs.

Addresses the model's default tendencies

Models have strong priors toward certain behaviors: ending scenes on tidy resolution, having characters communicate their feelings clearly, adding a light at the end of every dark tunnel, defaulting to similar sentence structures. Your system prompt is where you push back on those tendencies — not by listing prohibitions, but by describing what you want instead (see section 6).

Uses identity statements as task-register primers

A line like "you are a summarization assistant" is pointing the model's attention at a specific cluster of trained behaviors: compression, neutrality, structured output, distillation. "You are a dialogue parser" calls up a different cluster. "You are a human roleplayer" pulls the model away from its default helpful-assistant register and toward a creative, character-grounded one.

Think of it as a macro. One line that calls up a whole suite of associated behaviors, rather than having to enumerate them individually.

The effect is subtle but real — anecdotally, even simple identity statements produce noticeable register differences across repeated generations. The principle that determines whether one is worth including: does it diverge meaningfully from the model's defaults? "You are a helpful assistant" is so close to the model's trained posture that it's essentially a no-op. "You are a worldbuilder helping develop a setting" is doing genuine work. The further the identity statement steers from the baseline, the more load it's carrying.

Does not repeat what the model already knows

Instructions like "be creative" or "write well" are doing nothing. Every token in your system prompt is taking up space. Make it count.

Is specific about format

If you want outputs in a certain length range, say so. If you have specific formatting conventions — how to handle internal thoughts, how to handle scene breaks — spell them out.

Here's a principle worth tattooing somewhere: a system prompt is not a wish list, it's a constraint set. It narrows the probability distribution toward what you want. The more precisely it describes the output you're trying to produce, the more consistently you'll get something that resembles it.

A few things to leave out:

Identity statements that merely restate the model's defaults ("you are a helpful assistant") — save the slot for something that diverges
Hedges and apologies ("try to," "if possible," "please")
Redundant negations that could be rewritten as positive descriptions

Part Three — The Bigger Picture

11. The Conversation Is the Prompt

This is the underlying principle that unifies everything above.

The model doesn't have a separate "you" and "system" and "history" — it has a context window, and everything in that context window is input. Your chat history is not a log of what you've discussed; it's an active part of what the model is using to generate the next token.

This means:

Clean histories produce cleaner outputs than cluttered ones
Corrections that live in chat contribute more noise and bad examples than a regenerated message that's satisfactory
The tone and quality of your messages sets a precedent the model will follow
Bad outputs left in context will pull future outputs toward them

Treat the conversation as a document you're coauthoring in real time. Every message you send is a line in that document.

12. A Note on Tooling

Everything in this guide is achievable with a standard web UI. But it's worth knowing that the degree of control described here — over lore placement, prompt injection depth, chat history structure, context management — is the floor of what's possible, not the ceiling.

More advanced tools exist that expose these variables directly: the ability to inject content at specific positions in the context, to keep chat histories clean of user-turn messages entirely, to manage lore retrieval with precision. If you find yourself wanting that level of control, it's worth knowing such tools are out there. The concepts in this guide translate directly to them.

Appendix: Quick Reference

Primacy Zone — Top of Context

Persistent lore, world rules, core character frameworks, system prompt.

Middle Zone — Reference

Arc or chapter-specific lore, relationship status, uploaded reference docs.

Recency Zone — Bottom of Context

Immediate scene setup, author intent, behavioral reminders, current context.

Positive Prompting

Describe the space you want to be in, not the behaviors you want to avoid.

Scene Granularity

Direct at the scene level; hand the model a script and let it supply the texture.

RAG & Document Format

Write summaries in prose, not bullets. Include direct quotes, metadata headers, and natural language framing for better retrieval.

Memory Handoff

Request a chat primer before closing a session. Paste recent messages into the new session to carry style as well as state.

Edit, Don't Correct

Regenerate bad outputs rather than arguing with them in chat.