Hey everybody. I’m just getting into LLMs. Total noob. I started using llama-server’s web interface, but I’m experimenting with a frontend called SillyTavern. It looks much more powerful, but there’s still a lot I don’t understand about it, and some design choices I found confusing.
I’m trying the Harbinger-24B model to act as a D&D-style DM, and to run one party character while I control another. I tried several general purpose models too, but I felt the Harbinger purpose-built adventure model was noticeably superior for this.
I’ll write a little about my experience with it, and then some thoughts about LLMs and D&D. (Or D&D-ish. I’m not fussy about the exact thing, I just want that flavour of experience).
General Experience
I’ve run two scenarios. My first try was a 4/10 for my personal satisfaction, and the 2nd was 8/10. I made no changes to the prompts or anything between, so that’s all due to the story the model settled into. I’m trying not to give the model any story details, so it makes everything up, and I won’t know about it in advance. The first story the model invented was so-so. The second was surprisingly fun. It had historical intrigue, a tie-in to a dark family secret from ancestors of the AI-controlled char, and the dungeon-diving mattered to the overarching story. Solid marks.
My suggestion for others trying this is, if you don’t get a story you like out of the model, try a few more times. You might land something much better.
The Good
Harbinger provided a nice mixture of combat and non-combat. I enjoy combat, but I also like solving mysteries and advancing the plot by talking to NPCs or finding a book in the town library, as long as it feels meaningful.
It writes fairly nice descriptions of areas you encounter, and thoughts for the AI-run character.
It seems to know D&D spells and abilities. It lets you use them in creative but very reasonable ways you could do in a pen and paper game, but can’t do in a standard CRPG engine. It might let you get away with too much, so you have to keep yourself honest.
The Bad
You may have to try multiple times until the RNG gives you a nice story. You could also inject a story in the base prompt, but I want the LLM to act as a DM for me, where I’m going in completely blind. Also, in my first 4/10 game, the LLM forced really bad “main character syndrome” on me. The whole thing was about me, me, me, I’m special! I found that off putting, but the 2nd 8/10 attempt wasn’t like that at all.
As an LLM, it’s loosy-goosy about things like inventory, spells, rules, and character progression.
I had a difficult time giving the model OOC instructions. OOC tended to be “heard” by other characters.
Thoughts about fantasy-adventure RP and LLMs
I feel like the LLM is very good at providing descriptions, situations, and locations. It’s also very good at understanding how you’re trying to be creative with abilities and items, and it lets you solve problems in creative ways. It’s more satisfying than a normal CRPG engine in this way.
As an LLM though, it let you steer things in ways you shouldn’t be able to in an RPG with fixed rules. Like disallowing a spell you don’t know, or remembering how many feet of rope you’re carrying. I enjoy the character leveling and crunchy stats part of pen-and-paper or CRPGs, and I haven’t found a good way to get the LLM to do that without just handling everything manually and whacking it into the context.
That leads me to think that using an LLM for creativity inside a non-LLM framework to enforce rules, stats, spells, inventory, and abilities might be phenomenal. Maybe AI-dungeon does that? Never tried, and anyway I want local. A hybrid system like that might be scriptable somehow, but I’m too much of a noob to know.
Late to the post, but look into SGLang, OP!
In a nutshell, it’s a framework for letting LLMs “fill in blanks” instead of generating entire replies, so you could script in rules as part of the responses as structure for it to grab onto. It’s all locally runnable (with the right hardware, unfortunately).
Also, there are some newer, less sycophantic DM specific models. I can look around if you want.
Will do, thanks for the tip. Your description does sound like a good fit for the idea. As long as it supports network inference between machines with heterogeneous cards, it would work for what I have in mind.
Also, another suggestion would be to be careful with your sampling. Use a low temperature and high MinP for queries involving rules, higher temperature (+ samplers like DRY) when you’re trying to tease out interesting ideas.
I would even suggest an alt front end like mikupad that exposes token probabilities, so you can go to any point in the reply and look through every “idea” the LLM had internally (and regen from that point of you wish”). It’s also good for debugging sampling issues when you have an incorrect answer (as sometimes the LLM gets it right, but bad sampling parameters choose a bad answer).
Ah, great idea about the low temp for rules and high for creativity. I guess I can easily change it in the front end, although I also set the temp when I start the server, and I’m not sure which one takes priority. Hopefully the frontend does, so I can tweak it easily.
Also your post just got me thinking about the DRY sampler, which I’m using, but might be causing troubles for cases where the model legit should repeat itself, like an !inventory or !spells command. I might try to either disable it or add a custom break token, like the ! mark.
I think ST can show token probabilities, so I’ll try that too, thanks. I have so much to learn! I really should try other frontends though. ST is powerful in a lot of ways like dynamic management of the context, but there are other things I don’t like as much. It attaches a lot of info to a character that I don’t feel should be a property of a character. And all my D&D scenarios so far have been just me + 1 AI char, because even though ST has a “group chat” feature, I feel like it’s cumbersome and kind of annoying. It feels like the frontend was first designed around one AI char only, and then something got glued on to work around that limitation.
One one more thing, I saw you mention context management.
Mistral (24B) models are really bad at long context, but this is not always the case. I find that Qwen 32B and Gemma 27B are solid at 32K (which is a huge body of text) and (with the right backend settings) you can easily run either at 64K with very minimal vram overhead.
Specifically, run Gemma with the latest llama.cpp server and comment (where it will automatically use sliding window attention as of like yesterday), or Qwen (and most other models) with exllamav2 or exllamav3, which quantizes the kv cache down to Q4 very efficiently.
This way you don’t need to manage context: you can feed the LLM the whole adventure so it doesn’t forget anything, and streaming responses will be instance since it’s always cached.
Oh, one thing about ST specifically: its default sampling presets are catastrophic last I checked. Like, they’re designed for ancient models, and while I have nothing against the UI it is kinda from a different era.
For Gemma and Qwen, I’ve been using like 0.2-0.7 temp, at least 0.05 MinP, 1.01 rep penalty (not something insane like 1.1) and maybe 0.3-ish dry, though like you said dry/xtc can really mess up some tasks.
As long as it supports network inference between machines with heterogeneous cards, it would work for what I have in mind.
It probably doesn’t, heh, especially non Nvidia cards. But the middle layer may work with some generic OpenAI backend like the llama.cpp server.
Thanks for sharing your nice project ThreeJawedChuck!
I feel like a little bit of prompt engineering would go a long way.
To explain, a models base personality tends to be aligned into the “ai chat assistant” archetype. Models are encouraged to be positive yes-men with the goal of assisting the user with goals and pleasing them with pleasantry in the process.
They do not need to be this way though. By using system prompts you may directly instruct the model to alter its personality or directly instruct it on how to structure things. In this relevant context tell it something like
"You are a dungeon master with the primary goal of weaving an interesting and coherent story in the ‘dungeons and dragons’ universe. Your secondary goal is ensuring game rules are generally followed correctly.
You are not a yes-man. You are dominant and in control of the situation.You may argue and challenge users as needed when negotiating game actions.
Your players want a believable and grounded setting without falling into the tropes of main character syndrome or becoming Mary Sues. Make sure that their adventures remain grounded and the world their characters live in remains largely indifferent to their existance."
This eats into a little bit of context but should change things up a little.
You may make the model more creative and outlandish or more rigid and predictable by adjusting sampler settings.
Consider finding a PDF or an epub of an old DND manual, convert to text, and put into your engines rag system so it can directly reference DND rules.
Be wary of context limits. No matter what model makers tell you, 16-32k is a reasonable limit to expect when it comes to models keeping coherent track of things. A good idea is to keep track of important information you dont want the model to forget in a text file and give it a refresher on relevant context when it starts getting a little confused about who did what.
Chain of Thought reasoning models may also give an edge when it comes to thinking deeper about the story and how its put together interaction wise. But as a downside they take some extra time and compute to think about things.
I never tried silly tavern but know its meant for roleplaying with character cards. I always recommend kobold since I know most about it but theres more than one way to do things.
Thanks for your comments and thoughts! I appreciate hearing from more experienced people.
I feel like a little bit of prompt engineering would go a long way.
Yah, probably so. I tried to write a system prompt to steer the model toward what I wanted, but it’s going to take a lot more refinement and experimenting to dial it in. I like your idea of asking it to be unforgiving about rules. I hadn’t put anything like that in.
That’s a great idea about putting a D&D manual, or at least the important parts, into a RAG system. I haven’t tried RAG yet but it’s on my queue of matters to learn. I know what it is, I just haven’t tried it yet.
I’ve for sure seen that the quality of output starts to decline about 16K context, even on models that claim to support 128K. Also, I feel like the system prompt seems more effective when there are only let’s say 4K context tokens so far. After the context grows, the model becomes less and less inclined to follow the system prompt. I’ve been guessing this is because as the context grows, any given piece of it becomes more dilute, but I don’t really know.
For those reasons, I’m trying to use summarization to keep the context size under control, but I haven’t found a good approach yet. SillyTavern has an auto summary injecting system, but either I’m misunderstanding it, or I don’t like how it works, and I end up doing it manually.
I tried a few CoT models, but not since I moved to ST as a front end. I was using them with the standard llama-server web interface, which is a rather simple affair. My problem was that the thinking output seemed to spam up the context, leaving me much less ctx space for my own use. Each think block was like 500-800 tokens. It looks like ST might have an ability to only keep the most recent think block in the context, so I need to do more experimenting. The other problem I had was that the thinking could just take a lot of time.
I’ve had a good amount of fun doing role play with LLMs. I think that’s one of the nicer things you can do with them. From my experience I’d say they aren’t even close to being able to do the maths on something like the D20 system. But they have a good grasp what the framework is about. They know how to do high fantasy, or science fiction. They’re creative, can make up scenarios, characters, can write dialogue and to some degree narration. LLMs know all the common plot tropes and it occasionally makes me laugh how they sometimes love to push for sudden plot twists, or enhance the storyline with silly things.
And they do other things than just D&D. You can also instruct them to write dialogue. Or be a 90s computer text adventure.
SillyTavern is a good choice. Personally, I’ve homed in on KoboldCPP. Because SillyTavern tends to confuse me with it’s bazillion of options. And I like the very basic “story mode” of KoboldCPP, at least for creative storywriting. Where it’s pretty much down to a large text area / sheet of paper, and I can edit and switch around things easily and I also see exactly where the instruct-mode special tokens get placed etc. But that’s more personal preference… You do you.
What I also experienced is a wide variety with the models and how good they are at story writing. You definitely need to find and pick the right one. Some are good at it, write good narration. Some can’t maintain the pacing but always push for a quick wrap up within a few paragraphs, or they don’t like to describe the surroundings. Some don’t properly abide by the character descriptions and have their own ideas, or they just get confused and lost in meaningless side stories. The prompt also affects this a lot. But it’s also down to the specific model in use.
Or be a 90s computer text adventure
Zork on steroids!
Thank you for posting your experience.
I’ve always thought LLMs would be perfect for being a DM or even a self-hosted “Choose your own adventure” story.
The prospect of having an “unlimited” adventure sounded awesome. In practice, and in my experience - it wasn’t that fun.
I throughly enjoy stretching my creativity and “thinking outside the box”, but it’s no fun when the LLM simply says “yeah, sure whatever.” I guess the “stretching my creativity” is more “pushing my creativity against the boundaries of what is allowed/makes sense”… but without any sort of resistance… I got bored.
but it’s no fun when the LLM simply says “yeah, sure whatever.” I
I hear ya. LLMs tend to heavily tilt toward what the user wants, which is not ideal for an RPG.
Have you tried any of the specialized RPG models? The one I’m using now has, at least twice so far, put me into a situation where I felt my party (2 chars, me and the AI) were going to die unless we ran away. We just finished a very difficult fight, used everything at our disposal, and sustained several serious injuries in the process. Then an even more powerful foe appeared, and it really felt that was going to be the end unless we ran. Would it really have killed us? I can’t say, but I did get a genuine sense of that. It might help that in the system prompt, I had put this:
The story should be genuinely dangerous and frightening, but survivable if we use our wits.
I have the feeling the generalist models are much more tilted in the “yeah, sure, whatever” direction. I tried at least one RPG focused model (Dan’s dangerous winds, or something like that) which was downright brutal, and would kill me off right away with no opportunity for me to do anything about it. That wasn’t fun for the opposite reason. But like you say, it’s also not fun to have no risk and no boundaries to test one’s mettle. The sweet spot is can be elusive.
I’m thinking that a non-LLM rules system around an LLM for descriptive purposes could really help here too, to enforce a kind of rigor on the experience.
AIDungeon has been around for ages. There’s an app, there’s a website.
What was your setup for this experiment?
I’m wondering if kolboldcpp + sillytavern would be suitable here.
What was your setup for this experiment?
I’m using llama.cpp + sillytavern. I’m very much in learning mode with ST however, so I’m confident I could be using it in a more effective manner than I know how to at the moment. It seems like koboldcpp + ST ought to be similar to what I’m doing.