Hey everybody. I’m just getting into LLMs. Total noob. I started using llama-server’s web interface, but I’m experimenting with a frontend called SillyTavern. It looks much more powerful, but there’s still a lot I don’t understand about it, and some design choices I found confusing.

I’m trying the Harbinger-24B model to act as a D&D-style DM, and to run one party character while I control another. I tried several general purpose models too, but I felt the Harbinger purpose-built adventure model was noticeably superior for this.

I’ll write a little about my experience with it, and then some thoughts about LLMs and D&D. (Or D&D-ish. I’m not fussy about the exact thing, I just want that flavour of experience).

General Experience

I’ve run two scenarios. My first try was a 4/10 for my personal satisfaction, and the 2nd was 8/10. I made no changes to the prompts or anything between, so that’s all due to the story the model settled into. I’m trying not to give the model any story details, so it makes everything up, and I won’t know about it in advance. The first story the model invented was so-so. The second was surprisingly fun. It had historical intrigue, a tie-in to a dark family secret from ancestors of the AI-controlled char, and the dungeon-diving mattered to the overarching story. Solid marks.

My suggestion for others trying this is, if you don’t get a story you like out of the model, try a few more times. You might land something much better.

The Good

Harbinger provided a nice mixture of combat and non-combat. I enjoy combat, but I also like solving mysteries and advancing the plot by talking to NPCs or finding a book in the town library, as long as it feels meaningful.

It writes fairly nice descriptions of areas you encounter, and thoughts for the AI-run character.

It seems to know D&D spells and abilities. It lets you use them in creative but very reasonable ways you could do in a pen and paper game, but can’t do in a standard CRPG engine. It might let you get away with too much, so you have to keep yourself honest.

The Bad

You may have to try multiple times until the RNG gives you a nice story. You could also inject a story in the base prompt, but I want the LLM to act as a DM for me, where I’m going in completely blind. Also, in my first 4/10 game, the LLM forced really bad “main character syndrome” on me. The whole thing was about me, me, me, I’m special! I found that off putting, but the 2nd 8/10 attempt wasn’t like that at all.

As an LLM, it’s loosy-goosy about things like inventory, spells, rules, and character progression.

I had a difficult time giving the model OOC instructions. OOC tended to be “heard” by other characters.

Thoughts about fantasy-adventure RP and LLMs

I feel like the LLM is very good at providing descriptions, situations, and locations. It’s also very good at understanding how you’re trying to be creative with abilities and items, and it lets you solve problems in creative ways. It’s more satisfying than a normal CRPG engine in this way.

As an LLM though, it let you steer things in ways you shouldn’t be able to in an RPG with fixed rules. Like disallowing a spell you don’t know, or remembering how many feet of rope you’re carrying. I enjoy the character leveling and crunchy stats part of pen-and-paper or CRPGs, and I haven’t found a good way to get the LLM to do that without just handling everything manually and whacking it into the context.

That leads me to think that using an LLM for creativity inside a non-LLM framework to enforce rules, stats, spells, inventory, and abilities might be phenomenal. Maybe AI-dungeon does that? Never tried, and anyway I want local. A hybrid system like that might be scriptable somehow, but I’m too much of a noob to know.

  • brucethemoose@lemmy.world
    link
    fedilink
    English
    arrow-up
    2
    ·
    edit-2
    21 hours ago

    Also, another suggestion would be to be careful with your sampling. Use a low temperature and high MinP for queries involving rules, higher temperature (+ samplers like DRY) when you’re trying to tease out interesting ideas.

    I would even suggest an alt front end like mikupad that exposes token probabilities, so you can go to any point in the reply and look through every “idea” the LLM had internally (and regen from that point of you wish”). It’s also good for debugging sampling issues when you have an incorrect answer (as sometimes the LLM gets it right, but bad sampling parameters choose a bad answer).

    • ThreeJawedChuck@sh.itjust.worksOP
      link
      fedilink
      English
      arrow-up
      1
      ·
      20 hours ago

      Ah, great idea about the low temp for rules and high for creativity. I guess I can easily change it in the front end, although I also set the temp when I start the server, and I’m not sure which one takes priority. Hopefully the frontend does, so I can tweak it easily.

      Also your post just got me thinking about the DRY sampler, which I’m using, but might be causing troubles for cases where the model legit should repeat itself, like an !inventory or !spells command. I might try to either disable it or add a custom break token, like the ! mark.

      I think ST can show token probabilities, so I’ll try that too, thanks. I have so much to learn! I really should try other frontends though. ST is powerful in a lot of ways like dynamic management of the context, but there are other things I don’t like as much. It attaches a lot of info to a character that I don’t feel should be a property of a character. And all my D&D scenarios so far have been just me + 1 AI char, because even though ST has a “group chat” feature, I feel like it’s cumbersome and kind of annoying. It feels like the frontend was first designed around one AI char only, and then something got glued on to work around that limitation.

      • brucethemoose@lemmy.world
        link
        fedilink
        English
        arrow-up
        2
        ·
        edit-2
        18 hours ago

        One one more thing, I saw you mention context management.

        Mistral (24B) models are really bad at long context, but this is not always the case. I find that Qwen 32B and Gemma 27B are solid at 32K (which is a huge body of text) and (with the right backend settings) you can easily run either at 64K with very minimal vram overhead.

        Specifically, run Gemma with the latest llama.cpp server and comment (where it will automatically use sliding window attention as of like yesterday), or Qwen (and most other models) with exllamav2 or exllamav3, which quantizes the kv cache down to Q4 very efficiently.

        This way you don’t need to manage context: you can feed the LLM the whole adventure so it doesn’t forget anything, and streaming responses will be instance since it’s always cached.

        • ThreeJawedChuck@sh.itjust.worksOP
          link
          fedilink
          English
          arrow-up
          1
          ·
          8 minutes ago

          Mistral (24B) models are really bad at long context, but this is not always the case. I find that Qwen 32B and Gemma 27B are solid at 32K

          It looks like the Harbinger RPG model I’m using (from Latitude Games) is based on Mistral 24B, so maybe it inherits that limitation? I like it in other ways. It was trained on RPG games, which seems to help it for my use case. I did try some general purpose / vanilla models and felt they were not as good at D&D type scenarios.

          It looks like Latitude also has a 70B Wayfarer model. Maybe it would do better at bigger contexts. I have several networked machines with 40GB VRAM between all them, and I can just squeak I4Q_XS x 70B into that unholy assembly if I run 24000 context (before the SWA patch, so maybe more now). I will try it! The drawback is speed. 70B models are slow on my setup, about 8 t/s at startup.

      • brucethemoose@lemmy.world
        link
        fedilink
        English
        arrow-up
        1
        ·
        19 hours ago

        Oh, one thing about ST specifically: its default sampling presets are catastrophic last I checked. Like, they’re designed for ancient models, and while I have nothing against the UI it is kinda from a different era.

        For Gemma and Qwen, I’ve been using like 0.2-0.7 temp, at least 0.05 MinP, 1.01 rep penalty (not something insane like 1.1) and maybe 0.3-ish dry, though like you said dry/xtc can really mess up some tasks.