🇦🇺𝕄𝕦𝕟𝕥𝕖𝕕𝕔𝕣𝕠𝕔𝕕𝕚𝕝𝕖@lemm.ee to LocalLLaMA@sh.itjust.worksEnglish · 11 days agoHow much gpu do i need to run a 90b modelmessage-squaremessage-square16fedilinkarrow-up113arrow-down11file-text
arrow-up112arrow-down1message-squareHow much gpu do i need to run a 90b model🇦🇺𝕄𝕦𝕟𝕥𝕖𝕕𝕔𝕣𝕠𝕔𝕕𝕚𝕝𝕖@lemm.ee to LocalLLaMA@sh.itjust.worksEnglish · 11 days agomessage-square16fedilinkfile-text
minus-squareSylovik@lemmy.worldlinkfedilinkEnglisharrow-up4·11 days agoIn case of LLM’s you should look at AirLLM. I suppose there is no conviniet integrations to local chat tools, but issue at Ollama already started.
minus-square🇦🇺𝕄𝕦𝕟𝕥𝕖𝕕𝕔𝕣𝕠𝕔𝕕𝕚𝕝𝕖@lemm.eeOPlinkfedilinkEnglisharrow-up1·11 days agoThat looks like exactly the sort of thing i want. Any existing solution to get it to behave like an ollama instance (i have a bunch of services pointed at an ollama run on docker).
minus-squareSylovik@lemmy.worldlinkfedilinkEnglisharrow-up2·10 days agoYou may try Harbor. The description claims to provide an OpenAI-compatible API.
minus-squarered@lemmy.ziplinkfedilinkEnglisharrow-up1·10 days agothis is useless, llama.cpp already does that airllm does (offloading to CPU) but its actually faster. so just use ollama
In case of LLM’s you should look at AirLLM. I suppose there is no conviniet integrations to local chat tools, but issue at Ollama already started.
That looks like exactly the sort of thing i want. Any existing solution to get it to behave like an ollama instance (i have a bunch of services pointed at an ollama run on docker).
You may try Harbor. The description claims to provide an OpenAI-compatible API.
this is useless, llama.cpp already does that airllm does (offloading to CPU) but its actually faster. so just use ollama