{"id":657,"date":"2026-03-04T21:18:29","date_gmt":"2026-03-05T05:18:29","guid":{"rendered":"https:\/\/nramkumar.org\/tech\/?p=657"},"modified":"2026-03-07T20:18:51","modified_gmt":"2026-03-08T04:18:51","slug":"hosting-models-locally-for-openclaw","status":"publish","type":"post","link":"https:\/\/nramkumar.org\/tech\/blog\/2026\/03\/04\/hosting-models-locally-for-openclaw\/","title":{"rendered":"Hosting models locally for OpenClaw"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">As I am exploring OpenClaw to setup a personal assistant for myself, I wanted to host some LLM capability locally. My hardware is a 3060 GPU with 12 GB VRAM. While this cannot host a good model to be the main backing AI for OpenClaw, it can still host several very capable models that can do well scoped tasks that are handed off by the larger model. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">I wanted to run <code>qwen3:8b-q8_0<\/code> as one of the models. While this ran fine, it was surprisingly slow at relatively simple tasks of summarizing small quantities of text or extraction tasks on small volumes of text. Turns out ollama turns on thinking mode for these reasoning models by default and that makes requests in my measurement 10-30x slower on my workloads. The fix is easy &#8211; set <code>\u201cthink\u201d: false<\/code> in your ollama request.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Another optimization is quantizing the kv cache. This allows you to increase the context length for the input allowing you to process longer text. You can do this by setting the <code>Environment=\"OLLAMA_KV_CACHE_TYPE=q8_0\"<\/code> environment variable when running your ollama service.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">With both these optimizations I find that the local models act as very effective focused helpers for the larger LLM when given well scoped tasks.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>As I am exploring OpenClaw to setup a personal assistant for myself, I wanted to host some LLM capability locally. My hardware is a 3060 GPU with 12 GB VRAM. While this cannot host a good model to be the main backing AI for OpenClaw, it can still host several very capable models that can&#8230;<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-657","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/nramkumar.org\/tech\/wp-json\/wp\/v2\/posts\/657","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/nramkumar.org\/tech\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/nramkumar.org\/tech\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/nramkumar.org\/tech\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/nramkumar.org\/tech\/wp-json\/wp\/v2\/comments?post=657"}],"version-history":[{"count":3,"href":"https:\/\/nramkumar.org\/tech\/wp-json\/wp\/v2\/posts\/657\/revisions"}],"predecessor-version":[{"id":661,"href":"https:\/\/nramkumar.org\/tech\/wp-json\/wp\/v2\/posts\/657\/revisions\/661"}],"wp:attachment":[{"href":"https:\/\/nramkumar.org\/tech\/wp-json\/wp\/v2\/media?parent=657"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/nramkumar.org\/tech\/wp-json\/wp\/v2\/categories?post=657"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/nramkumar.org\/tech\/wp-json\/wp\/v2\/tags?post=657"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}