N Recursions: February 2024

There was a time when Google and StackOverflow replaced the need to buy books to refer documentation. Now we have entered the era of LLM's (Large Language Models) replacing Google and StackOverflow for coding. We've also entered the era of frameworks like SkyPilot assisting with batch jobs on the cloud. Choosing an LLM itself is like choosing an ice cream in a buffet of ice creams. To ease my programming, I decided to try a few VS Code extensions similar to GitHub CoPilot. The objective was to find extensions that worked locally without sending any of my data to an external server, and were free to use.

CAUTION: VSCode extensions aren't checked by anyone for safety, so install it only if you really trust the creator of the extension. Even for the ones mentioned below, check first.

There's a leaderboard of how well models can write code.

Instruct model versus base model

Base models: These aren't designed for answering questions. These are meant to provide completions. If you prompt it with "What is the capital of India", it'll respond with "?indiacapitalisdelhi#code ends here". Notice how it outputs the question mark. It only generates a completion based on the data it was trained with. Base models are used to generate code while you type.
Instruct models: The instruction tuned models are designed to answer questions. Asking it for the capital of India would return something like "The capital of India is New Delhi. It serves as the center of government for the country and...". The instruct models are used for question-answer type of interaction with the LLM.

Using Ollama for the local LLM instead of OpenAI's API

Download and install Ollama using `curl -fsSL https://ollama.com/install.sh | sh`. It installs to `/usr/local/bin` and creates a systemd service. You can also build ollama and take advantage of ROCm for AMD inbuilt GPUs. There are some ways to stop the service. Downloaded models get stored in `usr/share/ollama/.ollama/models/blobs`, and you can change the directory by changing the OLLAMA_MODELS environment variable. You can access it on localhost via port 11434. Some other commands:

systemctl list-unit-files | grep ollama will show you if it's running.
systemctl status ollama will show you more info about the process.
sudo systemctl stop ollama.service will stop the process.
sudo systemctl disable ollama.service will disable auto-startup.
sudo systemctl start ollama.service to start the process.
sudo systemctl enable ollama.service will enable auto-startup.
ollama list will show a list of the downloaded models.
ollama pull <modelname> will download any model listed here. Remember to use a smaller model for getting a faster response. This is especially important when you are using only CPU.
ollama rm <modelname> will delete a downloaded model.

Twinny

So far, Twinny has been the only local LLM that has worked fine, is well designed and supports multiple models. You can open Twinny's sidebar by pressing Ctrl+Shift+t. You can use the Alt+\ key combo to generate code completions. You can stop the code generation using Ctrl+Shift+/. The code completions are called FIM (Fill In Middle). Do take note that to generate code completions, you need to use models called "base models". Don't use "instruct models" for code completions. To make the suggested code in grey become actual code, you need to press Tab.

Here, the stable-code model is used for fill-in-middle code suggestions

You can also use the Twinny sidebar to chat with it, similar to how you chat with ChatGPT. For this, you need to use "instruct models". You can even highlight code in the main editor and ask Twinny to explain the code or generate code based on the highlighted code. For this functionality, the model specified in Chat is used. Chat can sometimes not understand the User or it produces incomplete outputs. This depends on how the model is prompted, which model is used and how much processing power is available. I'll update this section if I find out more.

For this chat, the `codellama:latest` model was used.

There is also a way to use models from HuggingFace.

Continue

The other almost-good extension I found was this. It can work by connecting to one of Continue's servers, or you can configure it to work with a local LLM too. Continue sends telemetry data to its server, so you need to go into the settings to disable it.

Continue didn't account for the entire code when adding new code

As shown in the screenshot above, I selected a function and asked Continue to add an error message if a file was not found. It did a poor job of adding new code. The entire code barely had 30 lines. This showed me that Continue was poorly programmed. There is of course a sidebar where you can type prompts and have code explained or generated, but I uninstalled Continue.

It connects to this server:

Running `nslookup node-proxy-server-blue-l6vsfbzhba-uw.a.run.app` showed the server it connected to

Continue still has many bugs that need to be resolved. Also, I couldn't fully trust it to run entirely locally, so I uninstalled it.

Other local LLM extensions

Wingman-AI: It's good, but it didn't support the `stable-code` model, so I uninstalled it.
Backseat pilot: Poorly explained on how to use. Meant to use llama cpp python. Uninstalled it immediately
Open copilot: Needs cody and llama cpp. Uninstalled it.
Local pilot: Requires a Github copilot account.
llama coder: Didn't generate anything. No interface to work with.
Tabby: Needs Docker Tabby server. Works on CPU and CUDA.
Your copilot: Doesn't work out-of-the-box. Meant for OpenAI API. Giving it Ollama's URLs didn't work.
Ollama autocoder: No interface. Not straightforward to use. Auto completion didn't work with the default space+pause or ctrl+space.
Wingman: Uses LM Studio or Openai APIs. Can't use at work without their permission. Likely uses telemetery. Has a nice set of prompt UIs for mode, prompt, placeholders. caters to general programming, creative writing and technical writing. The appimage didn't work on Linux.
Ollama copilot: Incompatible with latest vscode version.
Refact.ai: The local version needs docker with NVidia GPU.

If you don't want to pay for an OpenAI API but still want to use ChatGPT3.5's free chat, a person created Headless ChatGPT.