Using local LLM with Ollama and Semantic Kernel

May 11 2024

Page content

Introduction

Artificial Intelligence, especially Large language models (LLMs) are all in high demand. Since OpenAI released ChatGPT, interest has gone up multi-fold. Since 2023, Powerful LLMs can be run on local machines. Local Large Language Models offer advantages in terms of data privacy and security and can be enriched using enterprise-specific data using Retrieval augmentation generation (RAG).Several tools exist that make it relatively easy to obtain, run and manage such models locally on our machines. Few examples are Ollama, Langchain, LocalAI.

Semantic Kernel is an SDK from Microsoft that integrates Large Language Models (LLMs) like OpenAI, Azure OpenAI, and Hugging Face with conventional programming languages like C#, Python, and Java. Semantic Kernel also has plugins that can be chained together to integrate with other tools like Ollama.

This post describes usage of Ollama to run model locally, communicate with it using REST API from Semantic kernel SDK.

Ollama

To setup Ollama follow the installation and setup instructions from the Ollama website. Ollama runs as a service, exposing a REST API on a localhost port.Once installed, you can invoke ollama run to talk to this model; the model is downloaded, if not already and cached the first time it’s requested.

For the sake of this post, we can use Phi3 model, so run ollama run phi3. This will download phi3 model, if not already, and once done, it will present a prompt. Using this prompt, one can start chatting with the model.

Why SemanticKernel ?

As such , Ollama can be integrated with from any application via REST API. Then why go for SemanticKernel SDK? It provides a simplified integration of AI capabilities into existing applications, lowering the barrier of entry for new developers and supporting the ability to fine-tune models. It supports multiple languages like C#, Python and Java.

Using Ollama

Install Ollama by following instructions here.Ollama exposes set of REST APIs, check Documentation here. It provides range of functions like get response for Prompt, get Chat response. for Specific operations, it supports streaming and non-streaming response. First step is to download/pull using ollama run phi3. This will pull, if required, the model and set it up locally. In the end, it will show prompt where user can interact with model.

Now Ollama API can be easily accessed. Below is the gateway class.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
``` 

public class OllamaApiClient 
{

    private HttpClient _client = new();

	public Configuration Config { get; }

	public interface IResponseStreamer<T>
	{
		void Stream(T stream);
	}
	public class ChatMessage { 

			[JsonPropertyName("role")]
			public string Role { get; set;}
			
			[JsonPropertyName("content")]
			public string Content {get;set;}

	}

	public class ChatResponse
	{
		[JsonPropertyName("model")]
		public string Model { get; set; }

		[JsonPropertyName("created_at")]
		public string CreatedAt { get; set; }

		[JsonPropertyName("response")]
		public string Response { get; set; }


		[JsonPropertyName("message")]
		public ChatMessage? Message { get; set; }


		[JsonPropertyName("messages")]
		public List<ChatMessage> Messages { get; set; }


		[JsonPropertyName("embedding")]
		public List<Double> Embeddings { get; set; }


		[JsonPropertyName("done")]
		public bool Done { get; set; }
	}

	public class ChatRequest { 
		[JsonPropertyName("model")]
		public string Model { get;set;}

		[JsonPropertyName("prompt")]
		[JsonIgnore(Condition = JsonIgnoreCondition.WhenWritingNull)]
		public string Prompt {get; set;}


		[JsonPropertyName("format")]
		[JsonIgnore(Condition = JsonIgnoreCondition.WhenWritingNull)]
		public string Format {get; set;}


		[JsonPropertyName("messages")]
		[JsonIgnore(Condition = JsonIgnoreCondition.WhenWritingNull)]
		public IList<ChatMessage> Messages {get; set;}

		[JsonPropertyName("stream")]
		public bool Stream {get; set;} = false;
	}


    public class Configuration
		{
			public Uri Uri { get; set; }

			public string Model { get; set; }
		}


    public OllamaApiClient(string uriString, string defaultModel = "")
        : this(new Uri(uriString), defaultModel)
		{
		}

    public OllamaApiClient(Uri uri, string defaultModel = "")
			: this(new Configuration { Uri = uri, Model = defaultModel })
		{
		}

    public OllamaApiClient(Configuration config)
			: this(new HttpClient() { BaseAddress = config.Uri }, config.Model)
		{
    		Config = config;

			}

    public OllamaApiClient(HttpClient client, string defaultModel = "")
		{
			_client = client ?? throw new ArgumentNullException(nameof(client));
			_client.Timeout = TimeSpan.FromMinutes(10);

			(Config ??=  new Configuration()).Model = defaultModel;
			
		}

	public async Task<ChatResponse> GetEmbeddingsAsync(ChatRequest message, CancellationToken token) {
		message.Model = this.Config.Model;
		return await PostAsync<ChatRequest,ChatResponse>("/api/embeddings",message,token);
	}


	public async Task<ChatResponse> GetResponseForChatAsync(ChatRequest message, CancellationToken token) {
		message.Model = this.Config.Model;
		return await PostAsync<ChatRequest,ChatResponse>("/api/chat",message,token);
	}


	public async Task<ChatResponse> GetResponseForPromptAsync(ChatRequest message, CancellationToken token) {
		message.Model = this.Config.Model;
		return await PostAsync<ChatRequest,ChatResponse>("/api/generate",message,token);
	}

	public async IAsyncEnumerable<ChatResponse> GetStreamForPromptAsync(ChatRequest message, CancellationToken token) {
		message.Model = this.Config.Model;
		message.Stream = true;
		await foreach(ChatResponse resp in  StreamPostAsync<ChatRequest,ChatResponse>("/api/generate",message,token)) {
			yield return resp;
		}
	}

	public async IAsyncEnumerable<ChatResponse> GetStreamForChatAsync(ChatRequest message, CancellationToken token) {
		message.Model = this.Config.Model;
		message.Stream = true;
		await foreach(ChatResponse resp in  StreamPostAsync<ChatRequest,ChatResponse>("/api/chat",message,token)) {
			yield return resp;
		}
	}

    private async Task<TResponse> GetAsync<TResponse>(string endpoint, CancellationToken cancellationToken)
		{
			var response = await _client.GetAsync(endpoint, cancellationToken);
			response.EnsureSuccessStatusCode();

			var responseBody = await response.Content.ReadAsStringAsync(cancellationToken);
			return JsonSerializer.Deserialize<TResponse>(responseBody);
		}

    private async Task PostAsync<TRequest>(string endpoint, TRequest request, CancellationToken cancellationToken)
		{
			var content = new StringContent(JsonSerializer.Serialize(request), Encoding.UTF8, "application/json");
			var response = await _client.PostAsync(endpoint, content, cancellationToken);
			response.EnsureSuccessStatusCode();
		}



    private async IAsyncEnumerable<TResponse> StreamPostAsync<TRequest,TResponse>(string endpoint, TRequest request, CancellationToken cancellationToken)
		{
			var content = new StringContent(JsonSerializer.Serialize(request), Encoding.UTF8, "application/json");
			var response = await _client.PostAsync(endpoint, content, cancellationToken);

 			using Stream stream = await response.Content.ReadAsStreamAsync();

			using StreamReader reader = new StreamReader(stream);

			while (!reader.EndOfStream) {
				var jsonString = await reader.ReadLineAsync(cancellationToken);
				TResponse  result =  JsonSerializer.Deserialize<TResponse>(jsonString);
				yield return result;
			}

			yield break;
	}


    private async Task<TResponse> PostAsync<TRequest, TResponse>(string endpoint, TRequest request, CancellationToken cancellationToken)
		{
			var content = new StringContent(JsonSerializer.Serialize(request), Encoding.UTF8, "application/json");
			var response = await _client.PostAsync(endpoint, content, cancellationToken);
			response.EnsureSuccessStatusCode();

			var responseBody = await response.Content.ReadAsStringAsync(cancellationToken);

			return JsonSerializer.Deserialize<TResponse>(responseBody);
		}
}

With this class in place, now it can be integrated with SemanticKernel.

Integrating with SemanticKernel

Semantickernel SDK operates on a plug-in system, where developers can use pre-built plugins or create their own. These plugins consist of prompts that the AI model should respond to, as well as functions that can complete specialized tasks. Accordingly, it provides interfaces for (Chat completion)[https://learn.microsoft.com/en-us/dotnet/api/microsoft.semantickernel.chatcompletion.ichatcompletionservice?view=semantic-kernel-dotnet] and Text Generation tasks which can be use d to integrate with external implementation like Ollama.

Below are implementations of these interfaces that use Ollama API,

Text Generation

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
public class TextGenerationService : ITextGenerationService
{
    
    public string ModelApiEndPoint { get; set; }
    public string ModelName { get; set; }

    public IReadOnlyDictionary<string, object?> Attributes => throw new NotImplementedException();

    public async Task<IReadOnlyList<TextContent>> GetTextContentsAsync(string prompt, PromptExecutionSettings? executionSettings = null, Kernel? kernel = null, CancellationToken cancellationToken = default)
    {
          
        var client = new OllamaApiClient(ModelApiEndPoint, ModelName);

        OllamaApiClient.ChatRequest req = new OllamaApiClient.ChatRequest() {
                Model=ModelName,
                Prompt=prompt,
        };

        OllamaApiClient.ChatResponse resp = await client.GetResponseForPromptAsync(req
            , cancellationToken);

        return new List<TextContent>() { new TextContent(resp.Response) };
    }

    public async IAsyncEnumerable<StreamingTextContent> GetStreamingTextContentsAsync(string prompt, PromptExecutionSettings? executionSettings = null, Kernel? kernel = null, CancellationToken cancellationToken = default)
    {
            var ollama = new OllamaApiClient(ModelApiEndPoint, ModelName);

            OllamaApiClient.ChatRequest req = new OllamaApiClient.ChatRequest() {
                    Prompt=prompt,
                    Stream=true
            };

            await foreach( OllamaApiClient.ChatResponse resp in ollama.GetStreamForPromptAsync(req, cancellationToken)) {
                    yield return new StreamingTextContent( text:  resp.Response) ;
            } 

    }
}

Chat Completion

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
public class OllamaChatCompletionService : IChatCompletionService
{
    public string ModelApiEndPoint { get; set; }
    public string ModelName { get; set; }

    public IReadOnlyDictionary<string, object?> Attributes => throw new NotImplementedException();
    
    public async Task<IReadOnlyList<ChatMessageContent>> GetChatMessageContentsAsync(ChatHistory chatHistory, PromptExecutionSettings? executionSettings = null, Kernel? kernel = null, CancellationToken cancellationToken = default)
   {


        var client = new OllamaApiClient(ModelApiEndPoint, ModelName);

        OllamaApiClient.ChatRequest req = new OllamaApiClient.ChatRequest() {
                Model=ModelName
        };

        req.Messages = new List<OllamaApiClient.ChatMessage>();

        // iterate though chatHistory Messages
        foreach (var history in chatHistory)
        {
            req.Messages.Add(new OllamaApiClient.ChatMessage{
                Role=history.Role.ToString(),
                Content=history.Content
            });
        }

        OllamaApiClient.ChatResponse resp = await client.GetResponseForChatAsync(req
            , cancellationToken);

        List<ChatMessageContent> content = new();
        content.Add( new(role:resp.Message.Role.Equals("system",StringComparison.InvariantCultureIgnoreCase)?AuthorRole.System:AuthorRole.User,content:resp.Message.Content));

        return content;
    }

    public async IAsyncEnumerable<StreamingChatMessageContent> GetStreamingChatMessageContentsAsync(ChatHistory chatHistory, PromptExecutionSettings? executionSettings = null, Kernel? kernel = null, CancellationToken cancellationToken = default)
    {

        var client = new OllamaApiClient(ModelApiEndPoint, ModelName);

        OllamaApiClient.ChatRequest req = new OllamaApiClient.ChatRequest() {
                Model=ModelName
        };

        req.Messages = new List<OllamaApiClient.ChatMessage>();

        // iterate though chatHistory Messages
        foreach (var history in chatHistory)
        {
            req.Messages.Add(new OllamaApiClient.ChatMessage{
                Role=history.Role.ToString(),
                Content=history.Content
            });
        }

        CancellationTokenSource source = new CancellationTokenSource();
        CancellationToken token = source.Token;

        await foreach (OllamaApiClient.ChatResponse resp in  client.GetStreamForChatAsync(req,token)) { 
            yield return new(role:resp.Message.Role.Equals("system",StringComparison.InvariantCultureIgnoreCase)?AuthorRole.System:AuthorRole.User,
            content:resp.Message.Content ?? string.Empty); 
        }

     }
}

Above implementation is for demonstration purposes only. I am sure further optimization is certainly possible.

After this, it is time to use it as client of SemanticKernel SDK. Below is the test case for chat completion service,

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
    [Fact]
    public async void TestChatGenerationviaSK() 
    {
        var ollamachat = ServiceProvider.GetChatCompletionService();


        // semantic kernel builder
        var builder = Kernel.CreateBuilder();
        builder.Services.AddKeyedSingleton<IChatCompletionService>("ollamaChat", ollamachat);
        // builder.Services.AddKeyedSingleton<ITextGenerationService>("ollamaText", ollamaText);
        var kernel = builder.Build();


        // chat generation
        var chatGen = kernel.GetRequiredService<IChatCompletionService>();
        ChatHistory chat = new("You are an AI assistant that helps people find information.");
        chat.AddUserMessage("What is Sixth Sense?");
        var answer = await chatGen.GetChatMessageContentAsync(chat);
        Assert.NotNull(answer);
        Assert.NotEmpty(answer.Content!);
        System.Diagnostics.Debug.WriteLine(answer.Content!
    }

Full Source code of this post is available here.

Summary

Local AI combined with Retrieval Augmented Generation is powerful combination that any one get started with without need for subscriptions while conserving data privacy. Next step in this is to Use RAG for augmenting the results using enterprise/private data.

Happy Coding !!