Small Language model + local GPT + Unix style pipelines!

Posted by Venkatesh Subramanian on February 04, 2024 · 9 mins read

Motivation for Small language models: In my previous Generative AI posts we explored Large language models like the OpenAI ChatGPT which has more than 100 billion parameters.
In contrast, small language models typically have less than 20 billion parameters. They require less memory and compute, thus making them ideal for running on your edge devices or local systems. These models are also trained on smaller datasets, thus they are more resource efficient even in training phase before deployment for inferencing. While they may not capture the same level of complexity and nuanced understanding as larger models, they still prove effective for various tasks such as text generation, sentiment analysis, and simple language understanding.

2 key benefits:

  • Run anytime, anywhere - Developers can easily run these on their local laptops without need of high end GPUs. Plus no need to charge your credit card for those cloud based models like OpenAI!
  • Keep data in-house - With small language models, you can setup and run everything on your internal servers, thus catering to privacy of sensitive data used with the models, and all within your team’s governance.

What are some examples of SLMs? Some examples of these smaller language models are Llama2, Mistral, and Microsoft Phi.

Llama2 is a family of pre-trained and fine-tuned models from Meta ranging from 7B to 70B in size.
Mistral is a 7B size dense transformer model from Mistral AI.
Phi is a suite of small language models from Microsoft ranging from 1.3 B parameters to 2.7 B parameters, making this leaner than even the above Llama2 and Mistral 7B models.

Where can you get them and what are the typical sizes - storage and RAM. These models can be accessed from Huggingface model hub. The Llama2 7B model’s original size is 13GB, however we can use a “quantized” (I will explain this in a bit..) 4-bit version of this that is only 3.9 GB is size. Mistral 7B is very similar to Llama2 7B, also given that they have same number of parameters.
Microsoft Phi-2 original model is about 5 GB and its optimized 4-bit quantized version is about 2.9 GB in size. Note that the storage and memory to load the model are roughly equivalent as of now.

What is Quantization and the GGUF format?
Quantization is a technique to reduce the computational and memory costs of a model inferencing by decreasing the precision with which the weights and activations of the model are represented. Less bits means less storage size, less energy use, and faster computation. Of course the tradeoff could be reduced accuracy due to the information loss caused by lower precision.

GGUF or the GPT-Generated Unified format is a binary format for fast loading and saving of models, released in August 2023 by AI community contributors including Georgi Gerganov. It is designed so that all information to load a model is contained in the format. Plus it is also extensible so that new information can be added to models with backward compatibility. Note that GGUF is implemented in C++ language. So if you writing code in Python then you will need a python binding for C++ to bridge between the two languages. “Llama-cpp-python” is in fact a python binding for llama.cpp code.

There are also some advanced model usage methods such as Mixture of experts (MoE) - an ensemble of models that gets chosen based on the task. Mixtral from Mistral is an example of this. We will not be going into details of MoE in this article, I will cover it in a future deep dive!

Bringing it all together with Unix style pipelines

  • Step 1: Go to Huggingface portal, search for the model of your choice, and download it to your local system.
  • Step 2: Import the necessary Llamacpp libraries and load model in memory.
      from llama_cpp import Llama
      llm = Llama(model_path="./models/<your model binary path>")
    
  • Step 3: Use Langchain Expression Language LCEL to do the Unix pipeline magic as below!
    from langchain_core.output_parsers import StrOutputParser
    from langchain_core.prompts import ChatPromptTemplate

    prompt = ChatPromptTemplate.from_template("Tell me about the film {movie}")
    output_parser = StrOutputParser()

    chain = prompt | llm2 | output_parser

    chain.invoke({"movie": "Jurassic Park"})

Note how the parameterized prompt is piped into a LLM object and the output coming from model is then piped into output parser. This is very clean and succinct way to replace several dozen lines of code with just pipes and filter pattern in single line.

Llama2 7B 4-bit quant model only completed my chat sentences, however did not give any answers to direct questions. And sometimes it gave me answers in German or Dutch instead of English, even when I explicitly commanded it to use English!

Next I tried the Mistral 7B 4-bit quant model.
Output for above prompt as below from Mistral 7B model:
. живело 1947 gånger\nThe 180-page script for Steven Spielberg\'s 1993 movie "Jurassic Park" is displayed at a Christie\'s auction in New York on November 2, 2015. A copy of the original screenplay from the 1993 film Jurassic Park is estimated to sell for upwards of $6 million when it goes under the hammer next month, Christies said today. In addition to being one of the most popular films in history, “Jurassic Park” also stands as one of Steven Spielberg’s best-received and highest grossing efforts (in North America, at least), so much so that it has spawned several sequels, a theme park ride and multiple video games. 2021-04-08 · The dinosaur-filled island is being built by Universal Parks & Resorts for the Jurassic World: Isla Nublar attraction in the new Epic Universe theme park at the Orlando resort. 2019-03-27 · The film\'s success also led to a sequ

As you can see it abruptly ends mid-sentence.

Next I asked Phi-2 to write a small essay on India as follows.

    prompt = ChatPromptTemplate.from_template("Write a short essay on {place} maximum 100 words")
    output_parser = StrOutputParser()
    chain = prompt | llm3 | output_parser
    chain.invoke({"place": "India"})

Output came as follows:
\nAnswer: India is the second-most populous country in the world, with a population of approximately 1.3 billion people as per the 2021 estimates. This makes it one of the most diverse and culturally rich nations, with more than 2,000 languages and dialects spoken across its 29 states and eight union territories.\n\nThe country has a complex political structure, with a federal system that allows for greater autonomy at the state level. The President is the head of state, while the Prime Minister heads the government. India also has a multi-party democratic political system, where elections are held regularly to select representatives from various parties to form the central and state governments.\n\nIndia's economy is one of the fastest growing in the world, with a GDP that has been steadily increasing over the years. It is the seventh largest country by land area and the second most populous, making it an important player on the global stage. The country is known for its diverse industries such as manufacturing, agriculture, information technology, and tourism.\n

This is crisp and reasonably good output.

Conclusion: In my experiment the quantized Phi-2 model responses were far better than both Mistral and Llama2 7Bs, although Phi-2 had lesser parameters than the other models. Also note that Hugging face site has model cards detailing the known capabilities and limitations of every model. This is all the more important for smaller models as they will typically specialized in few spots at the cost of cutting corners in other spots, by design. Whereas, your large size language models will try to cater to multiple styles of prompts simultaneously. So careful testing and validation of these models is key before deciding to deploy into your production. We can expect both the types of model to exist, and so users can leverage each type for its strengths.


Subscribe

* indicates required

Intuit Mailchimp