Running ChatGLM-6B Dialogue Language Model Successfully on Windows 10: Detailed Process

ChatGLM-6B is an open-source conversational language model based on the General Language Model (GLM) architecture, supporting both Chinese and English. The model is optimized using similar techniques as ChatGPT, trained on 1T bilingual data with 62 billion parameters, and supplemented with supervised fine-tuning, self-play feedback, and human feedback reinforcement learning.

ChatGLM-6B is jointly developed by the KEG Lab at Tsinghua University and Zhizhi AI. With model quantization technology, users can deploy it locally on consumer-grade GPUs (requiring a minimum of 6GB VRAM at INT4 quantization level).

ChatGLM-6B can be understood as a locally deployable lightweight version of ChatGPT.

After multiple attempts, xiaoz finally successfully ran the ChatGLM-6B conversational language model on Windows 10. This article records and shares the entire process.

Reading Basics

This article is suitable for researchers interested in artificial intelligence and requires a certain level of programming and computer knowledge. If you are familiar with the Python programming language, you will have a better understanding of this article.

Hardware & Software Preparation

ChatGLM-6B has certain requirements for both hardware and software. Here is xiaoz's hardware information:

CPU: AMD 3600
Memory: DDR4 16GB
GPU: RTX 3050 (8GB)

Software environment:

Operating system: Windows 10 or other operating systems
Git tool installation
Python installation (version 3.10)
NVIDIA driver installation

This article assumes that you have a certain level of programming and computer knowledge and will not provide detailed instructions on installing and using the above software tools. If you are not familiar with them, it is recommended to give up reading.

Deploying ChatGLM-6B

ChatGLM-6B is open-sourced on GitHub: https://github.com/THUDM/ChatGLM-6B

First, you need to clone the code using Git:

git clone https://github.com/THUDM/ChatGLM-6B.git

Next, xiaoz sets pip to use the Aliyun mirror source to facilitate the smooth installation of various Python dependencies. The command is:

pip config set global.index-url https://mirrors.aliyun.com/pypi/simple/
pip config set install.trusted-host mirrors.aliyun.com

Then, go to the ChatGLM-6B directory and install Python dependencies using the following command:

pip install -r requirements.txt

Next, download the model. Due to xiaoz's weak GPU, xiaoz chose the 4-bit quantized model. It is recommended to download the model in advance as the built-in Python script for downloading may fail easily and is slow.

The author hosts the model on the "Hugging Face Hub". We need to download the model from there. xiaoz chose the "4-bit quantized model" and executed the following command:

git clone -b int4 https://huggingface.co/THUDM/chatglm-6b.git

The pytorch_model.bin file is quite large. If the git command is slow or fails, you can try manually downloading pytorch_model.bin and placing it in the local repository directory.

Note: You cannot only download the .bin file. You need to download the .json/.py and other files inside and put them in the same directory. It is recommended to clone the entire repository with Git and then manually download and merge the .bin file into one folder.

Running ChatGLM-6B

Enter the Python terminal and start running the ChatGLM-6B model using the following command:

# Specify the path to the model you cloned from Hugging Face Hub
mypath = "D:/apps\ChatGLM-6B\model\int4\chatglm-6b"
# Import dependencies
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained(mypath, trust_remote_code=True)
model = AutoModel.from_pretrained(mypath, trust_remote_code=True).half().cuda()
model = model.eval()
response, history = model.chat(tokenizer, "你好", history=[])
print(response)

During the runtime, it didn't go as smoothly as I imagined. I encountered an error message saying "Torch not compiled with CUDA enabled." I solved it by referring to this issue.

The solution is to execute the command:

python -c "import torch; print(torch.cuda.is_available())"

If it returns False, it means that the installed PyTorch does not support CUDA. Then, I executed the following command:

pip install torch==2.0.0+cu118 torchvision==0.15.1+cu118 -f https://download.pytorch.org/whl/cu118/torch_stable.html

Finally, there were no more errors. However, everyone's hardware and software are different, so the encountered errors may vary. Just be flexible and adapt accordingly.

Testing ChatGLM-6B in the Command Line

Running a 4-bit quantized model is quite resource-intensive, and the 8GB VRAM of the 3050 GPU is easily exhausted. The response speed also feels slow. (Click on the images to enlarge them)

The official repository also provides ways to run the model on the web and through an API. I encountered some errors with the web running method, which I haven't resolved yet. The CLI running method mentioned above works fine.

Additional Notes

To check the PyTorch version you have installed, you can enter the following code in the Python interactive environment:

import torch
print(torch.__version__)

If the result displays x.x.x+cpu, it may not support CUDA. Refer to the "Torch not compiled with CUDA enabled" error message resolution mentioned above.

Check the PyTorch version again. If it shows +cuxxx, it means it supports GPU. In other words, the +cpu version that does not support CPU is not available, and only the +cu version that supports GPU can be used.

Personal Experience

I had some brief conversations with ChatGLM-6B, and I feel that the results are good. Although it is not as good as ChatGPT overall, ChatGLM-6B is open-sourced by a domestic development team and can run on consumer-grade GPUs. I must give it praise and thumbs up. I hope the team continues to work hard and catch up with ChatGPT.

ChatGLM-6B GitHub repository: https://github.com/THUDM/ChatGLM-6B