История изменений
Исправление baaba, (текущая версия) :
Маленький промежуточный отчёт.
Нашёл такую: https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF Вроде как семь на восем слоёв из них 7 активных, круто.
Поддержка MoE вроде как есть в моей версии llama.cpp:
PS E:\llamacpp\llama-b6565-bin-win-cuda-12.4-x64> ./llama-server.exe –help|findstr moe ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce GTX 1650, compute capability 7.5, VMM: yes The following devices will have suboptimal performance due to a lack of tensor cores: Device 0: NVIDIA GeForce GTX 1650 Consider compiling with CMAKE_CUDA_ARCHITECTURES=61-virtual;80-virtual and DGGML_CUDA_FORCE_MMQ to force the use of the Pascal code for Turing. load_backend: loaded CUDA backend from E:\llamacpp\llama-b6565-bin-win-cuda-12.4-x64\ggml-cuda.dll load_backend: loaded RPC backend from E:\llamacpp\llama-b6565-bin-win-cuda-12.4-x64\ggml-rpc.dll load_backend: loaded CPU backend from E:\llamacpp\llama-b6565-bin-win-cuda-12.4-x64\ggml-cpu-haswell.dll –cpu-moe, -cmoe keep all Mixture of Experts (MoE) weights in the CPU –n-cpu-moe, -ncmoe N keep the Mixture of Experts (MoE) weights of the first N layers in the –cpu-moe-draft, -cmoed keep all Mixture of Experts (MoE) weights in the CPU for the draft –n-cpu-moe-draft, -ncmoed N keep the Mixture of Experts (MoE) weights of the first N layers in the gpt-oss, granite, grok-2, hunyuan-dense, hunyuan-moe, kimi-k2, llama2, gpt-oss, granite, grok-2, hunyuan-dense, hunyuan-moe, kimi-k2, llama2,
Запускаю так: ./llama-server.exe -m ../model/mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf –threads 12 –gpu-layers 12 –n-cpu-moe 20
Но терплю фиаско:
print_info: PAD token = 0 ‘’ print_info: LF token = 13 ‘<0x0A>’ print_info: EOG token = 2 ‘’ print_info: max token length = 48 load_tensors: loading model tensors, this can take a while… (mmap = true) llama_model_load: error loading model: missing tensor ‘blk.0.ffn_down_exps.weight’ ←[0mllama_model_load_from_file_impl: failed to load model ←[0mcommon_init_from_params: failed to load model ‘../model/mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf’, try reducing –n-gpu-layers if you’re running out of VRAM ←[0msrv load_model: failed to load model, ‘../model/mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf’ ←[0msrv operator(): operator(): cleaning up before exit… main: exiting due to model loading error ←[0m
Попробую, может дома на 3060 взлетит.
А у вас как? Кому удалось запустить и вкусить плодов? Особенно на несильном, как у меня железе. Ведь деление на активные и пассивные слои должно дать нехилый профит и даже должно работать на CPU.
Исходная версия baaba, :
Маленький промежуточный отчёт.
Нашёл такую: https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF Вроде как семь на восем слоёв из них 7 активных, круто.
Поддержка MoE вроде как есть в моей версии llama.cpp:
PS E:\llamacpp\llama-b6565-bin-win-cuda-12.4-x64> ./llama-server.exe –help|findstr moe ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce GTX 1650, compute capability 7.5, VMM: yes The following devices will have suboptimal performance due to a lack of tensor cores: Device 0: NVIDIA GeForce GTX 1650 Consider compiling with CMAKE_CUDA_ARCHITECTURES=61-virtual;80-virtual and DGGML_CUDA_FORCE_MMQ to force the use of the Pascal code for Turing. load_backend: loaded CUDA backend from E:\llamacpp\llama-b6565-bin-win-cuda-12.4-x64\ggml-cuda.dll load_backend: loaded RPC backend from E:\llamacpp\llama-b6565-bin-win-cuda-12.4-x64\ggml-rpc.dll load_backend: loaded CPU backend from E:\llamacpp\llama-b6565-bin-win-cuda-12.4-x64\ggml-cpu-haswell.dll –cpu-moe, -cmoe keep all Mixture of Experts (MoE) weights in the CPU –n-cpu-moe, -ncmoe N keep the Mixture of Experts (MoE) weights of the first N layers in the –cpu-moe-draft, -cmoed keep all Mixture of Experts (MoE) weights in the CPU for the draft –n-cpu-moe-draft, -ncmoed N keep the Mixture of Experts (MoE) weights of the first N layers in the gpt-oss, granite, grok-2, hunyuan-dense, hunyuan-moe, kimi-k2, llama2, gpt-oss, granite, grok-2, hunyuan-dense, hunyuan-moe, kimi-k2, llama2,
Запускаю так: ./llama-server.exe -m ../model/mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf –threads 12 –gpu-layers 12 –n-cpu-moe 20
Но терплю фиаско:
print_info: PAD token = 0 ‘’ print_info: LF token = 13 ‘<0x0A>’ print_info: EOG token = 2 ‘’ print_info: max token length = 48 load_tensors: loading model tensors, this can take a while… (mmap = true) llama_model_load: error loading model: missing tensor ‘blk.0.ffn_down_exps.weight’ ←[0mllama_model_load_from_file_impl: failed to load model ←[0mcommon_init_from_params: failed to load model ‘../model/mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf’, try reducing –n-gpu-layers if you’re running out of VRAM ←[0msrv load_model: failed to load model, ‘../model/mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf’ ←[0msrv operator(): operator(): cleaning up before exit… main: exiting due to model loading error ←[0m
Попробую, может дома на 3060 взлетит.
А у вас как?