开源项目介绍Native-LLM-for-Android
开源项目介绍:Native-LLM-for-Android
项目地址:Native-LLM-for-Android 创作活动时间:2025年 支持在 Android 设备上运行大型语言模型 (LLM) ,具体支持的模型包括: DeepSeek-R1-Distill-Qwen: 1.5B Qwen2.5-Instruct: 0.5B, 1.5B Qwen2/2.5VL: 2B, 3B MiniCPM-DPO/SFT: 1B, 2.7B Gemma2-it: 2B Phi3.5-mini-instruct: 3.8B Llama-3.2-Instruct: 1B Native-LLM-for- Android项目主要提供2个参考点,1、将LLM模型导出为onnx模型,2、在安卓端实现LLL模型的运行,本博文主要关注将llm导出为onnx推理(对现有的llm模型进行局部修改并导出),并以miniCPM模型为例进行测试。同时,Native- LLM-for-Android项目还有一些列模型量化代码可以学习。
1、模型运行性能
运行最快的模型是Llama3.2-1B-Instruct q8f32,达到25 token每秒,相应的硬件与os为 Nubia Z50(Android
13、8_Gen2-CPU);其次是Distill-Qwen-1.5B q8f32,达到22 token每秒,相应的硬件与os为Xiaomi-14T-Pro
(HyperOS 2、MediaTek_9300±CPU);
2、核心功能
这里主要关注将llm导出为onnx脱离torch环境运行,因此对Android运行代码不予理会
2.1、分词器
分词器也就是Tokenizer ,一共两个功能:
1、将输入模型的文本,分为多个短词,并转换为token,
2、将模型输出的token转换为文本
需要注意的是,不同的llm模型分词规则是不一样的,同时对于的token编码规则也不同
一般运行onnx模型,都是基于Transformer库中的Tokenizer,这无法脱离torch环境。应该自行实现。
Native-LLM-for-Android项目中Tokenizer 依赖的是mnn-llm仓库中的实现.
具体代码链接为:
,是纯c++代码,与torch环境毫无关联
同时,在每一种模型的Android-onnx代码路径下,都有对于的Tokenizer的c++实现代码
2.2、导出onnx模型
在Native-LLM-for-Android项目下Export_ONNX目录中,每一个模型都有单独的导出代码
如Gemma模型的导出,分别执行A-B-C步骤,导出3个模型,在最后的导出代码中含onnx推理代码
其中关于QwenVL模型的导出较为复杂,需要对transformers库中modeling_qwen2_vl.py文件进行改写覆盖,将单个模型拆分为5个模型运行。其中A模型是VIT的主体部分,E模型是llm的主体部分,BCD模型是一些切片索引操作,被单独导出为模型。关于E模型导出有报错,可以参考
如果导出模型报错
RuntimeError: The serialized model is larger than the 2GiB limit imposed by the protobuf library. Therefore the output file must be a file path, so that the ONNX external data can be written to the same directory. Please specify the output file name.
则尝试将torch版本降低到2.4.1
pip install torch2.4.1 torchvision0.19.1 torchaudio==2.4.1 –index-url
2.3、onnx模型量化
关于onnx模型量化,可以参考:,根据介绍,onnx量化可以分为动态量化与静态量化,动态量化在推理时根据输入数据动态计算缩放因子与零点;静态量化,使用校准数据集离线计算缩放因子(Scale)和零点(Zero
Point)。通常,建议对 RNN 和基于 Transformer 的模型使用动态量化,对 CNN 模型使用静态量化
在Native-LLM-for-Android-main\Do_Quantize\Dynamic_Quant 目录下有多个模型量化代码,具体如下图所示
q8_f16的量化代码如下所示,可以看到对于大尺寸的模型的量化有一个关键参数项 is_large_model
import os
import gc
import glob
import sys
import onnx
import torch
import subprocess
import onnx.version_converter
from onnxsim import simplify
from onnxslim import slim
from onnxruntime.quantization import QuantType, quantize_dynamic
from onnxruntime.transformers.optimizer import optimize_model
from transformers import AutoModelForCausalLM
Path Setting
original_folder_path = r"C:\Users\Downloads\Model_ONNX" # The original folder. quanted_folder_path = r"C:\Users\Downloads\Model_ONNX_Optimized" # The optimized folder. model_path = os.path.join(original_folder_path, “Model.onnx”) # The original fp32 model path. quanted_model_path = os.path.join(quanted_folder_path, “Model_Optimized.onnx”) # The optimized model stored path. download_path = r’C:\Users\Downloads\Qwen2-1.5B-Instruct’ # Set the folder path where the LLM whole project downloaded, otherwise set “NONE”. use_gpu = True # If true, the transformers.optimizer will remain the FP16 processes. provider = ‘CPUExecutionProvider’ # [‘CPUExecutionProvider’, ‘CUDAExecutionProvider’] use_low_memory_mode_in_Android = False # If you need to use low memory mode on Android, please set it to True.
Preprocess, it also cost alot of memory during preprocess, you can close this command and keep quanting. Call subprocess may get permission failed on Windows system.
(optional process)
subprocess.run([f’python -m onnxruntime.quantization.preprocess –auto_merge –all_tensors_to_one_file –input {model_path} –output {quanted_folder_path}’], shell=True)
Start Quantize
quantize_dynamic( model_input=model_path, model_output=quanted_model_path, per_channel=True, # True for model accuracy but cost a lot of time during quanting process. reduce_range=False, # True for some x86_64 platform. weight_type=QuantType.QInt8, # It is recommended using uint8 + Symmetric False extra_options={‘ActivationSymmetric’: False, # True for inference speed. False may keep more accuracy. ‘WeightSymmetric’: False, # True for inference speed. False may keep more accuracy. ‘EnableSubgraph’: True, # True for more quant. ‘ForceQuantizeNoInputCheck’: False, # True for more quant. ‘MatMulConstBOnly’: True # False for more quant. Sometime, the inference speed may get worse. }, nodes_to_exclude=None, # Specify the node names to exclude quant process. Example: nodes_to_exclude={’/Gather’} use_external_data_format=True # Save the model into two parts. ) model_size_bytes = sys.getsizeof(onnx.load(model_path).SerializeToString()) model_size_gb = model_size_bytes * 9.31322575e-10 # 1 / (1024 * 1024 * 1024) if model_size_gb > 2.0: is_large_model = True else: is_large_model = True if use_low_memory_mode_in_Android else False
ONNX Model Optimizer
slim( model=quanted_model_path, output_model=quanted_model_path, no_shape_infer=False, # True for more optimize but may get errors. skip_fusion_patterns=False, no_constant_folding=False, save_as_external_data=is_large_model, verbose=False ) if download_path == “NONE”: num_heads = 0 # default hidden_size = 0 # default else: if (‘vl’ in download_path.lower()) & (‘qwen’ in download_path.lower()): if “2.5” in download_path or “3b” in download_path.lower(): from transformers import Qwen2_5_VLForConditionalGeneration model = Qwen2_5_VLForConditionalGeneration.from_pretrained(download_path, torch_dtype=torch.float16, device_map=‘cpu’, trust_remote_code=True, low_cpu_mem_usage=True).eval() else: from transformers import Qwen2VLForConditionalGeneration model = Qwen2VLForConditionalGeneration.from_pretrained(download_path, torch_dtype=torch.float16, device_map=‘cpu’, trust_remote_code=True, low_cpu_mem_usage=True).eval() else: model = AutoModelForCausalLM.from_pretrained(download_path, torch_dtype=torch.float16, device_map=‘cpu’, trust_remote_code=True, low_cpu_mem_usage=True).eval() num_heads = model.config.num_attention_heads hidden_size = model.config.hidden_size del model gc.collect()
transformers.optimizer
model = optimize_model(quanted_model_path, use_gpu=use_gpu, opt_level=2, num_heads=num_heads, hidden_size=hidden_size, provider=provider, verbose=False, model_type=‘bert’) model.convert_float_to_float16( keep_io_types=True, force_fp16_initializers=True, use_symbolic_shape_infer=True, # True for more optimize but may get errors. op_block_list=[‘DynamicQuantizeLinear’, ‘DequantizeLinear’, ‘DynamicQuantizeMatMul’, ‘Range’, ‘MatMulIntegerToFloat’] ) model.save_model_to_file(quanted_model_path, use_external_data_format=is_large_model) del model gc.collect()
onnxslim 2nd
slim( model=quanted_model_path, output_model=quanted_model_path, no_shape_infer=False, # True for more optimize but may get errors. skip_fusion_patterns=False, no_constant_folding=False, save_as_external_data=is_large_model, verbose=False )
Upgrade the Opset version. (optional process)
model = onnx.load(quanted_model_path) model = onnx.version_converter.convert_version(model, 21) onnx.save(model, quanted_model_path, save_as_external_data=is_large_model) if is_large_model: pattern = os.path.join(quanted_folder_path, ‘*.data’) files_to_delete = glob.glob(pattern) for file_path in files_to_delete: try: os.remove(file_path) except Exception as e: print(f"Error deleting {file_path}: {e}")
It is not recommended to convert an FP16 ONNX model to the ORT format because this process adds a Cast operation to convert the FP16 process back to FP32.
3、导出minicpm模型onnx推理
3.1 下载模型
pip install modelscope 基于modelscope 库可以下载MiniCPM-2B-dpo-fp16模型 from modelscope import snapshot_download model_dir = snapshot_download(‘OpenBMB/MiniCPM-2B-dpo-fp16’,cache_dir=".cache_dir")
3.2 导出onnx模型
这里以MiniCPM-2B-split导出方式为例
先在命令行进入 F:\Native-LLM-for-Android-main\Export_ONNX\MiniCPM\MiniCPM-2B-split
目录
然后创建,model_a,model_b两个目录,用于存储2个onnx模型,并将代码修改为以下形式
最后在命令行中执行
python .\MiniCPM_Export.py
即可实现模型导出为onnx,并进行推理测试
这里可以发现代码的推理速度居然为0.375token/s,简直巨慢。
按照单个模型导出,并进行推理测试,发现效果如下所示,可以发现性能有6倍的提升,这表明数据通信也占据了大量的耗时。
3.3 单独运行onnx
基于以下代码可以运行onnx模型,但无法脱离transformers库,除非手写tokenizer实现分词,并实现token与文本的对应关系。 import numpy as np import onnxruntime from transformers import AutoModelForCausalLM, AutoTokenizer import time path = ‘F:\DMTcache_dir\OpenBMB\MiniCPM-2B-dpo-fp16’ # Set the folder path where the MiniCPM whole project downloaded. onnx_model_A = r’F:\Native-LLM-for-Android-main\Export_ONNX\MiniCPM\MiniCPM-2B-single\model_q8_f16\MiniCPM_part_A_Optimized.onnx’ # Assign a path where the exported MiniCPM_part_A stored.
Run the exported model by ONNX Runtime
query = “山东省最高的山是哪座山, 它比黄山高还是矮?差距多少?” max_seq_len = 1024 # Please modify the same variable, which declared in the modified modeling_minicpm.py on line 1008, at the same time. num_heads = 36 head_dim = 64 num_key_value_heads = 36 num_layers = 40 hidden_size = 2304 max_single_chat_length = 341 # It a adjustable value, but must less than max_seq_len. tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True)
ONNX Runtime settings
session_opts = onnxruntime.SessionOptions() session_opts.log_severity_level = 3 # error level, it a adjustable value. session_opts.inter_op_num_threads = 0 # Run different nodes with num_threads. Set 0 for auto. session_opts.intra_op_num_threads = 0 # Under the node, execute the operators with num_threads. Set 0 for auto. session_opts.enable_cpu_mem_arena = True # True for execute speed; False for less memory usage. session_opts.execution_mode = onnxruntime.ExecutionMode.ORT_SEQUENTIAL session_opts.graph_optimization_level = onnxruntime.GraphOptimizationLevel.ORT_ENABLE_ALL session_opts.add_session_config_entry(“session.intra_op.allow_spinning”, “1”) session_opts.add_session_config_entry(“session.inter_op.allow_spinning”, “1”) ort_session_A = onnxruntime.InferenceSession(onnx_model_A, sess_options=session_opts, providers=[‘CPUExecutionProvider’]) in_name_A = ort_session_A.get_inputs() out_name_A = ort_session_A.get_outputs() in_name_A0 = in_name_A[0].name in_name_A1 = in_name_A[1].name in_name_A2 = in_name_A[2].name in_name_A3 = in_name_A[3].name in_name_A4 = in_name_A[4].name in_name_A5 = in_name_A[5].name out_name_A0 = out_name_A[0].name out_name_A1 = out_name_A[1].name out_name_A2 = out_name_A[2].name
Pre-process inputs
prompt = tokenizer.apply_chat_template([{“role”: ‘user’, “content”: query}], tokenize=False, add_generation_prompt=False) token = tokenizer(prompt, return_tensors=‘pt’)[‘input_ids’] ids_len = token.shape[1] + np.zeros(1, dtype=np.int64) input_ids = np.zeros(max_seq_len, dtype=np.int32) input_ids[:ids_len[0]] = token[0, :] attention_mask = np.array([-65504.0], dtype=np.float32) history_len = np.zeros(1, dtype=np.int64) past_key_states_A = np.zeros((num_layers, num_key_value_heads, max_seq_len, head_dim), dtype=np.float16) past_values_states_A = past_key_states_A num_decode = 0 print(’\nTest Question: ’ + query + “\n\nMiniCPM Answering:\n”)
Start to run LLM
start_time = time.time() while history_len < max_single_chat_length: token_id, past_key_states_A, past_values_states_A = ort_session_A.run( [out_name_A0, out_name_A1, out_name_A2], {in_name_A0: input_ids, in_name_A1: attention_mask, in_name_A2: past_key_states_A, in_name_A3: past_values_states_A, in_name_A4: history_len, in_name_A5: ids_len}) if token_id == 2: # the stop_id in MiniCPM is “2” break else: history_len[0] += ids_len[0] ids_len[0] = 1 num_decode += 1 attention_mask[0] = 0.0 input_ids[0] = token_id print(tokenizer.decode(token_id), end="", flush=True) end_time = time.time() print(f"\n\nDecode: {(num_decode / (end_time - start_time)):.3f} token/s")