Shanzhi-M1

License Hugging Face Medical Benchmark

๐ŸŒŸ Model Overview

Shanzhi-M1 is a medical LLM alignment framework developed by Shanghai Mingpin Medical Data Technology Co., Ltd. Built on the Qwen3-32B base model, it addresses core pain points of existing medical LLMs (misalignment with clinical cognition, poor adaptation to dynamic standards, high reward training costs) via three innovative designs. The framework integrates authoritative medical standards into the full training pipeline, enabling medical AI to transition from "technically feasible" to "medically trustworthy."

Core Innovations

  1. 3D Medical Standard System ("Dimensions-Scenarios-Disciplines"): Embeds domain standards (e.g., accuracy, compliance, empathy) into a structured matrix to guide data generation, SFT, and RL, resolving the disconnect between static evaluation and dynamic clinical needs.
  2. Independent Multi-Dimensional Reward Model: Decomposes medical evaluation criteria (instead of single-scalar scores) to replace high-cost real-time rubric scoring with internalized rewards, improving consistency and reducing expert labor by over 90%.
  3. Geometric Projection Reference Constraints: Translates medical cognitive logic (e.g., "medium answers should lie between good/poor") into mathematical regularization, ensuring scoring gradients align with clinical reasoning and enabling training on large-scale synthetic data.

Core Highlights

  • ๐Ÿ† Top-Performing Open-Source Medical Model: Achieves 62.7 on HealthBench (Full) and 44.7 on HealthBench (Hard) โ€” outperforming all open-source models and most closed-source counterparts (e.g., OpenAI O3, Gemini 2.5 Pro).
  • ๐Ÿฉบ Clinical Scenario Excellence: Leads in 5 core medical scenarios (Emergency Referrals: 74.3, Communication: 69.6, Context Awareness: 52.4, etc.).
  • ๐Ÿ’ฐ Cost-Efficient: Compresses expert annotation labor to <1/10 of traditional methods while maintaining clinical effectiveness.
  • ๐Ÿ”ง Standard-Extensible: Supports dynamic updates to multi-source medical guidelines (regional, disciplinary, scenario-specific).

๐Ÿ“Š Performance Metrics

HealthBench

HealthBench Full Comparison

HealthBench Hard Comparison

HealthBench Full Score

HealthBench Hard Score

Scenario-Specific Performance

Clinical Scenario Score Performance Note
Emergency Referrals 74.3 Highest among all tested models; prioritizes risk timeliness
Medical Communication 69.6 Excels in patient adherence guidance
Context Seeking 58.5 Strong at proactive clinical information collection
Global Health 59.2 Adapts to diverse regional medical standards
Context Awareness 52.4 Maintains consistency across multi-turn clinical conversations

๐Ÿ”ง Technical Features

1. 3D Medical Standard Matrix Construction

  • Variables:
    • L (Core Dimensions): e.g., Information Content Quality, Clinical Reasoning, Compliance.
    • M (Scenarios): e.g., Chronic Disease Management, Pediatric Consultation, Emergency Triage.
    • N (Disciplines): e.g., Internal Medicine, Surgery, Pharmacy, Nursing.
  • Matrix Role: Guides structured data generation (questions, multi-quality answers, rubrics) to ensure coverage of all clinical scenarios.

2. Training Pipeline

  1. SFT Cold-Start: Fine-tunes base model with "full-dimensional optimal samples" to learn medical logic, safety, and professional expressions.
  2. Reward Model (RM) Training:
    • Input: 5-tuple (q, ab_i,q, ar_i,q, aw_i,q, Desc_i) (question, good/medium/poor answers, dimension description).
    • Loss: Combines Bradley-Terry (BT) loss (pairwise quality comparison) and Geometric Constraint (GC) loss (ensures score order aligns with medical logic).
  3. Reinforcement Learning (RL): Uses GRPO with RM scores as rewards to optimize SFT-enhanced model, enabling continuous alignment with medical standards.

3. Efficiency Optimization

  • Synthetic Data Support: Geometric constraints reduce reliance on scarce high-quality expert annotations.
  • Deployment Compatibility: Supports 4-bit quantization for low-resource environments (e.g., single RTX 4090).

โš™๏ธ Quick Start

1. Install Dependencies

pip install transformers torch vllm sglang>=0.4.6.post1

2. Load Model & Run Inference

from transformers import AutoTokenizer, AutoModelForCausalLM

# Load model and tokenizer (replace with Hugging Face repo once released)
model_name = "mingpinDZJ/Shanzhi-M1"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    trust_remote_code=True,
    torch_dtype="bfloat16",
    device_map="auto"
)

# Example: Medical query (pediatric medication safety)
prompt = "A 3-year-old child (pediatric scenario) is taking acetaminophen for fever. Can they also take compound cold medicine? Please explain the risks and recommendations."

# Format chat input (supports multi-turn conversations)
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    thinking_mode="on"  # Enables clinical reasoning logging
)

# Generate response
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=1024,
    temperature=0.1,  # Low temperature for clinical accuracy
    top_p=0.95
)

# Parse output (separate reasoning process from final answer)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):]
try:
    # Locate reasoning-end token ()
    reason_end_idx = len(output_ids) - output_ids[::-1].index(tokenizer.encode("")[0])
    reasoning = tokenizer.decode(output_ids[:reason_end_idx], skip_special_tokens=True)
    answer = tokenizer.decode(output_ids[reason_end_idx:], skip_special_tokens=True)
except ValueError:
    reasoning = "No explicit reasoning logged."
    answer = tokenizer.decode(output_ids, skip_special_tokens=True)

# Print results
print(f"Clinical Reasoning: {reasoning}")
print(f"Final Answer: {answer}")

3. Efficient Deployment (SGLang/vLLM)

SGLang Server (Supports MTP Inference)

python -m sglang.launch_server \
--model-path mingpinDZJ/Shanzhi-M1 \
--reasoning-parser qwen3 \
--mem-fraction 0.9

vLLM Server (High Throughput)

vllm serve mingpinDZJ/Shanzhi-M1 \
--reasoning-parser qwen3 \
--tensor-parallel-size 1 \
--dtype bfloat16

โš ๏ธ Usage Notices

  1. Medical Disclaimer: This model is for research, medical education, and clinical decision support only โ€” it cannot replace professional diagnosis, treatment, or medical advice.
  2. Intended Use Cases:
    • Medical student training (case simulation, knowledge verification).
    • Healthcare provider decision support (second opinion, guideline alignment).
    • Public health education (general health consultation).
  3. Safety Guidelines:
    • Always validate model outputs against authoritative medical guidelines (e.g., WHO, UpToDate).
    • Use under the supervision of licensed medical professionals in clinical settings.
  4. Data Integrity: Training data is fully independent of the HealthBench evaluation set to avoid overfitting.

๐Ÿ“„ License

Licensed under the Apache License 2.0. Permitted for:

  • Non-commercial research and education.
  • Commercial use with proper attribution and compliance with medical regulations (e.g., HIPAA, GDPR for patient data).

๐Ÿค Acknowledgements

  • Base Model: Qwen3-32B (by Alibaba Cloud) for strong general language capabilities.
  • Evaluation Benchmark: HealthBench (Arora et al., 2025) for clinically grounded performance testing.
  • Open-Source Tools: Hugging Face Transformers, vLLM, SGLang for model deployment and inference.
  • Expert Contribution: Board-certified physicians for defining 3D medical standards and validating rubrics.

๐Ÿ“ž Contact Us


**Bridging AI and Clinical Practice โ€” Making Trustworthy Medical AI Accessible**
Downloads last month
12
Safetensors
Model size
33B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for mingpinDZJ/Shanzhi-M1

Base model

Qwen/Qwen3-32B
Finetuned
(149)
this model