Skip to content
All Projects

Benchmarking Local LLMs on the M5 Max MacBook Pro

Testing what 128 GB of Apple Silicon can do: 14 local LLMs benchmarked in GGUF and MLX across four categories on an M5 Max MacBook Pro

n8n javascript python postgresql docker bash

Abstract

The M5 Max MacBook Pro with 128 GB of unified memory represents a new tier of consumer hardware for local LLM inference. This project presents an automated benchmarking pipeline that evaluated 14 models across four categories (GSM8K, HumanEval, MMLU, IFEval) in both GGUF and MLX formats where available, using two-layer scoring — deterministic automated validation combined with independent five-dimension rubric review. Across over 2,400 scored prompt runs at three temperatures, the top performers — Qwen3.5 122B MoE (1.00 rubric average), Nemotron Super ~120B and MiniMax M2.5 229B (MLX) (each 0.98) — demonstrated strong results from quantized local inference, while Qwen2.5 VL 7B delivered 0.90 accuracy at 69.8 tokens/second — 3x faster than comparably accurate models at a fraction of the parameter count.

Introduction

Consumer Apple Silicon hardware can now run large language models locally with meaningful throughput. The M5 Max MacBook Pro with 128 GB of unified memory supports models up to 229 billion parameters via quantized inference in LM Studio. This project set out to answer a practical question: across a wide range of open-weight models, what accuracy and throughput can you actually expect from local inference on this hardware?

To test this systematically, the project uses an automated pipeline that:

  1. Benchmarks local models across four established evaluation categories
  2. Applies a two-layer scoring methodology separating deterministic validation from qualitative assessment
  3. Stores all results in a normalized relational schema for longitudinal comparison
  4. Runs end-to-end without manual intervention via workflow orchestration

The pipeline evaluated 14 models — 11 tested in both GGUF and MLX formats — ranging from 4.8 GB (Qwen2.5 VL 7B, Q4_K_M GGUF) to 101 GB (MiniMax M2.5 229B, Q3_K_XL GGUF) across four size tiers.

ModelGGUF QuantGGUF GBGGUF SourceMLX BitsMLX GBMLX Source
MiniMax M2.5 229BQ3_K_XL101.0MiniMax-M2.5-GGUF3-bit100.1mlx-community
Nemotron Super ~120BQ4_K_M85.0lmstudio-community
Qwen3.5 122B MoEQ4_K_S73.5unsloth4-bit69.6mlx-community
Llama 4 Scout 109B MoEQ4_K_M61.3lmstudio-community4-bit61.1mlx-community
GPT-OSS 120BMXFP458.5lmstudio-communityMXFP463.4mlx-community
Llama 3.3 70BQ4_K_M39.7lmstudio-community4-bit39.7mlx-community
Qwen2.5 Coder 32B4-bit18.3mlx-community
DeepSeek R1 32BQ4_K_M18.0lmstudio-community4-bit18.4mlx-community
Mistral Small 24BQ4_K_M13.5lmstudio-community4-bit14.1mlx-community
Qwen2.5 14B 1MQ4_K_M8.3lmstudio-community8-bit15.7mlx-community
DeepSeek Coder V2 LiteQ4_K_M8.8lmstudio-community4-bit8.8mlx-community
Phi-4 14BQ4_K_M7.9lmstudio-community4-bit8.3mlx-community
Qwen2.5 Coder 7B8-bit8.1mlx-community
Qwen2.5 VL 7BQ4_K_M4.8lmstudio-community8-bit9.0mlx-community

Methodology

Test Architecture

The pipeline is an n8n workflow running in Docker on an Unraid server, communicating with the M5 Max MacBook Pro over the local network via SSH and HTTP.

n8n Benchmark Runner workflow

The workflow iterates through each model sequentially. For each model, it SSH-es into the M5 Max MacBook Pro to load the model via lms load <model_id> --context-length N, executes 20 prompts against LM Studio’s OpenAI-compatible API (/v1/chat/completions), applies Layer 1 automated scoring (deterministic pass/fail), persists results to PostgreSQL, posts progress to Mattermost, and unloads the model via lms unload --all before proceeding to the next. After all models complete, a separate Layer 2 rubric review is performed by Claude (via API), scoring each stored response across five qualitative dimensions.

Benchmark Selection

Four benchmark categories were selected to evaluate distinct capabilities relevant to practical local LLM usage:

CategorySourceCapability TestedPrompts
GSM8KCobbe et al., 2021Multi-step mathematical reasoning5
HumanEvalChen et al., 2021Python function completion5
MMLUHendrycks et al., 2021Broad factual knowledge (multiple choice)5
IFEvalZhou et al., 2023Constraint-based instruction following5

Each category contributes 5 prompts for a total of 20 prompts per model per temperature setting. Prompts were drawn from the original benchmark datasets, selecting problems that are solvable within a single inference call and verifiable through deterministic automated scoring.

Prompt Design

The exact prompts used in each category are listed below.

GSM8K — Math Reasoning (scoring: extract final number via regex, exact match)

#PromptAnswer
0Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May.
How many clips did Natalia sell altogether in April and May?
Let’s think step by step.
72
1Weng earns $12 an hour for babysitting. Yesterday, she just did 50 minutes of babysitting.
How much did she earn? Let’s think step by step.
10
2Betty is saving money for a new wallet which costs $100. Betty has only half of the money she needs.
Her parents decided to give her $15 for that purpose, and her grandparents twice as much as her parents.
How much more money does Betty need to buy the wallet? Let’s think step by step.
5
3Julie is reading a 120-page book. Yesterday, she was able to read 12 pages and today, she read twice as many pages as yesterday.
If she wants to read half of the remaining pages tomorrow, how many pages should she read?
Let’s think step by step.
42
4James writes a 3-page letter to 2 different friends twice a week.
How many pages does he write a year? Let’s think step by step.
624

HumanEval — Python Code Completion (scoring: execute against assert tests, or LLM judge fallback)

Each prompt uses the template: “Complete this Python function:” followed by the function signature and docstring.

#FunctionTask
0has_close_elements(
numbers: List[float],
threshold: float) -> bool
Return True if any two numbers
are closer than threshold
1separate_paren_groups(
paren_string: str) -> List[str]
Split balanced parenthesis
groups into list
2truncate_number(
number: float) -> float
Return decimal part of
positive float (3.5 → 0.5)
3below_zero(
operations: List[int]) -> bool
Return True if bank balance
ever drops below zero
4mean_absolute_deviation(
numbers: List[float]) -> float
Calculate MAD around
mean of input list

MMLU — Factual Knowledge (scoring: extract letter A–D via regex, exact match)

Each prompt ends with “Answer with only the letter.”

#QuestionABCDAnswer
0Term for gamete production
by meiosis?
GametogenesisOogenesisSpermatogenesisSporogenesisA
1Best description of
mitochondria function?
Protein
synthesis
ATP via cellular
respiration
DNA
replication
Lipid
synthesis
B
2Charge of a proton?-10+1+2C
3Philosopher of the
categorical imperative?
John
Locke
David
Hume
Immanuel
Kant
John Stuart
Mill
C
4What does GDP stand for
in economics?
Gross Domestic
Product
General Debt
Payment
Government
Defense Policy
Gross Development
Percentage
A

IFEval — Instruction Following (scoring: constraint-specific automated check)

#PromptConstraint
0Write exactly 3 bullet points (each starting with ”- ”) about the benefits of regular exercise.
Use no more than 20 words per bullet.
Exactly 3 lines
matching ^-
1Respond ONLY in ALL CAPS.
Describe what a neural network is in 2-3 sentences.
Entire response
uppercase
2Write a response containing “innovation” at least 3 times and “future” at least 2 times.
Topic: technology trends.
Keyword frequency
thresholds
3Respond with a valid JSON object with exactly these keys: “name”, “age”, “city”.
Use any reasonable values.
Valid JSON with
required keys
4Write between 50 and 60 words (inclusive) describing the water cycle.
Count carefully.
Word count
in [50, 60]

The following table documents each pipeline file’s role:

FileStagePurpose
prompts.mdRunner20 prompts, expected answers, scoring method
scoring-rubric.mdReview5-dimension rubric definitions (1–3 scale)
scoring-playbook.mdReviewReview procedure, SQL queries, validation
run-playbook.mdRunnerPre-flight checklist, config, post-run checks
models.mdRunnerModel registry (tier, format, quant, RAM)
troubleshooting.mdOpsFailure modes, root causes, fixes

Full prompt templates are available in the project repository (link coming soon).

Scoring Framework

Each model response is scored twice through independent mechanisms.

Two-Layer Scoring Architecture

Layer 1 — Automated Validation. The benchmark runner scores each response immediately using deterministic, category-specific methods:

CategoryValidation MethodPass Criteria
GSM8KNumeric extraction (regex cascade)Exact match to expected answer
HumanEvalCode execution + assert testsAll tests pass (fallback: LLM judge)
MMLULetter extraction (regex cascade)Exact match to expected letter
IFEvalConstraint-specific checks (5 types)All structural constraints satisfied

All automated scores are binary: 1.00 (pass) or 0.00 (fail). Results are stored with category-specific score_detail JSON containing the extraction evidence.

Layer 2 — Independent Rubric Review. After all models have been evaluated and unloaded, a separate review process uses the Claude API to score every stored response across five qualitative dimensions on a 1–3 scale:

DimensionWhat It Measures3 = Excellent
AccuracyCorrectness of final answerFully correct
ReasoningQuality of intermediate stepsComplete, logical, correct
AdherenceFollowing prompt instructionsAll instructions satisfied
ConcisenessEconomy of responseNo padding or repetition
ConfidenceClarity of answer deliveryDirect, no hedging

Scores of 1 (poor) and 2 (adequate) are defined in the scoring rubric. Maximum: 15 points per prompt (5 dimensions x 3 points). Scores are aggregated per model as a normalized average (0.00–1.00) for cross-model comparison.

Temperature Control

Three temperature settings were tested across separate workflow executions:

  • T=0.0 — Deterministic output for reproducibility baseline
  • T=0.3 — Low variance for practical use-case simulation
  • T=0.7 — Higher variance to assess robustness and creativity

Temperature is configured as a single parameter in the workflow’s config node and applied uniformly to all 20 prompts within a run. Each temperature setting produces an independent set of results, enabling per-temperature analysis.

Data Architecture

All results are persisted in PostgreSQL across five normalized tables:

TableOne Row =Purpose
benchmark_runsOne workflow executionRun metadata, timestamps, model count
model_benchmark_resultsOne model in one runPer-category scores, overall score, throughput
prompt_resultsOne prompt responseRaw response, automated score, score detail (JSONB)
response_scoresOne rubric evaluationFive dimension scores, total, reviewer notes
modelsOne model in catalogSize, format, quantization, tier, RAM requirement

Database ER diagram — 5 normalized tables with foreign key relationships

The prompt_results table stores the model’s raw response alongside the automated score and category-specific score_detail JSON. The response_scores table stores the independent rubric review. The thinking_text column in prompt_results captures extracted <think>...</think> blocks from reasoning models (11 of 14 models emit chain-of-thought), stored separately from the response text.

Hardware Configuration

Inference host: M5 Max MacBook Pro, Apple M5 Max chip, 128 GB unified memory. Models served via LM Studio’s OpenAI-compatible API on the local network.

Orchestration host: Unraid server running n8n v2.37.4 (Docker), PostgreSQL 16, and Mattermost (notifications).

Network: SSH over the local network for model load/unload commands. HTTP for inference API calls. SSH key authentication with the key mounted in the n8n container.

Results

Model Performance Tradeoffs

The pipeline evaluated 14 models across three temperatures — 11 in both GGUF and MLX formats — producing over 2,400 scored prompt responses. Results are presented below by inference format.

Overall Performance

The top performers were Qwen3.5 122B MoE, which achieved a perfect 1.00 rubric average — scoring 15.0/15 across all four categories — followed by Nemotron Super ~120B and MiniMax M2.5 229B (MLX), each at 0.98.

The efficiency standout was Qwen2.5 VL 7B (GGUF Q4_K_M), which achieved 0.90 rubric accuracy at 69.8 tokens/second with a 4.8 GB model file — 3x faster than Mistral Small 24B at the same accuracy level, using one-third the disk space.

GGUF Models (sorted by rubric average):

ModelGBQuanttok/sAvgBest
Qwen3.5 122B MoE73.5Q4_K_S31.71.00All
Nemotron Super ~120B85.0Q4_K_M27.60.98GSM8K
MiniMax M2.5 229B101.0Q3_K_XL38.30.98MMLU
GPT-OSS 120B58.5MXFP465.90.95IFEval
Llama 4 Scout 109B MoE61.3Q4_K_M24.30.93GSM8K
Qwen2.5 VL 7B4.8Q4_K_M69.80.90HumanEval
Mistral Small 24B13.5Q4_K_M22.40.90GSM8K
DeepSeek R1 32B18.0Q4_K_M15.80.88GSM8K
Qwen2.5 14B 1M8.3Q4_K_M36.50.88GSM8K
Phi-4 14B7.9Q4_K_M35.70.88GSM8K
Llama 3.3 70B39.7Q4_K_M7.50.86GSM8K
DeepSeek Coder V2 Lite8.8Q4_K_M127.10.82GSM8K

MLX Models (sorted by rubric average):

ModelGBBitstok/sAvgBest
Qwen3.5 122B MoE69.64-bit43.71.00All
MiniMax M2.5 229B100.13-bit45.70.98GSM8K
GPT-OSS 120B63.4MXFP463.60.98GSM8K
DeepSeek R1 32B18.44-bit12.50.91HumanEval
Qwen2.5 Coder 32B18.34-bit19.40.91GSM8K
Qwen2.5 14B 1M15.78-bit26.60.91MMLU
Qwen2.5 VL 7B9.08-bit38.10.91HumanEval
Llama 3.3 70B39.74-bit9.00.90MMLU
Llama 4 Scout 109B MoE61.14-bit21.60.90GSM8K
Mistral Small 24B14.14-bit28.00.88GSM8K
Phi-4 14B8.34-bit43.10.87GSM8K
DeepSeek Coder V2 Lite8.84-bit144.00.85HumanEval
Qwen2.5 Coder 7B8.18-bit54.00.83GSM8K

Per-Category Leaders

CategoryLeaderScoreRunner-upScore
GSM8KQwen3.5 122B MoE (GGUF)1.00Llama 4 Scout, Nemotron, MiniMax (MLX)1.00
HumanEvalQwen3.5 122B MoE (GGUF)1.00Nemotron Super ~120B0.99
MMLUQwen3.5 MoE, Nemotron, MiniMax, GPT-OSS1.00Qwen2.5 14B 1M, VL 7B (MLX)0.93
IFEvalQwen3.5 122B MoE (MLX)1.00GPT-OSS 120B (both)0.99

Speed-Accuracy Tradeoff

Model performance spans from 4.8 GB / 69.8 tokens/second (Qwen2.5 VL 7B, 0.90 rubric average) to 101 GB / 38.3 tokens/second (MiniMax M2.5 229B). The relationship between model size and accuracy is not linear — Qwen2.5 Coder 32B (18.3 GB, 0.91 accuracy) outperformed Llama 3.3 70B (39.7 GB, 0.86 accuracy) while running 2.6x faster.

Format Comparison: GGUF vs MLX

Eleven models were tested in both GGUF and MLX quantizations. MLX variants achieved higher throughput in most cases, though the advantage varied by model and was not universal.

ModelGGUF tok/sMLX tok/sSpeedupAccuracy (GGUF → MLX)
Qwen3.5 122B MoE31.743.7+38%1.00 → 1.00
MiniMax M2.5 229B38.345.7+19%0.98 → 0.98
DeepSeek Coder V2 Lite127.1144.0+13%0.82 → 0.85
Mistral Small 24B22.428.0+25%0.90 → 0.88
Phi-4 14B35.743.1+21%0.88 → 0.87
Llama 3.3 70B7.59.0+20%0.86 → 0.90
GPT-OSS 120B65.963.6-3%0.95 → 0.98
Qwen2.5 VL 7B69.838.1-45%0.90 → 0.91
Qwen2.5 14B 1M36.526.6-27%0.88 → 0.91
DeepSeek R1 32B15.812.5-21%0.88 → 0.91
Llama 4 Scout 109B MoE24.321.6-11%0.93 → 0.90

Notable: MLX was faster for 6 of 11 models (up to +38% for Qwen3.5 122B MoE). However, GGUF Q4_K_M outperformed MLX 4-bit/8-bit for Qwen2.5 VL 7B (-45%), Qwen2.5 14B 1M (-27%), DeepSeek R1 32B (-21%), and Llama 4 Scout (-11%). The throughput advantage appears format- and architecture-dependent rather than universal.

GGUF vs MLX throughput comparison — same models, side by side

Discussion

Two-Layer Scoring

The two-layer scoring methodology proved essential. Automated validation alone would have rated several models higher than their actual reasoning quality warranted. The rubric review caught cases where a model produced a correct final answer through flawed intermediate reasoning — the automated layer scored these as passing, but the rubric review penalized reasoning quality. This separation provides a more accurate picture of model capability than either layer alone.

Orchestration

n8n proved surprisingly capable as an ML evaluation orchestration engine. JavaScript Code nodes handled HTTP calls, SSH commands, regex scoring, and SQL construction within a single workflow. The primary limitation was debugging complexity — a 14-node workflow with SSH connections, HTTP calls, and PostgreSQL writes produces failure modes that are difficult to trace through n8n’s visual interface. The most fragile component was SSH-based model loading. Timing issues, stale connections, and LM Studio CLI hangs required defensive error handling (connectivity probes via nc -z -w 3, explicit unload-before-load sequences) that accounted for more development time than any other pipeline component.

Data Persistence

Storing results in PostgreSQL rather than flat files transformed the project from a one-off experiment into a reusable evaluation system. Ad-hoc queries like “fastest model above 80% accuracy” or “best performer under 10 GB” are trivial SQL. The normalized schema (runs → model results → prompt results → rubric scores) supports longitudinal comparison as new models are released without any schema changes.

Limitations

This evaluation has several constraints. All inference ran on a single machine (M5 Max MacBook Pro, 128 GB unified memory), limiting the maximum model size to approximately 108 GB RAM. Quantized models (4-bit, 3-bit) were tested rather than full-precision weights, which may affect accuracy relative to published benchmarks. The prompt set (20 per model) is small compared to full benchmark suites (GSM8K contains 8,792 problems; HumanEval contains 164). Results reflect local inference characteristics and are not directly comparable to cloud-hosted evaluations.

Future Work

Planned extensions include automated re-runs triggered by new model releases (via n8n webhook + LM Studio model registry polling), expansion of the prompt set to 50+ per category, addition of multi-turn conversation benchmarks, and a web dashboard for interactive result exploration built on the existing PostgreSQL schema.

References

  1. Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., & Schulman, J. (2021). Training Verifiers to Solve Math Word Problems. arXiv preprint arXiv:2110.14168. https://arxiv.org/abs/2110.14168

  2. Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. de O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., Ray, A., Puri, R., Krueger, G., Petrov, M., Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, S., … Zaremba, W. (2021). Evaluating Large Language Models Trained on Code. arXiv preprint arXiv:2107.03374. https://arxiv.org/abs/2107.03374

  3. Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., & Steinhardt, J. (2021). Measuring Massive Multitask Language Understanding. Proceedings of the International Conference on Learning Representations (ICLR). https://arxiv.org/abs/2009.03300

  4. Zhou, J., Lu, T., Mishra, S., Brahma, S., Basu, S., Liang, Y., Zhong, Q., Blume, C., Li, X., Li, T., Rawat, A. S., Vashishth, S., Dey, K., He, H., Cho, K., & Sil, A. (2023). Instruction-Following Evaluation for Large Language Models. arXiv preprint arXiv:2311.07911. https://arxiv.org/abs/2311.07911

  5. LM Studio. Local LLM inference engine with OpenAI-compatible API. https://lmstudio.ai

  6. n8n. Workflow automation platform. https://n8n.io