As of 2026-04-22 UTC, InternVL3.5 is useful less as a single leaderboard event than as a reminder that multimodal evaluation has started to split into several contracts. A model can look strong on a composite vision-language table while still leaving open questions about resolution policy, cluster placement, GUI action reliability, and whether a benchmark actually resembles the workload a builder plans to run.[1][2]

That distinction matters for AI-China tracking because InternVL is a Shanghai AI Laboratory and OpenGVLab family with public weights, project pages, GitHub release artifacts, and Hugging Face model cards spread across the open ecosystem.[1][3][4] It is also the kind of release that gets interpreted too quickly if the reader stops at the largest model name. The public InternVL3.5 page lists sizes from 1.06B to 240.70B-A28B, with smaller dense models, MoE variants, separate vision encoder sizes, and a dynamic-resolution path that changes how much visual evidence the model consumes.[1]

The eval lesson is straightforward: InternVL3.5 should be read through boundaries, not just ranks. The project claims a broad capability gain, but the engineering signal sits in how those gains are produced and where they might fail when moved into a real product.

Image context: the cover photograph comes from ITU Pictures' record of an AI for Good session held on 2024-07-04 at the Expo Center of Shanghai during WAIC. It is not a photograph of the InternVL team. It is used as contextual AI-China conference photography, matching the article's subject: public Chinese AI research moving from lab release to ecosystem interpretation.[6]

The score is not one object

The InternVL3.5 paper says the family advances versatility, reasoning, and efficiency, with a reported up to +16.0% overall reasoning gain and a 4.05x inference-speed improvement compared with InternVL3.[2] Those numbers are meaningful, but they should not be flattened into "InternVL3.5 is better" without preserving the eval envelope. The paper ties the improvement to a coarse-to-fine training strategy, a Visual Resolution Router, and a Decoupled Vision-Language Deployment strategy, so the claimed progress is partly architectural and operational rather than only a matter of bigger weights.[2]

The project page makes the same boundary visible from another angle. InternVL3.5 uses dynamic resolution, with a maximum of 36 tiles of 448 x 448 during training and a maximum of 128 tiles during testing.[1] That is not a cosmetic detail. It means evaluation depends on when the system spends extra visual tokens, how it chooses resolution, and which tasks reward fine-grained inspection rather than broad scene understanding.

This is where benchmark reading gets dangerous. If a document-understanding task, chart task, GUI task, and image-captioning task all land inside one aggregate, the aggregate can hide the routing policy. A builder testing invoice extraction, visual QA over screenshots, or robot-facing scene inspection needs to know whether the model's gain comes from better language reasoning, more visual tiles, better OCR exposure, or a runtime policy that spends compute only on hard visual cases.

Visual routing is now part of the model contract

The Visual Resolution Router is the most important operational idea in the release because it turns resolution from a fixed input assumption into a decision. The arXiv abstract describes ViR as dynamically adjusting visual-token resolution without compromising performance, while DvD separates the vision encoder and language model across different GPUs to balance load.[2]

For evaluation, that makes the input policy part of the tested system. A static benchmark answer may say that the model solved a chart or document question. A production eval has to ask which resolution path it used, how often high-resolution tiles were selected, whether latency moved with image complexity, and whether the same routing rule holds across scanned forms, mobile screenshots, dense slides, and product images.

The training data description reinforces this point. The project page says the continued pre-training corpora include multimodal data covering image captioning, general QA, mathematics, scientific domains, charts, OCR, knowledge grounding, document understanding, multi-turn dialogue, and medical data, plus a text-only component. It gives approximately 116 million pre-training samples corresponding to about 250 billion tokens, with a text-only to multimodal ratio of about 1:2.5.[1] The SFT stage is described as about 56 million samples and 130 billion tokens, with a text-only to multimodal ratio of roughly 1:3.5.[1]

Those anchors matter because they tell evaluators what the model may have learned to route around. Strong OCR and chart results can come from more than one source: training exposure, visual-encoder capability, resolution policy, language-model reasoning, or post-training preference tuning. A good eval design isolates those layers instead of treating the model card as a single capability object.

Split deployment changes the benchmark question

The DvD idea matters because it turns a model family into a cluster-topology question. If the vision encoder and language model can be placed across different GPUs, then the practical eval is no longer only "how good is the model?" It becomes "what accuracy, latency, memory, and scheduling behavior does this deployment shape produce under the workload we care about?"[2]

That is especially relevant for the largest InternVL3.5 branch. The project page lists InternVL3.5-241B-A28B at 240.70B-A28B, with a 5.54B vision component and 235.09B language component; the Hugging Face model page provides the public model artifact surface for that branch.[1][4] At that scale, the model is not a casual local toy. The benchmark reader needs to separate capability ceiling from deployable lane.

The smaller side of the family is just as important. A 1.06B or 2.35B model gives a different eval question: what does a compact multimodal model preserve when the workload is constrained by device, cost, or latency? The same release family therefore asks two different questions. The large branch asks how close an open system can push the upper bound. The small branch asks which inspection, OCR, GUI, or assistant tasks survive under a much tighter resource budget.[1]

This is why AI-China model analysis should avoid one-note ranking. InternVL3.5 is an ecosystem artifact, not only a score. Its GitHub repository connects the family to Hugging Face and ModelScope links, older InternVL branches, vision encoders, and a public release lineage that lets builders move across versions rather than treat each model as an isolated announcement.[3]

GUI and embodied agency need their own harness

The paper also states that InternVL3.5 supports novel capabilities such as GUI interaction and embodied agency.[2] That line deserves a separate eval boundary. GUI and embodied tasks are not just harder image questions. They combine visual recognition, instruction following, state tracking, action selection, and error recovery.

A conventional multimodal benchmark may reward a correct answer about what appears on a screen. A GUI-agent benchmark has to evaluate whether the model chooses the correct control, avoids destructive actions, handles disabled or hidden states, and recovers when the interface changes. An embodied-agent benchmark adds the physical or simulated action layer: a wrong perception can become a wrong movement, not just a wrong sentence.

AIbase's release coverage frames InternVL3.5 as an open-source multimodal release from Shanghai AI Laboratory and highlights its role for researchers and developers.[5] That is a fair ecosystem-level read, but the builder-level read should be narrower. Open weights and strong scores make inspection possible. They do not by themselves prove that GUI or embodied agents are ready for production autonomy.

The practical eval harness should split at least four lanes: static image QA, document or chart inspection, GUI state-action tasks, and embodied or robotics-facing perception. If a single score mixes those lanes, a buyer can mistake visual intelligence for action reliability.

What builders should watch

For builders comparing Chinese multimodal model releases, InternVL3.5 suggests a better checklist than "largest model, highest rank." First, track the resolution policy: when does the system spend more visual tokens, and what latency does that create? Second, track the deployment policy: does the model run as a monolith, or does the vision-language split require a specific multi-GPU layout? Third, track the task harness: does the benchmark test a static answer, a document workflow, a GUI action, or an embodied loop? Fourth, track the artifact path: are the weights, docs, and serving recipes visible enough for another team to reproduce the claim?[1][2][3][4]

The strongest read is not that InternVL3.5 settles the Chinese multimodal race. It does something more useful: it exposes the places where multimodal claims need to be audited. Visual routing, split deployment, and GUI agency are not footnotes. They are the new eval boundaries.

Sources

  1. InternVL project page, "InternVL3.5" release notes and model table (model sizes, dynamic resolution, training-data descriptions, SFT details).
  2. Weiyun Wang et al., "InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency," arXiv:2508.18265.
  3. OpenGVLab, "InternVL" GitHub repository (release history, model-family links, Hugging Face and ModelScope artifact links).
  4. Hugging Face, "OpenGVLab/InternVL35-241B-A28B" model card and public model artifact page.
  5. AIbase, "Shanghai AI Lab Releases the Multimodal Large Model Shuengwan InternVL3.5" (secondary release coverage).
  6. ITU Pictures on Flickr, "AI for Good Innovate for Impact" (2024 WAIC session photograph used as this article's cover image).