Driving all platforms UI automation with vision-based model
We are preparing the v1.0 release. It is currently published under the npm
@betatag.
The v1.0 docs and code are on https://midscenejs.com/ and themainbranch.
The v0.x docs and code are on https://v0.midscenejs.com/ and thev0branch.
The v1.0 changelog: https://midscenejs.com/changelog
| Instruction | Video |
|---|---|
| Use JS code to drive task orchestration, collect information about Jay Chou's concert, and write it into Google Docs (By UI-TARS model) | google-doc-1080p.mp4 |
| Control Maps App on Android (By Qwen-2.5-VL model) | control-maps-app-on-android.mp4 |
| Using midscene mcp to browse the page (https://www.saucedemo.com/), perform login, add products, place orders, and finally generate test cases based on mcp execution steps and playwright example | showcase-3-mcp.mp4 |
- Describe your goals and steps, and Midscene will plan and operate the user interface for you.
- Use Javascript SDK or YAML to write your automation script.
- Web Automation: Either integrate with Puppeteer, Playwright or use Bridge Mode to control your desktop browser.
- Android Automation: Use Javascript SDK with adb to control your local Android device.
- iOS Automation: Use Javascript SDK with WebDriverAgent to control your local iOS devices and simulators.
- Any Interface Automation: Use Javascript SDK to control your own interface.
- Three kinds of APIs:
- Interaction API: interact with the user interface.
- Data Extraction API: extract data from the user interface and dom.
- Utility API: utility functions like
aiAssert(),aiLocate(),aiWaitFor().
- MCP: Midscene provides MCP services that expose atomic Midscene Agent actions as MCP tools so upper-layer agents can inspect and operate UIs with natural language. Docs
- Caching for Efficiency: Replay your script with cache and get the result faster.
- Debugging Experience: Midscene.js offers a visualized replay back report file, a built-in playground, and a Chrome Extension to simplify the debugging process. These are the tools most developers truly need.
- Chrome Extension: Start in-browser experience immediately through the Chrome Extension, without writing any code.
- Android Playground: There is also a built-in Android playground to control your local Android device.
- iOS Playground: There is also a built-in iOS playground to control your local iOS device.
Midscene.js is all-in on the pure-vision route for UI actions: element localization and interactions are based on screenshots only. It supports visual-language models like Qwen3-VL, Doubao-1.6-vision, gemini-3-pro, and UI-TARS. For data extraction and page understanding, you can still opt in to include DOM when needed.
- Pure-vision localization for UI actions; the DOM extraction mode is removed.
- Works across web, mobile, desktop, and even
<canvas>surfaces. - Far fewer tokens by skipping DOM for actions, which cuts cost and speeds up runs.
- DOM can still be included for data extraction and page understanding when needed.
- Strong open-source options for self-hosting.
Read more about Model Strategy
- Official Website: https://midscenejs.com
- Documentation: https://midscenejs.com
- Sample Projects: https://github.com/web-infra-dev/midscene-example
- API Reference: https://midscenejs.com/api
- GitHub: https://github.com/web-infra-dev/midscene
Community projects that extend Midscene.js capabilities:
- midscene-ios - iOS Mirror automation support for Midscene
- midscene-pc - PC operation device for Windows, macOS, and Linux
- midscene-pc-docker - Docker image with Midscene-PC server pre-installed
- Midscene-Python - Python SDK for Midscene automation
- midscene-java by @Master-Frank - Java SDK for Midscene automation
- midscene-java by @alstafeev - Java SDK for Midscene automation
We would like to thank the following projects:
- Rsbuild and Rslib for the build tool.
- UI-TARS for the open-source agent model UI-TARS.
- Qwen-VL for the open-source VL model Qwen-VL.
- scrcpy and yume-chan allow us to control Android devices with browser.
- appium-adb for the javascript bridge of adb.
- appium-webdriveragent for the javascript operate XCTest。
- YADB for the yadb tool which improves the performance of text input.
- Puppeteer for browser automation and control.
- Playwright for browser automation and control and testing.
If you use Midscene.js in your research or project, please cite:
@software{Midscene.js,
author = {Xiao Zhou, Tao Yu, YiBing Lin},
title = {Midscene.js: Your AI Operator for Web, Android, iOS, Automation & Testing.},
year = {2025},
publisher = {GitHub},
url = {https://github.com/web-infra-dev/midscene}
}Midscene.js is MIT licensed.
