Using vision GPT to interact with your computer

Published at: 14.10.24

Two weeks back (28, 29.10.2024), my roommate (Mo) and I participated in a hackathon organized by Factory:Berlin and Tech:Berlin. It was an interesting story of how Mo first dropped out of the hackathon and then came up with an idea and then we both got so much hype during the hackathon and made it to the finals. Our idea was to develop an app not only for developers but for everyone where the user can describe what kind of problem he is facing with the computer e.g. close the vim editor or add animations to my slides etc and then GPT given access to ⌨️ and mouse🖱️ will solve user's problem. This blog post is about our hackathon journey and also describes how we approached the above problem and share what we learned!

We signed up for the hackathon without really knowing what we were going to develop. We were just trying to come up with ideas! Mo shared this story that his father is always calling him to solve issues he is having with the computer or he just does not know how to use certain applications and then we decided to develop an app to solve exactly this issue.

🛠️ Preparation for Hackathon

From past experiences of Hackathon, I learned

It is necessary to have a concrete idea of what you want to develop
Having reliable teammates
Homework and research around your idea so that during the hackathon you only focus on the implementation

Luckily, this time we checked all three boxes! We agreed that we did not need a third mate in our team and then we started our homework by Googling available solutions.

Two solutions stood out from many available solutions we looked into

However, both solutions did not align with our vision.

Open-interpreter does not use vision LLMs by default (however, now they experimenting with it {ref}) and output system code which is even difficult for developers to understand, let alone a simple user! Also, trying to operate the computer without looking into it does not sound like an ideal situation.

Open-Interface is something that we found closest to our idea however, as it uses vision to look at the screen however, it only uses keyboard shortcuts to perform actions {ref} and we will explain why relying only on keyboard shortcuts is not an optimal strategy!

Now that we got clarity that there is no exact solution to what we want to develop, the next question was to figure out if it is even possible with the current state of artificial intelligence research and development. One important distinction is that if you make this system only for web browsers then it can be a bit easy as you can access DOM but when it comes to native applications then you don't have something similar to DOM. There are similar efforts for example Pywinauto for Windows and LDTP for Linux. However, we could not find anything robust and good for MacOS.

We decided to look into academic papers and it turned out to be an emerging field of research, we end up skimming research papers that were only published recently even last month (e.g. SeeClick). Now we had the clarity of why and what we want to develop and the challenges we are going to face!

🤔 Why, what, and how

We wanted to develop something

That is not RAG
Easy to use, it should not even require LLM API KEYS for end user
User client should be lightweight and cross-platform (turned out to be a small challenge)
Something we could finish in two days of Hackathon

We decided to use GPT-4o from OpenAI. Vision capabilities of GPT models are only a year mature, we learned that GPT is amazing in describing what is in the image but if you ask it to locate where is what (spatial reasoning) it will start showing its weaknesses, we knew this problem since the beginning as it is clearly mentioned on OpenAI website.

Solution engineering

The day of the hackathon arrived, we split our work that Mo will work on user's Client and Server, and I will start working on our AI backend.

High-level view of our solution's architecture

From now on I will only focus on how our AI works! if you want to read more about how client and server work you should definitely check out the detailed blog post from Mo on his blog. We decided to focus on only one use case and that was "to add animations to a slide with bullet points of Microsoft PowerPoint".

Microsoft PowerPoint 2019 ppt with a slide with bullet points

Let's start with the pseudo-code of how we decided to approach our solution.

Pseudo code:

Process user input and figure out which application can be used by looking into installed applications
Take a screenshot of the whole screen and check if an application is open or not
If an action can be performed on more than one opened application or window then ask the user where to perform that action!
Bring application in the focus (it is important otherwise click event will be bounced)
Come up with a plan, which lists all the actions necessary to execute to solve a user problem
Execute each action using click or keyboard shortcuts then take a screenshot
Analyse if the previous action yields desired results and then move to step 6.
Stop once the user problem has been solved

We start coding our prototype using Python and MacOS knowing our client is going to be cross-platform and developed using Electron. We assumed everything we need from the client is going to be supported by the Electron application. We used Jinja templates for our prompts.

Our system prompt is quite simple and the only interesting thing about is that it lists all the installed apps in the user's computer. This allows our system to understand which apps can be used to perform user action and also restrict LLM hallucination to not to come up with some random application name.

You are now the backend for a program that is controlling user's computer.
User needs your help to perform certain task and user input will be like this:

- Help me to close vim text editor
- Help me force close my web browser
- bold all the headings of my document

You are supposed to navigate to the correct application and execute these steps on user behalf and help user to
achieve desired goals! You are supposed to return keyboards shortcuts to help execute user tasks.

Please make sure your keys are valid for user's operating system! For example:
- windows key only works for Windows and Linux and does not work for MacOS
- Alt key only works for Windows and Linux and does not work for MacOS
- Command key only works for MacOS and does not work for Windows and Linux
- Option key only works for MacOS and does not work for Windows and Linux

User has following application installed:

{{installed_apps}}

So the first step is to figure out if the user request can be solved using an installed application and if the user has more than one application that can used to solve the problem then ask the user to open which app is to be used. We have a separate prompt to figure it out and used cmd+space and application shortcut to bring an application to the focus.

Once the application is in focus we are ready to come up with a plan. However, instead of focusing on asking LLM to come up with a plan we were more worried about how clicking will work! So at first, we provided a plan ourselves.

plan = [
    "Given the screen shot, User want to click on Animation tab.",
    "Given the screen shot, User want to click on the text box containing Bullet points.",
    "Given the screen shot, User want to click Appear button present in the Animation tab.",
]

You can see we start with a simple plan prompt as it is not very descriptive. We wanted to see if LLM could figure it out without much guidance from the prompt. We appended each step of a plan with the following prompt

Plan step:
{{step}}.

Given the above step of a plan you must estimate the x and y coordinates of the point to click on. The user's screen resolution is {{screen_resolution}}.
Please return your corrected response in valid json format that I can put in json.loads() without an error - this is extremely important. Do not add any leading or trailing characters.
Expected LLM Response
{
    "x" ".."
    "y" ".."
    "reason" ".."
}
"x" x coordinate, must be int
"y" y coordinate, must be int
"reason" please why you come up with these values of x and y

We prepared this prompt for each step and sent it to LLM with a screenshot of the whole window. However, we started noticing that even GPT-4o could not simply do it, then we gave a shot to ask LLM to generate keyboard shortcuts.

Regarding keyboard shortcuts, what we learn is that for LLMs it is very hard to figure out keyboard shortcuts as the same application across different versions can have different shortcuts for the same action. LLMs hallucinate a lot while spitting out keyboard shortcuts, as they consistently generate madeup shortcuts, or can't distinguish shortcuts for different operating systems (Cmd is for Mac, Ctrl is for Windows). Also, there are many apps without keyboard shortcuts defined for each action! So we have to have a capability to use clicks assuming LLM can look into the screen and figure out where to click.

This was pretty daunting and we decided to focus on figuring out clicking! However, as we experienced that GPT-4o is not working at all to spit out coordinates, we tweaked our prompt for steps to provide more details. Sadly, that did not work either!

Asking LLM to generate co-ordinates for clicking

Our first experiment adds to update our steps and add more details to guide LLM.

plan = [
    "Given the screen shot, User want click on Animation tab which is between Transitions and Slide Show tabs.",
    "Given the screen shot, User want to click on the text box containing Bullet points. Please ignore side menu and click on the text.",
    "Given the screen shot, User want to click Appear button present in the Animation tab. This button is big green star, left to the Preview button.",
]

This does not help either.

One thing we noticed is that if we use small screens then co-ordinates are not off by big margins but when you switch to big screens it becomes horrible. Perhaps LLM is downscaling screenshots and in the process, it is losing valuable hints e.g. button text and that can be a reason.

We were super exhausted at this time and we decided to take help from research papers.

Our first candidate was the SeeClick paper. It is a very extensive paper as they came up with their own dataset to ground LLM for the screen, benchmark to compare and evaluate, and also completely open source. They basically fine-tuned QwenVL on a dataset they created of various screens including mobile phones.

Mo deployed their model on Runpod, however for some reason results again were not satisfactory.

We switched to the next research paper, UFO. This paper has not been presented at a conference yet, however, it was as a preprint on arXiv in May 2024. This paper is exactly in the direction of what we want to achieve. At the time of writing this blog post, they are only focusing on Windows OS.

The most exciting thing for us was that they recently announced that UFO can now click anywhere on the screen. It got us so curious and we read their code and figured out that instead of asking direct coordinates to click, they ask LLM to return estimated relative fractional x and y coordinates of the point to click on, ranging from 0.0 to 1.0 {ref}. This turned out to be a game-changer! So to make this work they were not sending full screenshots instead, they were sending only screenshots of the application window.

Only sending the application's screenshot has the benefit that image size is now small, and also less noise for LLM to get confused.

Once, you get fractional x and y coordinates, you have to do basic maths to convert them into actual coordinates and then voila you have a VLLM agent that can click, almost.

# Code excerpt from UFO: https://github.com/microsoft/UFO/blob/dd46a6acaf76716a68c7d3e792c935ece6778083/ufo/automator/ui_control/controller.py#L277
        
def transform_point(self, fraction_x: float, fraction_y: float) -> Tuple[int, int]:
  """
  Transform the relative coordinates to the absolute coordinates.
  :param fraction_x: The relative x coordinate.
  :param fraction_y: The relative y coordinate.
  :return: The absolute coordinates.
  """
  application_rect: RECT = self.application.rectangle()
  application_x = application_rect.left
  application_y = application_rect.top
  application_width = application_rect.width()
  application_height = application_rect.height()

  x = application_x + int(application_width * fraction_x)
  y = application_y + int(application_height * fraction_y)

return x, y

Self-Ask prompting (reflection)

Fractional coordinates really improved our clicking mechanism but it was still a hit-or-miss situation. To make it reliable we incorporated a mechanism which we called self-ask prompting. What we did was that once, we got the co-ordinates, we executed the action and took a screenshot of the application and sent it back to LLM, and asked if the executed action was in the right direction!

This was the user prompt:

{{user_prompt}}

Your task was to estimate the relative fractional x and y coordinates of the point to click, ranging
from 0.0 to 1.0. The origin is the top-left corner of the application window.

So you came up with
x = {{x_coordinate}}
y = {{y_coordinate}}

and your reason was:

{{prev_reason}}

We execute click on x and y coordinate and please check what happened next in the screenshot!
Please strictly check given the screenshot, And notice if you achieve user query and if your fractional x and y co-ordinates were correct
and if not please then come up with new coordinates and correct them!
Please return your response in valid json format that I can put in json.loads() without an error - this is extremely important. Do not add any leading or trailing characters.
Expected LLM Response
{
    'correction_required': '..'
    'reason': '..'
    'x': '..'
    'y': '..'
}
"correction_required" put true if correction is required otherwise false
"reason" Please justify why you believe correction was required or not
"x" corrected relative fractional x coordinate
"y" corrected relative fractional y coordinate

Self-asking improved the overall situation, however, we noticed sometimes it can be an infinite loop situation. Hallucination is still a big problem as LLM sometimes says there is no correction required even action was clearly in the wrong direction.

Given the short time of the hackathon, we could not further improve it. We also introduced prompts for planning and also used a self-ask mechanism for it as well.

There were 26 teams in the hackathon. Every team demonstrated their work to judges, and only 6 teams were selected for the finals. To our surprise, we were also in the finals which we were not expecting at all.

We also have a short video demo on YouTube which you can watch here.

I would like to pen down now and if you have any questions feel free to comment. Our work is open source you should check out our repository and if you like this blog post and our work let us know by giving us a star!

Maqbool ur Rahim

Maqbool ur Rahim

Something cool should go here...