Published at:
14.10.24
Two weeks back (28, 29.10.2024), my roommate (Mo) and I participated in a hackathon organized by Factory:Berlin and Tech:Berlin. It was an interesting story of how Mo first dropped out of the hackathon and then came up with an idea and then we both got so much hype during the hackathon and made it to the finals. Our idea was to develop an app not only for developers but for everyone where the user can describe what kind of problem he is facing with the computer e.g. close the vim editor or add animations to my slides etc and then GPT given access to ⌨️ and mouse🖱️ will solve user's problem. This blog post is about our hackathon journey and also describes how we approached the above problem and share what we learned!
We signed up for the hackathon without really knowing what we were going to develop. We were just trying to come up with ideas! Mo shared this story that his father is always calling him to solve issues he is having with the computer or he just does not know how to use certain applications and then we
decided to develop an app to solve exactly this issue.
🛠️ Preparation for Hackathon
From past experiences of Hackathon, I learned
- It is necessary to have a concrete idea of what you want to develop
- Having reliable teammates
- Homework and research around your idea so that during the hackathon you only focus on the implementation
Luckily, this time we checked all three boxes! We agreed that we did not need a third mate in our team and then we started our homework by Googling
available solutions.
Two solutions stood out from many available solutions we looked into
However, both solutions did not align with our vision.
Open-interpreter does not use vision LLMs by default (however, now
they experimenting with it {ref}) and output system code which is even difficult for developers to
understand, let alone a simple user! Also, trying to operate the computer
without looking into it does not sound like an ideal situation.
Open-Interface is something that we found closest to our idea
however, as it uses vision to look at the screen however, it only
uses keyboard shortcuts to perform actions {ref} and we will explain why relying only on keyboard shortcuts is not an
optimal strategy!
Now that we got clarity that there is no exact solution to what we want to
develop, the next question was to figure out if it is even possible with the current
state of artificial intelligence research and development. One important distinction is that if you make this system only for web browsers then it can be a bit easy as you can access DOM but when it comes to native
applications then you don't have something similar to DOM. There are similar
efforts for example
Pywinauto
for Windows and
LDTP
for Linux. However, we could not find anything robust and good for MacOS.
We decided to look into academic papers and it turned out to be an emerging field
of research, we end up skimming research papers that were only published
recently even last month (e.g.
SeeClick). Now we had the clarity of why and what we want to develop and the challenges we
are going to face!
🤔 Why, what, and how
We wanted to develop something
- That is not RAG
- Easy to use, it should not even require LLM API KEYS for end user
- User client should be lightweight and cross-platform (turned out to be a small challenge)
- Something we could finish in two days of Hackathon
We decided to use GPT-4o from OpenAI. Vision capabilities of GPT models are only a year mature, we learned that GPT is amazing in describing what is in the image but if you ask it to locate where is what (spatial reasoning) it will start showing its weaknesses, we knew this problem since the beginning as it is clearly mentioned on OpenAI website.
Solution engineering
The day of the hackathon arrived, we split our work that Mo will work on user's Client
and Server, and I will start working on our AI backend.
![]() |
High-level view of our solution's architecture |
From now on I will only focus on how our AI works! if you want to read more
about how client and server work you should definitely check out the detailed
blog post from Mo on his
blog. We
decided to focus on only one use case and that was "to add animations to a slide with bullet points of Microsoft PowerPoint".
Let's start with the pseudo-code of how we decided to approach our solution.
Pseudo code:
- Process user input and figure out which application can be used by looking into installed applications
- Take a screenshot of the whole screen and check if an application is open or not
- If an action can be performed on more than one opened application or window then ask the user where to perform that action!
- Bring application in the focus (it is important otherwise click event will be bounced)
- Come up with a plan, which lists all the actions necessary to execute to solve a user problem
- Execute each action using click or keyboard shortcuts then take a screenshot
- Analyse if the previous action yields desired results and then move to step 6.
- Stop once the user problem has been solved
We start coding our prototype using Python and MacOS knowing
our client is going to be cross-platform and developed using Electron. We
assumed everything we need from the client is going to be supported by the Electron application. We used Jinja templates for our prompts.
Our system prompt is quite simple and the only interesting thing about is that
it lists all the installed apps in the user's computer. This allows our system
to understand which apps can be used to perform user action and also
restrict LLM hallucination to not to come up with some random application
name.
You are now the backend for a program that is controlling user's computer.
User needs your help to perform certain task and user input will be like this:
- Help me to close vim text editor
- Help me force close my web browser
- bold all the headings of my document
You are supposed to navigate to the correct application and execute these steps on user behalf and help user to
achieve desired goals! You are supposed to return keyboards shortcuts to help execute user tasks.
Please make sure your keys are valid for user's operating system! For example:
- windows key only works for Windows and Linux and does not work for MacOS
- Alt key only works for Windows and Linux and does not work for MacOS
- Command key only works for MacOS and does not work for Windows and Linux
- Option key only works for MacOS and does not work for Windows and Linux
User has following application installed:
{{installed_apps}}
Once the application is in focus we are ready to come up with a plan. However,
instead of focusing on asking LLM to come up with a plan we were more
worried about how clicking will work! So at first, we provided a plan ourselves.
plan = [
"Given the screen shot, User want to click on Animation tab.",
"Given the screen shot, User want to click on the text box containing Bullet points.",
"Given the screen shot, User want to click Appear button present in the Animation tab.",
]
You can see we start with a simple plan prompt as it is not very descriptive.
We wanted to see if LLM could figure it out without much guidance from the
prompt. We appended each step of a plan with the following prompt
Plan step:
{{step}}.
Given the above step of a plan you must estimate the x and y coordinates of the point to click on. The user's screen resolution is {{screen_resolution}}.
Please return your corrected response in valid json format that I can put in json.loads() without an error - this is extremely important. Do not add any leading or trailing characters.
Expected LLM Response
{
"x" ".."
"y" ".."
"reason" ".."
}
"x" x coordinate, must be int
"y" y coordinate, must be int
"reason" please why you come up with these values of x and y
We prepared this prompt for each step and sent it to LLM with a screenshot of the whole window. However, we started noticing that even
GPT-4o could not simply do it, then we gave a shot to ask LLM to generate
keyboard shortcuts.
Regarding keyboard shortcuts, what we learn is that for LLMs it is very hard to figure out keyboard shortcuts as the same application across different versions can have different shortcuts for the same action. LLMs hallucinate a lot while spitting out keyboard shortcuts, as they consistently generate madeup shortcuts, or can't distinguish shortcuts for different operating systems (Cmd
is for Mac,Ctrl
is for Windows). Also, there are many apps without keyboard shortcuts defined for each action! So we have to have a capability to use clicks assuming LLM can look into the screen and figure out where to click.
Asking LLM to generate co-ordinates for clicking
Our first experiment adds to update our steps and add more details to guide LLM.plan = [
"Given the screen shot, User want click on Animation tab which is between Transitions and Slide Show tabs.",
"Given the screen shot, User want to click on the text box containing Bullet points. Please ignore side menu and click on the text.",
"Given the screen shot, User want to click Appear button present in the Animation tab. This button is big green star, left to the Preview button.",
]
One thing we noticed is that if we use small screens then co-ordinates are not off by big margins but when you switch to big screens it becomes horrible. Perhaps LLM is downscaling screenshots and in the process, it is losing valuable hints e.g. button text and that can be a reason.
We were super exhausted at this time and we decided to take help from research
papers.
Self-asking improved the overall situation, however, we noticed sometimes it can be an infinite loop situation. Hallucination is still a big problem as LLM sometimes says there is no correction required even action was clearly in the wrong direction.
Our first candidate was the SeeClick paper. It is a very extensive paper as they came up with their own
dataset to ground LLM for the screen, benchmark to compare and evaluate, and
also completely
open source. They basically fine-tuned QwenVL on a dataset they created of various
screens including mobile phones.
Mo deployed their model on Runpod, however for some reason results again
were not satisfactory.
We switched to the next research paper,
UFO. This paper has not been
presented at a conference yet, however, it was as a preprint
on arXiv in May 2024. This paper is
exactly in the direction of what we want to achieve. At the time of
writing this blog post, they are only focusing on Windows OS.
The most exciting thing for us was that they recently
announced
that UFO can now click anywhere on the screen. It got us so curious and we
read their code and figured out that instead of asking direct coordinates
to click, they ask LLM to
return estimated relative fractional x and y coordinates of the point to
click on, ranging from 0.0 to 1.0 {ref}. This turned out to be a game-changer! So to make this
work they were not sending full screenshots instead, they were sending only
screenshots of the application window.
Only sending the application's screenshot has the benefit that image size is now small, and also less noise for LLM to get confused.
Once, you get fractional x and y coordinates, you have to do basic maths
to convert them into actual coordinates and then voila you have a VLLM
agent that can click, almost.
# Code excerpt from UFO: https://github.com/microsoft/UFO/blob/dd46a6acaf76716a68c7d3e792c935ece6778083/ufo/automator/ui_control/controller.py#L277
def transform_point(self, fraction_x: float, fraction_y: float) -> Tuple[int, int]:
"""
Transform the relative coordinates to the absolute coordinates.
:param fraction_x: The relative x coordinate.
:param fraction_y: The relative y coordinate.
:return: The absolute coordinates.
"""
application_rect: RECT = self.application.rectangle()
application_x = application_rect.left
application_y = application_rect.top
application_width = application_rect.width()
application_height = application_rect.height()
x = application_x + int(application_width * fraction_x)
y = application_y + int(application_height * fraction_y)
return x, y
Self-Ask prompting (reflection)
Fractional coordinates really improved our clicking mechanism but it was still a hit-or-miss situation. To make it reliable we incorporated a mechanism which we called self-ask prompting. What we did was that once, we got the co-ordinates, we executed the action and took a screenshot of the application and sent it back to LLM, and asked if the executed action was in the right direction!This was the user prompt:
{{user_prompt}}
Your task was to estimate the relative fractional x and y coordinates of the point to click, ranging
from 0.0 to 1.0. The origin is the top-left corner of the application window.
So you came up with
x = {{x_coordinate}}
y = {{y_coordinate}}
and your reason was:
{{prev_reason}}
We execute click on x and y coordinate and please check what happened next in the screenshot!
Please strictly check given the screenshot, And notice if you achieve user query and if your fractional x and y co-ordinates were correct
and if not please then come up with new coordinates and correct them!
Please return your response in valid json format that I can put in json.loads() without an error - this is extremely important. Do not add any leading or trailing characters.
Expected LLM Response
{
'correction_required': '..'
'reason': '..'
'x': '..'
'y': '..'
}
"correction_required" put true if correction is required otherwise false
"reason" Please justify why you believe correction was required or not
"x" corrected relative fractional x coordinate
"y" corrected relative fractional y coordinate
Given the short time of the hackathon, we could not further improve it. We also introduced prompts for planning and also used a self-ask mechanism for it as well.
There were 26 teams in the hackathon. Every team demonstrated their work to judges, and only 6 teams were selected for the finals. To our surprise, we were also in the finals which we were not expecting at all.
We also have a short video demo on YouTube which you can watch here.
I would like to pen down now and if you have any questions feel free to comment. Our work is open source you should check out our repository and if you like this blog post and our work let us know by giving us a star!
No comments:
Post a Comment