Overview

A custom agent typically consists of:

  1. An agent class that processes observations and decides on actions

  2. An adapter that converts your agent’s actions to the Common Language for Action (CLA) format

  3. A main loop that handles the interaction between your agent and the environment

It’s useful to look at our sample Claude Computer Use implementation before writing your own agent class.

Creating a Simple Agent

Here’s an example of a simple agent that responds to tasks:

from agent.base import Agent

class SimpleAgent(Agent):
    def __init__(self):
        self.messages = []
        
    async def predict(self, screenshot, text):
        # This would be where your agent's logic goes
        # For this example, we'll return a simple action based on the text
        if "type" in text.lower():
            return {"action": "type_text", "text": "Hello, world!"}
        else:
            # Default: click in the center of the screen
            return {"action": "click_point", "x": 500, "y": 500}
    
    def process_response(self, response):
        # Convert the agent's response to an action
        # This is to ensure that the agent can break out of the loop by itself
        # Return (is_final_response, action)
        if "action" not in response:
            return True, "I don't know how to help with that."
            
        action_type = response.get("action")
        
        if action_type == "type_text":
            return False, {"action": "type", "text": response.get("text", "")}
        elif action_type == "click_point":
            return False, {"action": "left_click", "coordinate": [response.get("x", 0), response.get("y", 0)]}
        else:
            return True, f"I don't understand the action: {action_type}"

Creating a Custom Adapter

Next, create an adapter to translate your agent’s actions to the CLA format. See more details under the adapter concept.

from hud.adapters.common import Adapter
from hud.adapters.common.types import ClickAction, Point, TypeAction

class SimpleAdapter(Adapter):
    def __init__(self):
        super().__init__()
        self.agent_width = 1024
        self.agent_height = 768
        
    def convert(self, data):
        # Convert the action dict to CLA format
        action_type = data.get("action")
        
        if action_type == "type":
            return TypeAction(text=data.get("text", ""), enter_after=False)
            
        elif action_type == "left_click":
            coord = data.get("coordinate", [0, 0])
            return ClickAction(point=Point(x=coord[0], y=coord[1]), button="left")
            
        # Fall back to parent's implementation
        return super().convert(data)

Main Loop

Finally, implement the main loop to connect your agent with the environment:

import asyncio
import hud

async def main():
    # Initialize the agent and adapter
    agent = SimpleAgent()
    adapter = SimpleAdapter()
    
    # Create environment
    env = await hud.gym.make("Local-OSWorld-Ubuntu")
    
    # Reset to a task
    obs = await env.reset(metadata={"run": "simple-agent-run"})
    
    # Agent interaction loop
    max_steps = 5
    for i in range(max_steps):
        # Get agent's prediction
        response = await agent.predict(obs.screenshot, obs.text)
        
        # Process the response
        done, action = agent.process_response(response)
        
        if done:
            # This is a final response
            env.final_response = str(action)
            break
        
        # Step the environment with the action
        actions = adapter.adapt_list([action])
        obs, reward, terminated, info = await env.step(actions)
        
        if terminated:
            break
    
    # Evaluate the result
    result = await env.evaluate()
    print(f"Evaluation result: {result}")
    
    # Close the environment
    await env.close()

if __name__ == "__main__":
    asyncio.run(main())