BardeenAgent: Introducing the Most Reliable Research Agent

TL;DR;

We introduce BardeenAgent, a browser agent designed to efficiently research structured information (like job postings, customer testimonials, or recent blog posts) from complex and interactive websites.

Unlike existing methods, like OpenAI’s Operator or Claude’s Computer Use, which repeatedly use costly AI calls for each datapoint, the BardeenAgent learns how to perform an extraction task just for a single row of data, and programmatically applies it to the entire dataset.

This allows it to reliably and quickly extract large, structured datasets at scale, significantly outperforming previous agents in accuracy and cost-effectiveness.

As a result of this advancement, BardeenAgent empowers sales and revenue leaders by automating real-time market intelligence and competitor analysis.

Teams gain rapid, continuously updated insights into customer behavior, hiring strategies, product developments, and testimonials.

This capability enables sales leaders to drastically reduce research overhead, personalize outreach effectively, and maintain an agile, informed competitive advantage.

Existing Web Agents Struggle With Structured Research at Scale

Most previous research in LLM-powered web agents focuses on navigation, form-filling, or question-answering.

Existing methods struggle when the task involves structuring and cleaning data from diverse websites.

We notice a few key issues that limit the effectiveness of commonly used agents in large-scale data research:

1. Lack of Structure

‍Agents often produce outputs that aren’t consistently structured.

Since most have been built with question-answering as a focus, they typically respond in paragraphs.

Adhering to a schema when outputting results is important because it allows that data to be used in further business processes (e.g. enriching a CRM).‍

2. Limited Website Coverage

‍Agents are evaluated through established benchmarks, which typically feature a narrow set of websites. This causes agents to overfit rather than generalize effectively.‍

3. Model Laziness

‍Browser agents such as OpenAI operator and Claude Computer Use tend to terminate execution before completing the full task, leaving the rest to the user.

For example, when there are several pages of data, agents will often only save the first page of information.

This is especially harmful when there are dozens of pages of information, leading to the agent only extracting a small fraction of what it was supposed to.

BardeenAgent: The Executable LLM Agent Solution

To overcome the limitations we observed with existing web agents—particularly their struggles with structured extraction and scalability—we introduce BardeenAgent.

A novel approach that dynamically builds executable programs to reliably and efficiently extract structured data at scale.

Our key insight was leveraging the inherent, generalizable HTML structures found across web pages to significantly improve both precision and recall.

Infographic - BardeenAgent Recording Phase and Replay Phase

Executing in Two Phases: Recording and Replaying

Most existing agents repeatedly invoke large language models (LLMs) for each data item, which is costly and error-prone due to compounding inaccuracies.

To address this, we designed BardeenAgent to operate in two distinct phases: the Recording Phase (Program Generation) and the Replay Phase (Program Execution).

This design dramatically reduces computational overhead, minimizes cost, and significantly improves extraction accuracy at scale.

1. Recording Phase (Program Generation)

The agent navigates a website, guided by the LLM, and records each action. By recording an action, it saves the CSS selectors of the element(s) it acted on, as well as the operation it was trying to perform (saving text, clicking, typing, etc).

When identifying repetitive structures, like lists, it generalizes operations (more on this later) from individual elements to all similar items in the list, effectively generating a reusable program.

In its program generation phase, the agent won’t save all the data in a list or paginate through all pages of information - it will only save a small snippet to make sure its program was written correctly.

2. Replay Phase (Program Execution)

Once recorded, the generated program systematically replays the action.

The program will act on ALL items in the list, and since an LLM is not required to return each and every item in the list on its own, this eliminates repeated LLM calls, significantly improving efficiency and cost.

CSS Selector Generation: Efficiently Extracting Structured Lists

Structured data on web pages commonly appears in repetitive HTML lists. Leveraging this insight, we designed BardeenAgent to enter a specialized “List Mode,” which identifies representative list elements and generates robust CSS selectors to extract entire lists efficiently and reliably.

Selector generation is implemented in two ways:

Using heuristics (direct structural clues from tags, classes, and hierarchical relationships)
Or with LLM-based generation (uses reasoning capabilities of LLMs to intelligently handle long or complex lists)

Example of How BardeenAgent Focuses on a Single Data Row once it is in List Mode — *An Example of How the Agent Focuses on a Single Data Row once it is in List Mode*

This “List Mode” approach drastically improves efficiency, scalability, and accuracy, enabling the agent to reliably extract complete, schema-compliant data at scale, significantly reducing the overall token usage and extraction time.

Introducing WebLists: A New Benchmark for Web Agents

To evaluate the effectiveness of our method, we introduce WebLists, a benchmark specifically designed to evaluate the ability of web agents to perform interactive, schema-bound data extraction at scale.

WebLists includes 200 tasks across 50 distinct websites (the top 50 companies on the Forbes Cloud 100).

We opted to create our own benchmark, instead of using existing evaluation methods, because no benchmark evaluated the ability of browser agents to extract large amounts of structured data.

WebLists is the first benchmark to capture the ability of browser agents to extract tables of schema-compliant data.

For each company website, we ask the agent to perform four specific real-world tasks:‍

1. Job Postings

If you want to understand company growth and whether they hire the kind of people who use your product, it’simportant to extract lists of all the job postings they have available and then analyze that data. ‍

2. Category-Specific Job Extraction

‍If you want to see if a company is expanding in a certain category (e.g., AI/ML), you can use this type of task to extract all their job postings in that category.

3. Blog Posts and Product Updates

If a sales team is researching a company they are reaching out to, to provide more insightful information during a meeting or outbound, they can summarize the last 20-30 blog posts of the company to see what they have been recently working on.‍

4. Customer Testimonials

‍In competitive analysis, if I want to see the kind of customers that my competition gets and what they do with those customers, I can use this tool to extract all their testimonials and perhaps generate a report with this information.

BardeenAgent Is Effective At Scale

We tested the BardeenAgent on the WebLists benchmark against Agent-E, Perplexity, and also Wilbur, our previous research agent (you can read about it here).

Here are the results:

Agent	Jobs		Job Cat.		Blogs		Testimonials		Overall		Q&A Acc.
Agent	Pre	Rec	Pre	Rec	Pre	Rec	Pre	Rec	Pre	Rec	Q&A Acc.
Agent-E	50.0	17.3	22.0	9.0	36.7	8.7	68.5	13.5	44.3	12.1	58.0
Wilbur	95.2	18.7	86.9	62.5	90.4	24.6	55.4	16.1	82.0	30.5	56.0
LLM + Search	23.3	2.1	13.3	0.2	27.9	8.4	13.7	2.6	19.6	3.3	42.0
BardeenAgent	87.0	86.8	71.1	64.5	71.9	59.9	59.9	53.5	72.5	66.2	60.0
– Selector Model	78.5	25.0	63.9	45.7	81.8	32.7	59.2	38.8	70.9	35.6	60.0

	
		OSZAR »

Across all four use cases, the BardeenAgent significantly outperformed existing SOTA agents.

It achieved an impressive 66.2% recall, doubling the recall of the best-performing previous model (Wilbur).

Meanwhile, our agent maintained a high precision, demonstrating an ability to accurately capture relevant data without significant noise. We provide further details in our paper (link below).

Evaluating Cost

For practical use cases, accuracy is important, but so is cost. By amortizing costs across successfully extracted rows, we found that BardeenAgent achieves over 3x lower costs per row compared to competing agents:

BardeenAgent: ~1.07¢ per row
Wilbur: ~4.55¢ per row
Agent-E: ~3.21¢ per row

Question Answering Performance

To demonstrate its versatility, we also tested the BardeenAgent on traditional question-answering tasks that didn’t involve structured lists.

Some of these questions required complex interaction with the websites, such as filling out forms, observing the results, and then relaying them back to the user.

Here is an example of a question in the dataset:

I am thinking about switching to Databricks for my AI development needs and want to calculate the costs. How much would I be paying per month for the following compute type: All-Purpose compute, m4.xlarge | 4 CPUs | 16GB AWS instance type, with 10 active instances that run 20 hours/day for 25 days per month?

In this experiment, BardeenAgent achieved a 60% accuracy, competitive with state-of-the-art web agents.

This highlights the unique strength of being able to perform great structured data extraction without compromising the more general web agent capabilities of navigation, form-filling, and Q&A.

How Do We Stack Up Against OpenAI’s Operator?

BardeenAgent WebLists Benchmark: Precision Recall by Company

BardeenAgent WebLists Benchmark: Completion Times Histogram

Due to the limited availability of OpenAI Operator, we were only able to measure OpenAI Operator’s task successes on the job extraction use case.

We found that it achieved a precision of 0.41 (half of BardeenAgent) and a recall of 0.08 (less than a tenth of BardeenAgent).

Additionally, we found that the median time to completion for OpenAI operators was 9 minutes per website, much higher than other agents. Additionally, the completion time was highly variable, sometimes taking longer than 30 minutes.

What’s Next for the Bardeen Agent?

While the results of the BardeenAgent are extremely promising, we identified a few key areas where we could further improve.

1. Complex Widget Interaction

Tasks that require intricate filtering or form-filling still present challenges to the BardeenAgent. Developing specialized widget-handling tools or fine-tuned DOM interaction techniques could significantly enhance agent capabilities.

2. Selector Precision Improvements

‍BardeenAgent occasionally produces overly inclusive CSS selectors, reducing extraction precision. Future works could look into a fine-tuned model to improve the accuracy of the CSS selector model.

3. Enhanced Context for Question-Answering

‍BardeenAgent occasionally struggles on nuanced Q&A tasks, defaulting to partial or indirect answers. Integrating richer context handling or specialized Q&A prompts could boost performance significantly.

Running the Benchmark

If you are a researcher interested in running the WebLists benchmark, reach out to us at [email protected] for further information. We are happy to provide the research community with free Bardeen credits to run the benchmark and compare against our work.

You can find the full published paper here.

BibTex

@misc{bohra2025weblistsextractingstructuredinformation,

title={WebLists: Extracting Structured Information From Complex Interactive Websites Using Executable LLM Agents},

author={Arth Bohra and Manvel Saroyan and Danil Melkozerov and Vahe Karufanyan and Gabriel Maher and Pascal Weinberger and Artem Harutyunyan and Giovanni Campagna},

year={2025},

eprint={2504.12682},

archivePrefix={arXiv},

primaryClass={cs.AI},

url={https://arxiv.org/abs/2504.12682},

}