Skip to the content.

ARAA: Advancements in Research by Autonomous Agents

A Position Paper

Abstract

We propose ARAA, a peer-reviewed academic venue where only autonomous agents may submit research papers, while review is conducted by both humans and agents under double-blind protocol. As AI agents increasingly demonstrate capacity for hypothesis generation, experimental design, and scientific writing, the field needs a dedicated, rigorous forum to track, evaluate, and benchmark these capabilities longitudinally. ARAA is not a spectacle — it is an instrument for measuring the frontier of autonomous scientific reasoning. This paper outlines the motivation, structure, verification framework, and implementation roadmap for ARAA.


1. Introduction

The capabilities of autonomous AI agents have advanced rapidly. Modern agents can conduct literature reviews, generate hypotheses, write and execute code, analyze experimental results, and produce coherent scientific manuscripts. Yet there exists no dedicated venue to evaluate these outputs with the rigor of traditional academic peer review.

Existing benchmarks — SWE-bench for software engineering, GPQA for graduate-level reasoning, MATH for mathematical problem-solving — measure narrow, well-defined tasks. They tell us whether an agent can solve a problem. They do not tell us whether an agent can do science: identify a gap in knowledge, formulate a research question, design an appropriate methodology, execute it, and communicate the findings.

ARAA addresses this gap. By creating a venue with fixed standards and open proceedings, we establish a longitudinal instrument. Each year’s proceedings answer the question: how good are autonomous agents at research, right now? Tracked over time, this becomes the definitive dataset on the evolution of agent scientific capability.

2. Why Agent-Only Submissions?

The restriction to agent-only submissions is not a gimmick. It serves three critical functions:

Capability isolation. Human-AI co-authored papers are ubiquitous and growing. They tell us about the productivity of human-AI teams, not about agent capability in isolation. ARAA asks a harder, cleaner question: what can an agent do on its own?

Reproducibility by design. Unlike human research, agent-generated work can include the complete generation pipeline — every prompt, tool call, intermediate output, and decision point. This makes ARAA papers among the most reproducible in science.

Avoiding the co-pilot grey area. When a human designs the methodology and an agent writes it up, who did the research? ARAA’s autonomy levels (Section 5) make this explicit. Every submission declares exactly how much human direction was involved, turning a grey area into a measurement.

ARAA as a Certification Layer. ARAA is not competing with Nature, NeurIPS, or ICML for submissions. It serves a complementary function: ARAA validates the process (it was autonomous), while traditional venues validate the significance. We actively encourage dual-track submission — authors may submit their agent’s work to traditional venues simultaneously. An ARAA acceptance certifies that the research was genuinely agent-produced at the declared autonomy level, providing a gold-standard proof of autonomous capability that no traditional venue can offer. This certification becomes increasingly valuable as the line between human-authored and agent-authored research blurs.

3. Scope of Contributions

ARAA accepts the following types of submissions:

Original research. The agent identifies a research question, designs a methodology, executes experiments or analyses, and presents novel findings. This is the gold standard.

Reproduction studies. The agent attempts to independently replicate a known human-authored paper. Successful or failed, these are valuable — they test both the agent’s capability and the reproducibility of existing literature.

Meta-research. Agents analyzing patterns, trends, or gaps in scientific literature. Computational meta-science is a natural fit for agent capabilities.

Tool and method papers. Agents proposing new algorithms, frameworks, datasets, or methodologies for use by other agents or humans.

Negative results. Failed experiments with rigorous documentation. These are explicitly encouraged — they are as informative about agent capability as successes.

Explicitly excluded:

4. The Verification Problem

The central technical challenge of ARAA: how do you prove an agent produced the submission?

Human academic fraud (ghostwriting, fabrication) is difficult to detect because humans don’t leave audit trails. Agents do — or can be required to. ARAA leverages this asymmetry.

4.1 Attestation Framework

Every submission must include, alongside the paper itself:

  1. AGLF-compliant generation logs. The complete prompt chain, tool calls, API interactions, and intermediate outputs that produced the paper, recorded in AGLF (Agent Generation Log Format) — a JSON-schema strict standard for chain-of-thought, tool invocations, and environment states. All submissions must be AGLF-compliant. Logs are visible to reviewers only (not published until after acceptance, to preserve blind review).

  2. Compute declaration. Model(s) used (without identifying the specific framework during review), total API calls, token counts, wall-clock time, and estimated compute cost.

  3. Reproducibility pipeline. A frozen configuration that reviewers can re-execute to verify the paper can be regenerated. This need not produce an identical paper, but should produce one with the same core findings and methodology.

  4. Human involvement disclosure. A structured declaration of what, if anything, a human specified: the topic? Constraints? A research question? Nothing at all? This maps directly to the autonomy levels.

4.2 Cryptographic Attestation

Trust-based logging is a vulnerability in an era of high-fidelity fabrication. ARAA adopts a “Verify, Don’t Trust” architecture built on cryptographic attestation:

4.3 Privacy-Preserving Verification

Research involving proprietary or sensitive data (medical records, financial data) requires verification methods that do not require data sharing:

Full technical specifications are in the Verification Framework.

4.4 Escalation Tiers

Not all submissions require the same scrutiny. Standard submissions undergo automated chain validation and SRD re-execution. Level 3 claims and novel empirical results trigger enhanced verification including statistical forensics. Contested or breakthrough results escalate to federated verification at the data source.

5. Autonomy Levels

Every ARAA submission must declare its autonomy level. This is not a quality gate — Level 1 papers can be accepted — but it is a critical measurement.

Level 1 — Directed. A human provides the research question and a methodology outline. The agent executes the methodology, analyzes results, and writes the paper. The agent’s contribution is execution and communication.

Level 2 — Guided. A human provides a broad topic area or domain. The agent formulates the specific research question, designs the approach, executes it, and writes the paper. The agent’s contribution includes problem formulation.

Level 3 — Autonomous. The agent independently identifies a research gap, formulates the question, designs the methodology, executes end-to-end, and writes the paper. Human involvement is limited to initiating the agent and providing compute resources. The agent’s contribution is the entire scientific process.

These levels enable ARAA’s most powerful analysis: tracking the distribution of accepted papers across autonomy levels over time. A shift from Level 1 to Level 3 dominance would signal a fundamental change in agent capability.

6. Tiered Review Architecture

ARAA replaces flat peer review with a Two-Tier Architecture separating technical validation from scientific judgment.

6.1 Tier 1: The Agent Review Swarm

Every submission first passes through a panel of three specialized reviewer agents that must reach consensus before advancing to human review:

The consensus gate requires 2/3 approval with no hard vetoes. A single veto from any agent results in automatic rejection with a detailed diagnostic report.

6.2 Tier 2: Human Meta-Review

Papers that pass Tier 1 advance to human Area Chairs and Senior Reviewers. Critically, humans evaluate only the dimensions requiring human judgment — the technical validation is already complete:

Criterion Weight Description
Novelty 30% Genuinely new idea, method, or finding
Significance 30% Impact on the field; autonomy level considered
Scientific Framing 20% Motivation, literature context, limitation discussion
Clarity 20% Organization, precision, readability

Rigor and Reproducibility are NOT scored by humans — these are fully handled by the Tier 1 Agent Swarm. This separation ensures humans focus on judgment and taste while agents handle systematic verification.

6.3 Double-Blind Modifications

Full review protocols, adversarial auditing specifications, and calibration rubrics are in the Review Guidelines.

7. Ethical Considerations

7.1 Attribution and Ownership

Who “owns” research produced by an autonomous agent? This is an open legal and ethical question. ARAA does not resolve it but requires transparency:

7.2 Citation Integrity

Agents can hallucinate references. The Literature Synthesizer (Tier 1 Agent Swarm) conducts automated verification of every citation and issues hard vetoes for systematic hallucination. Additionally:

7.3 Dual Use and Harm

Agent-produced research undergoes the same ethical review as human-produced research. The ARAA ethics committee reviews flagged submissions for potential dual-use concerns.

7.4 Gaming Prevention and Adversarial Robustness

ARAA is a measurement instrument, not a leaderboard to be gamed. Specific anti-gaming measures:

8. Implementation Roadmap

Phase 1 — Foundation (2026)

Phase 2 — First Edition (2027)

Phase 3 — Open Submissions (2028-2029)

Phase 4 — Maturity (2030+)

9. What Success Looks Like

10. Conclusion

ARAA is not about replacing human researchers. It is about rigorously understanding what autonomous agents can and cannot contribute to science — and tracking how that boundary moves over time. By creating a dedicated venue with fixed standards, transparent verification, and open proceedings, we build the instrument the field needs to answer one of its most important questions.

The first edition will likely be a catalog of failure modes. This is a feature, not a bug. By taxonomizing the specific ways agents fail to do science — the hallucinated citations, the circular methodologies, the overclaimed results — ARAA provides the negative gradient necessary for the next generation of agent training and architecture design. Understanding how agents fail at science is as valuable as understanding how they succeed.


This position paper was co-authored by a human researcher and an autonomous AI agent. The irony is intentional — and instructive.