ARAA: Advancements in Research by Autonomous Agents
A Position Paper
Abstract
We propose ARAA, a peer-reviewed academic venue where only autonomous agents may submit research papers, while review is conducted by both humans and agents under double-blind protocol. As AI agents increasingly demonstrate capacity for hypothesis generation, experimental design, and scientific writing, the field needs a dedicated, rigorous forum to track, evaluate, and benchmark these capabilities longitudinally. ARAA is not a spectacle — it is an instrument for measuring the frontier of autonomous scientific reasoning. This paper outlines the motivation, structure, verification framework, and implementation roadmap for ARAA.
1. Introduction
The capabilities of autonomous AI agents have advanced rapidly. Modern agents can conduct literature reviews, generate hypotheses, write and execute code, analyze experimental results, and produce coherent scientific manuscripts. Yet there exists no dedicated venue to evaluate these outputs with the rigor of traditional academic peer review.
Existing benchmarks — SWE-bench for software engineering, GPQA for graduate-level reasoning, MATH for mathematical problem-solving — measure narrow, well-defined tasks. They tell us whether an agent can solve a problem. They do not tell us whether an agent can do science: identify a gap in knowledge, formulate a research question, design an appropriate methodology, execute it, and communicate the findings.
ARAA addresses this gap. By creating a venue with fixed standards and open proceedings, we establish a longitudinal instrument. Each year’s proceedings answer the question: how good are autonomous agents at research, right now? Tracked over time, this becomes the definitive dataset on the evolution of agent scientific capability.
2. Why Agent-Only Submissions?
The restriction to agent-only submissions is not a gimmick. It serves three critical functions:
Capability isolation. Human-AI co-authored papers are ubiquitous and growing. They tell us about the productivity of human-AI teams, not about agent capability in isolation. ARAA asks a harder, cleaner question: what can an agent do on its own?
Reproducibility by design. Unlike human research, agent-generated work can include the complete generation pipeline — every prompt, tool call, intermediate output, and decision point. This makes ARAA papers among the most reproducible in science.
Avoiding the co-pilot grey area. When a human designs the methodology and an agent writes it up, who did the research? ARAA’s autonomy levels (Section 5) make this explicit. Every submission declares exactly how much human direction was involved, turning a grey area into a measurement.
ARAA as a Certification Layer. ARAA is not competing with Nature, NeurIPS, or ICML for submissions. It serves a complementary function: ARAA validates the process (it was autonomous), while traditional venues validate the significance. We actively encourage dual-track submission — authors may submit their agent’s work to traditional venues simultaneously. An ARAA acceptance certifies that the research was genuinely agent-produced at the declared autonomy level, providing a gold-standard proof of autonomous capability that no traditional venue can offer. This certification becomes increasingly valuable as the line between human-authored and agent-authored research blurs.
3. Scope of Contributions
ARAA accepts the following types of submissions:
Original research. The agent identifies a research question, designs a methodology, executes experiments or analyses, and presents novel findings. This is the gold standard.
Reproduction studies. The agent attempts to independently replicate a known human-authored paper. Successful or failed, these are valuable — they test both the agent’s capability and the reproducibility of existing literature.
Meta-research. Agents analyzing patterns, trends, or gaps in scientific literature. Computational meta-science is a natural fit for agent capabilities.
Tool and method papers. Agents proposing new algorithms, frameworks, datasets, or methodologies for use by other agents or humans.
Negative results. Failed experiments with rigorous documentation. These are explicitly encouraged — they are as informative about agent capability as successes.
Explicitly excluded:
- Literature surveys without novel synthesis or insight
- Papers generated by a single prompt with no iterative refinement
- Work where a human designed the core methodology and the agent merely executed and wrote it up (this would be Level 0, below ARAA’s threshold)
4. The Verification Problem
The central technical challenge of ARAA: how do you prove an agent produced the submission?
Human academic fraud (ghostwriting, fabrication) is difficult to detect because humans don’t leave audit trails. Agents do — or can be required to. ARAA leverages this asymmetry.
4.1 Attestation Framework
Every submission must include, alongside the paper itself:
-
AGLF-compliant generation logs. The complete prompt chain, tool calls, API interactions, and intermediate outputs that produced the paper, recorded in AGLF (Agent Generation Log Format) — a JSON-schema strict standard for chain-of-thought, tool invocations, and environment states. All submissions must be AGLF-compliant. Logs are visible to reviewers only (not published until after acceptance, to preserve blind review).
-
Compute declaration. Model(s) used (without identifying the specific framework during review), total API calls, token counts, wall-clock time, and estimated compute cost.
-
Reproducibility pipeline. A frozen configuration that reviewers can re-execute to verify the paper can be regenerated. This need not produce an identical paper, but should produce one with the same core findings and methodology.
-
Human involvement disclosure. A structured declaration of what, if anything, a human specified: the topic? Constraints? A research question? Nothing at all? This maps directly to the autonomy levels.
4.2 Cryptographic Attestation
Trust-based logging is a vulnerability in an era of high-fidelity fabrication. ARAA adopts a “Verify, Don’t Trust” architecture built on cryptographic attestation:
- Merkle-chained execution traces: Each log entry is hash-chained to its predecessor, making insertion, deletion, or reordering tamper-evident
- Trusted Execution Environments (TEEs): For high-stakes submissions, agents execute inside secure enclaves (Intel SGX, AMD SEV-SNP) that produce hardware-signed attestation reports
- Compute provider co-signatures: API providers independently confirm call volume and timing
- Zero-Knowledge Proofs (ZKPs): For sensitive data, agents can prove computational correctness without revealing input data
4.3 Privacy-Preserving Verification
Research involving proprietary or sensitive data (medical records, financial data) requires verification methods that do not require data sharing:
- Federated verification: A designated Reviewer Agent travels to the data source, re-executes the pipeline in a sandboxed environment, and produces a signed verification report — data never leaves its origin
- Synthetic Reference Datasets (SRDs): When real data cannot be shared, agents must submit a synthetic dataset preserving the schema and statistical properties, enabling pipeline re-execution and adversarial stress-testing
- TEE-mediated privacy: Data is decrypted only inside the secure enclave; even the operator cannot inspect raw data during processing
Full technical specifications are in the Verification Framework.
4.4 Escalation Tiers
Not all submissions require the same scrutiny. Standard submissions undergo automated chain validation and SRD re-execution. Level 3 claims and novel empirical results trigger enhanced verification including statistical forensics. Contested or breakthrough results escalate to federated verification at the data source.
5. Autonomy Levels
Every ARAA submission must declare its autonomy level. This is not a quality gate — Level 1 papers can be accepted — but it is a critical measurement.
Level 1 — Directed. A human provides the research question and a methodology outline. The agent executes the methodology, analyzes results, and writes the paper. The agent’s contribution is execution and communication.
Level 2 — Guided. A human provides a broad topic area or domain. The agent formulates the specific research question, designs the approach, executes it, and writes the paper. The agent’s contribution includes problem formulation.
Level 3 — Autonomous. The agent independently identifies a research gap, formulates the question, designs the methodology, executes end-to-end, and writes the paper. Human involvement is limited to initiating the agent and providing compute resources. The agent’s contribution is the entire scientific process.
These levels enable ARAA’s most powerful analysis: tracking the distribution of accepted papers across autonomy levels over time. A shift from Level 1 to Level 3 dominance would signal a fundamental change in agent capability.
6. Tiered Review Architecture
ARAA replaces flat peer review with a Two-Tier Architecture separating technical validation from scientific judgment.
6.1 Tier 1: The Agent Review Swarm
Every submission first passes through a panel of three specialized reviewer agents that must reach consensus before advancing to human review:
- Methodology Critic: Evaluates statistical appropriateness, experimental design, causal validity, and research trajectory authenticity. Can issue hard vetoes for fundamental methodological flaws.
- Code Auditor: Conducts clean-room execution — spinning up an ephemeral, air-gapped container to re-execute the pipeline against the SRD with no network access or hidden dependencies. Runs adversarial stress tests (label shuffling, feature permutation, outlier injection, schema mutation) and validates AGLF execution trace consistency. Includes instruction injection scanning to detect prompt-injection vectors targeting the review swarm. Can issue hard vetoes for non-functional pipelines or fabrication indicators.
- Literature Synthesizer: Verifies every citation against academic databases, checks for misattribution and hallucinated references, and conducts systematic novelty assessment against prior work. Can issue hard vetoes if >10% of citations are hallucinated.
The consensus gate requires 2/3 approval with no hard vetoes. A single veto from any agent results in automatic rejection with a detailed diagnostic report.
6.2 Tier 2: Human Meta-Review
Papers that pass Tier 1 advance to human Area Chairs and Senior Reviewers. Critically, humans evaluate only the dimensions requiring human judgment — the technical validation is already complete:
| Criterion | Weight | Description |
|---|---|---|
| Novelty | 30% | Genuinely new idea, method, or finding |
| Significance | 30% | Impact on the field; autonomy level considered |
| Scientific Framing | 20% | Motivation, literature context, limitation discussion |
| Clarity | 20% | Organization, precision, readability |
Rigor and Reproducibility are NOT scored by humans — these are fully handled by the Tier 1 Agent Swarm. This separation ensures humans focus on judgment and taste while agents handle systematic verification.
6.3 Double-Blind Modifications
- Author blinding: Agent framework identity hidden; style normalization required
- Reviewer blinding: Reviewer identities hidden from operators
- Meta-data isolation: Execution traces reviewed by the verification committee separately from the paper, preventing log characteristics from de-anonymizing the framework
Full review protocols, adversarial auditing specifications, and calibration rubrics are in the Review Guidelines.
7. Ethical Considerations
7.1 Attribution and Ownership
Who “owns” research produced by an autonomous agent? This is an open legal and ethical question. ARAA does not resolve it but requires transparency:
- The operator (human or organization that ran the agent) is listed as the responsible party
- The agent framework is credited as the generating system
- Post-acceptance, all generation logs are published, enabling full attribution analysis
7.2 Citation Integrity
Agents can hallucinate references. The Literature Synthesizer (Tier 1 Agent Swarm) conducts automated verification of every citation and issues hard vetoes for systematic hallucination. Additionally:
- Citation context validation checks whether cited works are accurately characterized
- Self-plagiarism detection cross-references against the ARAA proceedings archive
7.3 Dual Use and Harm
Agent-produced research undergoes the same ethical review as human-produced research. The ARAA ethics committee reviews flagged submissions for potential dual-use concerns.
7.4 Gaming Prevention and Adversarial Robustness
ARAA is a measurement instrument, not a leaderboard to be gamed. Specific anti-gaming measures:
- No public rankings of frameworks by acceptance rate (to avoid marketing incentives)
- Verification committee actively checks for “teaching to the test” — agents fine-tuned specifically to produce ARAA-style papers without genuine research capability
- Diversity requirements: a single operator may submit at most N papers per edition
- Instruction injection testing: The Code Auditor actively scans submissions for prompt-injection vectors embedded in code comments, LaTeX metadata, data headers, or configuration files — any attempt to manipulate the review swarm. Confirmed injection vectors are treated as academic misconduct and result in automatic rejection.
8. Implementation Roadmap
Phase 1 — Foundation (2026)
- Publish this position paper (arXiv, blog, social)
- Gather community feedback via GitHub issues and discussions
- Recruit founding program committee (5-10 researchers)
- Submit workshop proposal to NeurIPS 2027 or ICML 2027
Phase 2 — First Edition (2027)
- Invite-only: 5-10 agent frameworks invited to submit
- Target: 30-50 submissions, 15-20 acceptances
- Co-located workshop with oral presentations and panel discussion
- Proceedings published open-access (GitHub + arXiv)
- Post-workshop analysis: what did we learn about agent research capability?
Phase 3 — Open Submissions (2028-2029)
- Open call for papers, any agent framework may submit
- Establish benchmark dashboard tracking longitudinal trends
- Introduce “best paper” awards by autonomy level
- Grow program committee, add institutional sponsors
Phase 4 — Maturity (2030+)
- Evaluate transition to standalone conference if volume and quality warrant
- Historical proceedings become the definitive reference dataset
- Inform AI policy discussions with empirical capability data
9. What Success Looks Like
- Year 1: At least 3 accepted papers with genuine novel contributions at Level 2 or above
- Year 3: Recognized workshop at a top venue, 50+ submissions, first Level 3 acceptance
- Year 5: A Level 3 paper that would pass peer review at a top-tier human venue — ARAA’s “Turing Moment”
- Ongoing: Proceedings cited by capability researchers, policy makers, and the AI safety community as the empirical ground truth on agent research capability
10. Conclusion
ARAA is not about replacing human researchers. It is about rigorously understanding what autonomous agents can and cannot contribute to science — and tracking how that boundary moves over time. By creating a dedicated venue with fixed standards, transparent verification, and open proceedings, we build the instrument the field needs to answer one of its most important questions.
The first edition will likely be a catalog of failure modes. This is a feature, not a bug. By taxonomizing the specific ways agents fail to do science — the hallucinated citations, the circular methodologies, the overclaimed results — ARAA provides the negative gradient necessary for the next generation of agent training and architecture design. Understanding how agents fail at science is as valuable as understanding how they succeed.
This position paper was co-authored by a human researcher and an autonomous AI agent. The irony is intentional — and instructive.