Wave 4: AI R&D
Wave 4 asked panelists to forecast how AI will affect AI-related research and development, and the implications this may have on leading AI and technology companies. Some AI experts believe that AI-driven improvements in the development of AI systems will lead to an explosion of AI capabilities, and this wave of forecasting questions was designed to elicit forecasts about that mechanism.
The following report summarizes responses from 253 experts, in addition to 58 superforecasters and 810 members of the public collected between Nov 20, 2025 and Dec 20, 2025. Within expert respondents, 56 computer scientists, 52 industry professionals, 57 economists, and 88 research staff at policy think tanks participated.
Our wider website contains more information about LEAP, our Panel, and our Methodology, as well as reports from other waves.
Insights
- AI model benchmark performance is likely to outpace most forecasters’ expectations. The median expert in our sample predicted state-of-the-art (SOTA) accuracy on a programming benchmark called LiveCodeBenchPro (Hard) of 14% in 2026. In fact, shortly after the LEAP Wave 4 survey closed, OpenAI’s GPT-5.2 achieved a score of 33% on the benchmark. This dynamic is also evident in earlier work from FRI.
- Experts predict continual low levels of junior hiring, and some rationales point to AI being a driver of this. The median expert expects the share of new hires at top-15 tech companies who will have one year or less of total experience to remain close to its current low of 7% through 2040. This is significantly below the 15% observed in 2019.
- Experts also expect AI and technology companies to become more valuable per employee. The median expert predicts that by the end of 2030 the valuation-per-employee for top AI companies will grow to $20 million—about a 40 percent increase. The median expert also assigns a 50% probability that at least three companies with five or fewer full-time employees will achieve a valuation of $10 billion (in 2025 USD) by the end of 2032.
- Experts forecast major increases in AI data center buildout. The median expert forecasts that global installed capacity of hyperscale data centers will increase 2.6 times by the end of 2030, from 36 GW in 2024 to 92 GW.
- As observed in previous LEAP waves, experts and superforecasters tend to forecast similar levels of progress, but when the two groups disagree, experts tend to predict more progress than superforecasters. Even when forecasters agree directionally on the impacts of AI, written rationales reveal deep disagreements about the reason for their forecasts. Some forecasters argue that high valuations and the rapid buildout of data centers are the result of bubble dynamics. Others argue that improvement in AI agents will lead companies to dramatically increase their revenue without increasing their workforces.
Questions
-
LiveCodeBench Pro: What will be the highest percentage accuracy ever achieved by an AI system on LiveCodeBench Pro (Hard), by the following resolution dates? ⬇️
-
AI Company Valuation per Employee: What will be the ratio between the summed valuation (in 2025 USD) and the total number of employees at the top five AI companies by the following resolution dates? ⬇️
-
Entry-Level Tech Hiring: What percentage of new hires at the top 15 tech companies will have one year or less of total experience in the following resolution periods? ⬇️
-
Low Headcount, High-Valuation: In what calendar year will at least three companies with five or fewer full-time employees first report individual valuations of $10 billion (in 2025 USD) or more? ⬇️
-
Hyperscale Infrastructure: What will be the total installed capacity (in GW) of hyperscale data centers operating globally by the following resolution dates? ⬇️
-
AI-Written Paper about AI: In what year will a paper about AI written at least in part by AI first be judged by a panel of experts to be worthy of a Test-of-Time award? ⬇️
For full question details and resolution criteria, see below.
Results
In this section, we present each question, and summarize the forecasts made and the reasoning underlying those forecasts. More concretely, we present background material, historical baselines, and resolution criteria; graphs, results summaries, results tables; as well as rationale analyses and rationale examples. In the first three waves, experts and superforecasters wrote over 600,000 words supporting their beliefs. We analyse these rationales alongside predictions to provide significantly more context on why experts believe what they believe, and the drivers of disagreement, than the forecasts alone.
LiveCodeBench Pro
Question. What will be the highest percentage accuracy ever achieved by an AI system on LiveCodeBench Pro (Hard), by the following resolution dates?
Results. In the 50th percentile scenario, the median expert predicts a state-of-the-art (SOTA) AI system’s accuracy on LiveCodeBench Pro in 2026 will reach 14%, just under a doubling from the baseline value of 7.7% at the time of the survey. This central prediction increases to 33% by the end of 2030. However, experts disagree substantially: the bottom quartile of experts expect SOTA accuracy of 20% or less by the end of 2030, while the top quartile expect at least 60% accuracy. Experts expect more progress than the public: the median public respondent expects SOTA AI to achieve only 18 percent accuracy by the end of 2030. Experts and superforecasters are much closer in their forecasts: there are no statistically significant differences across all scenarios in 2030.
Recent benchmark performances suggest the median expert is underestimating SOTA AI systems’ progress—and substantially so. Since the survey closed, OpenAI’s GPT 5.2 achieved 33% accuracy on the Q3 2025 set of problems (and 23% on the Q2 2025 set that produced the original baseline value of 7.7%). While the set of questions in LiveCodeBench Pro is dynamic, and performance on future question sets could be lower than 33%, the resolution of this question will be at least 33%. Definitive accuracy assessment will need to wait until the resolution date.
Rationale analysis:
-
Commercial incentives: High-forecast respondents often emphasize coding’s clear commercial value as a driver of progress: “Because coding assistance has an obvious commercial payoff, I expect sustained and significant effort from companies to push performance forward.” Some low-forecast respondents, meanwhile, question whether the “hard”, Olympiad-level LiveCodeBench question set is a good proxy for commercial potential, with one writing that there might not be “a lot of biz-use cases for these kinds of problems.”
-
Adaptive difficulty of LiveCodeBench: Many low-forecast respondents emphasize that LiveCodeBench Pro’s continuously updated problem set creates a moving target and makes progress on the benchmark difficult: “An important variable is how much more difficult the test may become over time.” Other respondents acknowledge this dynamic but think “the ability to make harder problems will be outrun by AI progress.”
-
Capability improvements: Respondents who forecast rapid LLM performance improvements on LiveCodeBench Pro often point to models’ technical advances. One writes, “By 2030, multiple generations of models and advances in planning, verification, and code agents could enable substantial improvement.” Another predicts that by 2030, “reasoning models will develop stronger internal self-verification: mental code tracing, multi-approach consistency checking, and symbolic reasoning about invariants.” Others point to reinforcement learning, test-time compute, and scaling as drivers of past and future progress. Conversely, several low-forecast respondents note that, according to the paper that launched LiveCodeBench, frontier models still “exhibit stark limitations in complex algorithmic reasoning, nuanced problem-solving, and handling edge cases” and express skepticism that current LLM architectures are capable of overcoming these limitations. One forecaster noted that “these problems require exceptional creative insight and non-obvious algorithmic leaps that current reasoning approaches struggle with, so progress will likely be slower than on easier tiers.”
-
Progress on adjacent benchmarks: High-forecast respondents frequently cite rapid improvements on related coding and mathematical benchmarks: “Frontier AI models made a roughly 16x jump in improvement between 2023 and 2024 on SWE bench.”; “I'll obtain a base rate from investigating FrontierMath performance over time. In November 2024, SOTA models couldn't solve almost any Tier 3-4 problems… Now, in November 2025, the best model scores around 18%.” Low-forecast respondents argue these benchmarks measure fundamentally different capabilities (e.g., “[C]urrent LLMs demonstrate proficiency in implementation-oriented problems; they exhibit stark limitations in complex algorithmic reasoning.”) or lack the adaptive difficulty component (e.g., “Unlike other benchmarks where models can be trained to the test, a continuously updated test may be more difficult to create rapidly escalating scores, and therefore may not generalize.”)
-
S-curve dynamics: Multiple high-forecast respondents view recent progress as indicating the beginning of rapid improvement. One notes that “in the history of AI benchmarks, breaking the ‘zero barrier’ is often a leading indicator of a coming S-curve in performance,” and another that “once a capability threshold is breached (0% -> 5%), accuracy often doubles or triples within 12-18 months.”
-
Data availability and contamination: Several high- and low-forecast respondents mention data-related concerns. A high-forecast respondent argues that “there will be some kind of leaking as [the] biggest models are constantly crawling the web. At some point they will get to similar code to the one in the benchmark and they will learn from it.” Another that “despite the claims of continuous updating to avoid contamination, I don't believe this benchmark will ultimately be able to avoid [it].” Low-forecast respondents express concern about data exhaustion, which they expect will hinder performance on LiveCodeBench: “Many advanced algorithmic tasks are underrepresented in public training data, so scaling models does not always produce corresponding improvements in functional code performance.”
AI Company Valuation per Employee
Question. What will be the ratio between the summed valuation (in 2025 USD) and the total number of employees at the top five AI companies by the following resolution dates?
Results.4 The median expert forecasts an increase in the valuation-to-employee ratio by the end of 2026. Their central forecast of $16 million per employee would constitute a nearly 13% increase over the Q3 2025 baseline of $14.2 million. However, they also express uncertainty in their forecasts. The median expert assigns a 25% likelihood that the valuation-to-employee ratio will fall to $13 million or less and a further 25% likelihood that the ratio will increase to $21 million or more.
The median expert’s central forecast is a 40% increase from baseline to $20 million by the end of 2030. By the end of 2040, the median expert’s central forecast rises to $32 million.
Rationale analysis:
- Composition of the top five: The most frequent point of consideration is whether pure-play, high-ratio AI labs (e.g., OpenAI, Anthropic) or diversified, low-ratio conglomerates (e.g., Google, Alibaba) will dominate the LMArena Text Leaderboard. As one high-forecast respondent notes, “The largest increase in the metric would be a result of either Google or Alibaba being replaced…by a lower employee company.” Low-forecast respondents tend to view that scenario as improbable: “It seems unlikely that Google could be replaced by 2030,” writes one. Another notes, "If large tech conglomerates like Amazon, Apple, Meta, or Alibaba dominate the LMArena leaderboard by 2026-2030, their massive general employee bases (hundreds of thousands doing e-commerce, cloud infrastructure, hardware, etc.) will severely dilute the valuation-per-employee ratio even if their AI divisions are highly efficient.” Some low-forecast respondents share the sentiment that “China is likely to dominate AI in the long-term future, due to power and talent advantages compared to the US, and Chinese valuations will remain low.”
- AI-driven productivity gains: Most high-forecast respondents believe AI will lead to significant productivity gains and an attendant decoupling of revenue growth from headcount growth: “Highly autonomous AI agents are expected to enable companies to dramatically increase their revenue and impact without proportionally increasing their workforce,” argues one, and another that, “automation could drastically reduce staffing needs or enable extremely profitable business models run by small workforces.” Many low-forecast respondents acknowledge the potential for productivity gains in domains like software engineering, but argue that human labor will remain essential for “sales, regulatory compliance, management, customer service, etc.,” and contend that as companies scale, they accrue operational bloat that AI cannot easily automate: “Big labs still hire huge teams for safety, infrastructure, legal, sales, and hardware, so I don't expect an extreme ‘tiny team, massive valuation’ scenario anytime soon.”
- AI bubble: High-forecast respondents typically see the current valuation surge as a sustainable trend or the beginning of an even larger surge: “AI tools will greatly boost the R&D and commercialization efficiency of top AI companies, leading to an explosive growth in their valuations.” Low-forecast respondents tend to see a much higher likelihood that current valuations are indicative of an AI bubble that is likely to burst, and that if it does, valuations are likely to crash faster than companies can lay off staff, compressing the ratio. One respondent writes, “During recent drawdowns, headcount reductions were something like 5% on average, and probably a bit less in aggregate considering the size of Amazon. Percent decreases in valuations, however, were steeper by an order of magnitude.”
- AGI versus commoditization: High-forecast respondents often invoke AGI or the likelihood of an “intelligence explosion” by 2040, a scenario some believe will lead to a proliferation of automated employees and “top companies effectively becom[ing] capital allocation engines managed by elite oversight teams.” Low-forecast respondents, however, tend to believe AI will become “normalized as infrastructure,” similar to commoditized utilities that have low profit margins. Several low-forecast respondents argue that high-quality Chinese open-source models will accelerate this trend.
Entry-Level Tech Hiring
Question. What percentage of new hires at the top 15 tech companies will have one year or less of total experience in the following resolution periods?
Results. The median expert gives central forecasts close to the May 2025 baseline of 7% at all time horizons: 7.0% by end-2026, 7.0% by end-2030, and 7.8% by end-2040. The 7% baseline and experts’ central forecasts are down from 15% observed in 2019. By 2040, the median expert gives a 25% chance that the entry-level share falls to 4.0% or lower. They also give a 25% chance the share is at least 13%.
Experts and superforecasters give largely similar forecasts—in six of the nine resolution date-quantile pairs, the difference between their forecasts is statistically insignificant—but superforecasters expect stronger entry-level hiring by end-2040 than experts. Superforecasters project 10% of new hires will be entry-level in the median end-2040 scenario, compared to the median expert’s 7.8% central forecast. Superforecasters’ projections nonetheless remain well short of the 2019 observed value of 15%.
Rationale analysis:
- Reversion to the mean versus a structural shift: Most high-forecast respondents view the 2024 drop as a temporary fluctuation driven by macroeconomic factors rather than a permanent new normal. The “recent trend is just noise mostly related to the business cycle, and overhiring in tech a few years ago, not AI,” writes one, and another that “the current trend of not hiring entry level is caused more by the end of the ZIRP [zero-interest-rate policy] era, and the recent oversupply of entry level engineers, than by AI. I expect the situation to revert closer to the mean in coming years.” Conversely, low-forecast respondents typically view the recent decline as heralding the start of a new era in which entry-level jobs are displaced by AI and capex spending on compute is prioritized; a reversion to the historical mean, therefore, is believed to be unlikely: “The decline in inexperienced hires is a structural shift in tech companies and will continue to continue going down given the sharp rise in productivity.”
- AI's impact on entry-level tasks: Low-forecast respondents frequently emphasize that AI is already “taking over routine, entry-level tasks that [tech] companies previously hired new graduates to perform." One adds that “entry level jobs, especially in coding/software engineering are leaning more heavily towards leveraging AI for assistance, making experience a lot more important & valuable to be able to best make use of AI.” Some even believe that by 2040 “most of the entry level job[s] can be eliminated,” especially if ASI is achieved. Many high-forecast respondents, however, predict AI will instead empower new hires and create new types of roles. One writes that “as technologies mature, they become more usable by inexperienced workers” and another that “the Class of ~2029 will probably be excellent managers of LLMs.” Furthermore, some believe that “AI growth could create new junior-friendly roles (e.g., AI operations, data curation, safety evaluations)” that will absorb new entrants.
- Demographic necessities: High-forecast respondents often emphasize the long-term necessity of replacing retiring workers. They note that “the Boomer and Gen X generations are rapidly retiring” and that “companies cannot hire 0% juniors forever without depleting their future senior talent pipeline” and that if the low new-hire trend persists there won’t be “enough experienced people to hire.” But many low-forecast respondents appear to believe the supply of experienced talent is sufficient to suppress junior hiring for a long time. They point to the “continued availability of experienced talent laid off or displaced in earlier cycles” and argue that automation will likely result in fewer laborers of all experience being needed, which will increase the pool of more experienced workers for the foreseeable future.
- Adaptation of the education system: High-forecast respondents frequently cite the advanced AI skills likely to be possessed by future new hires: “Worker training/education will adapt to meet the needs of industry”; “Schools and universities are already updating curricula...so by the 2030s-2040s, we are likely to see graduates whose education and thinking are much better aligned.” Low-forecast respondents typically don’t address this factor, or when they do tend to be skeptical, with one arguing that “universities currently underserve” students with regards to teaching skills needed for the GenAI boom and another arguing that the “rapid growth in AI and machine learning roles where employers prefer strong prior experience and specialized skills...structurally disfavors fresh graduates.”
- Effect of market correction: Many low-forecast respondents warn that the popping of an AI bubble could reduce entry-level pipelines further: “Given my expectation of a broader market correction, further layoffs are likely, which would push the share of entry-level hires even lower”; “Expect continued decrease over next few years, especially [if] AI bubble pops.” This high-forecast respondent, however, argues that market forces will eventually restore demand for junior workers as their price drops: “As human labor becomes less competitive, salaries will fall, making it more attractive for firms to hire people again.” Another writes, “Wages adjust. Honestly, it's that simple.”
Low Headcount, High-Valuation
Question. In what calendar year will at least three companies with five or fewer full-time employees first report individual valuations of $10 billion (in 2025 USD) or more?
Results.5 Within the next 6-8 years, experts, the public, and superforecasters expect to see at least three companies with five or fewer employees report individual valuations of $10 billion or more. The median expert gives a 50% chance of resolution by the end of 2032, the public by end-2031, and superforecasters by end-2033. The median forecaster in all three groups gives a 5% chance that 3 such companies will exist by the end of 2027. Moreover, the median expert estimates a 95% probability that this will occur by the end of 2046. However, experts disagree substantially, with the bottom quartile giving a 95% chance of resolution by end-2035 and the top quartile assigning a 95% chance of resolution only by end-2080.
Some public figures have opined on a similar question. CEO of OpenAI Sam Altman has publicly stated that, “We’re going to see ten-person, billion-dollar companies pretty soon.”6 Economist Tyler Cowen predicted, in an interview with Altman, “I think you’ll have billion-dollar companies run by two or three people with AIs, I don’t know, two and a half years.”7 Altman responded, “I agree on all of those counts. I think the AI can do it sooner than that.”
Rationale analysis:8
- AI capabilities: Early-resolution respondents frequently emphasize that AI agents are likely to automate much of cognitive labor in the not-too-distant future, making low-headcount, high-valuation companies a viable option. “AI continues to outpace expectations and it’s not hard to imagine a future where agentic systems, automated coding, and highly leveraged workflows allow tiny teams to produce outsized value,” writes one, and another that “this is inevitable using AI agents.” Late-resolution respondents, however, tend to express skepticism that AI will possess these capabilities anytime soon. One writes, “Behind Tyler Cowen's prediction, there seems to be a very strong assumption towards having many reliable AI-only ‘employees’. This assumes AI systems that are effectively error-free, which seems unlikely to me (there is a decent amount of evidence that AI systems still fall on their face in OOD [out-of-distribution] scenarios).”
- Market conditions: Early-resolution respondents tend to think that the hype around AI, and top AI researchers, has led to overinflated valuations and believe this dynamic increases the odds of a near-term resolution. One writes, “This question may be mostly about the AI bubble; SSI [Safe Superintelligence] for instance has such high value not because AI is making people more productive, but because people think Ilya Sutskever is very clever and may be able to solve some AI problem by himself that would justify that valuation.” Most late-resolution respondents don’t disagree, but are far more likely to believe that a near-term market correction will disrupt this dynamic: “Current valuations for these companies are based largely on projected growth, not current revenue. If we see a significant market correction, those valuations could snap back to fundamentals, potentially delaying these milestones by a decade or more.”
- Organizational requirements: Late-resolution respondents often emphasize that regulatory and governance pressures will likely continue to necessitate that high-valuation companies hire full-time employees with a wide range of skills: “As valuations rise toward $10B, investors, regulators, and customers often expect more organizational structure: compliance, finance, legal, security, and dedicated operations staff,” notes one, and another that “there are [a] practical and legal minimum number of rolls that a company has...Investors would surely expect the company to have at least one employee who was legally qualified, one employee who was a qualified accountant, one employee who was a technology expert...Plus specialists in Sales, Marketing, Human Resources.” Early-resolution respondents, however, tend to focus on two ways companies could keep their long-term headcount at five or below: first, if a company “artificially keeps head count [low] by contracting out literally everything (IT, legal, PR, purchasing, payroll, etc)”; second, by replacing human labor with AI agents.
- Replicability: Early-resolution respondents tend to focus on the possibility of individual outlier companies achieving extreme valuations based primarily on founder reputation, even if only “for a very short time, until more people are hired,” with many forecasters pointing to Safe Superintelligence as an example. Late-resolution respondents, however, question the likelihood that this can occur outside of a few isolated instances. As one writes, “While exceptionally high valuations are currently possible for top-tier talent, sometimes even without a concrete business model, where investors essentially ‘buy the team’, such cases remain rare.”
- Valuation per employee of existing companies: Many early-resolution respondents highlight that existing companies are already close to meeting the resolution criteria. One notes “current market leaders like Safe Superintelligence and Anysphere are already achieving valuations of approximately $3.2 billion and $2.9 billion per employee, respectively. This exceeds the theoretical threshold required for this question (e.g., a $10B company with 5 employees requires only $2B per employee).” But several late-resolution respondents question the accuracy of those figures, which derive from employee counts provided by TrueUp: “Every search I've done suggests that SSI has about 20 employees [versus 10 reported by TrueUp], maybe even 50 (10 was just when the company was established in 2024). Similarly, Anysphere’s employee count appears to be solidly within 150-300 employees, and definitely not 10 [as reported by TrueUp].” Relatedly, another forecaster notes that Safe Superintelligence and Anysphere are “rapidly increasing headcount and will not operate with less than 10 employees when [they are] up to speed.”
- Economic Incentives: Several late-resolution respondents emphasize that the cost of limited hiring is trivial relative to a $10 billion valuation, creating strong incentives to expand. One writes, “I don’t think it will ever happen…if the company valuation is more than 100 million, the cost to hire an extra person is so small, why not hire more to make [the] founding team[’s] job much easier?” Another adds, “For an extremely valuable company, the marginal cost of hiring a few secretaries / admins / developers is extremely low. Even if AI can do 80-90% of what a person can do, it is likely worth it to hire a handful of people to handle the other stuff.” Early-resolution respondents rarely address why companies would deliberately stay small, although one speculates that ultra-lean firms might be “seen as inherently prestigious due to their focus on solely elite talent.”
Hyperscale Infrastructure
Question. What will be the total installed capacity (in GW) of hyperscale data centers operating globally by the following resolution dates?
Results.9 A typical nuclear power plant in the U.S. produces one gigawatt (GW) of power, and the IEA estimates that the total installed capacity of hyperscale data centers globally in 2024 reached 36 GW. Experts, the public, and superforecasters expect the total installed capacity of hyperscale data centers to be in line with IEA projections of 62-108 GW by 2030:10 The median expert’s central forecast is 92 GW of installed capacity by the end of 2030, a 2.6x increase from 2024. Other groups give slightly lower forecasts: the public predicts 83 GW and superforecasters 86 GW. Experts’ and superforecasters’ central forecasts for 2030, however, are statistically indistinguishable.
Rationale analysis:
- Physical constraints: The most significant consideration involves the extent to which physical limitations—power, land and water availability, wait times to connect to the grid, supply chain bottlenecks, etc.—will cap growth. Low-forecast respondents tend to expect physical constraints in the data center build-out, despite the level of capital investment and demand for compute. Of the U.S., one writes, “Converting capital to operational capacity faces significant friction: interconnection queues stretch 5+ years, PJM’s [largest regional transmission organization in the US] grid becomes capacity-constrained by summer 2026, and regional operators are delaying transmission upgrades.” High-forecast respondents typically acknowledge these and other constraints, but believe that unprecedented capital allocation will overcome them enough to allow for rapid growth, particularly if we see “strategic sovereign investment in faster-building jurisdictions (e.g., UAE, China).”
- Interpretation of IEA projections: The IEA report linked to this question contains four different hyperscale growth scenarios. Low-forecast respondents typically judge the report’s Headwinds case, which envisions slower-than-anticipated AI adoption and bottleneck issues, to be more credible whereas high-forecast respondents tend to be more confident “that AI will be economically worth this investment" and that therefore the Base Case (which is anchored by robust-growth industry projections) or the Lift Off case (which envisions more-rapid-than-anticipated AI adoption and flexibility in data-center siting) is more likely to materialize.
- Durability of demand: Many high-forecast respondents argue that short-term growth is effectively guaranteed by committed capital: “Major growth is baked in now until 2027…” By 2030, one predicts “AI will be demonstrating clear value for the economy and national security” and another that as “AI moves from chatbots to compute-intensive agents that run continuous loops,” demand for inference will surge. Low-forecast respondents are more skeptical. In the near-term several highlight that “the IEA notes that roughly 20% of planned data-center additions in the Base Case are at risk of delay due to grid bottlenecks and siting constraints” and that in the long-term, there is the risk of a market correction or simply diminishing returns. One draws a parallel to the dotcom era, predicting “infrastructure will be overdeveloped and we will have a high data-center capacity growth rate during the boom years…but face slowing development once the enthusiasm for burning cash has vanished.”
- Impact of efficiency gains: Low-forecast respondents frequently highlight the potential for efficiency gains to limit data center growth. (This potential is also captured in the IEA’s High Efficiency scenario.) “Over time AI energy need will saturate because of more efficient hardware,” wrote one, and another that “AI's energy consumption could decrease dramatically through algorithmic improvements.” Still another noted that “people like Ilya [Sutskever] believe the pre-training era is ending and we are entering a research driven phase again, with room for growth in RL based fine tuning that does not necessarily require exponential compute.” Most high-forecast respondents don’t dismiss these arguments, but instead tend to believe that “efficiency gains partially offset the need for infinite capacity expansion but do not fully solve the energy burden.”
- Public backlash: Several low-forecast respondents emphasized the potential for a backlash to data center growth, particularly in Western democracies, to stymie growth more than it already does. “With news of diesel powered data centers right next to towns and residential energy bills set to jump as a result of these energy hungry data centers…something will have to give,” wrote one, and another that “environmental regulations and other onerous construction obstacles" in the West will hamstring development. Many high-forecast respondents, however, point to deregulatory trends: “[The] current US administration is likely to minimize regulations on the AI industry and advocate for resources to maintain competitive levels with China.”
AI-Written Paper about AI
Question. In what year will a paper about AI written at least in part by AI first be judged by a panel of experts to be worthy of a Test-of-Time award?
Results.11 Experts, superforecasters, and the public give largely similar forecasts across all percentiles.12 The median expert assigns a 50% chance that an AI-written paper is deemed worthy of a Test-of-Time award by 2035. One-quarter of experts chose 2040 or later as their central forecast (i.e., an equal likelihood such a paper is written before or after that year), while a separate quarter chose 2030 or earlier as their central forecast. The median expert gives a 5% chance an AI-written paper is deemed worthy of a Test-of-Time award by 2028 and a 5% chance it will take another quarter-century (i.e., by 2050) or longer.
Rationale analysis:13
- Interpretation of “written at least in part”: Early-resolution respondents usually take a non-restrictive interpretation of this language and believe that “‘written in part by AI’ is a very low bar” and perhaps already being cleared. One writes, “I think a lot of the most important AI-related papers in recent years have been ‘written at least in part by AI’ (e.g. Anthropic often notes that Claude was used to help write some of their important papers).” Another argues that “currently a significant fraction of the manuscripts being published (maybe 5 to 10% at least, probably more) could perfectly list AI as a contributing author (or as an important tool used during research).” Late-resolution respondents typically filter this language through a far more restrictive prism, assuming that any AI contribution must be significantly greater than what is occurring today to count: “Authors don't list other tools as their co-authors.”
- AI’s ability to produce test-of-time-worthy research: Early-resolution respondents point to rapid advancements in AI capabilities, both “in other fields (protein design, materials, theorem proving),” and in writing papers: “The Agents4Science conference showed that fully AI-written papers are possible,” writes one, and another that “METR [Model Evaluation and Threat Research]'s research has shown AI task horizons doubling every ~7 months, suggesting month-long autonomous research projects may be achievable by 2028-2029.” Most late-resolution respondents express skepticism about AI's creative and conceptual abilities, and they deem these to be essential for the question to resolve: “I do not believe AI could create, on its own, any scientific paper beyond a literature review with significant insights because AI is primarily a predictive engine…”; “GenAI, as it is today, simply doesn't have the context or world view to consider nuanced and novel approaches to problems.”
- Cultural barriers: Late-resolution respondents often emphasize what they perceive to be deep-seated resistance in the academic community to attributing authorship to AI. One writes that “major publication venues have rules against listing AI as an author, which I do not expect to change, as academics will likely keep seeing AI as a tool rather than a peer that deserves being listed as an author.” Another wrote that crediting AI is “fundamentally at odds with the purpose of a Test-of-Time award,” the point of which is to recognize “lasting intellectual human contribution and human insight.” Many early-resolution respondents, however, point out that the resolution panel’s independence from traditional conference rules allows for flexibility, and that while major conferences might ban AI authors, the panel can recognize work from open platforms like arXiv which “means a high-impact AI paper can be recognized by the expert panel even if it is structurally excluded from the prestige economy of major conferences.”
- Timeline to AGI: Most early-resolution respondents believe improvements to current AI architectures, even if they fall short of AGI, will suffice. One writes, “The genuinely hard part of research lies in identifying unsolved problems, combining existing ideas in novel ways, and iterating through experiments, and recent advances in AI coding and reasoning models are increasingly supporting even these processes.” But some late-resolution respondents argue that “new architectural paradigms in AI” would need “to be developed that could broaden a model / system's ability to consider the framing of problems in similar ways that a human expert would.”; “Barring/until about AGI,” writes another, “this feels very unlikely to me.”
- The high bar of test-of-time recognition: Many late-resolution respondents point to the extraordinary difficulty of writing a test-of-time-worthy paper. One notes that such awards recognize papers “like those focused on GANs, ImageNet, and the transformer architecture. These are research efforts that were widely cited, changed directions of entire research domains, and had enduring practical impact and relevance.” Another emphasizes that “the gap between where AI papers are now and Test-of-Time caliber is pretty enormous!” Many early-resolution respondents, however, believe AI can nevertheless meet this bar relatively soon based on evolving capabilities or that the resolution panel is likely to apply generous standards. One suggested that “people will be tempted to give some AI paper a ‘Test of Time’ Award just for the frisson of novelty,” and another that “the act of judging an AI authored paper as award worthy becomes a way to signal technological progress or generate publicity.” A third argued that specific topics like “‘AI consciousness’ or chain-of-thought research seem like they would particularly plausibly merit recognizing the contribution of the AI system itself.”
Footnotes
-
In some cases, the "aggregate" refers to the mean; in others, the median is used, depending on which is more appropriate for the distribution of responses. ↩ ↩2 ↩3 ↩4 ↩5 ↩6
-
We occasionally elicite participants' quantile forecasts (estimates of specific percentiles of a continuous outcome) to illustrate the range and uncertainty of their predictions. ↩ ↩2 ↩3 ↩4 ↩5 ↩6
-
LiveCodeBench measures accuracy without any tool use. It is unclear if this rationale is referring to the use of formal verification tools during, e.g., a reinforcement learning process, or a tool call during inference. The latter would be excluded from the benchmark. ↩
-
We are further developing our distribution-fitting and pooling methods for these questions, but we still report our preliminary results here. ↩
-
We are further developing our distribution-fitting and pooling methods for these questions, but we still report our preliminary results here. ↩
-
See https://conversationswithtyler.com/episodes/sam-altman-2/. ↩
-
Many forecasters expressed skepticism of the data source used in the ‘Background Information’ section. ↩
-
We are further developing our distribution-fitting and pooling methods for these questions, but we still report our preliminary results here. ↩
-
In the survey, we incorrectly listed the upper bound of this range as 139 GW, which likely pulled aggregate forecasts up. We do not report 75th percentile forecasts in this section since these are plausibly especially impacted by the error. However, the correct value of 108 GW was contained in materials linked in the survey. ↩
-
We are further developing our distribution-fitting and pooling methods for these questions, but we still report our preliminary results here. ↩
-
Across all pairwise comparisons between the three groups for each percentile, only one comparison is statistically significant at the 5% level (the 95th percentile forecasts for experts and the public). ↩
-
Notably, some forecasters interpreted resolution as requiring an AI-written paper to be at least 10 years old and/or that the resolution would be determined by an FRI-commissioned panel. ↩
Cite Our Work
Please use one of the following citation formats to cite this work.
APA Format
Murphy, C., Rosenberg, J., Canedy, J., Jacobs, Z., Flechner, N., Britt, R., Pan, A., Rogers-Smith, C., Mayland, D., Buffington, C., Kučinskas, S., Coston, A., Kerner, H., Pierson, E., Rabbany, R., Salganik, M., Seamans, R., Su, Y., Tramèr, F., Hashimoto, T., Narayanan, A., Tetlock, P. E., & Karger, E. (2025). The Longitudinal Expert AI Panel: Understanding Expert Views on AI Capabilities, Adoption, and Impact (Working paper No. 5). Forecasting Research Institute. Retrieved 2026-01-14, from https://leap.forecastingresearch.org/reports/wave4
BibTeX
@techreport{leap2025,
author = {Murphy, Connacher and Rosenberg, Josh and Canedy, Jordan and Jacobs, Zach and Flechner, Nadja and Britt, Rhiannon and Pan, Alexa and Rogers-Smith, Charlie and Mayland, Dan and Buffington, Cathy and Kučinskas, Simas and Coston, Amanda and Kerner, Hannah and Pierson, Emma and Rabbany, Reihaneh and Salganik, Matthew and Seamans, Robert and Su, Yu and Tramèr, Florian and Hashimoto, Tatsunori and Narayanan, Arvind and Tetlock, Philip E. and Karger, Ezra},
title = {The Longitudinal Expert AI Panel: Understanding Expert Views on AI Capabilities, Adoption, and Impact},
institution = {Forecasting Research Institute},
type = {Working paper},
number = {5},
url = {https://leap.forecastingresearch.org/reports/wave4}
urldate = {2026-01-14}
year = {2025}
}