Wave 5: Security and Geopolitics

Wave 5 asked panelists to consider geopolitical and military issues related to AI. On average, respondents thought that AI capabilities and production would become significantly more multipolar, that autonomous cyber weapons would be in use by NATO states by the 2040s, and that a U.S.-China treaty addressing military AI by that date is unlikely.

First released on:

23 February 2026

The following report summarizes responses from 229 experts, in addition to 54 superforecasters and 703 members of the public collected between Jan 12, 2026 and Feb 02, 2026. Within expert respondents, 47 computer scientists, 45 industry professionals, 56 economists, and 81 research staff at policy think tanks participated.

Our wider website contains more information about LEAP, our Panel, and our Methodology, as well as reports from other waves.

Insights

Experts and superforecasters expect the performance gap between U.S. and Chinese AI models to narrow by 2031, with parity anticipated by 2041. Experts and superforecasters forecast that Chinese AI systems will roughly match American ones on benchmarks by the end of 2040. At the time this survey was published (January 2026), the top Chinese models lag behind U.S. models by nine points on the Epoch Capabilities Index. At the time of writing (February 2026), this gap is six points. The median expert expects this gap to narrow to five points by the end of 2030 and to zero by the end of 2040.¹
Taiwan's dominance in leading-edge chip manufacturing is predicted to erode, with the U.S. and China becoming significant leading-edge chip manufacturers. Experts project Taiwan's global share of leading-edge chip production to fall from over 90% today to roughly 71% by 2030 and 46% by 2040, with the U.S. and China each capturing around 20% in 2040. Forecasters cited national security concerns as the major factor driving increased U.S. leading-edge chip manufacturing, arguing that onshore chip production could act as an insurance against blockade, conflict, or disruption. Forecasters also expected China to scale-up its leading-edge chip manufacturing abilities, eventually reaching technical parity with the U.S. and Taiwan.
Experts expect a NATO member state to use autonomous cyber weapons by 2041, and that U.S.-China agreement on military AI is unlikely by this date. The median expert forecasts that NATO or a member state will publicly authorize fully autonomous offensive cyber operations in 2041, with superforecasters predicting this will occur slightly earlier, in 2038. A quarter of experts expect this policy shift to occur in 2035 or earlier. In addition, experts gave only a 16 percent probability that the U.S. and China would sign an agreement on military AI usage by the end of 2030 and a 40 percent probability that such an agreement would be signed by the end of 2040. The median superforecaster agreed with these predictions, although there was widespread disagreement within each group about the likelihood of such a deal.
AI systems are expected to surpass top human forecasters within the next few years, but the significance of that achievement is debated. Superforecasters themselves are the most bullish group with the median superforecaster predicting AI systems will beat their ForecastBench benchmark by 2028, which is earlier than the median expert (2030) or the median public (2033) forecast. However, forecasters qualitatively disagree on what this milestone would signify. Many note that AI excels at data-rich, quantitative questions (weather, sports, financial data) but struggles with geopolitical judgment where data is sparse and context-dependent. Others caution that because ForecastBench is structured as a frozen 2024 human baseline with many data-heavy questions and multiple AI attempts, this advantages AI systems in ways that may overstate genuine forecasting superiority.
Experts expect large (>$1 billion) regulatory fines on AI companies to total a cumulative $4 billion by 2030. The median expert predicts that the U.S. and/or EU would impose $1 billion in cumulative large-scale regulatory fines on AI companies by 2027, rising to $4 billion by 2030, and $10 billion by 2040. The public consistently expected higher fines, while superforecasters gave the most conservative predictions, estimating $8 billion in fines by 2040. Forecasters cite AI-driven financial fraud, disinformation campaigns, sexual victimization through "deepfakes" and copyright violations as potential trigger points for fines, but were skeptical that major fines would be levied by the end of 2027, pointing out that there have been few examples of such fines so far.

Questions

U.S. versus China Polarity: What scores will the top-performing American AI system and top-performing Chinese AI system have on the Epoch Capabilities Index at the following resolution dates (Dec 31, 2026, Dec 31, 2030, and Dec 31, 2040)? ⬇️
Frontier AI Chip Manufacturing: By the end of 2030 and 2040, what share of leading-edge logic chips used to train leading AI systems will be manufactured (i.e., the share of chips physically produced within the region) in each of the following countries or regions, according to a panel of experts: China (People’s Republic of China), USA, Taiwan (Republic of China), Sum of other regions (not China, USA or Taiwan)? ⬇️
Autonomous Cyber Operations and Cyber Weapons: In what year (2026 to 2100) will NATO or any NATO member state publicly and explicitly authorize, in law or binding policy, the use of fully autonomous offensive cyber operations or cyber weapons? ⬇️
U.S. and China Military Agreement: Will the United States and China formally sign any bilateral agreement that specifically addresses the military use (i.e., weaponization or autonomous use of force) of artificial intelligence by the following resolution dates (Dec 31, 2027, Dec 31, 2030, and Dec 31, 2040)? ⬇️
Regulatory Fines on AI-Generated Content: What will be the sum of regulatory fines individually exceeding $1 billion (2025 USD [i.e., inflation-adjusted]) imposed in the United States or European Union on companies responsible for the development or operation of an AI system, where the cited violation explicitly focuses on AI-generated content, by the following resolution dates (Dec 31, 2027, Dec 31, 2030, and Dec 31, 2040)? ⬇️
ForecastBench Performance: In what year will at least one AI system achieve an overall average difficulty-adjusted Brier score below the lower bound of the 95% confidence interval of the 2024 “Superforecaster median forecast” entry’s difficulty-adjusted Brier score—0.072, as of January 9, 2026—on ForecastBench’s tournament leaderboard? ⬇️

For full question details and resolution criteria, see below.

Results

In this section, we present each question, and summarize the forecasts made and the reasoning underlying those forecasts. More concretely, we present background material, historical baselines, and resolution criteria; graphs, results summaries, results tables; as well as rationale analyses and rationale examples. In the first three waves, experts and superforecasters wrote over 600,000 words supporting their beliefs. We analyze these rationales alongside predictions to provide significantly more context on why experts believe what they believe, and the drivers of disagreement, than the forecasts alone.

U.S. versus China Polarity

Question. What scores will the top-performing American AI system and top-performing Chinese AI system have on the Epoch Capabilities Index at the following resolution dates (Dec 31, 2026, Dec 31, 2030, and Dec 31, 2040)?

Results. Both median experts and superforecasters expect the performance gap between the top U.S. and top Chinese AI systems to narrow substantially by 2040, while predicting that U.S. systems will maintain a modest advantage at least until the end of 2030.² At the time of survey publication (January 2026), the top Chinese model lags behind the top U.S. model by 9 points on the Epoch Capabilities Index³ with the top-performing U.S. system obtaining a score of 154 and the top-performing Chinese system(s) obtaining a score of 145. As of the publication of this report (February 2026), the top Chinese model lags behind the top U.S. model by 6 points with a score of 148. Experts expect this gap to narrow–predicting a median score difference of 5 by 2030 and 0 by 2040 while superforecasters expect gaps of 4 in 2030 and 0 in 2040. In contrast, the median public participant anticipates much smaller benchmark performance improvement. The median public participant expects American systems to continue to outperform Chinese systems over this time period with an expected score difference of 5 remaining in 2040.

U.S. versus China Polarity. The figure above shows the median difference between top-performing American and Chinese AI systems on the Epoch Capabilities Index by participant group.

U.S. versus China Polarity. The figure above shows the median 50th percentile (as well as 25th and 75th percentiles when applicable) forecasts by participant group.

Rationale analysis:

Forecasters offer a wide range of opinions regarding the likely future performance of Chinese versus U.S. AI systems on the Epoch Capabilities Index. China-progresses respondents predict Chinese systems will converge with, or potentially surpass, U.S. systems; U.S.-maintains respondents predict Chinese systems will lag persistently behind.

Impact of chip restrictions: A frequent point of disagreement centers on whether export controls are likely to create a hard ceiling for China. Many U.S.-maintains respondents argue that they will: “As the training process becomes more time-consuming and resource-consuming from a computational point of view, lacking access to the most advanced chips could become problematic and significantly hamper the Chinese efforts.” China-progresses respondents often concede that achieving parity with, or surpassing, the U.S. is unlikely in the short term, but emphasize that “over a longer horizon, China has the capacity to invest heavily in domestic semiconductor supply chains, specialized AI accelerators, large-scale datacenters, and the supporting power grid upgrades needed for sustained training and inference,” and that this will likely lead to major gains for China relative to the U.S. regardless of import restrictions. Others point out that “Chinese AI developers achieve sustained gains through algorithmic efficiency…rather than reliance on cutting-edge hardware,” and that this factor, paired with the recent U.S. decision to allow Nvidia to sell near-SOTA chips to China, is also likely to blunt any U.S. advantage.
Catch-up versus first-mover advantage: China-progresses respondents often argue that “catching up is much easier than leading,” because “even when the U.S. is first with a capability or training method, the underlying ideas tend to spread—through publications, open-source implementations, talent flows, [and] reverse engineering.” U.S.-maintains respondents contest this, with one predicting that the complexity of future models will make copying harder and another suggesting that “as long as American companies retain a slight lead, they may be able to harness their more powerful AI models to better automate AI R&D, resulting in a positive feedback loop.”
The Chinese political system: Many China-progresses respondents point to structural advantages in China’s ability to mobilize resources faster than the U.S. and argue that “China has longer term advantages in terms of infrastructure & energy,” with one predicting that China will “surpass [the] U.S. early because of lesser environmental regulation,” which will in turn allow for faster scaling. Others predict the drive to surpass the U.S. will “translate into [a] concentrated national effort on a few flagship projects—exactly the kind of effort that can produce occasional ‘top leaderboard’ contenders.” In contrast, U.S.-maintains respondents tend to focus on the detrimental effects of the Chinese political system, arguing that the “political requirements of model approval...[mean] they will stay a few steps behind” and that “China will plateau because of a declining economy as a result of [a] demographic precipice,” leading to political instability that will impede progress. Others question the sustainability of state funding. One asks: “The issue here is the business case for Chinese-developed AI technologies—who is going to buy them outside of China? It is hard to see enterprise customers in the United States or Europe relying on Chinese AI tech. This means government funding is going to be the critical element here. For the next few years, the PRC will probably keep pouring in funding. But beyond 2030?”
Odds of a technological plateau: A key difference in underlying assumptions is whether AI progress will plateau. Several China-progresses respondents assume that it will, and that “as progress slows,” a “convergence between models” is likely because “once base capability is ‘good enough’...the marginal value of being first at the frontier weakens.” Or as another puts it, “Everyone is framing this is in terms of polarity and competition, but I think the more likely outcome is convergence around some commodity version, particularly given the norms of information sharing among the research community, coupled with corporate espionage, and asymptotic limits to growth.” U.S.-maintains respondents, however, typically assume growth in AI capabilities will continue apace and therefore rely on “extrapolating near straight line improvement for the duration of this question,” leading them to predict the gap will be maintained.
Geopolitical risk factors: Several China-progresses respondents identified scenarios where the U.S.-EU rifts could accelerate Chinese progress. One warned it “may allow more transfer of European technologies to China. I am thinking specifically of ASML [the exclusive global supplier of Extreme Ultraviolet (EUV) lithography machines]. More access could jumpstart China’s chip development.” Another warned that “if by 2040 China includes Taiwan it will (presumably) have acquired the chip making capability that is today powering U.S. AI.”

Rationale examples, China-progresses respondents:

Performance improvements appear to be flattening rather than continuing at the same rapid pace. I expect this trend to persist over the coming years. As progress slows, I also anticipate a convergence between models, meaning that American and Chinese systems will increasingly catch up to one another, given that Chinese models typically lag by only about six months to a year.

Given the reversal by the Trump administration on selling AI chips to China, I think China will pour a lot of resources into AI improvements as well as developing their own chips by reverse engineering the Nvidia chips as best they can.

If the U.S. and its allies can continue to cooperate to prevent China from accessing the latest technology, e.g., EUV [Extreme Ultraviolet], machines from Europe and specialty chemicals from Japan, the U.S. could maintain its lead in computational hardware. China will devote more resources, especially electrical power, to be able to continually chase the U.S.. Herein lies a dangerous situation. Trump's aggressive actions against Europe, over cultural issues as well as Greenland, could prompt The Netherlands to break with the U.S. to sell their latest technology to China. If that happens then all bets are off.

Chinese developers are more prone to intentionally max out benchmarks at the cost of contamination.

Currently [the Chinese] often achieve similar results only months after American systems and this gap is likely to continue to close as they develop their own hardware. In the near term this may also be accelerated by American policies that have become hostile towards foreign students and academic researchers.

Rationale examples, U.S.-maintains respondents:

In the near term, I expect the U.S. to continue leading the benchmark and China to continue lagging by 7 months, as it has been the case in the past few years…As the training process becomes more time-consuming and resource-consuming from a computational point of view, lacking access to the most advanced chips could become problematic and significantly hamper the Chinese efforts. [So,] overall, China faces a greater risk of falling behind.

Continuing increases in AI capability require continued stability of the State. China may run into serious difficulties towards the end of this forecasting period.

Chinese models are about half a year behind, or circa six points. I expect this not to change much except with high capabilities growth, which will require huge datacenters and new paradigms, and where China might start showing a lag. Possible also that the AI race stops, there is an agreement by everyone to use U.S. self-improving AI, so the Chinese models would "stagnate" in the sense there aren't any new ones.

Frontier AI Chip Manufacturing

Question. By the end of 2030 and 2040, what share of leading-edge logic chips used to train leading AI systems will be manufactured (i.e., the share of chips physically produced within the region) in each of the following countries or regions, according to a panel of experts: China (People’s Republic of China), USA, Taiwan (Republic of China), Sum of other regions (not China, USA or Taiwan)?

Results. By 2040, experts, superforecasters, and the public all expect a significant redistribution of the share of manufacturing leading-edge logic chips away from Taiwan to other regions, namely to the U.S. and China. From estimates of over a 90% share in 2025, the median expert projects Taiwan to account for a 71% share by the end of 2030 and only a 46% share by the end of 2040, with both the U.S. and China capturing most of the displaced share. Superforecasters similarly anticipate a decline in Taiwan’s dominance, expecting a decrease in share to 77% by the end of 2030 and 51% by the end of 2040. The median expert expects the U.S. and China to hold a similar share of leading-edge logic chip manufacturing by the end of 2040 at 21% and 23% respectively. Other regions remain marginal players in the median expert’s forecast, together accounting for only a small fraction of manufacturing and collectively expected to reach roughly 9% of total share by 2040.

Frontier AI Chip Manufacturing. The figure above shows the mean forecasts (± standard deviation) of the forecasted share of leading-edge chip manufacturing attributed to the United States, China, Taiwan, and Other Regions.

Rationale analysis:

Forecasters offer a wide range of opinions regarding the likelihood that Taiwan's dominance of leading-edge chip manufacturing will persist. Status-quo respondents expect modest erosion but continued Taiwanese dominance; new-paradigm respondents anticipate a major rebalancing.

Inertia versus national security imperatives: The most discussed factor concerns the degree to which the complexity of the Taiwanese ecosystem will act as an anchor versus the degree to which national security imperatives will drive a relocation. Status-quo respondents argue that Taiwan’s “entrenched manufacturing expertise, operational culture, and concentrated talent pool” augurs against a rapid shift: “The advantage is not just capital expenditure, but accumulated operational know-how and ecosystem depth,” writes one, and another that “sunk investment in Taiwan is an extremely strong draw that will be difficult to overcome.” New-paradigm respondents, however, tend to view evolving national security imperatives as transformative: “From a US national security perspective, heavy dependence on Taiwan for the most advanced logic nodes is a single point of failure. Even if Taiwan remains the best producer, the U.S. has strong incentives to ensure some share of frontier-capable capacity exists on U.S. soil as insurance against blockade, conflict, coercion, or disruption.”
Probability of China invading Taiwan: Status-quo respondents typically deem the probability of China invading, or establishing administrative control over Taiwan by other means (by the end of 2030 or 2040), to be low. One writes, “I give Taiwan a 90% chance of remaining independent by 2040, so I believe they will still be the leading manufacturer.” Another adds, “The likelihood of Taiwan being annexed by China or another power is non-zero, but I do not place significant probability on this event.” Several predicted that Taiwan would prioritize maintaining their “‘silicon shield’ to incentivize the US to keep Taiwan away from the control of the CCP [Chinese Communist Party].” New-paradigm respondents, however, place substantially higher weight on Taiwan losing its independence: “I’m fairly certain that China is going to invade Taiwan in the near future,” writes one, while another concludes “the most likely scenario by 2030 is that Taiwan will be under Chinese control.” A few cite specific probabilities: “I am currently predicting about a 33% chance of China gaining control of Taiwan by 2030. And a 67% chance by 2040.” Some see political shifts as facilitating an invasion: “The U.S. commander in chief has a world view shaped by only his own moral compass which apparently divides the planet into several parts on geographic boundaries. Under such logic Taiwan would naturally fall to China…”
Efficacy of export controls on China: Many status-quo respondents cite restrictions on the transfer of extreme ultraviolet lithography (EUV) technology as decisive: “ASML has effectively [a] monopoly on chip manufacturing equipment, and cannot sell to China due to EU export restrictions, meaning that Chinese-based manufacturing [of leading-edge logic chips] is essentially zero and will remain so in [the] short term.” Another writes, “I expect the collective West to keep the latest technology away from China. And I do not believe China can develop all of the technology pieces required to build a new process node [compared to] the combined ecosystem of U.S./Japan/Netherlands/Germany/Korea.” New-paradigm respondents predict China will nevertheless find a way. One writes, “Somewhere in the next 5-10 years, I expect China to have practical and scalable EUV capabilities, leading to Chinese-made chips catching up with Taiwan/USA,” while another cites precedent: “China will likely be successful in developing in-country manufacturing, despite export controls, given their success with EVs, solar cells, and batteries.” A third speculates that “if the EU-U.S. relationship keeps deteriorating it might be possible [for China] to buy…the EUV machines from ASML in the future (and if not, it's just a matter of also slowly but steadily hiring former ASML engineers).”
Efficacy of U.S. CHIPS Act: Many status-quo respondents express skepticism: “The U.S. may look to diversify and re-shore but so far has not shown the ability to do so well," writes one while another cites specific challenges: “USA-sourced advanced semiconductors will hit a ceiling due to structural issues related to U.S. labor quality (by far the largest factor, particularly at the floor/operations technician-level), U.S.-located supply chain…, U.S. environmental regulation…, [and] U.S. input costs.” New-paradigm respondents express more optimism, with several pointing to the possibility that stated targets will be met: “The U.S. is plausibly a large minority producer by 2030 because Commerce and GAO both describe CHIPS-backed projects as moving the U.S. from 0% in 2022 to about 20% of global leading-edge logic by 2030, and TSMC [Taiwan Semiconductor Manufacturing Company] Arizona is already producing advanced chips.” Another echoes that sentiment, pointing to the potential for “TSMC Arizona, Intel Ohio, and Samsung Texas [to ramp] up volume production of <3nm chips by 2028-2030.”
Role of other regions: Status-quo respondents generally see other regions as having only a marginal impact: “Other regions realistically only include S Korea and it probably will enter this market,” writes one, while another predicts that “the other regions will not even be able to enter the competition. The EU, for example, lacks the facilities and the restrictive…legislation does not help…” New-paradigm respondents often echo this pessimism, but some treat other regions as a real, if second-tier, growth margin. One expects “South Korea…a new player, to gain a bit of market share in the long term.”

Rationale examples, status-quo respondents:

TSMC has strategic competitive advantages which have been hard to replicate. Unless TSMC fumbles badly the odds of others catching up are low. To match TSMC's yield advantage and efficiency is very hard and needs consistent execution, favorable policy, immigration, etc for the U.S. which will be hard. China on the other hand is behind on technology and cut off from access to lithography.

I’m treating this as an infrastructure question, which means the relevant drivers are slow and material: construction timelines and the ability to keep a whole supply chain operational at frontier tolerances.

Leading-edge manufacturing is not something you move quickly: building fabs, qualifying processes, and achieving high yields takes years, plus a deep local ecosystem of suppliers and talent.

Rationale examples, new-paradigm respondents:

There are significant economic, political, and geopolitical pressures on both China and the United States to indigenize chip production as much as possible, so while Taiwan will continue to exert dominance of chip production by 2030 (due to concentrations in the supply chain, talent, and institutional knowledge), the U.S. and China will make significant gains in expanding their share of high-quality compute production. In the U.S. this will be driven by realization of the investments made via the CHIPS Act, private sector funding going into domestic production over the past two years, and buttressed by the political momentum of both the Biden and Trump administration on improving supply chain security by driving constructions of fabs at home in partnerships with TSMC as well as partnerships with companies like Samsung and Intel. In China, this progress will be slower due to its comparative lag in chip production, but I expect SMIC [China’s Semiconductor Manufacturing International Corporation] to make significant gains here, already driven by momentum from export controls and pressures to indigenize, even if it doesn't quite catch up with the 3nm-equivalent processes.

The shift reflects classic risk-adjusted comparative advantage: as AI training becomes a strategic input with extremely high downside risk from supply disruption, firms and governments are willing to pay a higher unit cost to relocate fabrication closer to capital, customers, and security guarantees, making U.S. fabs economically viable despite higher wages and lower static efficiency than Taiwan.

I think there's about a 25-30% chance that by 2030 there will be an intelligence explosion and at least the beginnings of an industrial explosion. If the U.S. or China gets there first, they could pull ahead of Taiwan and potentially somewhat quickly…

While I suspect [a Chinese] invasion and annexation [of Taiwan] would likely lead to some damage to TSMC facilities, this would overall lead to China becoming the dominant producer of cutting-edge chips.

Autonomous Cyber Operations and Cyber Weapons

Question. In what year (2026 to 2100) will NATO or any NATO member state publicly and explicitly authorize, in law or binding policy, the use of fully autonomous offensive cyber operations or cyber weapons?

Results: The median expert forecasts that NATO or a NATO member state will publicly authorize fully autonomous offensive cyber operations or cyber weapons in 2041. This is similar to superforecasters, who predict such a shift to occur in 2038—a statistically insignificant difference—but significantly earlier than the public, which has a central forecast of 2050. Experts exhibit substantial disagreement: the bottom quartile (i.e., those expecting earlier policy change) forecast a publicly announced policy shift in 2035 or earlier, while the top quartile forecast this will not occur until 2060 or later. This disagreement seems to center on whether such a shift will be made public more than whether it will occur at all.

Experts and superforecasters, who broadly concur in their forecasts, expect a policy shift to occur sooner than the public. The differences between the median public forecast and median forecasts from both experts and superforecasters are statistically significant across nearly all scenarios—with the sole exceptions occurring in the 95th-percentile scenarios. Some of the reasons for disagreement are further discussed in the Rationale Analysis section.

Autonomous Cyber Operations and Cyber Weapons. The figure above shows the median 50th percentile (as well as 25th and 75th percentiles when applicable) forecasts by participant group.

Rationale Analysis:

Strategic advantage: The most significant difference of opinion among forecasters centers around whether a public or secret authorization would offer a greater strategic advantage. Late-resolution respondents frequently argue that nations will likely opt for secrecy, or at least ambiguity: “As we saw with the Iranian nuclear cyber attack [i.e., Stuxnet], explicit authorization for such systems wasn't a necessary prerequisite, and the challenges and liabilities (both legally and strategically) of implementing public-facing policies such as this far outweigh any potential benefits.”; “There is a real chance that strategic ambiguity is preserved indefinitely, with autonomy acknowledged implicitly but never explicitly authorized in binding public policy.” But many early-resolution respondents believe that “declaration of these capabilities [would] offer a credible deterrence signal.” One cites power imbalances as a potential factor, writing that “a country like Estonia, with low classic military capabilities but high cyber capabilities, that also sits on the Russian border, is more likely to try to use announced cyber plans as a deterrent.”
Geopolitical competition: Many early-resolution respondents cite competitive pressure from adversarial states, such as “North Korea, China and Russia, who have already employed frontier technology for influence operations and infrastructure hacks,” as a likely driver of public authorization(s). One forecaster writes, “NATO member states are likely to adopt autonomous cyber operations or weapons when not doing so would appear to place it at a disadvantage relative to its enemies.” Late-resolution respondents typically don’t dispute this dynamic will play a role, but emphasize that maintaining parity or an advantage over adversaries in this domain does not require public authorization. As one writes, “I expect operational reality to significantly outpace formal law, with explicit public authorization lagging far behind real-world capabilities.”
Legal hurdles: Late-resolution respondents often point to legal and reputational hurdles Western democracies would face were they to attempt to publicly authorize such a policy. One argues that “'fully autonomous' and 'offensive' is a reputational third rail,” and another that “Western countries with checks and balances are unlikely to lead on this.” A third writes, “An important reason is the 'accountability gap' under International Humanitarian Law. Since autonomous systems can lack a predictable chain of command, NATO members face immense legal hurdles in ensuring these weapons comply with principles of distinction and proportionality.” Early-resolution respondents tend to focus on drivers that they believe will force nations to move past these legal hurdles. As one writes, “by 2050 the speed of cyber warfare will have definitively surpassed human reaction times, making 'human-in-the-loop' policies obsolete. States will pass laws authorizing autonomous 'counter-measures' (functionally offensive) to maintain deterrence.” Another argues that laws will adapt as the technology evolves: “I think that autonomous ‘judgments’ are going to become increasingly normalized going forward and that’s the tide that will drive saying it out loud.”
Erosion of norms in the U.S.: Several early-resolution respondents cite the political trajectory in the U.S. as having the potential to lead to a unilateral U.S. authorization. “In light of the U.S. disregard for norms in the current administration it is likely to change course on the current policy,” writes one, while another notes, “As we have seen from Trump’s first year in his second term[, he] is happy to publicly state many illegal things that he can, or has, done. But also claim he is allowed to do them. I can very easily see this apply to autonomous cyber weapons.” Many late-resolution respondents, however, cite the potential for the erosion of norms to instead lead to the dissolution of NATO—in which case, it would be impossible for NATO or any NATO member state to publicly authorize such a policy. One writes: “I think within a generation NATO as we know it now will not exist.”
Technological readiness: Early-resolution respondents often emphasize that the technology is imminent or already present, and so the capability barrier isn’t a factor: “Autonomous cyber weapons are basically already possible, and maybe already being used (not only Stuxnet, but also that recent attack Anthropic reported on). So the question mainly comes down to when NATO will approve their use.” But late-resolution respondents are typically more focused on the risks of autonomy given the current state of AI. They argue that “solving the hallucination problem is a requirement” and that “while the technology may be technically viable now or in the near future, implementing such a policy into law would create immense pressure, especially considering the potential for a skilled hacker to alter algorithms and trigger catastrophic consequences.”

Rationale examples, early-resolution respondents:

The international community can stand on ceremony, but as revisionist states begin to use this technology, they are going to shatter stigmas. Smaller states aren't going to forfeit their attempt to use this tech to meet their own interests.

The competitive dynamics of agentic systems seem to already be passing levels that would enable almost autonomous offensive cyber operations. As some rogue states (e.g. North Korea) will not hesitate to use those, these will become a normal part of the tool kit for all parties. As there is no bright line distinction between “defensive” and “offensive”, full authorization is likely - the main doubt is when that will be done publicly.

Several factors make a reversal of policy likely: erosion of norms, growing conflicts with cyber-sophisticated adversaries like Iran and China, and the potential implosion of NATO such that some countries like the U.S. can break with the current consensus. In light of the U.S. disregard for norms in the current administration, it is likely to change course on the current policy. I'm less clear on whether that would be an overt, official policy as required for the question in the case of the U.S., since it doesn’t seem to bother with legalities. But I think once they engage in this practice de facto, other countries will follow suit with formal policies.

I expect that AI advancements will mean that there are significant improvements in the robustness, precision, and efficacy of cyber agents, such that they minimize humanitarian cost, allow for high accuracy execution, and infliction of significant enough amounts of damage such that declaration of these capabilities offers a credible deterrence signal. By 2030-2032, I expect that there will be an additional concern around the adoption of such sophisticated capabilities by strategic adversaries like North Korea, China and Russia, who have already employed frontier technology for influence operations and infrastructure hacks. Hence, I anticipate that several NATO members (or NATO as a whole) will have publicly adopted such weapons for their deterrence posture by 2033 at the latest.

Rationale examples, late-resolution respondents:

Even if there is a one off deployment of such weapon, making it a binding publicly and explicitly authorized policy will definitely encounter social, institutional and legal hurdles that will be difficult to overcome.

In an alliance setting, the main incentive is to preserve strategic ambiguity: autonomy can be useful while still being politically and diplomatically costly to name, because naming it invites scrutiny about escalation dynamics and accountability, especially when something goes wrong.

The most recent U.S. policy (DoD Directive 3000.09) excludes autonomous/semi-autonomous cyberspace systems. [With the] U.S. being the most advanced country in terms of cyber capabilities, one would expect it to lead a charge in adopting an explicit policy towards such weapons. Their most recent explicit reluctance for fully autonomous offensive cyber operations or cyber weapons is an indicator that considerable time may pass before the use of such weapons become a law in any NATO member country.

Although NATO and its member states already deploy highly automated and adaptive cyber capabilities, the resolution criteria for this question set an exceptionally high threshold, requiring a public, explicit, and binding authorization using unambiguous permissive language for fully autonomous offensive cyber operations, defined as systems that select and execute offensive actions without real-time human approval. Current NATO doctrine and member-state policy consistently emphasize state ownership, political oversight, and legal accountability in cyberspace, with autonomy typically framed as decision-support or automated execution under human judgment rather than as fully autonomous authorization.

The [NATO] alliance is actually in peril right now, so I think I am forced to make a sub-forecast regarding the expected future existence of NATO. I put that currently at 30% through 2028.

U.S. and China Military Agreement

Question. Will the United States and China formally sign any bilateral agreement that specifically addresses the military use (i.e., weaponization or autonomous use of force) of artificial intelligence by the following resolution dates (Dec 31, 2027, Dec 31, 2030, and Dec 31, 2040)?

Results: The median expert assigns a 5 percent probability that the U.S. and China will formally sign a bilateral agreement on military AI usage before the end of 2027. This figure rises to 16 percent by the end of 2030 and 40 percent by end-2040. Experts’ and Superforecasters’ forecasts were statistically indistinguishable across all three resolution dates.

The median public forecaster also assigns a 5 percent probability of a formal U.S.-China agreement on military AI by end-2027. However, the interquartile range of public forecasters’ predictions skewed higher (i.e., they assigned a greater likelihood of agreement) than that of experts, leading to a statistically significant difference between the groups’ 2027 forecasts. The two groups’ 2030 and 2040 forecasts were statistically indistinguishable.

Notably, in each participant group, the interquartile range widened to about 40 percent by 2040. This indicates substantial disagreement within each respondent group about the longer-term likelihood of a formal military AI deal between the current AI heavyweights.

U.S. and China Military Agreement. The figure above shows the distribution of forecasts by participant group, illustrating the median (50th percentile) and interquartile range (25th–75th percentiles) of each forecast.

Rationale analysis:

The current U.S. presidential administration: Forecasters frequently point to the current U.S. presidential administration as an obstacle to an agreement being signed. “This is quite unlikely during the Trump administration,” writes one, while another notes “the current U.S. 'AI Action Plan' (July 2025) explicitly pivots away from guardrails toward ‘winning the race.’” But many forecasters, although inclined to view an agreement as “unlikely in the current political environment,” view one as “highly likely in the next one”; “I expect no change in posture towards China or military AI in the remainder of the Trump administration, but suspect that the first two years of the next administration (e.g., by the 2030 date) is a potential inflection point.”
Competitive pressures: Many late-resolution respondents emphasize that “in the short term, while both sides are racing to build the ‘best’ AI, I don't see either wanting to constrain itself in writing,” and stress that “AI is viewed as central to future military dominance,” especially given “Taiwan tensions, where [an] AI advantage is decisive.” One writes that “if one side is widely believed to be compounding an advantage, a signed agreement reads less like mutual risk reduction and more like a constraint on the leader’s options.” Another argued that “both sides view military AI as a core competitive advantage,” and therefore, “agreeing to constraints would be perceived domestically as conceding future battlefield advantage.” Several early-resolution respondents acknowledge this dynamic, but come to a different conclusion regarding its effect. One argues “China's technology capacity will eventually outstrip the U.S. in scale and scope, forcing the U.S. to come to the table,” and many point to the shared danger of escalation independent of human control, or a near-miss incident related to the military use of AI, as the potential basis of an agreement: “The deployment of AI-enhanced and autonomous weapons systems by nation-states, actual or fears of escalation around Taiwan, and a high-profile AI near-miss incident will drive cooperation between the two powers.”
Verification issues: Several late-resolution respondents stress that the intangible, dual-use nature of AI will make it hard to verify if an agreement is being violated: “Military AI systems are software-driven, dual-use, and easily repurposed. Neither country would accept inspection or transparency mechanisms capable of meaningfully verifying compliance, and without verification a formal treaty is politically infeasible.” Another echoed that point, writing “it is not obvious how an agreement like this would be verified,” in contrast to Cold War nuclear agreements that were validated via “physical inspections, plane and satellite surveillance.” Early-resolution respondents tend not to address the verification factor, or suggest that an agreement could be symbolic or intended to establish voluntary norms for responsible use rather than technically restrictive, with one noting that that “confidence-building measures...are not necessarily legally binding.”
Trust deficit: Many late-resolution respondents point to the breakdown in trust between the U.S. and China as a major barrier. This forecaster summed up the consensus well: “The current U.S. government has been quite open in saying that any agreement that any U.S. government (including the current U.S. government) agreed [to] in the past will be set aside if/when it is convenient to do so. This doctrine applies to friends and allies (e.g. member[s] of NATO) so China would be extremely naive if it expected the U.S. to keep to any agreement that limited both sides ability to do something that might give it an advantage…I also doubt that many in the U.S.A. would trust China to keep any agreement on the military use of AI.” Early-resolution respondents acknowledge what one characterized as “a historic low in the U.S.-China relationship,” but tend to weigh other drivers more heavily.
Historical precedence: Several early-resolution respondents point to prior agreements between adversaries: “There is a long precedent of competing major powers signing key international agreements around arms control and non-proliferation (like the U.S. and the USSR signing the START I [Strategic Arms Reduction Treaty], INF [Intermediate-Range Nuclear Forces], CTBT [Comprehensive Nuclear Test Ban] and other treaties),” writes one. Others note that in 2024, Biden and Xi agreed that nuclear launch decisions should be left in exclusively human hands, suggesting a nuclear AI agreement of this nature that met the question resolution criteria could come to fruition. Some late-resolution respondents take issue with the nuclear analogy, however, with one writing “it’s not clear that the use of AI in warfare actually inherently produces the type of catastrophic risk that nuclear weapons do…AI can be used in non-lethal warfare, for instance.”
Interpretation of resolution criteria: Early-resolution respondents typically view the bar for a signed agreement as easily surmountable, emphasizing that the “question language sets a low bar for resolution by allowing for a restricted, targeted agreement,” such as a limited Memorandum of Understanding (MoU). Some suggest a “reaffirmation of the Biden-Xi AI-nuclear redline” statement could evolve into a qualifying agreement. Several late-resolution respondents instead highlight the caveat that any “agreement might be multilateral” which could obviate the need for a bilateral agreement.

Rationale examples, early-resolution respondents:

I don't think it will be expansive, but I could imagine an agreement which sets out very limited principles including disallowing autonomous nuclear weapons.

The most effective approach to AI governance will combine sector-based regulation (primarily in the commercial sphere) with confidence-building measures (CBMs) at the international level (in the military domain). CBMs are transparency-enhancing actions taken by states to reduce the likelihood of unintentional conflict resulting from miscommunication or accidents. They are not necessarily legally binding, and are often designed to be targeted, cumulative, and complementary. Over time, they establish norms for responsible use by addressing different aspects of the same challenge: reducing the risks of opacity and miscalculation associated with the introduction of new technologies. During the Cold War, transparency-enhancing initiatives such as President Eisenhower’s 1955 Open Skies proposal, the Washington–Moscow hotline, the 1972 Incidents at Sea Agreement, and the 1975 Helsinki Final Act each targeted narrow but distinct technical challenges…in November 2024, [the U.S. and China] jointly affirmed the need to maintain human control over the decision to use nuclear weapons. All that [is] to say, despite tense geopolitical competition between the PRC and U.S. currently, the most promising avenue for collaboration is on AI CBMs, which, given the sensitivity around nuclear issues between the two and still the commitment on AI x NC3 [Nuclear Command, Control and Communication], it is highly likely there will be bilateral development. The PRC does participate in international conversations on these topics, including LAWS [Lethal Autonomous Weapon Systems] and REAIM [Responsible AI in the Military Domain], but are unlikely to join a U.S.-led initiative, which means a bilateral commitment will be much more likely.

China sees herself resisting U.S. containment, and will not allow herself to limit her own development while this struggle is happening. In fact, the Chinese have seen themselves in this situation long before when things began to go south after the first round of tariffs during Trump's first term. All the events that have been happening since then have only further confirmed their thesis about U.S. containment and the necessity for self-reliance, industrial production, dominating key technologies, etc. Yet, in the past China has indeed signed declarations and treaties just to ignore them later when they go against their national interest. One example is UNCLOS [United Nations Convention on the Law of the Sea], which has been duly ignored when claiming their own right for the control of the South China Sea.

Presumption that even with a change of political parties in 2029, the status quo will remain in the first half of the next POTUS term, regardless of party/ideology. Presumption that given the possibility of autonomous software (cyberweapons), autonomous control of nuclear weapons, autonomous conventional weapons (drones, mines, etc), that it is in the mutual benefit of major powers to codify some form of international standards re. artificial intelligence. (Therefore little short-term possibility, great long-term possibility)

Rationale examples, late-resolution respondents:

Such an agreement would be very difficult to enforce and almost certainly would be broken by both parties because the stakes are too high not to. There may be some symbolic value to such an agreement but considering it would bring no strategic or safety value (because both parties would break it, and they would know that going into it), working toward the agreement would be a low priority for both.

In the short term, while both sides are racing to build the “best” AI, I don’t see either wanting to constrain itself in writing. They can get most of the benefit through ambiguity and dialogue rather than signing anything and even where there’s some alignment, trust is low and verification is weak, which makes formal commitments... unattractive.

This feels highly unlikely to happen at all, but I especially think it is unlikely to be bilateral. It could happen in a multilateral UN-setting or with the EU involved. But given how slow most things develop on the international stage (climate change negotiations, for example), 2040 feels unlikely.

The way this happens is if military AI starts to feel more like early nuclear weapons, (i.e., both sides get genuinely worried about accidents, speed, or loss of control), and then decide that some shared guardrails are better than none. There are already lots of interested actors and pressure on the sidelines, but for this to resolve we need leaders in both countries to converge on that view at the same time. The reason I don’t go higher (even by 2040) is that states can keep talking, signaling restraint, and coordinating informally without ever signing a clean bilateral agreement that clearly meets the resolution criteria. In other words: dialogue is easy [but] signing something explicit about military AI is hard.

Regulatory Fines on AI-Generated Content

Question. What will be the sum of regulatory fines individually exceeding $1 billion (2025 USD [i.e., inflation-adjusted]) imposed in the United States or European Union on companies responsible for the development or operation of an AI system, where the cited violation explicitly focuses on AI-generated content, by the following resolution dates (Dec 31, 2027, Dec 31, 2030, and Dec 31, 2040)?

Results. Domain experts estimated a median of $1 billion in cumulative fines by end-of-2027, rising to $4 billion by 2030 and $10 billion by 2040. The public expected consistently higher fines, forecasting $3.6 billion by 2027 and $11 billion by 2040. Superforecasters offered the most conservative predictions, estimating just $0.75 billion by 2027 and $8 billion by 2040. Statistical tests confirmed these group differences were significant in the near term, with experts and superforecasters both forecasting substantially lower totals than the public through 2030.

Regulatory Fines on AI-Generated Content. The figure above shows the median 50th percentile (as well as 25th and 75th percentiles when applicable) forecasts by participant group.

Rationale analysis:

Regulatory timeline: In general, forecasters are extremely skeptical that major fines will be levied by the end of 2027—with many pointing out that the base rate for AI-content-related fines of this magnitude being imposed is zero. But respondents are far more receptive to the idea that such fines might materialize between 2030 and 2040. High-forecast respondents tend to emphasize the potential for existing and emerging regulations to provide a sufficient basis for imposing major fines. One writes, “By 2030, I expect enforcement capacity, legal clarity, and political willingness to impose large penalties to increase, particularly in the European Union, where the AI Act provides a clear statutory basis for sanctions tied directly to AI-generated content such as deepfakes and synthetic media.” Low-forecast respondents, however, tend to emphasize not only that the laws and regulations that will limit AI content are still being developed, but also that an investigation would need to commence and conclude prior to a fine being levied: “Investigations of this type take 1 to 3 years, so no time for a big fine to be imposed by 2027,” writes one. Another adds, “It looks like the FTC investigation into Facebook [for privacy violations] and decision to [penalize] them took 16 months, so we’re already cutting things very close for Dec 2027.”
Likelihood of triggering events: Many high-forecast respondents expect that incidents such as “AI driven financial fraud scams at unprecedented scale, disinformation campaigns targeting major elections, [and] ubiquitous sexual victimization through deepfakes” will be common and justify major fines. Others point to the recent Anthropic settlement (which stemmed from a civil action, not a regulatory fine) as an indication that “copyright violations will be a significant area of litigation and regulation.” Low-forecast respondents often point out that these AI-generated content violations may not be sufficiently severe to warrant billion-dollar penalties, with one noting that the large fines that have been levied to date on big tech firms “are for privacy, an issue that far exceeds that of deep fakes in importance,” and another arguing that “CGI [computer generated images] have existed for decades without causing much perceived damage,” and that “as knowledge about the existence of deep-fake technology spreads, its impact on viewers and listeners [will] lessen.”
Attribution of liability: High-forecast respondents in general expect AI system developers and operators to be held accountable for harmful content generated by their systems. But low-forecast respondents frequently cite the dual-use nature of generative AI tools, with multiple forecasters drawing analogies to image editing software: “The AI-model creators will not be held accountable, just like e.g. photoshop is not held accountable for what can be done with the software”; “My guess is that AI providers are going to claim that their platform should not be held accountable for deep fake images/videos any more than Adobe Photoshop should be held accountable for people using it to generate fake images.” Another low-forecast respondent predicted that “mostly this will be done by modded (unlocked) open source models and by people who will not have the $1b,” making it difficult to identify and levy major fines on the perpetrators.
U.S. regulatory posture: High-forecast respondents often acknowledge that the anti-regulatory, pro-AI-development environment in the U.S. it likely to limit or preclude any major fines in the near term, but expect this dynamic to evolve: By 2030 “regulators in the U.S. are…likely to have more statutory tools afforded by Congress to pursue AI developers and operators,” predicts one. Many low-forecast respondents disagree with this assessment: “In the U.S., companies involved in AI generated content have gone out of their way to ingratiate themselves to the current administration specifically to forestall any regulation or regulatory enforcement,” is a common sentiment, as is the sense that “short term, regulatory capture by the big tech firms will prevent significant U.S. fines, and predatory U.S. foreign policy will prevent significant foreign fines.”
Number of targetable companies: While a few high-forecast respondents note that not only do several existing tech companies have sufficient revenue to sustain billion-dollar fines, but that more AI companies are likely to emerge over time—increasing the pool of candidates on which such a fine could be levied—they are outnumbered by the low-forecast respondents who emphasize that “there are only a handful of AI companies operating at sufficient revenue, market penetration, and jurisdictional exposure for such penalties to be credible.” As one forecaster writes, “The highest fine level [in the EU] is 7% of global annual revenue - so a company would need revenue of $14 billion or greater to be hit with a fine that meets the $1 billion level,” and that this “limits who can be targeted.”
Compliance reaction: High-forecast respondents generally don't emphasize proactive compliance, whereas low-forecast respondents frequently cite the likelihood that proactive remedial measures will prevent major violations: “Large AI companies will work to prevent this kind of behavior on their flagship models”; “I have my fines leveling off between 2030 and 2040, because presumably if large fines are being enforced, companies will be more proactive about following the rules.”

Rationale examples, high-forecast respondents:

The higher percentiles reflect scenarios in which generative AI plays a central role in major democratic, security, or market disruptions, leading regulators to treat AI-generated content violations with the same severity historically applied to data protection or competition violations.

Fines for AI-generated content are about to increase quite a bit, particularly after the controversy surrounding Grok generating image-based sexual abuse, including of minors, with regulators like the European Commission, OfCom, and the Australian eSafety Commissioner stepping in for victims. While I don't see the US federal government stepping up here, many states have AI generated IBSA legislation on the books. It's bipartisan, as is the reaction to the Grok scandal.

The history of technology companies actions, and their interactions with regulators, suggests that once an industry becomes large, profitable and concentrated the platforms will be held accountable for harms and may not invest in sufficient protections to prevent harmful content. The companies may also, in a more cynical view, be viewed as soft targets and potentially used as fiscal piggy banks.

I suspect the battle between the tech companies that, not surprisingly [want] few rules, and the public that want protection, is approaching a turning point. So I would expect more controls on AI companies, backed up by bigger fines for infringements (and outright bans). I would expect that it will require ever larger fines to get the message across that voters expect protection, but by sometime in the 2030s the AI companies will have got the message that bad behaviour has consequences and large fines will become less necessary.

Rationale examples, low-forecast respondents:

It does not seem to me to be difficult for large AI companies (either developers, providers or deployers) to comply with the relevant requirement of the current EU AI Act. It is technologically easy to do, and does not materially interfere with their product (except for nefarious purposes, which are not the main intended use cases). On the other hand, it is relatively easy to detect and prosecute large scale noncompliance. Therefore, I expect that all large companies will comply.

The cited large fines are for privacy, an issue that far exceeds that of deep fakes in importance. Now because the vast majority of deep fakes are still pretty easy to tell and in the future because I am quite confident that the companies will soon figure out the ways of establishing identity and provenance of digital images produced by their AIs.

There are many high bars for this question—over one billion U.S.D. as of 2025, regulatory rather than civil, imposition of the fine on provider/operator, and specifically an AI-generated content violation.

Even with new obligations (especially in the EU), enforcement usually starts with guidance, test cases, and smaller penalties; $1B+ fines tend to come later once expectations and precedents are established. Crossing the threshold generally needs a very large firm and a violation regulators see as systemic, repeated, or seriously harmful—so early years are likely $0 most of the time.

ForecastBench Performance

Question. In what year will at least one AI system achieve an overall average difficulty-adjusted Brier score below the lower bound of the 95% confidence interval of the 2024 “Superforecaster median forecast” entry’s difficulty-adjusted Brier score—0.072, as of January 9, 2026—on ForecastBench’s tournament leaderboard?

Results: The median expert predicts that an AI system will outperform Superforecasters in forecasting by the end of 2030. The median Superforecaster is more optimistic about improvements in AI-based forecasting systems: Superforecasters expect to be overtaken by the end of 2028. The public, meanwhile, forecasts that AI systems will not clearly outperform Superforecasters until 2033.

At the time this survey was published (January 2026), a simple extrapolation of ForecastBench trends suggested that AI systems would reach Superforecaster parity by October 2026. At time of writing (February 2026), the same projection suggests parity in December 2026, perhaps lending credibility to some respondents’ expectation that AI systems’ performance would begin to plateau.

ForecastBench Performance. The figure above shows the median 50th percentile (as well as 25th and 75th percentiles when applicable) forecasts by participant group.

Rationale analysis:

Linear trend extrapolation: Early-resolution respondents heavily rely on the linear projection of recent improvements in Brier scores. Many point to the language in the background information which reads, “A simple extrapolation of previous ForecastBench data suggests that LLMs may reach difficulty-adjusted Brier-score parity with superforecasters by October 2026.” Late-resolution respondents, however, challenge the validity of linear extrapolation in this domain, with several arguing that improvements will likely plateau as models bump up against irreducible uncertainty: “There probably is an absolute limit to how well one can forecast,” writes one, and another that “the inherent lower bound from aleatoric uncertainty might well be above 0.072.” One respondent argues that the trajectory “looks much more like a decaying exponential than a linear function... because ultimately the Brier score needs to plateau as it can’t go negative.”
Data crunching abilities versus geopolitical judgment: Many early-resolution respondents argue that forecasting plays to AI strengths: “This type of task...is one that is well-suited to LLMs given that they were built to spot patterns and predict outcomes,” is a common sentiment. In addition, many note that ForecastBench questions tend to center on data-rich domains “where AI has a distinct advantage (like weather at specific times and places).” Some late-resolution respondents emphasize that while AI typically handles data-rich questions well, “ForecastBench covers a mix of domains, including geopolitics...where structural uncertainty, regime shifts, and sparse data limit purely algorithmic forecasting advantages.” Another echoes that sentiment, writing, “For geopolitical questions—where data is sparse, hard to structure, and where intuition, contextual interpretation, and complex human interactions on a highly aggregated level matter—I think models currently perform less well.”
ForecastBench structure advantages LLMs: Many high-forecast respondents—and in particular, superforecasters—emphasize that the way ForecastBench is structured advantages LLMs in a way that makes it more likely that one will hit the target score. The following arguments were common: “Improvements in Superforecaster accuracy over time are not taken into account, only our accuracy as of 2024. This freezes the target. AI models will improve, but for purposes of this question Superforecasters will not.”; “AI systems are improving rapidly and will keep improving, whereas they are being compared against a 2024 performance by a group of 39 superforecasters (who in the real world would have been constantly updating their predictions for unresolved questions…).” Some superforecasters also pointed to the limited number of human forecasters assigned to each question, as well as the non-complex, data-dependent nature of many of the questions, as additional factors that could contribute to an early resolution.
Beating the 95% confidence interval: Many late-resolution respondents emphasize that the resolution criteria is exceptionally demanding. One states: “This a high bar—not just parity with 2024 superforecasters, but having a score outside the 95% confidence interval when, at present, all AIs have a score outside the 95% confidence interval in the other direction.” Another writes, “Beating the superforecaster median on ForecastBench after difficulty adjustment is a much higher bar than ‘be competitive’. The last bit of improvement is going to be brutally hard for AI: excellent calibration, restraint (knowing when not to be confident), and robustness across lots of weird question types.” Early-resolution respondents tend to minimize the gap between current performance and the specific target, with several suggesting that luck, paired with a high volume of attempts, will likely bridge it: “Given enough evaluations, eventually one will fall under this mark by chance…[Indeed, if] multiple models are tested every 2 weeks, the chance of this happening becomes quite high, so this question is really testing when models reach near parity, rather than any improvements over superforecasters."
Level of AI capability gains needed: Many early-resolution respondents suspect current architectures, augmented by “better scaffolding, tool use, [and] agentic search capabilities” will be sufficient to hit the target score: “Even if algo progress stopped, you can probably finish it off with better scaffolding,” writes one. Late-resolution respondents, however, often express doubt that current paradigms will be sufficient, with one noting that LLMs are “generally tuned to people-please,” and that this characteristic may prevent them from “making as extreme forecasts as they should.” In a similar vein, another argues, "Superforecasters are outliers. AI still averages. I think whatever it [is] that creates those outliers is difficult and it isn't clear it can be learned by a machine easily.”

Rationale examples, early-resolution respondents:

I'm optimistic this happens soon because forecasting is a domain where scaling + better training data + tool use can compound quickly, and AI capability is improving at an exponential pace. Once models are optimized specifically for probabilistic prediction (calibration, aggregation, retrieval, and reasoning over structured evidence), crossing a fixed threshold like 0.072 could happen abruptly.

I was among the participating Superforecasters, and many of the questions regarded weather forecasting, sports, and other highly aleatory and unpredictable themes in which the value a Superforecaster could bring is relatively low. Additionally, time was tight, and on certain questions, I spent significantly less time than I normally would. This led me to interpret the score on ForecastBench as a measure of how well an LLM can make good predictions in general, in comparison to the public and to generalists, rather than it being intended as a specific comparison to Superforecasters. [Therefore,] I’d expect LLMs to match or surpass Superforecasters on data-heavy questions (weather, financial, sports odds) very soon, while AI will continue lagging on questions that require a judgmental approach (geopolitics, fed rates).

The amount of data being generated by Polymarket and other prediction markets is likely to aid AI in mimicking the “secret sauce” that Supers currently bring to the table.

In the last couple of months, I have done some work as a contractor with one of the large LLM companies, on a project to improve the forecasting abilities of these models…This experience makes me think that with training on a lot of questions the models will be able to review more data and do more statistical analysis than human forecasters. There will be some questions where the model has an edge because of this. There will be other questions, those that are more judgemental in nature, that the models are likely to struggle with. What I wonder though, is whether anyone gets these 'change in direction' type questions right that often. [Success on ForecastBench] may be made easier by tactical decisions that maximise model advantages (data analysis and constant updating) and minimise their disadvantages (no real understanding) by sticking to the herd on these sorts of questions. In addition, [to resolve,] the question also only asks for one model to outperform in one year. Throw enough models with enough training at the problem and along with a good strategy, this seems pretty likely to me given that existing performance (albeit on what seem to have been simple questions) is pretty close to superforecasters.

Rationale examples, late-resolution respondents:

AI forecasting is improving quickly, but it’s still not at the point where it clearly and consistently outperforms the best human forecasters. Matching humans is one thing; reliably beating them across lots of tough, messy real-world questions is a much higher hurdle. That kind of jump usually takes time, not just smarter models but better judgment about uncertainty and enough real-world testing to show the gains are real. So while it wouldn’t be shocking if AI pulls this off sooner in a lucky scenario, the more realistic expectation is that it happens gradually…

LLMs…may struggle to produce forecasts that much [better than] what the average human would forecast. I.e., the LLM is itself a version of the wisdom of the crowd that uses probability to rank the likelihood of each member of the crowd being correct.

While recent progress suggests that AI systems may reach parity with human superforecasters in the near term, this question sets a materially higher bar: beating the lower bound of the 95% confidence interval of the 2024 Superforecaster median on a difficulty-adjusted basis. Crossing this threshold requires not just average parity, but robust and sustained outperformance beyond statistical noise. Short-term improvements may be driven by better calibration, ensemble methods, or benchmark-specific optimization, making early resolution plausible in a narrow sense. However, ForecastBench covers a mix of domains, including geopolitics and other contingent outcomes, where structural uncertainty, regime shifts, and sparse data limit purely algorithmic forecasting advantages. As a result, I expect diminishing returns once AI systems approach human-level performance on these tasks.

If systems are predicted to reach parity by 2026, then I think it will likely take another 10 years for the systems to reach the lower bounds of the 95% [confidence interval]. This prediction also considers that the present highest score is outside the [upper] bounds of [the 95% confidence interval].

Footnotes

We did not directly ask forecasters to predict the gap between U.S. and Chinese model performance. Instead, we compute the gap from separately elicited forecasts for each. The reported gap may differ slightly from what forecasters would report if asked directly. ↩
We analyze both (1) the median difference between American and Chinese system scores across all forecasters, and (2) the median forecast for each system separately within each forecaster group. The former reveals the expected performance gap; the latter shows anticipated improvement trajectories. A forecaster's p50 for two quantities separately may not exactly equal their p50 for the difference between those quantities. We expect any such discrepancy to be small given the similarity of the quantities being forecast. Note: the difference between median forecasts for each system does not equal the median of individual forecasters' predicted differences. Additionally, we have capped forecasts greater than an ECI score of 1000 at that threshold. ↩
The Epoch Capabilities Index is not a fixed absolute scale and its scoring is calibrated to arbitrary reference points (e.g., Claude 3.5 Sonnet = 130, GPT-5 = 150), and values may shift retroactively as new models and benchmarks are added to the jointly fitted model. Questions will resolve to the rescaled leaderboard values at the time of resolution, meaning the numerical targets may shift if Epoch recalibrates the index. Forecasting such fluctuations was part of this forecasting question. Our primary interest is the U.S.–China capability gap, which is relatively robust to minor rescaling, but forecasters were instructed to factor this measurement uncertainty into their estimates. ↩
In some cases, the "aggregate" refers to the mean; in others, the median is used, depending on which is more appropriate for the distribution of responses. ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶
We occasionally elicit participants' quantile forecasts (estimates of specific percentiles of a continuous outcome) to illustrate the range and uncertainty of their predictions. ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶

Cite Our Work

Please use one of the following citation formats to cite this work.

APA Format

Murphy, C., Rosenberg, J., Canedy, J., Jacobs, Z., Flechner, N., Britt, R., Pan, A., Rogers-Smith, C., Mayland, D., Buffington, C., Kučinskas, S., Coston, A., Kerner, H., Pierson, E., Rabbany, R., Salganik, M., Seamans, R., Su, Y., Tramèr, F., Hashimoto, T., Narayanan, A., Tetlock, P. E., & Karger, E. (2025). The Longitudinal Expert AI Panel: Understanding Expert Views on AI Capabilities, Adoption, and Impact (Working paper No. 5). Forecasting Research Institute. Retrieved 2026-02-24, from https://leap.forecastingresearch.org/reports/wave5

BibTeX

@techreport{leap2025,
    author = {Murphy, Connacher and Rosenberg, Josh and Canedy, Jordan and Jacobs, Zach and Flechner, Nadja and Britt, Rhiannon and Pan, Alexa and Rogers-Smith, Charlie and Mayland, Dan and Buffington, Cathy and Kučinskas, Simas and Coston, Amanda and Kerner, Hannah and Pierson, Emma and Rabbany, Reihaneh and Salganik, Matthew and Seamans, Robert and Su, Yu and Tramèr, Florian and Hashimoto, Tatsunori and Narayanan, Arvind and Tetlock, Philip E. and Karger, Ezra},
    title = {The Longitudinal Expert AI Panel: Understanding Expert Views on AI Capabilities, Adoption, and Impact},
    institution = {Forecasting Research Institute},
    type = {Working paper},
    number = {5},
    url = {https://leap.forecastingresearch.org/reports/wave5}
    urldate = {2026-02-24}
    year = {2025}
  }