How the Internet Archive Decides What to Archive: Priorities, Frequency, and Data Sources

info

How the Internet Archive Decides What to Archive: Priorities, Frequency, and Data Sources

Published: 2026-03-06

One trillion saved pages. Over 99 petabytes of data. Hundreds of crawls running simultaneously every day. Behind these numbers lies a question that everyone who professionally works with web archives asks themselves: how exactly does the Wayback Machine decide which sites to scan, how often to return to them, and why are some domains represented in the archive with thousands of snapshots while others have only a few records over ten years?
Understanding these mechanisms is critically important for anyone involved in website restoration. If you know how the system works from the inside, you can predict what you'll find in the archive and what won't be there. And you can influence the archiving of your own sites while they're still live.

A Brief History: From Alexa Internet to In-House Infrastructure
To understand how everything works now, you need to know how it all started. The history of web archiving at the Internet Archive is inseparably linked to a company called Alexa Internet.
In 1996, Brewster Kahle founded two projects simultaneously: the Internet Archive and Alexa Internet. Alexa was in the business of analyzing web traffic, and for this purpose it scanned the internet with its own crawlers. The collected data was then donated to the Internet Archive. It was a symbiotic model: Alexa got data for its analytics, and the Internet Archive received enormous volumes of web pages without needing to build its own crawling infrastructure.
In 1999, Amazon acquired Alexa for $250 million. But the partnership with the Internet Archive continued: Amazon/Alexa still scanned the internet and passed the crawls to the archive. For nearly two decades, Alexa was the primary data supplier for the Wayback Machine.
In parallel, the Internet Archive was developing its own capabilities. In 2003, Heritrix was created, an open-source crawler written in Java. Initially, Heritrix was used for relatively small targeted crawls, but starting in 2008, the Internet Archive began scaling up its own scanning. By 2010, the so-called "Worldwide Web Crawls" were launched, global crawls that systematically traverse the entire accessible internet.
The turning point came on May 1, 2022, when Amazon shut down the Alexa Internet service. The Alexa Rank, which had been the standard website popularity metric for two decades, ceased to exist. Along with it, the flow of crawls from Alexa to the Internet Archive stopped.
This was a serious challenge. The Internet Archive lost one of its largest data sources and was forced to rely entirely on its own infrastructure and partnerships. For those who work with web archives, this means one important thing: the nature and completeness of archival data before 2022 and after 2022 may differ significantly.

Where Wayback Machine Data Comes From Today
Data enters the Wayback Machine from many sources. If you hover over a dot on the Wayback Machine calendar, you'll see a "why" label that shows which collection or crawl a given snapshot belongs to. Behind each such crawl there is its own story.
Internet Archive's own crawls. This is the primary data source. The Heritrix crawler (and its newer versions, including Brozzler for dynamic content) traverses billions of pages. "Worldwide Web Crawls" have been running continuously since 2010 and represent massive scans of the entire accessible web. A single such crawl can take months: for example, "Wide Crawl Number 13" started in January 2015 and was only completed in July 2016.
Save Page Now. Since October 2013, any user can manually save a page through the Wayback Machine interface. Just enter a URL and press the button. Only one specific page is saved, not the entire site. An important nuance: Save Page Now does not add the URL to a list for future automatic crawls. It's a one-time save.
Cloudflare Always Online. In September 2020, the Internet Archive entered into a partnership with Cloudflare. Cloudflare customers who enable the Always Online feature automatically share information about their sites with the Internet Archive. Cloudflare identifies the most popular URLs on a site (based on GET request statistics with a 200 status code over the previous five hours) and sends them to the Wayback Machine for scanning. This is a significant source: many sites using Cloudflare may have been previously unknown to the Internet Archive's crawlers.
Archive-It. This is a paid subscription service through which libraries, universities, government agencies, and other organizations commission regular scanning of specific sites or collections. Archive-It allows you to configure exactly what to scan and how frequently. The results end up in the Wayback Machine. Many government websites are preserved specifically through Archive-It, especially during administration transitions.
Archive Team. A volunteer group of enthusiasts who independently archive internet content that is at risk of disappearing. When the shutdown of a service is announced (GeoCities, Google+, Vine, Yahoo Answers), Archive Team organizes mass downloading of content before the shutdown date. A significant portion of this data is transferred to the Internet Archive.
Common Crawl. The Internet Archive imports mirrors of crawls from the Common Crawl project, which we wrote about in detail in a separate article. This is an additional source that expands coverage.
Links from Wikipedia. The Internet Archive systematically archives URLs referenced in Wikipedia articles. This makes sense: if a link is used as a source in an encyclopedia, it should remain accessible in the future.
Other sources. Grants and partnerships, for example with the Sloan Foundation, the U.S. National Archives and Records Administration (NARA), and the former Internet Memory Foundation. Each of these partners contributes their own crawl collections.

What Determines Crawling Frequency
This is the key question, and the answer isn't as straightforward as one might wish. The Internet Archive doesn't publish a crawl prioritization formula, but from observations, documentation, and staff statements, the general picture can be reconstructed.
Link connectivity. The official Internet Archive documentation states directly: "crawls tend to find sites that are well linked from other sites." The Heritrix crawler works by following links: it visits a page, finds links, and follows them. The more links point to your site from other resources, the higher the probability that the crawler will reach you. This is similar to PageRank logic: well-connected sites are discovered and scanned more frequently.
Seed list. Every global crawl starts with a set of initial URLs from which the crawler "fans out" by following links. The closer your site is to these seed URLs, the sooner and more completely it will be scanned. In previous years, seed lists were formed based on Alexa data (lists of the most visited sites). After Alexa's shutdown in 2022, these lists are formed from the Internet Archive's own data, partner data, and previously known domains.
Crawl depth. Every crawl has a depth limit: how many "clicks" from the starting page the crawler is willing to go. For large global crawls, depth is usually limited to allow covering the maximum number of domains. This means that internal pages of small sites may not make it into the archive, even if the homepage is saved.
Cloudflare and other automated sources. If your site runs through Cloudflare with Always Online enabled, its popular pages will automatically be sent for scanning. Frequency depends on the Cloudflare plan. This is one of the most reliable ways to ensure regular inclusion in the archive.
Manual requests via Save Page Now. Every save through Save Page Now creates a record in the archive. Some users and bots systematically save certain sites, creating regular snapshots.
Multiple parallel crawls. At any given moment, hundreds of different crawls are running simultaneously with different purposes and scales. A single site can end up in multiple crawls: global, thematic, regional, commissioned through Archive-It. Therefore, the scanning frequency for the same site can vary greatly: dozens of snapshots in some months, none in others.

Why Your Site Might Not Be in the Archive
Despite a trillion saved pages, far from the entire internet makes it into the Wayback Machine. The main reasons for absence:
robots.txt. If a site's robots.txt file prohibits access for crawlers, Heritrix respects this. Moreover, the Wayback Machine has historically applied robots.txt retroactively: if the current robots.txt blocks the crawler, even old archived copies made before the restriction appeared are hidden. However, in recent years the Internet Archive has begun reconsidering this policy.
Dynamic content. Pages that are entirely generated by JavaScript in the browser (React, Vue, Angular SPAs) are poorly saved or not saved at all. Heritrix receives an "empty" HTML template from the server without content. The newer Brozzler crawler solves this problem by using a real browser to render pages, but its coverage is still significantly smaller than Heritrix's.
Content behind authentication. Pages accessible only after login, payment, or form submission are inaccessible to crawlers. This applies to online banking, personal accounts, and paywalled content.
The crawler simply didn't know about the site. If a site is new, has no incoming links, and isn't registered in directories, the crawler may never discover it. Before 2022, Alexa helped discover such sites through its toolbar. After Alexa's shutdown, this discovery channel stopped working.
IP or User-Agent blocking. Some sites block Internet Archive crawlers at the server level. In 2025-2026 this became especially relevant: major publishers like the New York Times, The Guardian, and Reddit began blocking archive.org_bot due to concerns that their content was being used via the Wayback Machine to train AI models.

The Scale of the Blocking Problem in 2025-2026
This deserves special attention. According to Nieman Lab, problems with crawling projects led to an 87% drop in the volume of saved pages from news publications between May and October 2025. The New York Times completely blocked Internet Archive crawlers and added archive.org_bot to its robots.txt. The Guardian restricted access to articles, leaving only homepage and section pages available for archiving. The Financial Times blocks all external bots, including the Internet Archive.
The reason isn't the Internet Archive itself, but publishers' concerns that AI companies use the Wayback Machine as a convenient data source for model training. An analysis of the Google C4 dataset showed that web.archive.org was among the top 200 most represented domains in the training data for the T5 and LLaMA models. And in May 2023, an AI company was sending tens of thousands of requests per second to Internet Archive servers, which led to a temporary service outage.
For website restoration, this means that archived copies of major media resources will become increasingly incomplete. If a site linked to materials from blocked publications, those links may lead nowhere.

How to Influence the Archiving of Your Site
If you want your site to make it into the Wayback Machine and be scanned regularly, here's what you can do.
Make sure robots.txt doesn't block crawlers. Check that your robots.txt doesn't contain Disallow directives for User-Agent: ia_archiver or User-Agent: archive.org_bot. If you previously blocked these bots, remove the rules.
Use Save Page Now. Regularly save important pages of your site manually. This can be automated through the Save Page Now API. The service is free but has a limit of 15 requests per minute.
Enable Cloudflare Always Online. If your site runs through Cloudflare, activate Always Online in the settings. This will automatically ensure regular archiving of your most popular pages.
Ensure good link connectivity. The more other sites link to yours, the higher the probability that the crawler will discover it and keep returning. This works for both search engines and archives.
Add your URLs to Wikipedia. If your content can be useful as a source in Wikipedia articles, adding links will increase the chances of regular archiving. But don't abuse it: Wikipedia strictly moderates links, and spam doesn't survive there.
Consider Archive-It. If you represent an organization that is required to preserve web content (library, university, government agency), an Archive-It subscription gives you full control over scanning frequency and depth.

What Gaps in the Archive Mean
When you open a site's timeline in the Wayback Machine and see gaps (months or years without snapshots), it can mean different things.
The site simply wasn't included in a crawl during that period. Small sites with few incoming links are scanned irregularly. A gap of several months for such a site is normal.
The owner blocked archiving. If robots.txt prohibited access, no snapshots were created. And if the restriction was added later, previously made snapshots may have been hidden retroactively.
Technical problems with the site. If the site was returning 5xx errors or was unavailable at the time of the crawl, the snapshot may not be saved or may be saved with an error mark (a red dot on the Wayback Machine calendar).
The domain expired and was sitting on parking. During the period between domain expiration and re-registration, a parking page usually sits on it. It may also be saved in the archive, which is important to consider during restoration (this is exactly why the BEFORE parameter exists in Archivarix).
Change in crawling infrastructure. After Alexa's shutdown in 2022, some sites that were previously scanned regularly by Alexa and passed to the Internet Archive may have temporarily fallen out of coverage until the Internet Archive's own crawls compensated for the loss.

The Technical Side: How Heritrix Works
For the curious: the Heritrix crawler (the name means "heiress" in archaic English) works on the following principle. It starts with a seed list of initial URLs, loads each page, extracts all links from it, queues them, and moves on to the next ones. A separate queue is maintained for each domain to avoid overloading a single server with too many requests.
Heritrix respects robots.txt and META nofollow tags. It also adapts its scanning speed: if a server responds slowly, the crawler reduces request frequency. Downloaded data is saved in WARC (Web ARChive) format, which is the ISO standard for storing web archives.
For dynamic content, Brozzler was created in 2015, which uses a real browser (Chrome via puppeteer) to render pages before saving them. Brozzler also integrates youtube-dl for downloading media content. However, Brozzler is significantly slower and more resource-intensive than Heritrix, so it is mainly used for targeted crawls through Archive-It rather than for global scanning.
Another component is Umbra, an intermediary layer between Heritrix and a browser that allows Heritrix to "see" links generated by JavaScript without fully rendering the page in a browser.
After scanning, data is processed and indexed. As of 2026, the lag time between scanning a page and its appearance in the Wayback Machine is 3 to 10 hours.

Practical Takeaways for Website Restoration
Everything described above has direct relevance for website restoration work through Archivarix.
Popular, well-connected sites have the most complete archives. If you're restoring a major site with a good backlink profile, the archive will most likely contain many snapshots across different dates, and you'll be able to choose the most suitable one.
For small sites, the archive may be sparse. Be prepared for the possibility that the version you need may not exist. In that case, it's worth checking Common Crawl, search engine caches, and Archive.today as additional sources.
Pay attention to the crawl source. If a snapshot was made through Save Page Now (a single specific page), it may not contain the full set of resources (images, CSS, scripts) needed for a visually complete restoration.
Archives after 2022 may have a different coverage pattern. After Alexa's shutdown, some categories of sites may be represented less fully than in previous years.
The 2025-2026 blocks may create "holes" in media resource archives. If the site being restored linked to materials from major publications that have blocked the Internet Archive, those links may be lost.

The use of article materials is allowed only if the link to the source is posted: https://archivarix.com/en/blog/inside-archive-org/

How the Internet Archive Decides What to Archive: Priorities, Frequency, and Data Sources

18 hours ago

How to Find and Buy an Expired Domain with a Good History

Buying an expired domain with history is one of the most effective ways to launch a new project with an already existing backlink profile, trust, and even traffic. Instead of promoting a bare domain f…

1 week ago

Common Crawl as an Alternative Data Source for Website Restoration

When it comes to restoring websites from archives, almost everyone thinks only of the Wayback Machine. That's understandable: archive.org is well known, it has a convenient interface, a trillion saved…

2 weeks ago

Archivarix Cache Viewer Extension for Chrome, Edge and Firefox

We've released a browser extension called Archivarix Cache Viewer. It's available for Chrome, Edge and Firefox. The extension is free and contains no ads whatsoever.
The idea is simple: quick access …

2 weeks ago

AI Content on Restored Websites: How to Detect It and What to Do About It

When you restore a website from the Web Archive, you expect to get original content that was once written by real people. But if the site's archives were made after 2023, there's a real chance of enco…

3 weeks ago

Web Archive in 2026: What Has Changed and How It Affects Website Restoration

In October 2025, the Wayback Machine reached the milestone of one trillion archived web pages. Over 100,000 terabytes of data. This is a massive achievement for a nonprofit organization that has been …

4 weeks ago

Archivarix External Images Importer 2.0 - New Plugin Version for WordPress

We are pleased to introduce version 2.0 of our WordPress plugin for importing external images. This is not just an update, the plugin has been completely rewritten from scratch based on modern requir…

1 month ago

Black Friday & Cyber Monday Coupons

Dear friends!
Black Friday and Cyber Monday are the best time to save on future website restores.
If you plan to restore websites, top up your balance in advance, or simply want to get more – now is…

3 months ago

Archivarix is 8 years old!

Dear friends!
Today we celebrate Archivarix's 8th anniversary, and it's the perfect occasion to say a huge thank you!
We are truly grateful that you chose our service for website recovery from web a…

5 months ago

7 Years of Archivarix

Today is a special day — Archivarix is celebrating its 7th anniversary! We want to thank you for your trust, ideas, and feedback, which have helped us become the best service for restoring websites fr…

1 year ago