ai

AI Content on Restored Websites: How to Detect It and What to Do About It

Published: 2026-02-13

When you restore a website from the Web Archive, you expect to get original content that was once written by real people. But if the site's archives were made after 2023, there's a real chance of encountering texts generated by language models. Website owners were mass-replacing original content with texts from ChatGPT and similar tools, often without even trying to edit them. The result: you restore a website and instead of original articles, you get reprocessed slop from a neural network.
This isn't just a matter of quality. Google is actively fighting low-quality AI content, and a restored website with such texts risks never making it into the index. In this article, we'll look at how to identify machine-generated texts on a restored website and what to do about them.

The mass adoption of generative AI for content creation began in late 2022, following the release of ChatGPT. By mid-2023, hundreds of thousands of websites already contained texts fully or partially written by language models. The reasons varied: some owners tried to save money on copywriters, others were bulking up page counts for SEO, and others were simply experimenting.
The Wayback Machine archives everything indiscriminately, with no filtering based on how content was created. If the Internet Archive bot visited a page and saw text, it saved it. It doesn't matter whether that text was written by a human or ChatGPT.
Websites that changed ownership during the 2023-2025 period turn out to be especially problematic. A typical scenario: a new owner buys a domain with history, deletes the old content, and uploads hundreds of AI articles to quickly build traffic. If the Web Archive captured exactly these snapshots, you'll get not the original website but its AI version upon restoration.
Another situation: the website didn't change hands, but the editorial team decided to "refresh" old articles using AI. Technically the URLs stayed the same, but the page content changed completely. And it's precisely these updated versions that may have been captured in the latest archives.

Before diving into AI text detection, it's worth trying to avoid the problem at the restoration stage. If the website existed before 2023, it makes sense to use a date limit when downloading through Archivarix.
Our system has a "BEFORE" parameter for website restoration that lets you set an upper date boundary for snapshots. By setting this limit to the end of 2022, you virtually guarantee yourself content free of AI generation. Of course, you'll lose all updates made after that date, but for many tasks this is an acceptable trade-off.
If you specifically need recent snapshots, or if the website was created in the AI era, you'll have to check the content manually.

Over two-plus years of working with restored websites, we've accumulated enough experience to identify characteristic markers of machine-generated text. None of them is absolute proof on its own, but their combination allows for confident conclusions.
The first and most obvious sign is unnatural structure. AI texts are almost always broken into neat sections with subheadings, each paragraph roughly the same length, each point logically complete. Real text doesn't look like that. Real text has long paragraphs and short ones, digressions from the topic, uneven rhythm. When you open a page and see a perfectly symmetrical structure with headings like "What Is It," "Why It Matters," "How It Works," "Conclusion," that's a reason to be suspicious.
The second sign is characteristic vocabulary. Every language model has marker words that it uses disproportionately often. For English texts, these include "delve," "crucial," "landscape," "tapestry," "multifaceted," "it's important to note," "in today's rapidly evolving." If every page on the website contains the same constructions from this set, the text is almost certainly machine-generated.
The third sign is a lack of specifics. AI is good at writing "in general" but struggles with details. If an article about car repair gets by without naming specific models, tools, and parts, and instead discusses "general principles of vehicle maintenance," it's most likely generated. A real author who knows the subject writes specifically: part numbers, wrench sizes, quirks of particular engines.
The fourth sign is uniform style across the entire website. On a real website with multiple authors, texts differ in style, depth, and approach. One author writes at length and in detail, another is brief and to the point, a third likes to insert personal stories. If all 200 articles on the site are written in the same "smooth" style with not the slightest variation, that's a sign of mass generation.
The fifth sign is content that doesn't match the publication date. AI texts often contain generic statements with no temporal anchoring. If an article is dated 2024 but contains not a single reference to specific events of that year, not a single mention of trends that were current at the time, that's suspicious. A real author almost always ties their text to the context of their time.

Manually checking every page on a large website is unrealistic. Fortunately, tools exist that automate this work.
Among free tools, GPTZero and ZeroGPT stand out. Both work fairly reliably with English texts, less so with Russian, but are still suitable for initial screening. Their main limitation: they analyze texts one at a time, which is inconvenient for a website with thousands of pages.
A more serious approach is using the APIs of these same services for batch processing. GPTZero provides an API through which you can run all of a website's texts automatically. You extract the text content of each page, send it for checking, and receive a probability score for AI generation.
For those who prefer to work locally, open-source detection models exist. For example, the RADAR model (Robust AI-text Detection via Adversarial leaRning) or detectors based on Ghostbuster. They require certain computational resources but allow you to check texts without sending data to third parties.
The perplexity method deserves a separate mention. The idea is simple: AI text usually has low perplexity (predictability) because language models generate the most probable word sequences. If text is too "smooth" and predictable from the perspective of a statistical language model, it was probably generated. Tools like Binoculars or DetectGPT work on exactly this principle.
It's important to understand that no detector provides 100% accuracy. Texts written by AI and then seriously edited by a human often pass checks as "human-written." And conversely, some authors with a very formal writing style are sometimes incorrectly identified as AI. Therefore, it's better to use automated checking results as a filter for manual analysis rather than as a final verdict.

Once you've identified which pages contain AI text, the question arises: what to do with them? There's no one-size-fits-all answer; it all depends on your goals.
If the purpose of the restoration is to preserve original content (for example, for historical reference or a portfolio), AI texts need to be removed. Try to find earlier snapshots of those same pages in the Web Archive where the original content still existed. Archivarix allows you to restore individual pages from different dates, so you can "assemble" a website from the best versions across different time periods.
If you're restoring a website for subsequent use and promotion, you have three options. First: completely rewrite the AI texts. This is the most reliable path. The rewriting will of course also be done by AI, but newer and more sophisticated. You're not going to do it yourself, right? Second: substantially edit them by adding specifics, personal experience, current data, and examples. A well-edited AI text can become a perfectly decent foundation. Third: delete the pages with AI content and keep only the originals.
In any case, it's not advisable to leave obviously machine-generated texts unchanged. Since 2024, Google has been consistently demoting websites with mass low-quality AI content. The Helpful Content updates and the March 2024 core update were aimed precisely at this. A restored website with hundreds of unprocessed AI articles has minimal chances of normal indexing.

The situation isn't always black and white. On many websites, AI was used selectively: for writing meta descriptions, generating FAQ sections, creating product descriptions in catalogs, or for "extending" existing articles. In such cases, a page contains a mix of original and machine-generated text.
Identifying such fragments is harder. Automated detectors usually give an averaged score for the entire text rather than pointing to specific paragraphs. Careful manual analysis helps here: if in the middle of a lively, emotional article a block suddenly appears with perfectly structured, dry, encyclopedia-style text, that block was most likely added with the help of AI.
For FAQ sections and meta descriptions, the issue is less critical. These elements are formal and templated by nature, and search engines don't judge them as strictly as the main content.

The problem of AI content in the Web Archive will only grow. By various estimates, in 2026 between 20 and 40 percent of new internet content is created with the involvement of generative models. This means the share of such texts in archives will grow with each passing year.
For those who work with website restoration, this is a new reality that requires adaptation. The good news is that detection tools are also evolving, and identifying machine-generated text gets easier every year. The bad news is that generation models are improving too, and the line between human and machine text is gradually blurring.
In any case, checking content for AI generation should become a standard part of the website restoration process. It doesn't take much time but can save you from serious indexing problems down the road.

The use of article materials is allowed only if the link to the source is posted: https://archivarix.com/en/blog/ai-content/

Archivarix Cache Viewer Extension for Chrome and Firefox

We've released a browser extension called Archivarix Cache Viewer. It's available for both Chrome and Firefox. The extension is free and contains no ads whatsoever.
The idea is simple: quick access t…

13 hours ago
AI Content on Restored Websites: How to Detect It and What to Do About It

When you restore a website from the Web Archive, you expect to get original content that was once written by real people. But if the site's archives were made after 2023, there's a real chance of enco…

1 day ago
Web Archive in 2026: What Has Changed and How It Affects Website Restoration

In October 2025, the Wayback Machine reached the milestone of one trillion archived web pages. Over 100,000 terabytes of data. This is a massive achievement for a nonprofit organization that has been …

1 week ago
Archivarix External Images Importer 2.0 - New Plugin Version for WordPress

We are pleased to introduce version 2.0 of our WordPress plugin for importing external images. This is not just an update, the plugin has been completely rewritten from scratch based on modern requir…

2 weeks ago
Black Friday & Cyber Monday Coupons

Dear friends!
Black Friday and Cyber Monday are the best time to save on future website restores.
If you plan to restore websites, top up your balance in advance, or simply want to get more – now is…

2 months ago
Archivarix is 8 years old!

Dear friends!
Today we celebrate Archivarix's 8th anniversary, and it's the perfect occasion to say a huge thank you!
We are truly grateful that you chose our service for website recovery from web a…

4 months ago
7 Years of Archivarix

Today is a special day — Archivarix is celebrating its 7th anniversary! We want to thank you for your trust, ideas, and feedback, which have helped us become the best service for restoring websites fr…

1 year ago
To everyone who has been waiting for top-up discounts!

Dear Archivarix users, Congratulations on the upcoming holidays and thank you for choosing our service to archive and restore your websites!…

2 years ago
6 Years of Archivarix

It's that special time when we take a moment to reflect not just on our achievements, but also on the incredible journey we've shared with you. This year, Archivarix celebrates its 6th anniversary, an…

2 years ago
Price Changes

On Feb 1st 2023 our prices will change. Activate the promo-code and get a huge bonus in advance.…

3 years ago