We Used AI to Detect AI. It Didn’t Go Well
- Jessica Bush

- 4 days ago
- 5 min read
Updated: 1 day ago
The introduction of AI into our daily lives has no doubt transformed the way we work. Even as I type this sentence, I find myself drawn to my new friend Sheila (aka ChatGPT) to help me brainstorm ideas, structure, research, etc. However, I am consciously leaving her out of the equation in the writing of the article itself. She does come into play when I run experiments, but not in the words you are reading.
Why am I doing this, you ask? To hopefully educate on the importance of using your own expertise and your own voice when writing for the masses. And I like to think I have a unique perspective on this topic being that my educational roots are in journalism and my more recent experience is in SEO and the impact AI is having on search results.
In recent months, we’ve noticed drops in rankings for pages that we know relied upon AI for creation. This is impacting the very clients we are serving with SEO strategy and changing the dynamic of our conversations with them. So much so that we have started to discourage the use of AI in the development of content for their websites.
We've also started to explore the use of AI Detectors to detect AI in content. Let’s pause and think about that notion for a second:
We are now using AI to detect AI. 🤯
We don’t quite trust that something was written by a human, so we are going to blindly trust AI to tell us that is was (or wasn’t).
Let's Detect AI: The Experiment
Let’s do a little experiment, shall we?
1200 words written by a human
Pulled directly from source word processor - not copied from a publication
Four AI Detectors
Date of Experiment: Jan 2, 2025
I put the same excerpt from a 20-year old article of mine through 4 “AI Detectors” and compared their scores. Each tool gives some form of a “likely AI/human” score. In an effort to keep the data consistent, I converted all scores into a “written by human” score.
Source Text: 100% Human Written
I’m honestly not sure whether I should be flattered or appalled that none of these thought my work was written entirely by a human. Simply put, some are more reliable than others when it comes to detecting human vs AI content. But are they consistent?

Source Text: Cleaned up by AI
Next, I asked GPT to take the same excerpt and “clean it up” then re-ran it through the detectors. Here is where things start to break down a bit because two of the four detectors found this version to be 100% human and more human than the previous.

Source Text: Rewritten by AI
For the sake of science, let’s give AI artistic license on my work. In this run, I asked Sheila to completely rewrite the excerpt as AI - in its own words, flow and structure. The fact that two of the four thought AI’s version of my writing was more human than the previous run is… puzzling.

Source Text: 100% AI Written
Now let’s flip the script. Let’s let AI generate a piece of content from one simple prompt and run it through these same detectors. The query: Please write me an article with less than 1,200 words that talks about the implications of using AI vs human writing in content.

The only thing human about this source text was the prompt that I fed into GPT (effectively 0%) and yet all four detectors believe this to be at least somewhat human (if not mostly).
Aside from the fact that this exercise left me with very little confidence in any of these AI detectors, I found myself getting asked multiple times to upgrade for better tools, including ones that “humanize” content.
How are you going to “humanize” content when you can’t reliably tell me what human content looks like?
Granted, Grammarly was the most consistent across all 4 when it came to recognizing the use of AI, but it still begs the question: What kinds of decisions are being made based on these kinds of tools?

So What?
How did we get here? Well, one could argue that the race to roll out AI tools… to be the first to market… is at the expense of reliability… not just in this example, but across the board. Consumers trust the tool because they don’t know any better - they don’t have any way to test - they themselves are rushing for an answer.
I’m not making the argument that these tools don’t have some value, but the way in which they are used matters. Consider the various ways people might be using them today and the possible implications of blindly trusting an output that has proven to give false positives and negatives (e.g. hiring, grading, etc)
Simply put, AI detectors are not reliable enough to be used as a basis for decision-making, especially when consumers are being incentivised to upgrade for better results. Brandeis University’s collection of recent studies demonstrated this even more effectively than I ever could. [1]
I’ll admit that Sheila (my GPT) is a wonderful companion when it comes to brainstorming and organizing my thoughts. So when does my reliance on her for those elements bleed into what a detector deems as “not human”? More importantly, when will we as a society become less reliant on AI and trust our own human instincts when it comes to creating and consuming content?
The optimist in me believes that we will always be able to tell what is authentic and genuinely human far better than AI can. We have that innate sense of knowing the vibe of what we are consuming.
Best Practices
When it comes to creating? Here’s my recommended approach:
Remember that you are the expert: AI cannot reliably emulate your specific expertise. Period.
Start with your own words: As easy as AI has made it to open a chat and ask a question, your thoughts are the most important starting point.
AI as a final check: Only bring in AI once you are ready to hit publish and only ask for high-level edits.
Don’t copy and paste: If you like a suggestion AI made, make it your own.
If you prefer to use AI to help brainstorm, structure and organize, treat it as a guide only.
Don’t plagiarize: Treat AI like another publication. You wouldn’t copy someone else’s content word-for-word… don’t do the same with AI.
And because I suspect you’re curious… here’s how the four tools scored the article you just read on its "human-ness".

Sources:



Comments