

Discover more from James' Newsletter
How To Detect Text Generated by an AI
Detecting text generated by AI is an evolving challenge, especially as AI-generated text becomes increasingly sophisticated.
However, you can approach this challenge in a few ways:
1. Machine Learning Approach:
Collect a large dataset with examples of both AI-generated and human-written texts.
Use this dataset to train a binary classification model. Depending on the size and complexity of your data, you can use neural networks, SVMs, or simpler algorithms.
Features can include word frequency, sentence length, sentence structure, use of rare words, or other statistical properties.
Test the model on new data and iterate to improve its accuracy.
2. Stylistic Analysis:
AI-generated text may have certain patterns, like a lack of genuine emotion, overly verbose phrasing, or a lack of certain idiomatic expressions.
Detecting these stylistic quirks may help in identifying AI-generated content.
3. Consistency Check:
AI might generate text inconsistent in context if generated in chunks. Checking for thematic or topic consistency throughout the text can help.
4. Anomaly Detection:
Instead of looking for AI-specific patterns, look for anomalies in the text that wouldn’t typically appear in human-written text. This is a broader approach and might have some overlap with the stylistic analysis.
5. Metadata Analysis:
If the text is from online sources, metadata (like timestamps or posting frequency) can sometimes give clues. For instance, an AI might produce content at an unnaturally consistent pace.
6. Using Pre-existing Tools:
There are already tools and APIs available that attempt to detect deepfakes and AI-generated content. Incorporating these into your solution might give you a head start.
7. Frequency Analysis:
Using certain words or phrases repetitively or with an unnatural frequency can signal AI-generated text.
8. Sentence Complexity and Structure:
Analyzing the complexity of sentences, their structures, and the transition between ideas can give hints about the text's origin.
Remember, the boundary between AI-generated and human-written text is blurring rapidly. A program created today might have difficulty detecting the AI-generated content of tomorrow. It's essential to keep updating the model and refining your approach.
[SIDE NOTE] As a writer, I suspect numbers 2, 4 and 8 might yield the most likely source of markers. I’m also curious to see how generative language degrades when sourced from itself - similar to the way genes do.