A robotic hand reaching into a digital network on a blue background, symbolizing AI technology.

BERT hates permutations…

Google astonishingly processes, or indexes, an estimated 130 trillion web pages – of which at least 25% in volume contains natural language readable text.

You would think that there should be extremely few 5-word (or slightly longer) phrases that does not already exist as meaningful expressions in this massive text repository.

You would be wrong…

Meaningful natural language word permutations are so rich in count that even the fairly simple phrase above : “…think that there should be extremely…”, have – as of current – zero results in Google.

The phrase : “I really love coffee in the morning”, with each word replaced in revolving-permutation iterations with just 10 synonyms, i.e. “We definitely prefer tea over an evening”, give rise to 10 million meaningful phrases…all merely describing the same general proposition that we feel positive about some kind of beverage during different parts of the day.

Google found that within 3 years after the release of GPT-2, the AI content in sampled websites increased nearly five-fold. Subsequently Google developed BERT, a powerful tool to help recognise writing styles and patterns often seen in AI-generated text – text that permutates from original to multiple copies substantially similar in meaning.

Best practices around AI generation is hugely important when it comes the ongoing battle against low-quality website textual content – the latter, exactly what Google BERT hates. Google’s algorithms increasingly favour more nuanced and experienced-based content – it is critical that webmasters do not merely change words in phrases whilst meaning in those changed phrases keeps mirroring the original.

Even though phrases permutated from the original can register low on Google results “hits”, BERT increasingly “understands” that AI generators often merely duplicate into the vast ocean of meaningful similar phrases.

Vreeslik helps SEO practitioners in supplying a 24/7, real-time stream of information snippets around any single topic in volumes sufficient enough so that it becomes easy to see what phrases around that topic uniquely stand out – and easy to reject AI-generated source that is terminologically different but semantically similar.