I noticed that a remarkable number of words starting with gr are still words if you swap the gr for h. For example, the words in the title of this post. How many words is this true for? Which pair of prefixes has the most words in common?

Here’s a Python script I wrote to answer those questions. Here’s the list of words I used. And here are the results. I only looked at prefixes of one or two letters.

The best pair was no and u. Here’s the list of suffixes they have in common. Most of it is words which can be prefixed with un or non. That isn’t very interesting, so I think the real winner is (b,st), with 1085 suffixes in common. It’s the first pair where one of the prefixes is two letters, and where most of the words aren’t just words with another Latin prefix in front of them.

I could do loads of calculations like this. If you import the Python script as a module, you can have a look at all the data it computes. Very interesting!

The word list I used probably skewed the results quite a bit because it contains lots of words which are conjugations or pluralisations or whatever of the same root word, as well as a load of really weird words which probably occur once in the whole corpus. I think if I look at this again I’ll use something like this frequency list, and use the frequencies of words as a weighting for scoring prefix-pairs.