How many words differ only by letter duplication

I’m working with football (the European variant) commentary, looking at word frequency. Often enough, the commentary will spell their excitement by writing ‘Gooaal’, which for the purposes of my word frequency is different from the properly spelled ‘goal’. I’m considering capturing these cases by removing all successive duplicate letters. This question is not about how to do that, but I need a guess of how big an error will I create with this technique.

How many words collapse to the same word (string) if one remove duplicate letters? I need a just rough estimate. My guess is less than 1‰ but I’m not a native speaker so I may miss cases.

To make it clearer, here is the first paragraph of this question after running it through the algorithm:

I’m working with fotbal (the European variant) comentary, loking at
word frequency. Often enough, the comentary wil spel their excitement
by writing ‘Goal’, which for the purposes of my word frequency is
diferent from the properly speled ‘goal’. I’m considering capturing
these cases by removing al sucesive duplicate leters. This question is
not about how to do that, but I ned a gues of how big an eror wil I
create with this technique

(I have no idea what tags are appropriate, please add more relevant ones.)

Answer

It would seem more efficient to write an algorithm that would capture any number of repeated o’s and a’s to capture these variations on “Goal!” than to eliminate all double letters in your entire database.

The frequently used verbs meet, speed, bleed, and feed form their past tenses by shortening the vowel to met, sped, bled, and fed.

Other minimal pairs that will reduce to the same result:

refereed, referred
feel, fell
devotee, devotees; devote, devotes
divorcee, divorcees; divorce, divorces
stooped, stopped
steeped, stepped
career, carer
kneel, knell
pall, pal
teen, ten
commandeer, commander

There are simply too many common words your program wouldn’t distinguish. Even if you had to deal with “Goooaaal!” and “Gooaal!” separately, it would seem a better use of your time — and risk less skewed results — if you abandoned this method of sorting your data.

Attribution
Source : Link , Question Author : Huang_d , Answer Author : KarlG

Leave a Comment