I’m working with football (the European variant) commentary, looking at word frequency. Often enough, the commentary will spell their excitement by writing ‘Gooaal’, which for the purposes of my word frequency is different from the properly spelled ‘goal’. I’m considering capturing these cases by removing all successive duplicate letters. This question is not about how to do that, but I need a guess of how big an error will I create with this technique.
How many words collapse to the same word (string) if one remove duplicate letters? I need a just rough estimate. My guess is less than 1‰ but I’m not a native speaker so I may miss cases.
To make it clearer, here is the first paragraph of this question after running it through the algorithm:
I’m working with fotbal (the European variant) comentary, loking at
word frequency. Often enough, the comentary wil spel their excitement
by writing ‘Goal’, which for the purposes of my word frequency is
diferent from the properly speled ‘goal’. I’m considering capturing
these cases by removing al sucesive duplicate leters. This question is
not about how to do that, but I ned a gues of how big an eror wil I
create with this technique
(I have no idea what tags are appropriate, please add more relevant ones.)
It would seem more efficient to write an algorithm that would capture any number of repeated o’s and a’s to capture these variations on “Goal!” than to eliminate all double letters in your entire database.
The frequently used verbs meet, speed, bleed, and feed form their past tenses by shortening the vowel to met, sped, bled, and fed.
Other minimal pairs that will reduce to the same result:
devotee, devotees; devote, devotes
divorcee, divorcees; divorce, divorces
There are simply too many common words your program wouldn’t distinguish. Even if you had to deal with “Goooaaal!” and “Gooaal!” separately, it would seem a better use of your time — and risk less skewed results — if you abandoned this method of sorting your data.