Latin Scrabble II: Calculating Letter Frequencies

Getting hold of a sample of Latin text of any size is easy enough. I wanted 100,000 characters, so I copied three books from different websites into a single file. I believe they were Cicero’s 1st Catilinarian, Vergil’s Eclogues, and a book of Ovid’s Metamorphoses, but they could have been almost anything. (With a large enough sample, it shouldn’t make much difference which authors are chosen, so I didn’t write them down.)

I then tried to use my very rusty knowledge of C to put together a C++ program that would open the file, read it character by character, and compile totals for each letter. It was only after wasting many hours on this project that I realized it could be done very simply without any programming at all. The Microsoft Word search-and-replace function tells you how many replacements it has made. Therefore, to calculate letter frequencies for any text, all you need to do is:

  1. Copy it into Word.
  2. Remove all the punctuation and spaces by replacing them with nothing using the search-and-replace function.
  3. Trim the remaining alphabet soup to the appropriate size. Having precisely 100,000 characters, or some similar round number, will greatly simplify calculating the percentages for individual letters.
  4. Go through the alphabet, replacing A with nothing (or * or any nonalphabetic character), then B, C, D, and so on, jotting down the total number of replacements made each time.

Assuming a Scrabble set of 100 tiles in all, if you have (e.g.) 9,425 As and 1663 Bs in a 100,000-character text, you know that you need 9 A tiles and 2 B tiles.

This is not the ideal method for calculating letter frequencies. For Scrabble, the significant number is not the percentage of As or Bs in a Latin text, or in the sum of all Latin texts, but the percentage of As or Bs in the set containing every distinct form of every Latin word, with all the duplicates removed. For instance, when we consider the frequency of interrogatives, relatives, et, and –que in Latin, my method probably overstates the number of Qs, Us, and Es, since these common words will come up far more often than the average word. H is probably the most overrepresented letter, since very few different Latin words have an H in them, but those few include all the forms of hic, haec, hoc, which come up over and over. R and S are probably underrepresented, since the complete set of all possible distinct forms of all Latin words would have tens of thousands of each just in the verb endings. In the long run, a computer program with a very large database of Latin texts would provide much more accurate percentages. However, they would probably not be very different from mine, so my method makes a tolerable substitute. It is certainly much better than using the English percentages, as we would be doing if we played Latin Scrabble with an unaltered English Scrabble set.

It would be interesting to calculate the percentages of long and short vowels separately, but I see no way to do that without scanning a lot of hundred-year-old school texts in which all the long vowels are marked. Even then, it might not be easy to find texts that mark the hidden quantities, that is, the vowels that are always long by position, and are therefore often not marked long even when they are long by nature. (For scanning verse, the fact that the second A in amans and the E in dolens are long vowels is irrelevant, since the –ns will make the syllables containing them long in any case. Even texts that profess to mark the long vowels often do not bother with the hidden longs, as if scansion counted for everything and pronunciation for nothing.)

This entry was posted in Latin Scrabble. Bookmark the permalink.

One Response to Latin Scrabble II: Calculating Letter Frequencies

Leave a Reply to Jan Hurych Cancel reply

Your email address will not be published. Required fields are marked *