The problem is to find a scoring function.
A function which assigns a number to the likeliness of a message to be english.
What I tried for the score computing function
Most of them based on ‘ETAOIN SHRDLU’ (frequency analysis)
- Position based weighting
(higher score is better)
doesn’t work so well
def scorecompute(msg): score = 0 positions = 'ETAOIN SHRDLU' for i in positions: weight = len(positions) - positions.find(i) score += msg.upper().count(i) * weight return score
From this point on higher scores are worse.
This one is not based on frequency analysis
-
Ensure all characters are ascii
def scorecompute(msg): if check_english: if not re.fullmatch('[A-Z 0-9\n]+', msg.upper()): return float('inf') return 1
-
Chi-squared based on character frequency
Include non-alphabetic character frequency ( ‘.’,’’’, ‘:’ etc..)
def scorecompute(msg: str): """ Uses chi square test to compute 'score' less score means the message is more likely to be english( match the 'frequency table') """ # add space, non-alpha characters usage too freq_table = {"A": 8.55, "K": 0.81, "U": 2.68, "B": 1.60, "L": 4.21, "V": 1.06, "C": 3.16, "M": 2.53, "W": 1.83, "D": 3.87, "N": 7.17, "X": 0.19, "E": 12.10, "O": 7.47, "Y": 1.72, "F": 2.18, "P": 2.07, "Z": 0.11, "G": 2.09, "Q": 0.10, "H": 4.96, "R": 6.33, "I": 7.33, "S": 6.73, "J": 0.22, "T": 8.94, ":": 7.40, "'": 7.40, " ":12.10} score = 0 for i in freq_table: observed = (msg.upper().count(i) /len(msg)) * 100 expected = freq_table[i] score += ((observed - expected )**2) / freq_table[i] return score
-
Chi-square test which penalizes non ascii characters Add a fixed number to the score for every non ascii character found in the message.
for i in msg: if not i.isascii(): score +=200
Other things I could try
-
Considering these methods are ‘1-gram’, a method I didn’t try was using ’n-gram’ frequency tables from english.
-
To make this a proper statistical test a final step would be to find the p-value for a given significance level (say 0.05) and for the degress of freedom (the number of classes in the frequency table ).