The problem is to find a scoring function.

A function which assigns a number to the likeliness of a message to be english.

What I tried for the score computing function

Most of them based on ‘ETAOIN SHRDLU’ (frequency analysis)

  1. Position based weighting (higher score is better)
          def scorecompute(msg):
    	  score = 0
    	  positions = 'ETAOIN SHRDLU'
    
    	  for i in positions:
    	      weight = len(positions) - positions.find(i)
    	      score += msg.upper().count(i) * weight
    
    	  return score
    
    doesn’t work so well

From this point on higher scores are worse.

This one is not based on frequency analysis

  1. Ensure all characters are ascii

          def scorecompute(msg):
    	  if check_english:
    	      if not re.fullmatch('[A-Z 0-9\n]+', msg.upper()):
    		  return float('inf')
    	  return 1
    
  2. Chi-squared based on character frequency

    Include non-alphabetic character frequency ( ‘.’,’’’, ‘:’ etc..)

          def scorecompute(msg: str):
    	  """
    	  Uses chi square test to compute 'score'
    	  less score means the message is more likely to be english( match
    	  the 'frequency table')
    	  """
    
    	  # add space, non-alpha characters usage too
    	  freq_table = {"A": 8.55, "K": 0.81, "U": 2.68,
    			"B": 1.60, "L": 4.21, "V": 1.06,
    			"C": 3.16, "M": 2.53, "W": 1.83,
    			"D": 3.87, "N": 7.17, "X": 0.19,
    			"E": 12.10, "O": 7.47, "Y": 1.72,
    			"F": 2.18, "P": 2.07, "Z": 0.11,
    			"G": 2.09, "Q": 0.10,
    			"H": 4.96, "R": 6.33,
    			"I": 7.33, "S": 6.73,
    			"J": 0.22, "T": 8.94, ":": 7.40, "'": 7.40, " ":12.10}
    	  score = 0
    	  for i in freq_table:
    	      observed = (msg.upper().count(i) /len(msg)) * 100
    	      expected = freq_table[i]
    	      score += ((observed -  expected )**2) / freq_table[i]
    	  return score
    
  3. Chi-square test which penalizes non ascii characters Add a fixed number to the score for every non ascii character found in the message.

          for i in msg:
    	  if not i.isascii():
    	      score +=200
    

Other things I could try

  • Considering these methods are ‘1-gram’, a method I didn’t try was using ’n-gram’ frequency tables from english.

  • To make this a proper statistical test a final step would be to find the p-value for a given significance level (say 0.05) and for the degress of freedom (the number of classes in the frequency table ).