Beyond ratios : Sampling & confidence intervals

Problem:

Suppose we have 100 log lines, each line with a different severity level. INFO, WARN, SEVERE.

Since processing all of them might be expensive, How do we sample a proportion of these log lines? What can we say about the ‘population’ of the log lines from this sample?

Confidence intervals

It lets us make statements such as ‘with x% ’level of confidence’ the number of severe lines in the overall population will be between y and z.’ So we can obtain this kind of inference based on the proportion from the sample which says something about the ‘population’.

Using Confidence intervals

The standard error (sort of like standard deviation), the critical value (based on a distribution) and margin of errors(1- level of confidence) lets us derive confidence intervals.

So if 20 lines were sampled from 100 and 2 of them were ‘SEVERE’ then we can use this to come up the confidence interval of the population proportion from this sample proportion.

level of confidence = say 95%

critical value = 1.96 (this magic number is based on the normal distribution and the level of confidence)

\[Proportion = {2 \over 20} { =0.1} \]

\[ Standard error = S.E. = \sqrt { P(1-P) \over N }\] \[= \sqrt{(0.1\times0.9) \over 20} {=0.067}\] where P = proportion and N = number of elements in the sample

the Confidence interval is given by \[(0.1+ (1.96 \times S.E.), 0.1 - (1.96 \times S.E.) )\] \[(0.1 - 0.1312, 0.1 + 0.1312)\] \[(-0.3132, 0.23132)\] taking floors and ceilings as fractional values don’t make sense in terms of the population that would be ( 0, 23)

and we can say and with 95% level of confidence the number of severe lines in the overall population will be between 0 - 23.