Chi-square

 

What’s a chi-square

 

Like the Poisson distribution mentioned before, Chi-square is a type of probability distribution.  Different from Poisson being a discrete distribution, Chi-square is continuous.  This means that observations following this distribution can take infinite numbers of possible values.  For example, the time you spend on waiting for the next bus can be described using a continuous distribution.  If you are lucky, the next bus comes within 5 minutes.  If you are unlucky, it might take you 1 hour.  If you are really unlucky and miss the only bus that will ever run on this route for the next 100 years, the waiting time is longer than your life time. 

 

Any continuous random variable can be described using the MEAN and VARIANCE measures, just like any discrete random variable.  Instead of having a probability distribution function, a continuous random variable has a probability density function. 

 

Let X be a continuous random variable with the density function shown below:

 

To calculate the probability that X will take values between 5 and 8, we integrate f(x) between x=5 and x=8 to get the area below the curve (the blue region).  Note that the probability that a continuous random variable will take on any single value is ZERO, since the area under a single curve is always 0.

 

What does it look like?

The probability density function for Chi-square random variable is:

 

                when x ≥ 0

 

 f(x)  = 0                                                          otherwise

 

where    and n is the degree of freedom

 

 

Here are some graphs:

 

 

                        n=5                                                                  n=10                                                    n=25                                                                n=50

 

Note that Chi-square distributions are not symmetrical.  They are skewed to the right. 

 

To see more examples of Chi-square, check out  http://www.xycoon.com/chisq1.htm

 

When to use Chi-square distribution?

 

Statistical tests performed on random variables which follow a Chi-square distribution are called “Chi-square tests”.    Some examples of when to use chi-square test are:

 

1.                  to test for independence

2.                  to test for homogeneity

3.                  to test for goodness of fit

 

Relation to Contingency Tables

 

We can use Pearson’s chi-square statistics.  For example if we want to test whether smoking increases the chance of getting cancer and have the data table below:

 

                                                                        Get cancer?

 

 

 

Yes

No

Total

Smoking?

Yes

100

20

120

No

40

80

120

 

Total

140

100

240

 

In other words, we want to check whether the variables smoking habit and cancer occurrence are independent of each other. 

 

Pearson’s chi-square statistic looks like this:

 

 

where Oi is what we observe (the numbers you get in the table) and

Ei is what we expect if there’s no dependence between the 2 variables.

 

If there’s no dependence between the 2 variables (this is our null model), then,

 

P(smoke and get cancer) = P(smoke) * P(get cancer)

                                       = 120 / 240 * 140 / 240

                                       = 0.2916

 

Then # of people who smoke and get cancer is P(smoke and get cancer)*total people = 0.2916 * 240 = 70

 

We do the same thing for all 4 cells and get this table: (this is our null distribution)

 

Get cancer?

 

 

 

Yes

No

Total

Smoking?

Yes

70

50

120

No

70

50

120

 

Total

140

100

240

 

Use the above formula, we get,

 

                        χ2 = (100-70)2/70 + (20-50)2 / 50 + (40-70)2 / 70 + (80-50)2 / 50

                             = 61.71

 

The larger the value of χ2 , the worse the fit

 

To see how significant this number is, we calculate the p-value.  We need to determine the degree of freedom first. 

 

Degrees of freedom = number of cells – number of independent parameters fitted -1

                                =  4 – 2 (smoke, cancer) -1

                                = 1

p-value = P(χ2 > a specific value | model is correct)

 

In this case, the specific value is 61.71

 

Using a chi-square table we get a p-value of 4 x 10-15 (use online calculator http://www.stat.sc.edu/~west/applets/chisqdemo.html or Excel).  This means that if there’s independence between smoking and cancer, then the probability of getting the data we have is 4 x 10-15, ie very unlikely.  So we can reject our null hypothesis.

 

Relation to Likelihood Ratio Tests

 

In short, a likelihood ratio test is a ratio of likelihoods:

 

  , where α and ө are two different models (ө is often referred to as alternate model, while α is the null model).  

 

Using our last example, one model assumes that there’s no relationship between smoking and getting cancer while the other model assumes that a relationship exists.  How to calculate the likelihood for models is beyond the scope of this course.  The important take-home message is that under certain assumptions, -2 ln (ratio) tends to a chi-square distribution with the degrees of freedom = difference in the number of parameters estimated in the 2 models. 

 

For example, suppose we are trying to model the weather tomorrow using a bunch of variables:  number of cars on the street right now, number of people in GSC, and amount of snow on the ground.  The model α says that none of the 3 factors matters and has a likelihood of 140 while the model ө says that all 3 factors contribute and has a likelihood of 100.  So we can calculate the -2 ln(ratio) = -2 ln (100/140) = 0.67.  The degree of freedom is the difference between the numbers of estimated parameter, which is 3.  So we can use the chi-square table to check out the p-value for this value.

 

<Christina Chen  2004-02-12>