In this week's assignment you will design and implement your own k-means clustering algorithm and use it to partition microarray data collected from various timepoints in the yeast cell cycle. In this experiment yeast cells were synchronized by arresting them at 37 degrees with a temperature sensitive allele of CDC4. mRNA was then collected every ten minutes for the next two hours (approximately two turns of the cell cycle) and hybridized to Affymetrix microarrays (see Cho et al for more info). As a starting point we are providing you with affymetrix data that has been pre-filtered. You can find the data here. The data was first filtered to insure that each gene was expressed at a level of at least 100 in at least one condition. In addition the data was filtered to remove genes whose normalized standard deviation (stdev/mean) was less than .5.
QUESTION 1: Explain the logic behind these filters.
Here is a script called normalize.pl. Download normalize.pl and and run it on the expression data. This script takes in expression data, stores it, and normalizes every gene by dividing each expression value for that gene by the magnitude (or norm) of that genes expression values (its expression vector or expression array). The norm is defined as:
Square Root(Sum_i xi2)
By dividing every element by the gene's norm, you make it so every gene now has a norm of 1. This makes it easier to compare the expression patterns of 2 genes.
Ok, so now we want to cluster these normalized gene expression values using k-means. Here is a general outline of how to proceed.
Step 1 Read in the data and store it in a two-dimensional (2D) array. For more info on 2D arrays click here . You may want to look at normalize.pl to get an idea of how to do this. You are free to use any part of this code in your own script (just make sure you comment it).
Step 2 Initalize the centroids. Centroids are the points that define the centers of the clusters in k-means clustering. For a given gene, its closest centroid tells you what cluster it's in. For this assignment we'll use 3 centroids. Use the following statements (exactly) to initialize the locations of your centroids:
@centroid1 = (0, .5, .5, 0, 0, 0, 0, 0, 0, 0, .5, .5, 0, 0, 0);
@centroid2 = (.3535, 0, .3535, 0, .3535, 0, .3535, 0, .3535, 0, .3535, 0,
.3535, 0, .3535);
@centroid3 = (.2582, .2582, .2582, .2582, .2582, .2582, .2582, .2582, .2582,
.2582, .2582, .2582, .2582, .365, 0);
Step 3 For each gene, calculate the correlation coefficient between it and centroids 1, 2, and 3. To do this we're going to use a subfunction called correlate. Subfunctions are used frequently in perl for tasks that a script performs many times. For instance, this subfunction takes as input two one-dimensional arrays and returns their correlation coefficient. Dowload this file to get the subfunction. Paste all of this file to the end of your file. Subfunctions are found at the end of your script after the main body of the program. The syntax for using this particular subfunction in the main section of your script is:
$correlation = correlate(\@array1,\@array2)
Use this subfunction to assign all of the genes to their closest centroid (the centroid with the highest correlation coeffiecent for that gene).
Step 4 Now we want to change the coordinates of the centroids to the center of their clusters. To do this you want to add up all of the elements of each cluster seperately. For example, say you had two genes in one of the clusters:
Gene 1: (x1, x2, x3)
Gene 2: (y1, y2, y3)
Then, to add them Gene 1 + Gene 2 = (x1 + y1, x
2 + y2, x3 + y3)
After adding up all of the genes in the cluster, normalize these arrays (see normalize.pl) and make them the new centroids.
Step 5 Repeat steps 3 and 4 ten times, each time outputting the new centroids in a readable format.
Step 6 Assign the genes one last time and print out all of the genes assigned to centroid 1.
QUESTION 2: We did 10 iterations, looking at your centroids, how many rounds did you really need?
QUESTION 3: Using any program you want (like excel), graph the centroids before you start clustering and after clustering. Are there any differences?
QUESTION 3: Go to www.yeastgenome.org or use the orf table provided and look up some of the annotations for the genes in cluster 1. Based on these annotations and the shape of the centroid what is your interpretation of cluster 1? What is your interpretation of the other two clusters?
Please turn in your code and answers to a folder labelled Assignment03 in your Homework folder.
The assignment is due Friday, Feb. 13th at midnight.