Assignment 1
Preparatory Steps:
Note: ccbtl00:~ username$ is just showing the command prompt (ie. do not type it into your terminal). You should have something similar.
- Move to the Homework directory.
ccbtl00:~ username$ cd Homework
- Make a directory for your assignment and change into it.
ccbtl00:Homework username$ mkdir Assignment1
ccbtl00:Homework username$ cd Assignment1
- Download human chromosome 20 (Here is the direct link to the file, but try using the Unix commands.)
ccbtl00:Assignment1 username$ ftp ftp.ncbi.nih.gov
- Follow the instructions to log into ftp anonymously
- Get the file hs_ref_chr20.fa.gz in /genomes/H_sapiens/CHR_20
...... LOG IN ANONYMOUSLY HERE ..... (ie. type anonymous as your user and email as your password)
ftp> cd /genomes/H_sapiens/CHR_20
ftp> get hs_ref_chr20.fa.gz
bye (logout)
- Unzip the files
ccbtl00:~/Assignment1 username$ gunzip hs_ref_chr20.fa.gz
There should be a file called hs_ref_chr20.fa now in your directory.
Warning: Your file may not be called hs_ref_chr20.fa if you download it using the direct link (it is ref_chr20.fa).
Questions:
- Each contig starts with a fasta header i.e.:
>gi|27501117|ref|NT_011512.7|Hs21_11669 Homo sapiens chromosome 20
Using Unix commands, how many contigs are there in human chromosome 20?
- Human chromosome 20 is 59505253 nts long.
Assume that %A + %T = %G + %C = 50% (or in other words, equal numbers of A, T, C, and G).
How many times would you expect the sequence "CTTG" to occur?
On average, how far apart show each instance of "CTTG" be spaced?
- Now lets test that assumption about uniformly distributed NT frequencies.
Here is a PERL script (save it on the Assignment1 folder as nuc_count.pl).
It reads in the contents of a fasta file, counts the occurences of As, Ts, Cs, Gs, and Ns, and prints the results.
Type perl nuc_count.pl to view the usage statement so that you know how to run the program.
How many times do each of the 4 nucleotides occur on chr 20?
-
Using the output from the previous question, calculate the real
frequencies of A, T, C, and G in chromosome 20.
Round the results to 3 decimal places.
-
Using the real nucleotide frequencies, calculate the expected number of occurences of "CTTG".
-
Copy nuc_count.pl into a new file called cttg_count.pl
Modify cttg_count.pl so that it
- Counts the occurences of "CTTG" in chr 20
- Calculates the frequencies of each of the 4 nucleotides
- Calculates the expected number of "CTTG" occurences
- Outputs these results in a human readable format (please remove the lines of code that output the raw nucleotide counts)
What is the actual number of occurences of "CTTG" in chromosome 20?
Is the number close to the expected number?
-
Now find the number of occurences of "CGCG" in chr20.
You will want to copy cttg_count.pl
into a new file called cgcg_count.pl
so you don't get confused.
What is the result compared to expected?
Can you give a biological reason why this is?
-
The file hs_ref_chr20.fa
is made up of different contigs.
Modify your script, copying it into contig_freq_count.pl, so that it prints the contig names
(ie. the fasta headers) and the 4 nucleotide frequencies for each
contig in the file. An example of what we want as output is
below (the frequency values are wrong though).
>gi|20558635|ref|NT_011333.5|Hs20_11490 Homo sapiens chromosome 20 genomic contig, reference assembly
Afreq: 0.25
Tfreq: 0.25
Gfreq: 0.25
Cfreq: 0.25
>gi|27501067|ref|NT_011387.8|Hs20_11544 Homo sapiens chromosome 20 genomic contig, reference assembly
.
.
.
and so on ...
Does it look like the frequencies are pretty even across all regions of the chromosome?
How do the frequencies for the individual contigs compare to the
average of all contigs (i.e., the overall nucleotide frequencies for
the chromosome)?
Note: Question 8 might be hard for some people. If you are new to
programming, don't sweat it, this problem would be hard for anyone
just learning to program. A great way to debug your programs is to
test them on a smaller, known data set. If you are having problems, or
are not sure if your script is working, try making a toy data set and
testing your script on it. This can really help a lot. And, remember,
the goal is to learn Perl and have fun; don't worry about your
grade. If you make a clear effort, you'll be fine. Good luck!
HandIn:
We want you to hand in a text file with the answers to the questions (show work as necessary).
Also, please hand in all of the programs that you have programmed as well with appropriate commenting (cttg_count.pl, cgcg_count.pl and contig_freq_count.pl)
Last updated: Wed Jan 14 2009