Assignment 1 Solutions

1) There are 6 contigs for this chromosome. A quick way to check this is by using the Unix command:
grep ">" -c hs_ref_chr20.fa

2) The number of times you expect the sequence CTTG:
p(C)*p(T)*p(T)*p(G)*# of bases = 0.25^4 * 59505253 = 232442
Since the probability of seeing CTTG = 1/256, occurences of CTTG should be 256 base pairs apart on average.

3) A: 16523028
T: 16725209
G: 13149290
C: 13107726

4) A: 0.278
T: 0.281
G: 0.221
C: 0.220

5) Again the expected # of CTTG's is:
p(C)*p(T)*p(T)*p(G)*# of bases = 0.22 * 0.281 * 0.281 * 0.221 * 59505253 = 228446

6) There are 274682 occurences of CTTG's on this chromosome. This number is a little higher but fairly close to the expected #.
cttg_count.pl

7) There are 140664 CGCG's expected on the chromosome. But there are only 18753.
This is much less than expected. That is because this sequence is known as a CpG island.
CpG sequences are prone to methylation, which can be followed by deamination which turns cytosine into thymine.
CpG islands are only found in highly expressed genes and their promoters, where they are protected from methylation.

cgcg_count.pl

8) See contig_freq_count.pl. The frequencies vary considerably between the different contigs. Here are the answers I get:

>gi|14772189|ref|NT_025215.4|Hs20_25371 Homo sapiens chromosome 20 genomic conti

A: 0.293139426215867
T: 0.304699601858846
G: 0.198110429762011
C: 0.204050542163276

>gi|20558635|ref|NT_011333.5|Hs20_11490 Homo sapiens chromosome 20 genomic conti

A: 0.215886966483565
T: 0.223506741474018
G: 0.280580442381694
C: 0.280025849660723

>gi|27501067|ref|NT_011387.8|Hs20_11544 Homo sapiens chromosome 20 genomic conti

A: 0.289341611052337
T: 0.292519385980783
G: 0.209447573187511
C: 0.208691429779369

>gi|51475129|ref|NT_011362.9|Hs20_11519 Homo sapiens chromosome 20 genomic conti

A: 0.272576623010424
T: 0.275601255537864
G: 0.226143921896956
C: 0.225678199554756

>gi|51475228|ref|NT_028392.5|Hs20_28551 Homo sapiens chromosome 20 genomic contig

A: 0.264648247668827
T: 0.269204760324607
G: 0.234008321339583
C: 0.232138670666983

>gi|51475229|ref|NT_035608.2|Hs20_35770 Homo sapiens chromosome 20 genomic contig

A: 0.204456987154535
T: 0.215189345492966
G: 0.293360395929489
C: 0.286993271423011