Homework 2 - BLAST Part 1: 1. The nr database is "non-redundant". This is advantageous for a number of reasons, almost all of which is due to the fact that this is a smaller database file than if you had just blindly combined Swiss-PROT, GenBank, and the model organism databases. The smaller size is good for computational reasons as BLAST can run faster. It's good for statisitical reasons because the smaller database means that you would expect to see fewer hits with a high score. It's also good for biological reasons because in many cases two labs have sequenced the same gene and given it two different names. The nr database will hopefully prevent confusion caused by this. 2. 40 3. Saccharomyces castelli 4. Score 808 bits, %identity = 52%. 5. I got 39 hits with an e-value < 1. This is less than before, this is because we changed the matrix and are using blosum80 which looks for more identity because the matrix was formed from proteins with at least 80% identity. The blast is therefore more stringent because the aligments have to be better in order to score well. This matrix is better than blosum62 for differentiating between closely related sequences. 6. Saccharomyces dairenensis 7. Score = 825 bits, ID = 66% 8. 46 hits, less than before. This might be two to two reasons: 1: The extension penalty was increased, so long gaps incur a higher penalty, resulting in a lower number of hits. 2: We have decresed the penalty for small gaps, so many more hits will score well. This increases the statistical threshold needed for a statistically significant hit, resulting in fewer significant hits. 9. The score is lower. This is due to the fact that BLAST has now inserted more gaps. And the gap extension score is larger. 10. This will likely take longer. Lowering the word size means more hits will be found initially, meaning more sequence that need to be extended and joined together. Making more work for blast. Part 2: Your favorite gene was part of the see my example code. I also posted code from a student in the class who solved the problem slightly differently than I did. Hopefully showing you that there are many, many different ways to tackle a programming problem.