The final project of this class is annotate an aphid gene. I don't want to emphasize the process, I just want to say some experience and interesting things during our collaboration. Our team has two programmers, on mathematician, one biologist.
1. First, we went to NCBI to search by the gene id, then we saw that there is nothing about annotation on NCBI, however, other parts on the genebank file is still useful in the whole process.
2. Download the sequence on the genebank, use repeatmask to mask repeat and transposon. Actually, there is some interesting things for this part. After the computational analysis, our biologist gave us some biological meaning. I was very surprised to see the report, she got a lot of conclusion about transposon. I am confused: we have masked them at the beginning. However, our biologist has 99.9% confidence, and we recheck the introduction of repeatmask, and it cannot mask all the repeat, since some transposon has genes; also, if they mask all of them, there would be some disadvantage in the gene prediction. Oh, bioinformatics really need us to know 3 fields very well.
3. Use GENEMARK/FGENESH/GENESCAN to do the prediction. At first, one of our member got the result from GLIMMER, however, the webserve of GLIMMER he chose is only for virus and bacterial, not eukaryote, thus we did not use this result. Also, when we run the program, we need to choose the module for the species. GENEMARK and FGENESH support more species than GENESCAN. GENESCAN only has 3 species, 2 plants and 1 human, we have to choose human since aphid is animal. For GENEMARK AND FGENESH, our biologist help us choose the most close related species.
4. When we got the result, we need to use Apollo to visualize it. That's the most weired part. Apollo seems to be very strict about the input format. At first, I don't want to parse the genescan file, since I found Apollo can adapt genescan file. However, the genescan and fgenesh files don't work. I rechecked the manual, and found a sentence like: other formats are not developed very well, you'd better use GFF or GENEBANK format. Oh~~~ Thus I parse the file into GFF. At first, I only note which is exon in GFF file, then the view of Apollo is not nice, only pieces of exons, no links represent introns. (Later our another member added mRNA annotation in the GFF file, and it works!! )Then I re-parse it to genebank file, and it works. Here, we were very surprised, Apollo cannot recognize the file even adding an extra space!!!
5. Check the picture generated by Apollo, picked up the overlapping predicted genes, and use BLASTp to validate them. Here, all the softwares predicted some very short genes. Long genes can always have good hit by BLASTp, (here, make sure to choose nr database, sometimes results of other databases are not as good as nr since others are small. ) For those proteins have good hits, we chose about 30 different homologous proteins.
6. Use MEGA to generate multiple alignment and phylogenetics trees. Here, I think MEGA is easy to install on windows. I used many bioinformatics software this semester, most of them are easy to install and use on Ubuntu, haha, that why Ubuntu is so popular. However, some of them cannot work well, or hard to install on Ubuntu. Cn3D is one of them, when I did protein structure homework. And MEGA is another one. MEGA need to install wine when I try to install it on Ubuntu, and our programmer told me wine is used for windows compatible on Ubuntu. Thus, I install the MEGA on lab computer, and MEGA really easy to use.
7. Finally, our biologist gave us the report of biological meaning of what we did. That's really a nice collaboration experience. I learned a lot from them during this project.
No comments:
Post a Comment