Last week, Doctor King organized a genomic workshop. Oh, they told me the reception on Monday is very nice.... What a pity, I fell asleep that evening and missed it.:(
I attended several talks of that workshop. However, the part I liked most is Lab Session, about assemble. I had some experience of gene prediction and microarray analysis before, however, I did not know a lot of assemble. The Lab session gave me a big picture of assemble, and I played around some softwares, learned how to work with it.
There are a lot of developers gave a tutorial on the Lab Session, introduce database things like GMOD and TURNKEY. That's really new to me. I like to learn a lot of different things in the school, not only my major. I think it would be helpful we I start to work after graduation.
Wednesday, December 15, 2010
How to annotate genes manually---for my final project
The final project of this class is annotate an aphid gene. I don't want to emphasize the process, I just want to say some experience and interesting things during our collaboration. Our team has two programmers, on mathematician, one biologist.
1. First, we went to NCBI to search by the gene id, then we saw that there is nothing about annotation on NCBI, however, other parts on the genebank file is still useful in the whole process.
2. Download the sequence on the genebank, use repeatmask to mask repeat and transposon. Actually, there is some interesting things for this part. After the computational analysis, our biologist gave us some biological meaning. I was very surprised to see the report, she got a lot of conclusion about transposon. I am confused: we have masked them at the beginning. However, our biologist has 99.9% confidence, and we recheck the introduction of repeatmask, and it cannot mask all the repeat, since some transposon has genes; also, if they mask all of them, there would be some disadvantage in the gene prediction. Oh, bioinformatics really need us to know 3 fields very well.
3. Use GENEMARK/FGENESH/GENESCAN to do the prediction. At first, one of our member got the result from GLIMMER, however, the webserve of GLIMMER he chose is only for virus and bacterial, not eukaryote, thus we did not use this result. Also, when we run the program, we need to choose the module for the species. GENEMARK and FGENESH support more species than GENESCAN. GENESCAN only has 3 species, 2 plants and 1 human, we have to choose human since aphid is animal. For GENEMARK AND FGENESH, our biologist help us choose the most close related species.
4. When we got the result, we need to use Apollo to visualize it. That's the most weired part. Apollo seems to be very strict about the input format. At first, I don't want to parse the genescan file, since I found Apollo can adapt genescan file. However, the genescan and fgenesh files don't work. I rechecked the manual, and found a sentence like: other formats are not developed very well, you'd better use GFF or GENEBANK format. Oh~~~ Thus I parse the file into GFF. At first, I only note which is exon in GFF file, then the view of Apollo is not nice, only pieces of exons, no links represent introns. (Later our another member added mRNA annotation in the GFF file, and it works!! )Then I re-parse it to genebank file, and it works. Here, we were very surprised, Apollo cannot recognize the file even adding an extra space!!!
5. Check the picture generated by Apollo, picked up the overlapping predicted genes, and use BLASTp to validate them. Here, all the softwares predicted some very short genes. Long genes can always have good hit by BLASTp, (here, make sure to choose nr database, sometimes results of other databases are not as good as nr since others are small. ) For those proteins have good hits, we chose about 30 different homologous proteins.
6. Use MEGA to generate multiple alignment and phylogenetics trees. Here, I think MEGA is easy to install on windows. I used many bioinformatics software this semester, most of them are easy to install and use on Ubuntu, haha, that why Ubuntu is so popular. However, some of them cannot work well, or hard to install on Ubuntu. Cn3D is one of them, when I did protein structure homework. And MEGA is another one. MEGA need to install wine when I try to install it on Ubuntu, and our programmer told me wine is used for windows compatible on Ubuntu. Thus, I install the MEGA on lab computer, and MEGA really easy to use.
7. Finally, our biologist gave us the report of biological meaning of what we did. That's really a nice collaboration experience. I learned a lot from them during this project.
1. First, we went to NCBI to search by the gene id, then we saw that there is nothing about annotation on NCBI, however, other parts on the genebank file is still useful in the whole process.
2. Download the sequence on the genebank, use repeatmask to mask repeat and transposon. Actually, there is some interesting things for this part. After the computational analysis, our biologist gave us some biological meaning. I was very surprised to see the report, she got a lot of conclusion about transposon. I am confused: we have masked them at the beginning. However, our biologist has 99.9% confidence, and we recheck the introduction of repeatmask, and it cannot mask all the repeat, since some transposon has genes; also, if they mask all of them, there would be some disadvantage in the gene prediction. Oh, bioinformatics really need us to know 3 fields very well.
3. Use GENEMARK/FGENESH/GENESCAN to do the prediction. At first, one of our member got the result from GLIMMER, however, the webserve of GLIMMER he chose is only for virus and bacterial, not eukaryote, thus we did not use this result. Also, when we run the program, we need to choose the module for the species. GENEMARK and FGENESH support more species than GENESCAN. GENESCAN only has 3 species, 2 plants and 1 human, we have to choose human since aphid is animal. For GENEMARK AND FGENESH, our biologist help us choose the most close related species.
4. When we got the result, we need to use Apollo to visualize it. That's the most weired part. Apollo seems to be very strict about the input format. At first, I don't want to parse the genescan file, since I found Apollo can adapt genescan file. However, the genescan and fgenesh files don't work. I rechecked the manual, and found a sentence like: other formats are not developed very well, you'd better use GFF or GENEBANK format. Oh~~~ Thus I parse the file into GFF. At first, I only note which is exon in GFF file, then the view of Apollo is not nice, only pieces of exons, no links represent introns. (Later our another member added mRNA annotation in the GFF file, and it works!! )Then I re-parse it to genebank file, and it works. Here, we were very surprised, Apollo cannot recognize the file even adding an extra space!!!
5. Check the picture generated by Apollo, picked up the overlapping predicted genes, and use BLASTp to validate them. Here, all the softwares predicted some very short genes. Long genes can always have good hit by BLASTp, (here, make sure to choose nr database, sometimes results of other databases are not as good as nr since others are small. ) For those proteins have good hits, we chose about 30 different homologous proteins.
6. Use MEGA to generate multiple alignment and phylogenetics trees. Here, I think MEGA is easy to install on windows. I used many bioinformatics software this semester, most of them are easy to install and use on Ubuntu, haha, that why Ubuntu is so popular. However, some of them cannot work well, or hard to install on Ubuntu. Cn3D is one of them, when I did protein structure homework. And MEGA is another one. MEGA need to install wine when I try to install it on Ubuntu, and our programmer told me wine is used for windows compatible on Ubuntu. Thus, I install the MEGA on lab computer, and MEGA really easy to use.
7. Finally, our biologist gave us the report of biological meaning of what we did. That's really a nice collaboration experience. I learned a lot from them during this project.
Monday, December 6, 2010
Thread for perl---learn this week
I alway want to learn how to use thread. This semester, I am working on a project, and the project needs to run blastx on 5000 genes, it takes me almost 3 days. Besides taking a long time, it also very risky since it's easy to broken when you run blastx such a long time. And this week, I learned how to use multiple thread in perl. There are a lot of things if you want to go deep into thread. Here, I just write the basic things that we can use threads quickly.
1. claim the package at the beginning: use thread;
2. construct thread object use: $thread=threads->new(sub routine, parameters of sub routine). Here, we only construct the thread object, but we did not run the subroutine.
3. run the thread one by one: $thread->join.
Actually, there are a lot of tricky things for thread, you can even write a book about thread. My experience tells me the best way to learn coding is reading other's code and coding by yourself. When I learned C++ in my undergraduate school, they always need you to read book and remember some rules, and take the exam about the rules. I just feel it's not very helpful. I think in the work, we need to learn things very quickly, and the fastest way to learn things is watching how others do it and mimic and practice.
Subscribe to:
Posts (Atom)