Last week, Doctor King organized a genomic workshop. Oh, they told me the reception on Monday is very nice.... What a pity, I fell asleep that evening and missed it.:(
I attended several talks of that workshop. However, the part I liked most is Lab Session, about assemble. I had some experience of gene prediction and microarray analysis before, however, I did not know a lot of assemble. The Lab session gave me a big picture of assemble, and I played around some softwares, learned how to work with it.
There are a lot of developers gave a tutorial on the Lab Session, introduce database things like GMOD and TURNKEY. That's really new to me. I like to learn a lot of different things in the school, not only my major. I think it would be helpful we I start to work after graduation.
Wednesday, December 15, 2010
How to annotate genes manually---for my final project
The final project of this class is annotate an aphid gene. I don't want to emphasize the process, I just want to say some experience and interesting things during our collaboration. Our team has two programmers, on mathematician, one biologist.
1. First, we went to NCBI to search by the gene id, then we saw that there is nothing about annotation on NCBI, however, other parts on the genebank file is still useful in the whole process.
2. Download the sequence on the genebank, use repeatmask to mask repeat and transposon. Actually, there is some interesting things for this part. After the computational analysis, our biologist gave us some biological meaning. I was very surprised to see the report, she got a lot of conclusion about transposon. I am confused: we have masked them at the beginning. However, our biologist has 99.9% confidence, and we recheck the introduction of repeatmask, and it cannot mask all the repeat, since some transposon has genes; also, if they mask all of them, there would be some disadvantage in the gene prediction. Oh, bioinformatics really need us to know 3 fields very well.
3. Use GENEMARK/FGENESH/GENESCAN to do the prediction. At first, one of our member got the result from GLIMMER, however, the webserve of GLIMMER he chose is only for virus and bacterial, not eukaryote, thus we did not use this result. Also, when we run the program, we need to choose the module for the species. GENEMARK and FGENESH support more species than GENESCAN. GENESCAN only has 3 species, 2 plants and 1 human, we have to choose human since aphid is animal. For GENEMARK AND FGENESH, our biologist help us choose the most close related species.
4. When we got the result, we need to use Apollo to visualize it. That's the most weired part. Apollo seems to be very strict about the input format. At first, I don't want to parse the genescan file, since I found Apollo can adapt genescan file. However, the genescan and fgenesh files don't work. I rechecked the manual, and found a sentence like: other formats are not developed very well, you'd better use GFF or GENEBANK format. Oh~~~ Thus I parse the file into GFF. At first, I only note which is exon in GFF file, then the view of Apollo is not nice, only pieces of exons, no links represent introns. (Later our another member added mRNA annotation in the GFF file, and it works!! )Then I re-parse it to genebank file, and it works. Here, we were very surprised, Apollo cannot recognize the file even adding an extra space!!!
5. Check the picture generated by Apollo, picked up the overlapping predicted genes, and use BLASTp to validate them. Here, all the softwares predicted some very short genes. Long genes can always have good hit by BLASTp, (here, make sure to choose nr database, sometimes results of other databases are not as good as nr since others are small. ) For those proteins have good hits, we chose about 30 different homologous proteins.
6. Use MEGA to generate multiple alignment and phylogenetics trees. Here, I think MEGA is easy to install on windows. I used many bioinformatics software this semester, most of them are easy to install and use on Ubuntu, haha, that why Ubuntu is so popular. However, some of them cannot work well, or hard to install on Ubuntu. Cn3D is one of them, when I did protein structure homework. And MEGA is another one. MEGA need to install wine when I try to install it on Ubuntu, and our programmer told me wine is used for windows compatible on Ubuntu. Thus, I install the MEGA on lab computer, and MEGA really easy to use.
7. Finally, our biologist gave us the report of biological meaning of what we did. That's really a nice collaboration experience. I learned a lot from them during this project.
1. First, we went to NCBI to search by the gene id, then we saw that there is nothing about annotation on NCBI, however, other parts on the genebank file is still useful in the whole process.
2. Download the sequence on the genebank, use repeatmask to mask repeat and transposon. Actually, there is some interesting things for this part. After the computational analysis, our biologist gave us some biological meaning. I was very surprised to see the report, she got a lot of conclusion about transposon. I am confused: we have masked them at the beginning. However, our biologist has 99.9% confidence, and we recheck the introduction of repeatmask, and it cannot mask all the repeat, since some transposon has genes; also, if they mask all of them, there would be some disadvantage in the gene prediction. Oh, bioinformatics really need us to know 3 fields very well.
3. Use GENEMARK/FGENESH/GENESCAN to do the prediction. At first, one of our member got the result from GLIMMER, however, the webserve of GLIMMER he chose is only for virus and bacterial, not eukaryote, thus we did not use this result. Also, when we run the program, we need to choose the module for the species. GENEMARK and FGENESH support more species than GENESCAN. GENESCAN only has 3 species, 2 plants and 1 human, we have to choose human since aphid is animal. For GENEMARK AND FGENESH, our biologist help us choose the most close related species.
4. When we got the result, we need to use Apollo to visualize it. That's the most weired part. Apollo seems to be very strict about the input format. At first, I don't want to parse the genescan file, since I found Apollo can adapt genescan file. However, the genescan and fgenesh files don't work. I rechecked the manual, and found a sentence like: other formats are not developed very well, you'd better use GFF or GENEBANK format. Oh~~~ Thus I parse the file into GFF. At first, I only note which is exon in GFF file, then the view of Apollo is not nice, only pieces of exons, no links represent introns. (Later our another member added mRNA annotation in the GFF file, and it works!! )Then I re-parse it to genebank file, and it works. Here, we were very surprised, Apollo cannot recognize the file even adding an extra space!!!
5. Check the picture generated by Apollo, picked up the overlapping predicted genes, and use BLASTp to validate them. Here, all the softwares predicted some very short genes. Long genes can always have good hit by BLASTp, (here, make sure to choose nr database, sometimes results of other databases are not as good as nr since others are small. ) For those proteins have good hits, we chose about 30 different homologous proteins.
6. Use MEGA to generate multiple alignment and phylogenetics trees. Here, I think MEGA is easy to install on windows. I used many bioinformatics software this semester, most of them are easy to install and use on Ubuntu, haha, that why Ubuntu is so popular. However, some of them cannot work well, or hard to install on Ubuntu. Cn3D is one of them, when I did protein structure homework. And MEGA is another one. MEGA need to install wine when I try to install it on Ubuntu, and our programmer told me wine is used for windows compatible on Ubuntu. Thus, I install the MEGA on lab computer, and MEGA really easy to use.
7. Finally, our biologist gave us the report of biological meaning of what we did. That's really a nice collaboration experience. I learned a lot from them during this project.
Monday, December 6, 2010
Thread for perl---learn this week
I alway want to learn how to use thread. This semester, I am working on a project, and the project needs to run blastx on 5000 genes, it takes me almost 3 days. Besides taking a long time, it also very risky since it's easy to broken when you run blastx such a long time. And this week, I learned how to use multiple thread in perl. There are a lot of things if you want to go deep into thread. Here, I just write the basic things that we can use threads quickly.
1. claim the package at the beginning: use thread;
2. construct thread object use: $thread=threads->new(sub routine, parameters of sub routine). Here, we only construct the thread object, but we did not run the subroutine.
3. run the thread one by one: $thread->join.
Actually, there are a lot of tricky things for thread, you can even write a book about thread. My experience tells me the best way to learn coding is reading other's code and coding by yourself. When I learned C++ in my undergraduate school, they always need you to read book and remember some rules, and take the exam about the rules. I just feel it's not very helpful. I think in the work, we need to learn things very quickly, and the fastest way to learn things is watching how others do it and mimic and practice.
Tuesday, November 30, 2010
Analyze the Optical Maps
This week, I presented the paper about Optical Maps with my group members. And I focus on how to reconstruct the whole genome map. Here I will discuss how can we reconstruct the whole genome map using the maps generated from Optical Maps.
Motivation
Motivation
Goal: Compare a new sequence to a reference sequence, to see whether there is any difference.
The most straightforward way is aligning two nucleotide sequence, however:
(1) Experimental expensive.
(2) Computational expensive: the complexity of aligning two sequences is O(mn), m, n are the length of the sequences.
Analyze result of optical map
What we get after optical map?
Input: a lot of segments of genome.
Output: For each segment, we know the cut position, and the size between every two cut.
Goal
Goal: Reconstruct the whole genome restriction map, align it to the reference sequence, check the variation.
Cluster
Goal: Find the cluster of segments which belong to the same region in the reference sequence.
Algorithm: Local optimal alignment.
Assemble
Goal: For each cluster, find a consensus map.
Algorithm: Multiple alignment.
Pairwise alignment
Goal: Align each assembled map contig to the reference sequence.
Algorithm: local optimal alignment.
Saturday, October 30, 2010
Microarray Analysis
microarray analysis
Microarray analysis is a new thing I leared in this class. And I went to several seminar about chip-chip analysis, it's another topic of bioinformatics besides gene prediction. I will skip the part of microarray process and only talk about analysis part.
1. The goal of microarray analysis is to compare two genelists, in these two genelists, which genes are upregulated? which are downregulated? We cannot say the genelist contains more gene A is upregulated by gene A, since it has a lot of bias. We need to use statistical method to find upregulated genes with statistical significant.
2. Software for microarray analysis. In the project of this class, we used BRB array tool to do microarray analysis. BRB array tool is an add-in of microsoft office. But be careful, from my experience, office 2003 is the best on to add this add-in, office 2007 may have crash. :( For the BRB array tool, we need to import the data, and filt, normalize them. Then add the tag for each array, (depending how you want to compare them), then run the comparison. It will generate a list of significant different genes, and heat maps, and cluster trees if you selected. It's convenient, however, I think if we want to know the whole process and algorithm, it will take us a lot of time. I borrowed a book from library, "Beginner for microarray", since I want to know the process of analysis. I just read small part of it, it really has a lot of things.
3. Principal components analysis and microarray data. This part is the hardest part, since it need a lot of linear algebra and statistics. I checked several reference, and the following is my understanding of PCA. In gene expression arrays we often have many genes being co-expressed in response to the same biological phenomenon, there are large number of measurements, and the measurements are correlated. The principal component analysis is a method for reducing the dimensionality when one has correlated measurements. It makes linear combinations of a group of variables in such a way that the linear combinations represent the data well.
The idea of PCA is, let X_1, ..., X_p be a set of real valued random variables, define a vector a=(a_1, a_2, ..., a_p)^T and seek a derived variable Z=a_1X_1+a_2X_2+...+a_pX_p such that var(Z) is maximized under the constraint ||a||=1. Then the derived variable Z attempts to capture the common variation in the variables X_i. Usually the single variable Z is not enough to represent the original variables X_1, ..., X_p, in that case we find a second derived variable, uncorrelated with the first, with the largest variance, and so on.
More formally, the task is to find uncorrelated variables Z_k such that Z_k=a_1kX_1+a_2kX_2+...+a_pkX_p and var(Z_k) is maximized under the constraint ||a_k||=1.
4. Sample PCA: Let us represent an expression array by x_ij, where i indexes one of the p genes and j indexes one of the n samples. The largest sample principal component z_1j is defined to be the linear combination z_1j=sum(a_i1, x_ij), ||a_1||=1, that has the largest sample variance. To calculate the PCA, we only need to calculate the eigenvectors of covariance matrix. And, these eigenvectors are called eigenarrays, z_kj are called eigengenes.
After I read this, I feel I am more clear about what is PCA and it application for microarray analysis.
Microarray analysis is a new thing I leared in this class. And I went to several seminar about chip-chip analysis, it's another topic of bioinformatics besides gene prediction. I will skip the part of microarray process and only talk about analysis part.
1. The goal of microarray analysis is to compare two genelists, in these two genelists, which genes are upregulated? which are downregulated? We cannot say the genelist contains more gene A is upregulated by gene A, since it has a lot of bias. We need to use statistical method to find upregulated genes with statistical significant.
2. Software for microarray analysis. In the project of this class, we used BRB array tool to do microarray analysis. BRB array tool is an add-in of microsoft office. But be careful, from my experience, office 2003 is the best on to add this add-in, office 2007 may have crash. :( For the BRB array tool, we need to import the data, and filt, normalize them. Then add the tag for each array, (depending how you want to compare them), then run the comparison. It will generate a list of significant different genes, and heat maps, and cluster trees if you selected. It's convenient, however, I think if we want to know the whole process and algorithm, it will take us a lot of time. I borrowed a book from library, "Beginner for microarray", since I want to know the process of analysis. I just read small part of it, it really has a lot of things.
3. Principal components analysis and microarray data. This part is the hardest part, since it need a lot of linear algebra and statistics. I checked several reference, and the following is my understanding of PCA. In gene expression arrays we often have many genes being co-expressed in response to the same biological phenomenon, there are large number of measurements, and the measurements are correlated. The principal component analysis is a method for reducing the dimensionality when one has correlated measurements. It makes linear combinations of a group of variables in such a way that the linear combinations represent the data well.
The idea of PCA is, let X_1, ..., X_p be a set of real valued random variables, define a vector a=(a_1, a_2, ..., a_p)^T and seek a derived variable Z=a_1X_1+a_2X_2+...+a_pX_p such that var(Z) is maximized under the constraint ||a||=1. Then the derived variable Z attempts to capture the common variation in the variables X_i. Usually the single variable Z is not enough to represent the original variables X_1, ..., X_p, in that case we find a second derived variable, uncorrelated with the first, with the largest variance, and so on.
More formally, the task is to find uncorrelated variables Z_k such that Z_k=a_1kX_1+a_2kX_2+...+a_pkX_p and var(Z_k) is maximized under the constraint ||a_k||=1.
4. Sample PCA: Let us represent an expression array by x_ij, where i indexes one of the p genes and j indexes one of the n samples. The largest sample principal component z_1j is defined to be the linear combination z_1j=sum(a_i1, x_ij), ||a_1||=1, that has the largest sample variance. To calculate the PCA, we only need to calculate the eigenvectors of covariance matrix. And, these eigenvectors are called eigenarrays, z_kj are called eigengenes.
After I read this, I feel I am more clear about what is PCA and it application for microarray analysis.
Sunday, October 10, 2010
Perl for Bioinformatics
This semester, I learn a lot of perl and linux system, the followings are useful links, I found them when I did my homework:
Bioperl tutorial
FTP download blast database
Perl: Subroutines
Regular Expression
Tutorial for awk
1. Linux system: I started to use Ubuntu system since this summer, when I took "Introduction to database". At first, I tried to install mysql and Java on windows, but I failed. When I went to ask our TA, they told me:"Sorry, we don't know how to install it on window, we only know Ubuntu". That's the first time I heard of Ubuntu, and then I installed it. That's the rule, if all the people use one system or programming language, I'd better use the same one, since it's easy to be compatible when you are working. Actually, Ubuntu has a lot of advantage. It's fast, and the software center is awesome, you don't need to worry about download and upgrade/update, it will automatically do it.
2. Perl: I learned perl last year in Bioinformatics class, but not systematically and professionally. This semester, I learned it more deep, including objected oriented programming, and multiple thread. Bioperl is really huge, and contains a lot of module, it even has some software as a module, like GENEWISE and GENESCAN, and we can parse the output as an object. However, my experience tell me you'd better use "system" to run it and parse the output file using regular expression by yourself, since you may know what is happening when the program trows exceptional!! I only used GENEBANK and BLAST object on Bioperl, since I think these two are well developed. Other software, I'd rather parse the file by myself.
3. Bioinformatics: Also, from this course, I have a systematical conception of bioinformatics and its problems which need to be solved.
Shotgun sequences--->(Assemble)--->Contigs---->(gene prediction)---->genes---->(phylogenetic analysis)---->phylogenetic tree---->biological meaning
In next semester, computational genome, I will learn more about these problems.
Bioperl tutorial
FTP download blast database
Perl: Subroutines
Regular Expression
Tutorial for awk
1. Linux system: I started to use Ubuntu system since this summer, when I took "Introduction to database". At first, I tried to install mysql and Java on windows, but I failed. When I went to ask our TA, they told me:"Sorry, we don't know how to install it on window, we only know Ubuntu". That's the first time I heard of Ubuntu, and then I installed it. That's the rule, if all the people use one system or programming language, I'd better use the same one, since it's easy to be compatible when you are working. Actually, Ubuntu has a lot of advantage. It's fast, and the software center is awesome, you don't need to worry about download and upgrade/update, it will automatically do it.
2. Perl: I learned perl last year in Bioinformatics class, but not systematically and professionally. This semester, I learned it more deep, including objected oriented programming, and multiple thread. Bioperl is really huge, and contains a lot of module, it even has some software as a module, like GENEWISE and GENESCAN, and we can parse the output as an object. However, my experience tell me you'd better use "system" to run it and parse the output file using regular expression by yourself, since you may know what is happening when the program trows exceptional!! I only used GENEBANK and BLAST object on Bioperl, since I think these two are well developed. Other software, I'd rather parse the file by myself.
3. Bioinformatics: Also, from this course, I have a systematical conception of bioinformatics and its problems which need to be solved.
Shotgun sequences--->(Assemble)--->Contigs---->(gene prediction)---->genes---->(phylogenetic analysis)---->phylogenetic tree---->biological meaning
In next semester, computational genome, I will learn more about these problems.
Sunday, September 19, 2010
maolilan---Introduce myself
Name: Tianjun Ye
Major: PhD mathematics
Hi, I am a PhD student in math department, also I am interested in computational biology.
I like traveling and hiking, this semester, I took several exciting trips, let me post some beautiful pictures:
Major: PhD mathematics
Hi, I am a PhD student in math department, also I am interested in computational biology.
I like traveling and hiking, this semester, I took several exciting trips, let me post some beautiful pictures:
| Rocky Mountain---Colorado |
| Rocky Mountain---Colorado |
| Rocky Mountain---Colorado |
| Biltmore Estate |
| Biltmore Estate |
![]() |
| Smoky Mountain |
| Brasstown |
| Smoky Mountain |
| Grand Lake |
| Upper Michigan |
| Upper Michigan |
Subscribe to:
Posts (Atom)
