Saturday, October 30, 2010

Microarray Analysis

microarray analysis

Microarray analysis is a new thing I leared in this class. And I went to several seminar about chip-chip analysis, it's another topic of bioinformatics besides gene prediction. I will skip the part of microarray process and only talk about analysis part.

1. The goal of microarray analysis is to compare two genelists, in these two genelists, which genes are upregulated? which are downregulated? We cannot say the genelist contains more gene A is upregulated by gene A, since it has a lot of bias. We need to use statistical method to find upregulated genes with statistical significant.

 2. Software for microarray analysis. In the project of this class, we used BRB array tool to do microarray analysis. BRB array tool is an add-in of microsoft office. But be careful, from my experience, office 2003 is the best on to add this add-in, office 2007 may have crash. :( For the BRB array tool, we need to import the data, and filt, normalize them. Then add the tag for each array, (depending how you want to compare them), then run the comparison. It will generate a list of significant different genes, and heat maps, and cluster trees if you selected. It's convenient, however, I think if we want to know the whole process and algorithm, it will take us a lot of time. I borrowed a book from library, "Beginner for microarray", since I want to know the process of analysis. I just read small part of it, it really has a lot of things.

3. Principal components analysis and microarray data. This part is the hardest part, since it need a lot of linear algebra and statistics. I checked several reference, and the following is my understanding of PCA. In gene expression arrays we often have many genes being co-expressed in response to the same biological phenomenon, there are large number of measurements, and the measurements are correlated. The principal component analysis is a method for reducing the dimensionality when one has correlated measurements. It makes linear combinations of a group of variables in such a way that the linear combinations represent the data well.
The idea of PCA is,  let X_1, ..., X_p be a set of real valued random variables, define a vector a=(a_1, a_2, ..., a_p)^T and seek a derived variable Z=a_1X_1+a_2X_2+...+a_pX_p such that var(Z) is maximized under the constraint ||a||=1. Then the derived variable Z attempts to capture the common variation in the variables X_i. Usually the single variable Z is not enough to represent the original variables X_1, ..., X_p, in that case we find a second derived variable, uncorrelated with the first, with the largest variance, and so on.
More formally, the task is to find uncorrelated variables Z_k such that Z_k=a_1kX_1+a_2kX_2+...+a_pkX_p and var(Z_k) is maximized under the constraint ||a_k||=1.

4. Sample PCA: Let us represent an expression array by x_ij, where i indexes one of the p genes and j indexes one of the n samples. The largest sample principal component z_1j is defined to be the linear combination z_1j=sum(a_i1, x_ij), ||a_1||=1, that has the largest sample variance.  To calculate the PCA, we only need to calculate the eigenvectors of covariance matrix. And, these eigenvectors are called eigenarrays, z_kj are called eigengenes.


After I read this, I feel I am more clear about what is PCA and it application for microarray analysis.

Sunday, October 10, 2010

Perl for Bioinformatics

This semester, I learn a lot of perl and linux system, the followings are useful links, I found them when I did my homework:
Bioperl tutorial
FTP download blast database
Perl: Subroutines
Regular Expression
Tutorial for awk

1. Linux system: I started to use Ubuntu system since this summer, when I took "Introduction to database". At first, I tried to install mysql and Java on windows, but I failed. When I went to ask our TA, they told me:"Sorry, we don't know how to install it on window, we only know Ubuntu". That's the first time I heard of Ubuntu, and then I installed it. That's the rule, if all the people use one system or programming language, I'd better use the same one, since it's easy to be compatible when you are working. Actually, Ubuntu has a lot of advantage. It's fast, and the software center is awesome, you don't need to worry about download and upgrade/update, it will automatically do it.

2. Perl: I learned perl last year in Bioinformatics class, but not systematically and professionally. This semester, I learned it more deep, including objected oriented programming, and multiple thread. Bioperl is really huge, and contains a lot of module, it even has some software as a module, like GENEWISE and GENESCAN, and we can parse the output as an object. However, my experience tell me you'd better use "system" to run it and parse the output file using regular expression by yourself, since you may know what is happening when the program trows exceptional!! I only used GENEBANK and BLAST object on Bioperl, since I think these two are well developed. Other software, I'd rather parse the file by myself.

3. Bioinformatics: Also, from this course, I have a systematical conception of bioinformatics and its problems which need to be solved.
Shotgun sequences--->(Assemble)--->Contigs---->(gene prediction)---->genes---->(phylogenetic analysis)---->phylogenetic tree---->biological meaning
In next semester, computational genome, I will learn more about these problems.