Three-layer approach to reconstruct condition specific gene network

Protocol: Three-layer approach to reconstruct condition specific gene network

Created by Ming Wu, 07/30/2012

The three layers are: 
  1. Identification of candidate genes
  2. Identification of regulators I: based on dependencies in gene expression
  3. Identification of regulators II: TF activity estimation

0. Prepare the microarray data

The approach requires multiple condition to compare in order to find the "unique" changes in your condition of interest. Thus an integrative dataset is needed. Microarrays from different conditions (but of a same system/species, assuming the generic gene network underlying these different phenotypes are mostly the same) can be combined and normalized to build the "integrative dataset". The microarrays must be in a same platform. 

Here I show an example to normalize different microarray data (human samples, affy array) using GCRMA approach with MATLAB.
more information and references of RMA/GCRMA can be found here: (http://www.mathworks.com/help/toolbox/bioinfo/ref/gcrma.html
It is pretty easy, put every souce file (.CEL files) in the working folder and execute this one line in MATLAB:
exp=affygcrma('*','HG-U133A.CDF','HG-U133A_probe_tab','CELPath', '/mnt/home/wuming1/MyR/ImicroArray/humanETABM/E-TABM185_CEL');
In this command, the '*' means to normalize any CEL files in this folder, the 'CELPath' defines the searching folder (location where the CEL files are), and the 'HG-U133A.CDF' and 'HG-U133A_probe_tab' are mapping annotations for the affy platform. You should be able to find these two files from Affy Product information or by Google. For human data, I include this two files in the (example/) folder in this protocol. The final normalized expression matrix will be stored at variable "exp".

Note: This requires the Bioinformatics toolbox in MATLAB. One could also use Bioconductor in R to do similar thing, one can refer to
http://www.bioconductor.org/packages/release/bioc/html/gcrma.html 

The normalization does take some time to compute, when you have too many array data to integrate, e.g. more than 200 samples, MATLAB in a laptop, desktop or even a 32G MEM tower server will report OUT of MEM error. In this situation it is better to run it on a HPCC server. The differences are that you have to put the command in a M file and use another server-defined job loader to call the M file into MATLAB, since you will not have the graphic interface of MATLAB. Here is an example to do it on MSU hpcc server:

The file for MATLAB is:
 my01.m
exp=affygcrma('*','HG-U133A.CDF','HG-U133A_probe_tab','CELPath', '/mnt/home/wuming1/MyR/ImicroArray/humanETABM/E-TABM185_CEL');
save('matlab1.mat','exp');
quit;

The file for hpcc to submit a job is:

 myJob01.qsub
#!/bin/bash -login

#PBS -N MATLABmicroarrayHuman1
#PBS -l nodes=1:ppn=1,walltime=10:00:00,mem=100gb
#PBS -W x=gres:MATLAB module load powertools
cd /mnt/home/wuming1/MyR/ImicroArray/
matlab -nodisplay -nosplash -r "my01" -logfile mylog01.log
One should refer to HPCC document at https://wiki.hpcc.msu.edu/display/hpccdocs/Documentation+and+User+Manual to know the parameters here and to know how to submit a job with this code. The "walltime" is how long you expect the job will be running, based on our experience, for normalizing ~6000 slides (human affy HG-U133A), ~>10 hours are necessary. 


1. Identification of candidate genes
We use ReliefF algorithm to identify genes in the integrative dataset. The description of ReliefF can be found in our paper.  
our sample MATLAB code of ReliefF is provided in this protocol : Re1.m 
Re1.m


(score, sorted-Score, ranking)=Re1(data, distanceMatrix, Ylabel, k)

for which
Inputs are:
  • "data": input data file, rows are samples, columns are features(genes).
  • "distanceMatrix": the pair-wise distance between samples.
  • "Ylabel": the label of samples, for example, 0 for cold, 1 for warm; or 1 for breast cancer, 0 for all others.
  • "k": the parameter in relief algorithm, I usually use k=10. 

ReliefF is based on "Nearest Neighbors", thus it is required to have a distance matrix to describe the similarity of transcriptomes. To obtain the distanceMatrix one could use: 

squareform(pdist(data));

Euclidean distance is by default and works ok as in our paper. If one has to try other distance metrics, there are different metrics in the pdist function. ReliefF takes in whatever distanceMatrix that is given.

(The parameter k is a bit tricky, generally 10 should be ok, but if you think there is more intriguing sub-types that contains less than 10 samples, you could try less than that. note that k should be smaller than the number of samples in a group, meaning if you only have 10 samples in your condition of interest, it is required to use k less than 10. please refer to algorithm ReliefF for more understandings of this parameter)

The results will show the "score" for each gene. The top-ranked genes are the candidate genes. There is no pre-defined explicit cut-off, one could plot the scores to see if there are steep drops in the scores so as to select the top few genes.  The algorithm helps to find the most specific genes but one has to look into the top-ranked genes to see how many genes to pursue in the next steps. 

When there is missing data, the ReliefF still works but there will be "NaNs" for the sorting and ranking. One may need to fix the missing data (either remove the gene/sample, or fill with mean of gene/samples). If not, one can still get the ReliefF scores and sort them separately, e.g. in Excel.



2. Identification of regulators I: based on dependences in expression

The approach is based on the calculation of mutual information between genes. The reason to use mutual information instead of pearson correlation is that the mutual information account for any type of dependencies, linear or nonlinear. 

The code for computing mutual information is : FastPairMI.m
FastPairMI.m


If you are interested in the algorithm, this is the website: http://icbp.stanford.edu/software/FastPairMI/ , you need to cite this work if use the alglrithm. There is another way of computing MI, which I introduced in another Protocol [Computing mutual information], which is a more "controllable" way that one can estimate information gain for adding features, but is pretty slow and not as convenient as the fast one when apply to large number of gene pairs. 

to compute the unconditional mutual information, which is the mutual information between any two genes for all conditions:

MIall=FastPairMI(M, 1);

in which M is the data matrix for all conditions, with each row is a gene's expression.

To compute the conditional mutual information, do the same thing

MIcon=FastPairMI(M',1);

but this time the M' only include the condition of interest.
Note that "1" is the std of the Gaussian kernel for density estimation in order to compute mutual information. (you can adjust the number if your expression data is of larger or smaller scale as compared with human U-133G (expression level 0-14)).

Now the (MIcon-MIall) is the score. But we want to integrate the PPI and P-DNA interaction data and only account for the direct interactions to the candidate genes. 

PPI and P-DNA data can be obtained from databases. For yeast this should not be difficult. Using this physical interaction data, one can extract the genes that connect to the candidate genes identified in layer I, including the TFs bind to the candidate genes (P-DNA), as well as the proteins interact with those TFs (PPI), or candidate genes. In the situation when P-DNA data is not complete available (e.g. in human), one need to specify the potential TFs for candidates (from literature, database or motif search, or other approaches), and their interacting proteins should be available from PPI data.  

Since computing pairwise MI is slow (yes it is slow even with this "FastMI"), one can first filter out other "irrelevant genes" and only input those interacting genes and candidate genes into MI computation, i.e. in matrix M and M', only those genes who are candidate genes or TFs binding on candidate genes, or proteins interact with these TFs or candidate genes are included. If you don't know how to "filter this out", you can use CytoScape ( http://www.cytoscape.org/). Play with it a little bit, it should not be complex to figure out.

Ok, now we have the network around the candidate genes (you can visualize it in Cytoscape), and we have scores for genes. We have scores for each gene pairs, but we only care about the score between those putative regulator (TFs and proteins interact with TFs) and the candidate genes. You can map this scores into cytoscape, and associate, e.g. the size of the node with the scores so that you have a intuitive map of the "reconstructed network".

nodes with zero scores (MIcond<MIall) can be removed. 

If there is still many nodes in the network (it may happen when your candidates are well-known and there are too many TFs and proteins that could be regulating it in previous studies) , and the score is generally low for many of them, you can further refer to CLR ( http://www.plosbiology.org/article/info:doi/10.1371/journal.pbio.0050008 ), and use this idea to remove many low-scored genes. A simple way is to set a threshold based on MIcon score (not the difference) distribution for all pairs (e.g. >lower 20% MI scores). You can make it more complex to follow the CLR way to set different threshold for different gene. Note that in these situations you will have to compute all the pair MIs for all genes to get the distribution.

Ok. Now with layer I and II one should have a network, which is based on physical interactions but many nodes are filtered out (zero scores and below thresholds). The center of the network is the candidate genes, and TFs are potential regulators. Some TFs have good MI scores with candidate genes, their expression could be regulating candidate expression in the condition of interest. Some TFs have zero score but since there are proteins with nonzero score go through them towards the candidate genes, they will be kept in the network  (for layer III).

There is one thing that may take your extra time to figure out: since everything here is in matrix form, the "genes"/"TFs" are essentially the row numbers in the expression matrix, thus one may need to do a lot of conversion/mapping from IDs, gene names etc to row numbers that is used in all calculations. The question is pretty trivial but the efforts could be great, and this is going to be a routine for the people in comp bio. For me I usually scratch a simple perl script to map one list/ID to another. Since this is a general topic I share our way (not necessary the best way)  in the Protocol [How to map a list to another/extract some information in perl].


3. Identification of regulators III: estimate TF activity

Many TFs' activity is not correlated with expression level, thus we need to estimate TF activity.

We use the expression of putative targets of TFs as indicators of their activity, and use ReliefF to select the TFs whose activity change the most in the condition of interest.

In the situation when you have the full matrix of P-DNA interaction, e.g. in yeast, you can do it this way:

CombData=data'*TFAc_GeneID;

The CombData now is the TF activity matrix by summing up putative targets of each TFs as indicator of TF activity for each condition. The input is the data (gene by conditions), and the P-DNA binding matrix (gene by TF). In the P-DNA binding matrix if TF i binds to gene j, then the element i,j is 1, otherwise 0.

Define a corresponding Y (1 for condition of interest, 0 for others), the one can apply ReliefF to find out which TFs activity change the most in the condition of interest.

P=squareform(pdist(CombData));
[score, sortScore, ranking]=Re1(CombData,P,Y,10);


Now we know the TF activity changes in the condition of interest, go back to the network identified in layer II, if the TF activity changed and the gene expression is unique (candidate genes), then the TF could be the regulator of the candidate gene, although its expression level may not correlate with the candidate gene.

It is more difficult if the complete P-DNA matrix is not available. In this case one has to look for the putative targets for TFs one by one, at least for all the TFs left in layer II (for some TFs that lack information we cannot address them).  A good resource to look at is TargetMine ( http://targetmine.nibio.go.jp/ ) which is a tool integrates information from curated databases. 

Search a gene in TargetMine, and look at the 
  • Gene(s) --> Upstream Transcription Factors
  • Gene(s) --> Downstream Transcription Target genes
Depending on what you are looking for.

We are looking for putative targets for a TF, thus we can search for the TF and look for downstream Target genes that are known. This way one can have a relatively reliable set of target gene for a TF (much more reliable than the motif search which usually generates too many predictions).

 Still you need to do the work to map the gene names/IDs to row numbers in the expression matrix, see [How to map a list to another/extract some information in perl].

For example, the putative target of JUN can be given as:

iJUN=[1904,5525,5526,7868,10187,3067,7465,2387,10913,3463,13171,3399,15968,954];

The numbers here are row number of target genes in the expression matrix. Then the activity indicator for JUN is the matrix:

iJUNY=sum(ETABM185(iJUN',:),1);

Matrix ETABM185 is our expression matrix. 
After assigning all the TFs, then we can collate them as

humanTF=[iTP53Y;iJUNY;iRESTY;iNFKB1Y;iCREB1Y, ......];

and run them with ReliefF:

[score, sortScore,ranking]=Re1(humanTF',YdistEu1,[CancerListVV==7],10);
in this example, the YdistEu1 is the pre-computed distance matrix (refer to layer 1 to see how to do it), and for the Y label I use CancerListVV==7 which in our data describes breast cancer samples.

I would suggest include positive and negative controls, which are TFs that are well-known to be activated in your condition and TFs that are not likely activated, besides the TFs in layer II that one is interested in, to give a rough idea what scores are considered good.



If you want to compare with GSEA analysis, this is a code of GSEA from other group:

gsea2.m

Based on this I craft a code for testing human TF:

testTFgsea2_human.m

TFoHumanBC_JUN = testTFgsea2_human(TScoresHumanBC,iJUN');

The TScores are scores for genes (in the differentially expression analysis), and the iJUN are the putative targets for JUN.

check "mattest" if you don't know how to get Tscores.

CLICK HERE TO DOWNLOAD ANNOTATION FILE

CLICK HERE TO DOWNLOAD DATA

CLICK HERE TO DOWNLOAD THE THREE LAYER APPROACH PACKAGE