- "data": input data file, rows are samples, columns are features(genes).
- "distanceMatrix": the pair-wise distance between samples.
- "Ylabel": the label of samples, for example, 0 for cold, 1 for warm; or 1 for breast cancer, 0 for all others.
- "k": the parameter in relief algorithm, I usually use k=10.
ReliefF is based on "Nearest Neighbors", thus it is required to have a distance matrix to describe the similarity of transcriptomes. To obtain the distanceMatrix one could use:
Euclidean distance is by default and works ok as in our paper. If one has to try other distance metrics, there are different metrics in the pdist function. ReliefF takes in whatever distanceMatrix that is given.
(The parameter k is a bit tricky, generally 10 should be ok, but if you think there is more intriguing sub-types that contains less than 10 samples, you could try less than that. note that k should be smaller than the number of samples in a group, meaning if you only have 10 samples in your condition of interest, it is required to use k less than 10. please refer to algorithm ReliefF for more understandings of this parameter)
The results will show the "score" for each gene. The top-ranked genes are the candidate genes. There is no pre-defined explicit cut-off, one could plot the scores to see if there are steep drops in the scores so as to select the top few genes. The algorithm helps to find the most specific genes but one has to look into the top-ranked genes to see how many genes to pursue in the next steps.
When there is missing data, the ReliefF still works but there will be "NaNs" for the sorting and ranking. One may need to fix the missing data (either remove the gene/sample, or fill with mean of gene/samples). If not, one can still get the ReliefF scores and sort them separately, e.g. in Excel.
2. Identification of regulators I: based on dependences in expression
The approach is based on the calculation of mutual information between genes. The reason to use mutual information instead of pearson correlation is that the mutual information account for any type of dependencies, linear or nonlinear.
The code for computing mutual information is : FastPairMI.m
If you are interested in the algorithm, this is the website: http://icbp.stanford.edu/software/FastPairMI/
, you need to cite this work if use the alglrithm. There is another way of computing MI, which I introduced in another Protocol [Computing mutual information]
, which is a more "controllable" way that one can estimate information gain for adding features, but is pretty slow and not as convenient as the fast one when apply to large number of gene pairs.
to compute the unconditional mutual information, which is the mutual information between any two genes for all conditions:
in which M is the data matrix for all conditions, with each row is a gene's expression.
To compute the conditional mutual information, do the same thing
but this time the M' only include the condition of interest.
Note that "1" is the std of the Gaussian kernel for density estimation in order to compute mutual information. (you can adjust the number if your expression data is of larger or smaller scale as compared with human U-133G (expression level 0-14)).
Now the (MIcon-MIall) is the score. But we want to integrate the PPI and P-DNA interaction data and only account for the direct interactions to the candidate genes.
PPI and P-DNA data can be obtained from databases. For yeast this should not be difficult. Using this physical interaction data, one can extract the genes that connect to the candidate genes identified in layer I, including the TFs bind to the candidate genes (P-DNA), as well as the proteins interact with those TFs (PPI), or candidate genes. In the situation when P-DNA data is not complete available (e.g. in human), one need to specify the potential TFs for candidates (from literature, database or motif search, or other approaches), and their interacting proteins should be available from PPI data.
Since computing pairwise MI is slow (yes it is slow even with this "FastMI"), one can first filter out other "irrelevant genes" and only input those interacting genes and candidate genes into MI computation, i.e. in matrix M and M', only those genes who are candidate genes or TFs binding on candidate genes, or proteins interact with these TFs or candidate genes are included. If you don't know how to "filter this out", you can use CytoScape ( http://www.cytoscape.org/
). Play with it a little bit, it should not be complex to figure out.
Ok, now we have the network around the candidate genes (you can visualize it in Cytoscape), and we have scores for genes. We have scores for each gene pairs, but we only care about the score between those putative regulator (TFs and proteins interact with TFs) and the candidate genes. You can map this scores into cytoscape, and associate, e.g. the size of the node with the scores so that you have a intuitive map of the "reconstructed network".
nodes with zero scores (MIcond<MIall) can be removed.
If there is still many nodes in the network (it may happen when your candidates are well-known and there are too many TFs and proteins that could be regulating it in previous studies) , and the score is generally low for many of them, you can further refer to CLR ( http://www.plosbiology.org/article/info:doi/10.1371/journal.pbio.0050008
), and use this idea to remove many low-scored genes. A simple way is to set a threshold based on MIcon score (not the difference) distribution for all pairs (e.g. >lower 20% MI scores). You can make it more complex to follow the CLR way to set different threshold for different gene. Note that in these situations you will have to compute all the pair MIs for all genes to get the distribution.
Ok. Now with layer I and II one should have a network, which is based on physical interactions but many nodes are filtered out (zero scores and below thresholds). The center of the network is the candidate genes, and TFs are potential regulators. Some TFs have good MI scores with candidate genes, their expression could be regulating candidate expression in the condition of interest. Some TFs have zero score but since there are proteins with nonzero score go through them towards the candidate genes, they will be kept in the network (for layer III).
There is one thing that may take your extra time to figure out: since everything here is in matrix form, the "genes"/"TFs" are essentially the row numbers in the expression matrix, thus one may need to do a lot of conversion/mapping from IDs, gene names etc to row numbers that is used in all calculations. The question is pretty trivial but the efforts could be great, and this is going to be a routine for the people in comp bio. For me I usually scratch a simple perl script to map one list/ID to another. Since this is a general topic I share our way (not necessary the best way) in the Protocol [How to map a list to another/extract some information in perl].
3. Identification of regulators III: estimate TF activity
Many TFs' activity is not correlated with expression level, thus we need to estimate TF activity.
We use the expression of putative targets of TFs as indicators of their activity, and use ReliefF to select the TFs whose activity change the most in the condition of interest.
In the situation when you have the full matrix of P-DNA interaction, e.g. in yeast, you can do it this way:
The CombData now is the TF activity matrix by summing up putative targets of each TFs as indicator of TF activity for each condition. The input is the data (gene by conditions), and the P-DNA binding matrix (gene by TF). In the P-DNA binding matrix if TF i binds to gene j, then the element i,j is 1, otherwise 0.
Define a corresponding Y (1 for condition of interest, 0 for others), the one can apply ReliefF to find out which TFs activity change the most in the condition of interest.
[score, sortScore, ranking]=Re1(CombData,P,Y,10);
Now we know the TF activity changes in the condition of interest, go back to the network identified in layer II, if the TF activity changed and the gene expression is unique (candidate genes), then the TF could be the regulator of the candidate gene, although its expression level may not correlate with the candidate gene.
It is more difficult if the complete P-DNA matrix is not available. In this case one has to look for the putative targets for TFs one by one, at least for all the TFs left in layer II (for some TFs that lack information we cannot address them). A good resource to look at is TargetMine ( http://targetmine.nibio.go.jp/
) which is a tool integrates information from curated databases.
Search a gene in TargetMine, and look at the
- Gene(s) --> Upstream Transcription Factors
- Gene(s) --> Downstream Transcription Target genes
Depending on what you are looking for.
We are looking for putative targets for a TF, thus we can search for the TF and look for downstream Target genes that are known. This way one can have a relatively reliable set of target gene for a TF (much more reliable than the motif search which usually generates too many predictions).
Still you need to do the work to map the gene names/IDs to row numbers in the expression matrix, see [How to map a list to another/extract some information in perl].
For example, the putative target of JUN can be given as:
The numbers here are row number of target genes in the expression matrix. Then the activity indicator for JUN is the matrix:
Matrix ETABM185 is our expression matrix.
After assigning all the TFs, then we can collate them as
and run them with ReliefF: