Protocol: siRNA sequence design
Created by Ming Wu, 07/2012
1. obtain the mRNA sequence and generate the candidate siRNAs
This is the perl code for preprocessing.
The program will ask for the accession number of the mRNA, and extract the sequence information from NCBI database. The output includes
- siRNATarget.set.out : list of the siRNA target sequences
- siRNACandidate.set.out : list of the siRNA candidates, complementary to siRNA target sequences
- mRNA.seq.out : the mRNA sequence
The siRNA target is generated with a 19bp window sliding through the mRNA sequence.
2. Computing the thermodynamic asymmetry
upload the sequences (antisense: siRNACandidate, sense: siRNATarget) to
The DINAMelt Web Server
obtain the "Loop Free-Energy Decomposition" page of the results.
Save the results and put the energy decomposition information into a TXT file.
You need to do this twice: one for Antisense-to-Sense, another for Sense-to-Antisense.
Example of the two TXT files that should obtained:
Based on these energy calculation, the delta-delta-G can be computed use our perl code:
The result of ddG (based on 3 nucleotide stacking energy) for each candidate will be in the file:
which has 4 columns: the row number of the candidate, the 3' terminal energy(anti-sense 5'), the 5' terminal energy, and ddG.
Thus the ddG vector is ready for candidates.
To prepare the Terminal nucleotide vector for each candidate, our code is here:
The code reads in the SeqAnti.TXT (siRNA candidate sequence) and the SeqSen.TXT (siRNA target sequence), and output a file termResult.txt which contains two columns (antisense-sense terminal nucleotide, and a number indicating the type (i.e. 1 of the 16 possible combinations)).
3. Logistic regression Model to predict activity of siRNA sequences
The model is in the MATLAB code:
Parameters are determined based on training sets.
The ddG vector and the terminal nucleotide vector should be inputed into MATLAB (you can use excel to import the text files stack3result.txt and termResult.txtgenerated from previous steps, and copy the last column into MATLAB variables "ddG3" and "termN").
Run the model with:
The output variable Prob_H_M_L is a three column matrix, the first column is the probability of miRNA candidates being Highly active, second column for probability of being Med, and the third column is for probability of being Low activity.
The vector c indicates for each miRNA which probability is the highest.