RNA Design based on Simulated SHAPE Data

RNA plays essential roles in numerous cellular processes including gene regulation and DNA replication. Such roles are known to be dictated by higher order structures of RNA molecule. Hence it is of prime importance to find an RNA sequence that can fold to have a particular function desired for use in pharmaceuticals. The challenge of finding an RNA sequence for a given structure is known as RNA design problem. Although there are some well-known algorithms to solve the problem, they only use hard constraint to evaluate the predicted sequences. Recently, SHAPE data has emerged as a soft constraint for improving RNA secondary structure algorithms. It motivates us to report a new method for accurate design of RNA sequences based on their secondary structures using SHAPE data as pseudo-free energy. We first use a stochastic model to simulate SHAPE data based on a set of 16S and 23S ribosomal RNA sequences. Then, a harmony search algorithm is applied to accurately predict RNA sequence using free energy and simulated SHAPE data.

We compere our algorithm with some well-known algorithms. Our algorithm precisely predicts 26 new sequences for the extracted structure from Rfam dataset while the other algorithms predict 22 out of 29. The proposed algorithm is comparable to them on RNA-SSD datasets where they can predict 33 appropriate sequences for RNA secondary structures (out of 34). Finally, we show that the predicted 3D structures of designed sequences with SHAPE data are more similar to nature.

How to use our software

  1. It's recommended to use 64-bit version of Ubuntu (our software is an ELF64 binary file, generated in Ubuntu),
  2. Download and build GTfold,
  3. Download HRDSSD binary file,
  4. Place requirment files (gtmfe, Data folder, HRDSSD, Astem.txt, ..) in the same directory.
  5. Run program with sturcture file such as: ./HRDSSD input file output file .

You may download our dataset from here.

Also, you can download HRDSSD-Suk and HRDWSD.


Comparison of HRDSSD and the other methods on dataset A based on number success count (sc) and average time (s).

Rfam AC length SC time SC time SC time SC time SC time SC time SC time
RF00001117 501.945010.12502.18500.92280.79490.15278.52
RF00002151 501.12504.14501.64492.98132.000-360.03
RF00003161 502.00508.202317.00415.94112.180-0-
RF00004193 500.94502.66500.58502.51480.77192.73460.35
RF0000574 500.28500.66500.30500.03350.37490.04470.08
RF0000689 500.38500.96500.44500.17440.34400.06500.10
RF00007154 500.86502.96500.82500.71480.58420.30440.40
RF0000854 500.22500.60500.24500.01460.24500.00500.06
RF00009348 509.505032.044911.734922.52393.130-81.27
RF00010357 0-0-0-0-0-0-0-
RF00011382 1282.000-0-0-0-0-0-
RF00012215 501.02503.50500.82501.52470.96323.24500.41
RF00013185 500.96503.20500.70501.42381.13132.00500.42
RF0001487 500.26500.80500.26500.04400.40500.02500.12
RF00015140 500.54502.08500.58500.86440.57191.74470.33
RF00016129 0-0-0-0-0-0-0-
RF00017301 502.185010.00501.14502.13422.76480.80482.49
RF00018360 5010.125042.46507.040-0-0-0-
RF0001983 500.36501.04500.30500.09460.33470.03500.10
RF00020119 500.54501.800-0-0-0-0-
RF00021118 500.28500.74500.26500.13400.63500.09500.58
RF00022148 500.50501.44500.48501.02440.6881.05420.21
RF00024451 0-0-0-0-0-0-0-
RF00025210 501.06503.50500.68503.36450.8710.93500.33
RF00026102 500.24500.54500.22500.02450.3623.11500.06
RF0002779 500.28500.64500.28500.03490.35500.04500.21
RF00028344 4687.3341174.544517.675026.420-0-11.54
RF0002973 500.34500.88500.40500.06390.2830.01500.08
RF00030340 507.985028.74505.940-442.340-265.55
sum 1247413.231241338.24116771.70108972.8887522.0554336.3288983.25

Comparison of HRDSSD and the other methods on RNA-SSD dataset B based on number success count (sc) and average time (s).

Id length SC time SC time SC time SC time SC time SC time SC time
AB015827857 50152.6649223.025013.32507.154520.643711.683940.66
AF0291951054 50525.4248354.405029.665011.580-4732.67432.15
AF0569381399 50841.30381738.505030.085043.850-43112.640-
AF096836647 5022.7649118.06504.48504.320-433.403516.13
AF106618351 505.185024.62501.54501.84413.56470.58494.58
AF107506338 503.765024.60501.78502.38393.59442.0412.19
AF141485474 5010.645078.30503.32503.42455.93334.610-
AJ011149377 5012.385041.38486.85501.430-371.72222.90
AJ130779507 5013.5849109.71503.98502.41476.19491.72199.87
AJ132572781 50103.6238312.26509.98507.120-4212.902218.87
AJ1336221297 1599.003462.0626176.235018.790-49233.9245289.25
AJ236455752 6314.172341.5011136.185018.220-5255.650-
D38777859 50162.0820342.205017.085014.140-42164.690-
L11935265 501.92507.04500.80500.78491.51490.81501.21
L771171476 2298.504238.004238.005023.630-5055.280-
LIU92530290 790.574191.001562.73501.020-221.26434.08
S70838390 4650.574689.80487.38502.990-462.964918.11
U63350419 505.145026.06501.76501.54494.14483.49444.73
U81771492 507.805057.76502.28501.990-462.710-
U84629300 4819.464653.48494.55500.97519.40402.40502.43
X61771660 4986.1836383.284992.475013.190-1717.090-
X819491201 18340.503982.0047206.455017.760-48128.073119.23
X996761443 2544.500-440.505031.170-46220.370-
Z83250261 506.10503.70501.72500.68352.20500.74451.85
sum 9293868.729024812.9310011085.751200232.3535567.179801273.41587438.22

Comparison of HRDSSD and the other methods on RNA-SSD dataset C based on number success count (sc) and average time (s).

Id length SC time SC time SC time SC time SC time SC time SC time
165 500.2500.72500.28500.04390.31500.02500.10
279 500.26500.7500.3500.03460.35500.01500.09
3122 501.7505.44500.96500.2963.33500.06500.18
4166 500.72502.26500.44500.34500.86500.10500.53
5180 505.125040.580-89.280-0-0-
6314 502.26509.24501.26504.83491.921110.95111.51
7340 504.125012.84501.56503.19452.641423.02242.29
8372 505.15016.34501.665013.94463.02431.17252.19
9376 0-0-0-0-0-0-0-
10583 5016.265097.08502.82504.39459.24438.00129.68
sum 45035.74450185.24009.2840836.3232621.6827273.3427216.58

Cantact Us