• Recent studies have shown that RNA structural motifs play essential roles in RNA folding and interaction with other molecules. Computational identification and analysis of RNA structural motifs remains a challenging task. Existing motif identification methods based on 3D structure may not properly compare motifs with high structural variations. Other structural motif identification methods consider only nested canonical base-pairing structures and cannot be used to identify complex RNA structural motifs that often consist of various non-canonical base pairs due to uncommon hydrogen bond interactions. In this article, we present a novel RNA structural alignment method for RNA structural motif identification, RNAMotifScan, which takes into consideration the isosteric (both canonical and non-canonical) base pairs and multi-pairings in RNA structural motifs. The utility and accuracy of RNAMotifScan is demonstrated by searching for kink-turn, C-loop, sarcin-ricin, reverse kink-turn and E-loop motifs against a 23S rRNA (PDBid: 1S72), which is well characterized for the occurrences of these motifs. Finally, we search these motifs against the RNA structures in the entire Protein Data Bank and the abundances of them are estimated.

  • Full Article


    RNAMotifScan Version 2.0 ReadMe:

    For comments or bugs please email Cuncong Zhong (cczhong@cs.ucf.edu).

    School of EECS, University of Central Florida


    RNAMotifScan.pl provides an interface for scanning through annotated PDB files. The annotated PDB files need to be in './RNAinPDB'. The process of generating the annotated PDB files will be discussed in detail later.


    To see the arguments and how to run RNAMotifScan.pl, please run 'perl RNAMotifScan.pl --help'.


    If you want to run alignment between two RNA structural segments, run './RNAMotifAlign' for reduced mode output (only score) or './RNAMotifAlign_align' for alignment output (alignment included). The arguments of 'RNAMotifAlign' and 'RNAMotifAlign_align' can be found in the programs.


    If you are not dealing with a new PDB entry (later than the version released on Augest, 2008), you probably need not to read the following steps. The PDB entries you need are already parsed and deposited in './RNAinPDB'.

    Getting annotated PDB files:

    1. first you need to download the PDB file(s),
    2. use MC-Annotate (http://www.major.iric.ca/MC-Tools_files/MC-Annotate.zip)
      to annotate the PDB file,
      (An alternative annotation program shall be RNAVIEW
      (http://ndbserver.rutgers.edu/services/download/), but you may need
      to write your own parsing scripts to generate formatted inputs.)
    3. First, download the PDB sequence file from 'ftp://ftp.rcsb.org/pub/pdb/derived_data/pdb_seqres.txt' (we do not attach this file since it is to large, and may also be outdated if you want to process a new PDB entry),
      place on the same directory of 'Make_RNA_Input.pl'.
      Then, use the program 'Make_RNA_Input.pl' to parse the result
      generated from MC-Annotate, save the result in a file.
      You may need to change the directory of the MC-Annotate generated files.
    4. cut the result into different files by using the program 'CutMultiMCA.pl'
      if you have processed more than 1 MC-Annotate result, the input of
      the program shall be the output file generated in step 3,
    5. copy the cutted files into ./RNAinPDB/.


    Annotation of output alignment:

    1. The line labled 'iso' is the isostericity base-pair indication. A Star '*' represents the matched location in the seqeunce is a part of a matched isosteric base-pair. A plus symbol '+' represents the matched location is a part of matched non-isosteric base-pair.
    2. The line labeled 'edge' is the interacting edges. 'W' is for Watson-Crick, 'H' is for Hoogsteen, and 'S' is for sugar edge. Upper case letters are for 'cis' orientation base-pairs while lower case letters are for 'trans' orientation base-pairs.
    3. The line labeled 'struc' represents the base-pair locations. A pair of '(' and ')' represent the base-pair that was identified in the first stage, while '<' and '>' represent the base-pair that was identified in the second stage.
    4. A dot in the sequence represents the breakage location of the original sequence due to presentation of multiple strands in the query motif. An '-' symbol represents a gap in the sequence.