##date: 2016-8-15 ##provider: LncRNAWiki Team ##contact: lncwiki@big.ac.cn ##description: In 2014, we integrated lncRNA sequences and annotation information (e.g. genomic location, transcript structure) from three data sources: GENCODE (version 19; 23,898 human lncRNA transcripts), NONCODE (version 4.0; 95,135 human lncRNA transcripts), and LNCipedia (version 2.1; 32,181 human lncRNA transcripts). A process of error and redundancy elimination was performed on the integrated data set. First, we removed sequences containing ā€œNā€ in each data source, and as a result, a total of 8 lncRNAs in LNCipedia were removed. Second, we excluded lncRNAs with ambiguous naming scheme; in each data source, two or more lncRNA transcripts having 100% sequence identity on the whole transcript length (based on blastn results) and occupying the same genomic location but having different IDs are considered as questionable lncRNAs. Consequently, 14, 20, and 8 lncRNAs were removed from GENCODE, NONCODE, and LNCipedia, respectively. Lastly, since different databases may have different naming schemes and a given lncRNA transcript may accordingly have different identifiers in different databases, we performed blastn across these three data sources. LncRNA transcripts having 100% sequence identity (based on blastn results) and occupying the same genomic location, were regarded as the same lncRNA. Finally, we obtained a total of 105,255 non-redundant lncRNA transcripts. LncRNAWiki keeps frequent updates by community annotation of human lncRNAs and integration of newly identified lncRNAs with experimental evidence. The updated version has, as of 28 June 2016, a total of 719 lncRNAs that are community-curated (the detailed list is available at Featured LncRNAs). ##Data sources(RawData.tar.gz): data sources of lncRNAWiki. ##Basic information (lncRNAWiki_105255_information.txt): Basic information of 105,255 non-redundant lncRNAs. Transcript ID: The original lncRNA ID in each database. Source: The database and version that this lncRNA is from. Same with: LncRNAs that have the same sequence and also the same genomic location in other databases. Classification: Classification based on genomic location and context. We obtained genome location information from GENCODE, NONCODE and LNCipedia. Based on the categories of Derrien et al. (Derrien, T. et al. (2012) The GENCODE v7 catalog of human long noncoding RNAs: Analysis of their gene structure, evolution, and expression. Genome Res), we classified lncRNAs into seven groups (Intergenic, Intronic (S),Intronic (AS),Overlapping (S),Overlapping (AS), Sense and Antisense) based on their genomic location in respect to protein-coding genes. The difference between our classification and Derrien's is that we classified lncRNAs that intersect protein-coding genes into Sense or Antisense by considering the whole transcript sequence instead of exonic region only. Most of the lncRNAs belong to only one category, a small number (1,264) belong to more than one category. Intergenic: LncRNAs are transcribed from intergenic regions. Intronic (S): LncRNAs are transcribed entirely from introns of protein-coding genes. Intronic (AS): LncRNAs are transcribed from antisense strand of protein-coding genes and the entire sequences are covered by introns of protein-coding genes. Overlapping (S): LncRNAs that contain coding genes within an intron on the sense strand. Overlapping (AS): LncRNAs that contain coding genes within an intron on the antisense strand. Sense: LncRNAs are transcribed from the sense strand of protein-coding genes and the entire sequence of lncRNAs are covered by protein-coding genes (Intronic lncRNAs are not included), or the entire sequence of protein-coding genes are covered by lncRNAs (Overlapping lncRNAs are not included ), or both lncRNAs and protein-coding genes intersect each other partially. Antisense: LncRNAs are transcribed from the antisense strand of protein-coding genes and the entire sequence of lncRNAs are covered by protein-coding genes (Intronic lncRNAs are not included), or the entire sequence of protein-coding genes are covered by lncRNAs (Overlapping lncRNAs are not included ), or both lncRNAs and protein-coding genes intersect each other partially. Length, Genomic location, Exon number and Exons: The length, genomic location, exon number of lncRNA, and genomic location of each exon. These information is obtained from GENCODE, NONCODE and LNCipedia annotation. Gene: lncRNA genes in different databases. ##Sequence: Fasta sequences of all lncRNAs in LncRNAWiki. non-redundant_105255.fasta: Transcript sequence of the 105,255 non-redundant lncRNAs. featured_lncRNAs.fasta: Transcript sequence of the featured lncRNAs. ##Small protein (micropeptide.txt): Predicted small proteins of all lncRNAs in LncRNAWiki.Predicted small protein (proteins of 100 amino acids or less in the absence of processing) includs 13 sub-sections, 'Name', 'Length(aa)', 'Molecular weight', 'Aromaticity', 'Instability index', 'Isoelectric point', 'Runs', 'Runs residual', 'Runs probability', 'Amino acid sequence', 'Secondary structure', 'PRMN', and 'PiMo'. Name: The name of predicted small proteins. Length (aa), Molecular weight, Aromaticity, Instability index, Isoelectric point: The length (aa), molecular weight, aromaticity, instability index, isoelectric point of predicted small proteins. Runs: The runs of secondary structure of predicted small proteins. Runs residual: The residual between runs/length of the predicted small protein and the average runs/length of mRNAs which have the same length with the predicted small protein. Runs probability: The average probability of the predicted small protein runs. Amino acid sequence, Secondary structure: The amino acid sequence and its secondary structure of the predicted small protein. Amino acid sequence: Refined predicted transmembrane helix. PiMo: Predicted transmembrane region.