<html><head><meta http-equiv="Content-Type" content="text/html charset=windows-1252"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;">Hi Vasisht,<div> I’ve attached a couple of files you might find useful. </div><div><br></div><div>e79_mrna_mismatch_GRCh37.txt: This file contains a list of imported RefSeq transcripts where the underlying genomic sequence does not match that of the mRNA sequence it is based upon. This scenario occurs because RefSeq annotation is carried out at the transcript level and then mapped to the genome. Sometimes transcripts may not map perfectly to the genome.</div><div><br></div><div>e79_mrna_no_comparison_GRCh37.txt: This contains a list of imported RefSeq transcripts where we were unable to compare the underlying genomic sequence with a corresponding mRNA. This usually happens when the accession changes or has been retired, or if during the import we were not able to parse the mRNA accession from the gff3 file. This is the case for a lot of the imported RefSeq transcripts in GRCh37. It does not mean that there is definitely a difference between the genomic sequence and whatever mRNA the transcript was annotated off, just that we were not able to carry out the comparison.</div><div><br></div><div>Some points to note about these files:</div><div><br></div><div>1) They are based off our upcoming update to GRCh37, thus were not carried out using the publicly available homo_sapiens_otherfeatures_78_37 db. However the information in the files should allow you to examine transcripts you’re interested by using the genomic coordinates and stable ids.</div><div><br></div><div>2) In our upcoming release (e79) we have added an analysis called ‘refseq_import’ into the GRCh37 otherfeatures db (an import of the publicly available RefSeq gff3 file for GRCh37). This is the set of models that the lists were generated off (not the currently available ‘refseq_human_import’).</div><div><br></div><div>3) The stable ids for ‘refseq_import' models are not unique. This is because a transcript can sometimes be mapped to several places on the genome. It’s therefore important to use the genomic coordinates to make sure you’re looking at the correct transcript model (these are in the file). As an aside I’ve noticed that in cases where the transcript maps to multiple places, often there is one perfect mapping and the others do not match the genomic sequence.</div><div><br></div><div>4) The decision as to whether or not there’s a mismatch between the genomic and mRNA sequence is done through pairwise alignment. The two sequences are aligned and if they’re identical they are considered a perfect match. Failing this the mRNA sequence will undergo polyA clipping and the alignment is carried out again. If the two sequence then align with 100 percent identity and coverage, this is also considered a perfect match. Otherwise they are flagged as being a mismatch. Note that this is done across the entire length of the transcript (so UTR is included if present). I’ve noticed that occasionally the only difference is that the 5’ UTR of the mRNA is longer, in future we might consider trimming these cases to see if it gives a perfect match.</div><div><br></div><div>5) This will all be easier to investigate with the update to GRCh37 as you will be able to find all this information as transcript attributes for the ‘refseq_import’ models. The attributes will cover if the match is perfect or imperfect and in the case of imperfect matches what region the mismatch occurred (5’ UTR, CDS, 3’ UTR for coding models or just whole transcript for non-coding models) or if no comparison was possible because of failure to find a matching mRNA accession.</div><div><br></div><div>Hope this is of some help,</div><div><br></div><div>Fergal.</div><div><br></div><div></div></body></html>