<html><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /></head><body style='font-size: 10pt; font-family: Verdana,Geneva,sans-serif'>

<p id="v1v1reply-intro"><span style="font-size: 10pt;">Hi Omar,</span></p>

<div id="v1v1replybody1">

<div dir="ltr">

<div dir="ltr"> </div>

<div dir="ltr">Thanks for getting back to me. The reason I asked about whether it must be EMF is that GERP scores are also included in comment lines (starting with '#') of the MAF files generated from the same alignments as the EMF files (e.g. <a href="https://ftp.ensembl.org/pub/release-109/maf/ensembl-compara/multiple_alignments/90_mammals.epo_extended/90_mammals.epo_extended.1_1.maf.gz">https://ftp.ensembl.org/pub/release-109/maf/ensembl-compara/multiple_alignments/90_mammals.epo_extended/90_mammals.epo_extended.1_1.maf.gz</a> ). I don't know of an existing Python script or resource that will load the 90-mammal alignment from either EMF or MAF format complete with GERP scores into Python. However, it's possible to load them from the MAF files — with a little coaxing — using the Biopython MafIterator function ( <a href="https://biopython.org/docs/1.81/api/Bio.AlignIO.MafIO.html#Bio.AlignIO.MafIO.MafIterator">https://biopython.org/docs/1.81/api/Bio.AlignIO.MafIO.html#Bio.AlignIO.MafIO.MafIterator</a> ).</div>

<div dir="ltr"> </div>

<div dir="ltr">The Biopython MafIterator ignores comment lines, but we could use a class such as 'CommentStashFile' below, to iterate over the lines of the input MAF file, extracting the comment lines and returning everything else. This allows us to access the comments in the MAF file, while letting 'MafIterator' take care of parsing the alignments.</div>

<div dir="ltr"> </div>

<blockquote type="cite" style="padding: 0 0.4em; border-left: #1010ff 2px solid; margin: 0">

<div dir="ltr">

<p><span style="font-family: 'courier new', courier, monospace;">from collections.abc import Iterator</span></p>

<p><span style="font-family: 'courier new', courier, monospace;">class CommentStashFile(Iterator):</span></p>

<p><span style="font-family: 'courier new', courier, monospace;">    def __init__(self, file_object):</span><br /><span style="font-family: 'courier new', courier, monospace;">        self.file_object = file_object</span><br /><span style="font-family: 'courier new', courier, monospace;">        self.comment_lines = []</span></p>

<p><span style="font-family: 'courier new', courier, monospace;">    def __next__(self):</span><br /><span style="font-family: 'courier new', courier, monospace;">        for line in self.file_object:</span><br /><span style="font-family: 'courier new', courier, monospace;">            if line.startswith("#"):</span><br /><span style="font-family: 'courier new', courier, monospace;">                self.comment_lines.append(line)</span><br /><span style="font-family: 'courier new', courier, monospace;">            else:</span><br /><span style="font-family: 'courier new', courier, monospace;">                return line</span><br /><span style="font-family: 'courier new', courier, monospace;">        raise StopIteration</span></p>

</div>

</blockquote>

<div dir="ltr"> </div>

<div dir="ltr">With the 'CommentStashFile' class, we can create a 'GerpMafIterator' function, which stashes the comments for each MAF block, then checks for the comment containing the GERP scores. These GERP scores can be added as column annotations to the 'MultipleSeqAlignment' object representing the corresponding MAF block:</div>

<div dir="ltr"> </div>

<div dir="ltr">

<blockquote type="cite" style="padding: 0 0.4em; border-left: #1010ff 2px solid; margin: 0">

<p><span style="font-family: 'courier new', courier, monospace;">from Bio.AlignIO.MafIO import MafIterator</span></p>

<p><span style="font-family: 'courier new', courier, monospace;">def GerpMafIterator(handle):<br />    gerp_line_prefix = "# gerp scores: "<br />    comment_stash_file = CommentStashFile(handle)<br />    for maf_block in MafIterator(comment_stash_file):<br />        gerp_score_line = None<br />        for comment_line in comment_stash_file.comment_lines:<br />            if comment_line.startswith(gerp_line_prefix):<br />                gerp_score_line = comment_line<br />                break<br />        try:<br />            gerp_score_text = gerp_score_line[len(gerp_line_prefix):].rstrip("\n")<br />        except TypeError as exc:<br />            raise ValueError("GERP score line not found for MAF block") from exc<br />        maf_block.column_annotations["gerp_scores"] = [<br />            None if x == "." else float(x)<br />            for x in gerp_score_text.split()<br />        ]<br />        yield maf_block<br />        comment_stash_file.comment_lines.clear()</span></p>

</blockquote>

<p><br /></p>

<p>It should be possible to use 'GerpMafIterator' on any current Ensembl MAF file containing GERP scores encoded as comments. For example, the file "90_mammals.epo_extended.1_1.maf.gz" could be downloaded and then processed as follows:</p>

</div>

<div dir="ltr">

<blockquote type="cite" style="padding: 0 0.4em; border-left: #1010ff 2px solid; margin: 0">

<p><span style="font-family: 'courier new', courier, monospace;">import gzip</span></p>

<p><span style="font-family: 'courier new', courier, monospace;">with gzip.open("90_mammals.epo_extended.1_1.maf.gz", "rt") as in_file_obj:</span><br /><span style="font-family: 'courier new', courier, monospace;">    for maf_block in GerpMafIterator(in_file_obj):</span><br /><span style="font-family: 'courier new', courier, monospace;">        gerp_scores = maf_block.column_annotations["gerp_scores"]<br />        # ... do stuff with GERP scores ...</span><span style="font-family: 'courier new', courier, monospace;"></span><span style="font-family: 'courier new', courier, monospace;"></span></p>

</blockquote>

</div>

<div dir="ltr"> </div>

<div dir="ltr">Do you think something like this might do the job?</div>

<div dir="ltr"> </div>

<div dir="ltr">All the best,</div>

<div dir="ltr"> </div>

<div dir="ltr">Thomas.</div>

<div dir="ltr"> </div>

<div dir="ltr"> </div>

<div dir="ltr">On 2023-03-09 22:18, Omar Gamel wrote:</div>

</div>

</div>

<blockquote type="cite" style="padding: 0 0.4em; border-left: #1010ff 2px solid; margin: 0">

<div id="v1v1replybody1">

<div dir="ltr">

<div dir="ltr">Hi Thomas,

<div> </div>

<div>Thank you for your response.</div>

<div>I am interested in loading the GERP and associated alignment data into Python,  not necessarily from an EMF file. </div>

<div>Right now I loaded the human genome and gerp scores separately from release 109 FTP, matching them by position. </div>

<div>I would similarly like to be able to get the gerp for an entire alignment. </div>

<div> </div>

<div>Best</div>

<div>Omar</div>

</div>

<br />

<div class="v1v1v1gmail_quote">

<div class="v1v1v1gmail_attr" dir="ltr">On Wed, Mar 8, 2023 at 5:56 PM Thomas Walsh <<a href="mailto:twalsh@ebi.ac.uk" rel="noreferrer">twalsh@ebi.ac.uk</a>> wrote:</div>

<blockquote class="v1v1v1gmail_quote" style="margin: 0px 0px 0px 0.8ex; border-left: 1px solid #cccccc; padding-left: 1ex;">

<div style="font-size: 10pt; font-family: Verdana,Geneva,sans-serif;">

<p>Hello Omar,</p>

<p>My name is Thomas Walsh from the Ensembl Compara team, and I helped create the EMF files that you've expressed an interest in.</p>

<p>I understand that you're especially interested in accessing GERP and alignment data together. Is it necessary for this to be taken from an EMF file? Or are you primarily interested in loading the GERP and associated alignment data into Python, regardless of how it got there?</p>

<p>All the best,</p>

<p>Thomas Walsh.</p>

<blockquote style="padding: 0px 0.4em; border-left: 2px solid #1010ff; margin: 0px;">

<p>-------- Forwarded Message --------<br />Subject: [ensembl-dev] Python Script to read emf dump files?<br />Date: Thu, 2 Mar 2023 13:21:18 -0500<br />From: Omar Gamel <<a href="mailto:omar.gamel@utoronto.ca" rel="noreferrer">omar.gamel@utoronto.ca</a>><br />Reply-To: Ensembl developers list <<a href="mailto:dev@ensembl.org" rel="noreferrer">dev@ensembl.org</a>><br />To: <a href="mailto:dev@ensembl.org" rel="noreferrer">dev@ensembl.org</a></p>

<p>Hello,</p>

<p>I am looking for a python package or script that can parse the alignment emf files in the ENSEMBL FTP dump into an object I can manipulate<br />Example files here: <a href="https://ftp.ensembl.org/pub/current_emf/ensembl-compara/multiple_alignments/90_mammals.epo_extended/" target="_blank" rel="noopener noreferrer">https://ftp.ensembl.org/pub/current_emf/ensembl-compara/multiple_alignments/90_mammals.epo_extended/</a><br />I am particularly interested in the GERP data along with the alignment.</p>

<p>I see the example scripts provided are largely in perl: <a href="https://github.com/Ensembl/ensembl-compara/tree/release/109/scripts/dumps" target="_blank" rel="noopener noreferrer">https://github.com/Ensembl/ensembl-compara/tree/release/109/scripts/dumps</a></p>

<p>Please let me know where I can find such a python resource.</p>

<p>Thank you<br />Omar</p>

</blockquote>

<div> </div>

<div> </div>

<div id="v1v1v1m_-4133689087722384834signature">

<div id="v1v1v1m_-4133689087722384834message-htmlpart1">

<div lang="EN-GB">

<div>

<table border="0" cellspacing="0" cellpadding="0">

<tbody>

<tr>

<td valign="top" width="368">

<p><strong><span>Thomas Walsh</span></strong></p>

<p><strong>Bioinformatics Developer, Ensembl Compara</strong></p>

<p><span>European Bioinformatics Institute (EMBL-EBI)</span></p>

<p><span>Wellcome Genome Campus</span></p>

<p><span>Hinxton</span></p>

<p><span>Cambridge CB10 1SD</span></p>

<p><span>United Kingdom</span></p>

<p><span>Email: <a href="mailto:twalsh@ebi.ac.uk" rel="noreferrer">twalsh@ebi.ac.uk</a></span></p>

</td>

<td valign="top" width="233">

<p><span><img id="v1v1v1m_-4133689087722384834v1Picture_x0020_1" src="https://webmail.ebi.ac.uk/?_task=mail&_action=get&_mbox=Vault%2FBulk&_uid=2969&_token=dQqaYe3boyQ9VTyVgESy80TYfjZWhSpS&_part=1.2&_embed=1&_mimeclass=image" width="209" height="73" border="0" /></span></p>

</td>

</tr>

</tbody>

</table>

<p><br /></p>

</div>

</div>

</div>

</div>

</div>

_______________________________________________<br />Dev mailing list    <a href="mailto:Dev@ensembl.org" rel="noreferrer">Dev@ensembl.org</a><br />Posting guidelines and subscribe/unsubscribe info: <a href="https://lists.ensembl.org/mailman/listinfo/dev_ensembl.org" target="_blank" rel="noopener noreferrer">https://lists.ensembl.org/mailman/listinfo/dev_ensembl.org</a><br />Ensembl Blog: <a href="http://www.ensembl.info/" target="_blank" rel="noopener noreferrer">http://www.ensembl.info/</a></blockquote>

</div>

</div>

</div>

<br />

<div class="v1v1pre">_______________________________________________<br />Dev mailing list    <a href="mailto:Dev@ensembl.org" rel="noreferrer">Dev@ensembl.org</a><br />Posting guidelines and subscribe/unsubscribe info: <a href="https://lists.ensembl.org/mailman/listinfo/dev_ensembl.org" target="_blank" rel="noopener noreferrer">https://lists.ensembl.org/mailman/listinfo/dev_ensembl.org</a><br />Ensembl Blog: <a href="http://www.ensembl.info/" target="_blank" rel="noopener noreferrer">http://www.ensembl.info/</a></div>

</blockquote>

</body></html>