<?xml version="1.0" encoding="UTF-8"?>        <rss version="2.0"
             xmlns:atom="http://www.w3.org/2005/Atom"
             xmlns:dc="http://purl.org/dc/elements/1.1/"
             xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
             xmlns:admin="http://webns.net/mvcb/"
             xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
             xmlns:content="http://purl.org/rss/1.0/modules/content/">
        <channel>
            <title>
									Parsing rules for Sequence Repository - Utilities				            </title>
            <link>https://www.profiproteomics.fr/forum/general-utilities/parsing-rules-for-sequence-repository/</link>
            <description>Forum dedicated to Proline Software</description>
            <language>fr-FR</language>
            <lastBuildDate>Fri, 15 May 2026 05:29:05 +0000</lastBuildDate>
            <generator>wpForo</generator>
            <ttl>60</ttl>
							                    <item>
                        <title>Parsing rules for Sequence Repository : Which parsing rule to use for NCBI fasta files</title>
                        <link>https://www.profiproteomics.fr/forum/general-utilities/parsing-rules-for-sequence-repository/#post-8</link>
                        <pubDate>Thu, 20 Sep 2018 12:24:45 +0000</pubDate>
                        <description><![CDATA[How to configure Sequence Repository for NCBI entry ? In order to extract the protein name gi|47169226 from an entruy formatted as :    &gt;gi|47169226|pdb|1UB2|A Chain A, Crystal Structure ...]]></description>
                        <content:encoded><![CDATA[<p><span style="font-size: 10pt">How to configure Sequence Repository for NCBI entry ? </span><br /><br /><span style="font-size: 10pt">In order to extract the protein name <strong>gi|47169226</strong> from an entruy formatted as :</span><br /><span style="font-size: 10pt">    <em>&gt;gi|47169226|pdb|1UB2|A Chain A, Crystal Structure Of Catalase-Peroxidase From Synechococcus Pcc 7942</em></span><br /><span style="font-size: 10pt">the parsing_rules file should be configured with specific entry:</span><br /><br /><span style="font-size: 10pt">parsing-rules = [{</span><br /><span style="font-size: 10pt">   name="uniprot1",</span><br /><span style="font-size: 10pt"> ...</span><br /><span style="font-size: 10pt">},</span><br /><span style="font-size: 10pt">{</span><br /><span style="font-size: 10pt">  name="NCBI",</span><br /><span style="font-size: 10pt">   fasta-name=,                                 <span style="color: #ff9933">// all files which name start with 'NCBI' will be considered by this rule </span></span><br /><span style="font-size: 10pt">   fasta-version="NCBI(*)_.*.fasta",   <span style="color: #ff9933">// NCBI version is extract after NCBI word and before first '_'</span></span><br /><span style="font-size: 10pt">   protein-accession ="&gt;(\\w{2}\\|*)\\|"  <span style="color: #ff9933">// Protein accession is extract from beginning to second | </span></span><br /><span style="font-size: 10pt">}]</span></p>]]></content:encoded>
						                            <category domain="https://www.profiproteomics.fr/forum/general-utilities/">Utilities</category>                        <dc:creator>Véronique Dupierris</dc:creator>
                        <guid isPermaLink="true">https://www.profiproteomics.fr/forum/general-utilities/parsing-rules-for-sequence-repository/#post-8</guid>
                    </item>
				                    <item>
                        <title>Parsing rules for Sequence Repository</title>
                        <link>https://www.profiproteomics.fr/forum/general-utilities/parsing-rules-for-sequence-repository/#post-7</link>
                        <pubDate>Thu, 20 Sep 2018 12:22:55 +0000</pubDate>
                        <description><![CDATA[The Sequence Repository module is needed to retrieve Protein Sequence from fasta files and to calculate related information such as coverage ...In order to be efficient, it is better to inst...]]></description>
                        <content:encoded><![CDATA[<p><span style="font-size: 10pt">The Sequence Repository module is needed to retrieve Protein Sequence from fasta files and to calculate related information such as coverage ...</span><br /><span style="font-size: 10pt">In order to be efficient, it is better to install it on the same computer as the one executing your Mascot Server, as this module will access and read fasta files.</span></p><p> </p><p><span style="font-size: 10pt">Concerning the parsing_rules file, it is configured using regular expression (java one). To create a valid regular expression the RegEx site is very helpful. </span><br /><span style="font-size: 10pt">We will give you here 3 examples of parsing rules</span></p><ol><li><span style="font-size: 10pt">Case one : Uniprot file. Suppose your fasta file are formatted like : uniprot_&lt;someTextWithOUT'_'&gt;_2017_02.fasta. In these files, entries are </span><br /><span style="font-size: 10pt"><em>&gt;sp|P53319|6PGD2_YEAST 6-phosphogluconate dehydrogenase, decarboxylating 2 OS=Saccharomyces cerevisiae (strain ATCC 204508 / S288c) GN=GND2 PE=1 SV=1</em> and protein name to extract is 6PGD2_YEAST</span></li><li><span style="font-size: 10pt">Case two: Uniprot file 2. Suppose your fasta file are formatted like : UP_&lt;someTextWithOUT'_'&gt;_20170225.fasta. In these files, entries are </span><br /><span style="font-size: 10pt"><em>&gt;sp|P53319|6PGD2_YEAST 6-phosphogluconate dehydrogenase, decarboxylating 2 OS=Saccharomyces cerevisiae (strain ATCC 204508 / S288c) GN=GND2 PE=1 SV=1</em> and protein name to extract is P53319</span></li><li><span style="font-size: 10pt">Case three: TAIR file. Suppose your fasta file are formatted like : TAIR10_&lt;anyText&gt;.fasta. In these files, entries are </span><br /><span style="font-size: 10pt"><em>&gt;AT1G51380.1 | Symbols:  | DEA(D/H)-box RNA helicase family protein | chr1:19047960-19049967 FORWARD LENGTH=392</em> and protein name to extract is AT1G51380.1</span></li></ol><p><span style="font-size: 10pt">the parsing rules should be :</span></p><p> </p><p><span style="font-size: 10pt">parsing-rules = [{</span><br /><span style="font-size: 10pt">   name="uniprot1",</span><br /><span style="font-size: 10pt">  fasta-name=,                                       <span style="color: #ff9900">// all files which name start with 'uniprot' will be considered by this rule</span></span><br /><span style="font-size: 10pt">  fasta-version="uniprot _(*)_(.*).fasta",   <span style="color: #ff9900">// uniprot version will be extract from second '_' to the end of the file name. 2017_02 in Case 1</span></span><br /><span style="font-size: 10pt">  protein-accession ="&gt;\\w{2}\\|*\\|(\\S+)" <span style="color: #ff9900">//extract last part of the entry as accession. 6PGD2_YEAST in Case 1</span></span><br /><span style="font-size: 10pt">},</span><br /><span style="font-size: 10pt">{</span><br /><span style="font-size: 10pt">  name="uniprot2",</span><br /><span style="font-size: 10pt">  fasta-name=,                                           <span style="color: #ff9900">// all files which name start with 'UP_' will be considered by this rule</span> </span><br /><span style="font-size: 10pt">  fasta-version="UP_*_(.*).fasta",             <span style="color: #ff9900">// uniprot version will be extract from second '_' to the end of the file name. 20170225 in Case 2</span></span><br /><span style="font-size: 10pt">  protein-accession ="&gt;\\w{2}\\|(+)\\|" <span style="color: #ff9900">//extract second part of the entry as accession. P53319 in Case 2</span></span><br /><span style="font-size: 10pt">},</span><br /><span style="font-size: 10pt">{</span><br /><span style="font-size: 10pt">  name="TAIR",</span><br /><span style="font-size: 10pt">  fasta-name=,                                  <span style="color: #ff9900">// all files which name start with 'TAIR' will be considered by this rule</span> </span><br /><span style="font-size: 10pt">  fasta-version="TAIR(*)_.*.fasta",    <span style="color: #ff9900">// TAIR version is extract after TAIR word and before first '_'</span></span><br /><span style="font-size: 10pt">  protein-accession ="&gt;(\\S+)"                    <span style="color: #ff9900">// Protein accession is extract from beginning to first space</span></span><br /><span style="font-size: 10pt">}]</span></p>]]></content:encoded>
						                            <category domain="https://www.profiproteomics.fr/forum/general-utilities/">Utilities</category>                        <dc:creator>Véronique Dupierris</dc:creator>
                        <guid isPermaLink="true">https://www.profiproteomics.fr/forum/general-utilities/parsing-rules-for-sequence-repository/#post-7</guid>
                    </item>
							        </channel>
        </rss>
		