The Sequence Repository module is needed to retrieve Protein Sequence from fasta files and to calculate related information such as coverage ...
In order to be efficient, it is better to install it on the same computer as the one executing your Mascot Server, as this module will access and read fasta files.
Concerning the parsing_rules file, it is configured using regular expression (java one). To create a valid regular expression the [url= http://www.regexplanet.com/advanced/java/index.html ]RegEx site[/url] is very helpful.
We will give you here 3 examples of parsing rules
the parsing rules should be :
parsing-rules = [{
name="uniprot1",
fasta-name=["uniprot"], // all files which name start with 'uniprot' will be considered by this rule
fasta-version="uniprot _([^_]*)_(.*).fasta", // uniprot version will be extract from second '_' to the end of the file name. 2017_02 in Case 1
protein-accession =">\\w{2}\\|[^\\|]*\\|(\\S+)" //extract last part of the entry as accession. 6PGD2_YEAST in Case 1
},
{
name="uniprot2",
fasta-name=["UP_"], // all files which name start with 'UP_' will be considered by this rule
fasta-version="UP_[^_]*_(.*).fasta", // uniprot version will be extract from second '_' to the end of the file name. 20170225 in Case 2
protein-accession =">\\w{2}\\|([^\\|]+)\\|" //extract second part of the entry as accession. P53319 in Case 2
},
{
name="TAIR",
fasta-name=["TAIR"], // all files which name start with 'TAIR' will be considered by this rule
fasta-version="TAIR([^_]*)_.*.fasta", // TAIR version is extract after TAIR word and before first '_'
protein-accession =">(\\S+)" // Protein accession is extract from beginning to first space
}]
How to configure Sequence Repository for NCBI entry ?
In order to extract the protein name gi|47169226 from an entruy formatted as :
>gi|47169226|pdb|1UB2|A Chain A, Crystal Structure Of Catalase-Peroxidase From Synechococcus Pcc 7942
the parsing_rules file should be configured with specific entry:
parsing-rules = [{
name="uniprot1",
...
},
{
name="NCBI",
fasta-name=["NCBI"], // all files which name start with 'NCBI' will be considered by this rule
fasta-version="NCBI([^_]*)_.*.fasta", // NCBI version is extract after NCBI word and before first '_'
protein-accession =">(\\w{2}\\|[^\\|]*)\\|" // Protein accession is extract from beginning to second |
}]