Parsing rules for Sequence Repository : Which parsing rule to use for NCBI fasta files

Véronique Dupierris — Thu, 20 Sep 2018 12:24:45 +0000

How to configure Sequence Repository for NCBI entry ?

In order to extract the protein name gi|47169226 from an entruy formatted as :
>gi|47169226|pdb|1UB2|A Chain A, Crystal Structure Of Catalase-Peroxidase From Synechococcus Pcc 7942
the parsing_rules file should be configured with specific entry:

parsing-rules = [{
name="uniprot1",
...
},
{
name="NCBI",
fasta-name=, // all files which name start with 'NCBI' will be considered by this rule
fasta-version="NCBI(*)_.*.fasta", // NCBI version is extract after NCBI word and before first '_'
protein-accession =">(\\w{2}\\|*)\\|" // Protein accession is extract from beginning to second |
}]

Parsing rules for Sequence Repository

Véronique Dupierris — Thu, 20 Sep 2018 12:22:55 +0000

The Sequence Repository module is needed to retrieve Protein Sequence from fasta files and to calculate related information such as coverage ...
In order to be efficient, it is better to install it on the same computer as the one executing your Mascot Server, as this module will access and read fasta files.

Concerning the parsing_rules file, it is configured using regular expression (java one). To create a valid regular expression the RegEx site is very helpful.
We will give you here 3 examples of parsing rules

Case one : Uniprot file. Suppose your fasta file are formatted like : uniprot__2017_02.fasta. In these files, entries are
>sp|P53319|6PGD2_YEAST 6-phosphogluconate dehydrogenase, decarboxylating 2 OS=Saccharomyces cerevisiae (strain ATCC 204508 / S288c) GN=GND2 PE=1 SV=1 and protein name to extract is 6PGD2_YEAST
Case two: Uniprot file 2. Suppose your fasta file are formatted like : UP__20170225.fasta. In these files, entries are
>sp|P53319|6PGD2_YEAST 6-phosphogluconate dehydrogenase, decarboxylating 2 OS=Saccharomyces cerevisiae (strain ATCC 204508 / S288c) GN=GND2 PE=1 SV=1 and protein name to extract is P53319
Case three: TAIR file. Suppose your fasta file are formatted like : TAIR10_.fasta. In these files, entries are
>AT1G51380.1 | Symbols: | DEA(D/H)-box RNA helicase family protein | chr1:19047960-19049967 FORWARD LENGTH=392 and protein name to extract is AT1G51380.1

the parsing rules should be :

parsing-rules = [{
name="uniprot1",
fasta-name=, // all files which name start with 'uniprot' will be considered by this rule
fasta-version="uniprot _(*)_(.*).fasta", // uniprot version will be extract from second '_' to the end of the file name. 2017_02 in Case 1
protein-accession =">\\w{2}\\|*\\|(\\S+)" //extract last part of the entry as accession. 6PGD2_YEAST in Case 1
},
{
name="uniprot2",
fasta-name=, // all files which name start with 'UP_' will be considered by this rule
fasta-version="UP_*_(.*).fasta", // uniprot version will be extract from second '_' to the end of the file name. 20170225 in Case 2
protein-accession =">\\w{2}\\|(+)\\|" //extract second part of the entry as accession. P53319 in Case 2
},
{
name="TAIR",
fasta-name=, // all files which name start with 'TAIR' will be considered by this rule
fasta-version="TAIR(*)_.*.fasta", // TAIR version is extract after TAIR word and before first '_'
protein-accession =">(\\S+)" // Protein accession is extract from beginning to first space
}]

Parsing rules for Sequence Repository - Utilities

Parsing rules for Sequence Repository : Which parsing rule to use for NCBI fasta files

Parsing rules for Sequence Repository