Parsing rules for Sequence Repository

Début du sujet 20/09/2018 2:22 pm [#5]

The Sequence Repository module is needed to retrieve Protein Sequence from fasta files and to calculate related information such as coverage ...
In order to be efficient, it is better to install it on the same computer as the one executing your Mascot Server, as this module will access and read fasta files.

Concerning the parsing_rules file, it is configured using regular expression (java one). To create a valid regular expression the [url= http://www.regexplanet.com/advanced/java/index.html ]RegEx site[/url] is very helpful.
We will give you here 3 examples of parsing rules

Case one : Uniprot file. Suppose your fasta file are formatted like : uniprot_<someTextWithOUT'_'>_2017_02.fasta. In these files, entries are
>sp|P53319|6PGD2_YEAST 6-phosphogluconate dehydrogenase, decarboxylating 2 OS=Saccharomyces cerevisiae (strain ATCC 204508 / S288c) GN=GND2 PE=1 SV=1 and protein name to extract is 6PGD2_YEAST
Case two: Uniprot file 2. Suppose your fasta file are formatted like : UP_<someTextWithOUT'_'>_20170225.fasta. In these files, entries are
>sp|P53319|6PGD2_YEAST 6-phosphogluconate dehydrogenase, decarboxylating 2 OS=Saccharomyces cerevisiae (strain ATCC 204508 / S288c) GN=GND2 PE=1 SV=1 and protein name to extract is P53319
Case three: TAIR file. Suppose your fasta file are formatted like : TAIR10_<anyText>.fasta. In these files, entries are
>AT1G51380.1 | Symbols: | DEA(D/H)-box RNA helicase family protein | chr1:19047960-19049967 FORWARD LENGTH=392 and protein name to extract is AT1G51380.1

the parsing rules should be :

parsing-rules = [{
name="uniprot1",
fasta-name=["uniprot"], // all files which name start with 'uniprot' will be considered by this rule
fasta-version="uniprot _([^_]*)_(.*).fasta", // uniprot version will be extract from second '_' to the end of the file name. 2017_02 in Case 1
protein-accession =">\\w{2}\\|[^\\|]*\\|(\\S+)" //extract last part of the entry as accession. 6PGD2_YEAST in Case 1
},
{
name="uniprot2",
fasta-name=["UP_"], // all files which name start with 'UP_' will be considered by this rule
fasta-version="UP_[^_]*_(.*).fasta", // uniprot version will be extract from second '_' to the end of the file name. 20170225 in Case 2
protein-accession =">\\w{2}\\|([^\\|]+)\\|" //extract second part of the entry as accession. P53319 in Case 2
},
{
name="TAIR",
fasta-name=["TAIR"], // all files which name start with 'TAIR' will be considered by this rule
fasta-version="TAIR([^_]*)_.*.fasta", // TAIR version is extract after TAIR word and before first '_'
protein-accession =">(\\S+)" // Protein accession is extract from beginning to first space
}]

Forum

[Épinglé] Parsing rules for Sequence Repository