Forum

[Sticky] Parsing rules for Sequence Repository  

  RSS

(@vero)
Member Admin
Joined: 10 mois  ago
Posts: 18
20/09/2018 3:22  

The Sequence Repository module is needed to retrieve Protein Sequence from fasta files and to calculate related information such as coverage ...
In order to be efficient, it is better to install it on the same computer as the one executing your Mascot Server, as this module will access and read fasta files.

 

Concerning the parsing_rules file, it is configured using regular expression (java one). To create a valid regular expression the [url= http://www.regexplanet.com/advanced/java/index.html]RegEx site[/url] is very helpful.
We will give you here 3 examples of parsing rules

  1. Case one : Uniprot file. Suppose your fasta file are formatted like : uniprot_<someTextWithOUT'_'>_2017_02.fasta. In these files, entries are
    >sp|P53319|6PGD2_YEAST 6-phosphogluconate dehydrogenase, decarboxylating 2 OS=Saccharomyces cerevisiae (strain ATCC 204508 / S288c) GN=GND2 PE=1 SV=1 and protein name to extract is 6PGD2_YEAST
  2. Case two: Uniprot file 2. Suppose your fasta file are formatted like : UP_<someTextWithOUT'_'>_20170225.fasta. In these files, entries are
    >sp|P53319|6PGD2_YEAST 6-phosphogluconate dehydrogenase, decarboxylating 2 OS=Saccharomyces cerevisiae (strain ATCC 204508 / S288c) GN=GND2 PE=1 SV=1 and protein name to extract is P53319
  3. Case three: TAIR file. Suppose your fasta file are formatted like : TAIR10_<anyText>.fasta. In these files, entries are
    >AT1G51380.1 | Symbols:  | DEA(D/H)-box RNA helicase family protein | chr1:19047960-19049967 FORWARD LENGTH=392 and protein name to extract is AT1G51380.1

the parsing rules should be :

 

parsing-rules = [{
   name="uniprot1",
  fasta-name=["uniprot"],                                       // all files which name start with 'uniprot' will be considered by this rule
  fasta-version="uniprot _([^_]*)_(.*).fasta",   // uniprot version will be extract from second '_' to the end of the file name. 2017_02 in Case 1
  protein-accession =">\\w{2}\\|[^\\|]*\\|(\\S+)" //extract last part of the entry as accession. 6PGD2_YEAST in Case 1
},
{
  name="uniprot2",
  fasta-name=["UP_"],                                           // all files which name start with 'UP_' will be considered by this rule 
  fasta-version="UP_[^_]*_(.*).fasta",             // uniprot version will be extract from second '_' to the end of the file name. 20170225 in Case 2
  protein-accession =">\\w{2}\\|([^\\|]+)\\|" //extract second part of the entry as accession. P53319 in Case 2
},
{
  name="TAIR",
  fasta-name=["TAIR"],                                  // all files which name start with 'TAIR' will be considered by this rule 
  fasta-version="TAIR([^_]*)_.*.fasta",    // TAIR version is extract after TAIR word and before first '_'
  protein-accession =">(\\S+)"                    // Protein accession is extract from beginning to first space
}]

This topic was modified 3 mois  ago by Véronique Dupierris

ReplyQuote
(@vero)
Member Admin
Joined: 10 mois  ago
Posts: 18
20/09/2018 3:24  

How to configure Sequence Repository for NCBI entry ? 

In order to extract the protein name gi|47169226 from an entruy formatted as :
    >gi|47169226|pdb|1UB2|A Chain A, Crystal Structure Of Catalase-Peroxidase From Synechococcus Pcc 7942
the parsing_rules file should be configured with specific entry:

parsing-rules = [{
   name="uniprot1",
 ...
},
{
  name="NCBI",
   fasta-name=["NCBI"],                                 // all files which name start with 'NCBI' will be considered by this rule 
   fasta-version="NCBI([^_]*)_.*.fasta",   // NCBI version is extract after NCBI word and before first '_'
   protein-accession =">(\\w{2}\\|[^\\|]*)\\|"  // Protein accession is extract from beginning to second | 
}]


ReplyQuote
Share:
  
Working

Please Login or Register