................ SHORT DOC ............................................. VIRAPOPS: Simulates a viral population (proteins or DNA) from a seed sequences family. References: Petitjean M., Vanet A. VIRAPOPS: a Forward Simulator Dedicated to Rapidly Evolved Viral Populations Bioinformatics 2014, 30[4], 578-580. DOI 10.1093/bioinformatics/btt724 Petitjean M., Vanet A. VIRAPOPS2 Supports the Influenza Virus Reassortments. Source Code Biol. Med., 2014, 9:18. DOI 10.1186/1751-0473-9-18 Hotline: For biology related questions: anne.vanet@univ-paris-diderot.fr For computation related problems: petitjean.chiral@gmail.com (please provide relevant input and output files in your contact email) Summary: VIRAPOPS reads a lot of initial DNA sequences in a file. These latter are mutated if asked, copied to simulate budding virions and translated in protein sequences (standard coding scheme). The mutated DNA sequences are added to those of the previous generation and the whole set of DNA sequences are then coded in protein sequences. The process applies iteratively until a user fixed number of generations is reached. Several parameters and variants of the scheme above can be defined by the user. Calling VIRAPOPS by script: # script: begin # in this example, it is assumed that: # (i) the executable is named VIRAPOPS, and # (ii) it located in the current working directory, and # (iii) it was flagged as executable, e.g. through the command # chmod 500 ./VIRAPOPS # cat << eof | ./VIRAPOPS data lines as read from keyboard (see parameters below) data lines as read from keyboard (see parameters below) data lines as read from keyboard (see parameters below) etc... eof # # script: end Input data and parameters: ------------------------- 1. Sequences in FASTA format [Y/N]. If the FASTA format is selected: Only A,C,G,T or a,c,g,t symbols are allowed in user input DNA sequences. Sequences are read and written on single lines. Comment output written lines contain the symbol > followed by: for a DNA sequence, the internal sequence number of this DNA sequence, or, for a protein sequence, the internal sequence number of its parent DNA sequence followed by the the internal sequence number of this protein sequence. 2. Input sequence(s) file name. The name of the file containing the initial DNA sequence(s). The input DNA sequence(s) are expected to have a common length multiple of 3. Caps letters are allowed but they are not mandatory. The maximal DNA sequence length is 10500. Optional complementation of the initial sequence(s) (** VIRAPOPS versions 2 and above **) Enter 0: no complementation; 1: ancestral sequences is complemented; 2: all sequences are complemented Optional. Deletion of the first read DNA sequence and its coded protein. This optional deletion is useful when VIRAPOPS read more than one input DNA sequence. The first DNA sequence is **always** the ancestral one, even if only one sequence is read. If the ancestral DNA sequence is deleted, it is used only to compute mutation statistics, and it is ignored by the iterative process generating new sequences. 3. Number n of generations to be simulated. The standard genetic code is applied. The proteins coded from the sequence(s) read in the input file correspond to generation number 0. The DNA sequences stored at generation k-1 mutate according to the rules selected by the user, then code for proteins. The whole set of DNA sequences and the whole set of protein sequences are updated according to the user rules, producing the content of generation k. 4. Number of generated DNA sequences per DNA parent sequence. This number is the number of times that the mutation and coding process is repeated at each generation. 5. Optional. Maximum life duration for DNA sequences. Each DNA sequence is created at age 0. At each generation the ages of all DNA sequences are increased by 1. 6. Optional. User allowed maximum cumulated number of DNA sequences. The growing of the number of sequences is exponential in the worst case: it is why the user can enter this maximum value. The mutation process is performed. Then, above that maximum value, DNA sequences are randomly deleted so that the cumulated number of DNA sequences remain equal to that maximum value. Remark: The number maximum number of storable DNA sequences is limited to 65536. If the user enters a larger value, it is set to this maximum value. The source has to be recompiled to extend this limit. 7. Default mutation rate per DNA position, in [0;1] (e.g., 1.E-4). Entering 0 means that no mutation will occur. 8. Optional. Mutational hotspots. For each hotspot, enter: DNA position, probability (in [0;1]) to mutate at this position. It overrides the default mutation rate per DNA position. 9. Reassortments (** VIRAPOPS versions 2 and above **) Enter the lengths of the fragments (e.g. 7 ou 8 fragments lengths to be entered). In the case where the fragments of a lengthy sequence are separated by spaces or/and breaklines is in a file named, say, virapops_fragments.txt, it is possible to use the following command to get the desired lengths of the fragments: awk 'BEGIN {FS=" "}{for(i=1;i<=NF;i=i+1) {printf length($i); printf " "}}{printf " "}' virapops_fragments.txt; echo " " It is also possible to store in a shell script file (to be flagged executable), say, virapops_fragments.s this command: awk 'BEGIN {FS=" "}{for(i=1;i<=NF;i=i+1) {printf length($i); printf " "}}{printf " "}' $1; echo " " and then execute the command script file: ./virapops_fragments.s virapops_fragments.txt In both cases, the lengths of the fragments are displayed on the screen, and can be entered in VIRAPOPS with the mouse. Enter the probabilities of reassortment of the fragments (e.g. 7 ou 8 values in [0;1]). Optional. Recombinations (all versions). Enter either the number of recombination events within this interval, or the recombination percentage followed by the symbol % (no space should be inserted between the number and the % symbol). Two DNA sequences are recombined as follows: a random cut occurs on each sequence (the location of the random cut is the same for the two sequences), then the second part of the second sequence is concatenated to the first part of the first sequence while the second part of the first sequence is concatenated to the first part of the second sequence. 10. Optional. Selection pressure on compensatory protein positions, step 1. Enter a closed generations interval n1, n2 in which the selection applies, followed by the 1 letter reference amino acid code (enter * to select the ancestral sequence one) /* Remark: the amino acid letter above is required for version 2.1 and above */ followed by a probability of survival (in [0..1]), followed by a list of positions a1, a2, ..., ak. The uniform law applies to non flagged positions. 11. Optional. Selection pressure on synthetic lethal protein positions, step 2. Enter a closed generations interval n1, n2 in which the selection applies, followed by the 1 letter reference amino acid code (enter * to select the ancestral sequence one) /* Remark: the amino acid letter above is required for version 2.1 and above */ followed by a probability of extinction (in [0..1]), followed by a list of positions a1, a2, ..., ak. The uniform law applies to non flagged positions. 12. Optional. Selection pressure on protein positions, step 3. Enter a closed generations interval n1, n2 in which the selection applies, followed by the 1 letter amino acid code, followed by a probability p of obtention (p in [0..1]), followed by a list of protein positions a1, a2, ..., ak. The triplet coding the selected amino acid appears with a probability p in the mutated DNA sequence. If it exists several such coding triplets, the closest to the non mutated one is selected. If it still exists several such coding triplets, a random one is selected. Then the protein sequence is generated according to that modified DNA sequence. 13. Optional. Genetic drift. enter the generation number, followed by the surviving sequences number or percentage (to indicate a percentage, no space should be inserted between the number and the % symbol). 14. Optional. Gene flow. An additional set of DNA sequences is read in a user specified file. Enter the generation number at which this file is to be read. If it is not null, enter the input gene flow file name. The expected sequence lengths are the one of the ancestral sequence. All sequences are included together at the specified generation. 15. Optional. Stop criterion based on the percentage of DNA mutations. Enter the interval p1, p2 in which the mean percentage of mutations should fall to stop the program (no % symbol is required). 16. Optional. Stop criterion based on the percentage of mutations in protein sequences. Enter the interval p1, p2 in which the mean percentage of mutations should fall to stop the program (no % symbol is required). 17. Trial number (default: 0). Running twice the program with all identical input data will generate twice the same output, except if the trial numbers differ. A small integer number is expected. 18. Optional. Removal of redundant DNA sequences [Y/N]. Entering Y is a way to reduce significantly the size of the output, although it has no immediate biological interpretation. 19. Optional. Removal of redundant protein sequences [Y/N]. Entering Y is a way to reduce significantly the size of the output, although it has no immediate biological interpretation. 20. Number m of final generations to be displayed. The sequences got at the n-m initial generations are not displayed. Entering m=1 means that only the content of the final generation is displayed, unless the user stop criterion 15 or 16 induces an exit before this display occurs. 21. Optional output statistics (** VIRAPOPS versions 2 and above **) Number of mutations per DNA position Number of mutations per protein position Modal DNA mutated values Modal protein mutated values Observed DNA mutated values Observed protein mutated values DNA consensus sequence Protein consensus sequence Output results: -------------- The proteins coded by the initial DNA sequences. In the case there are more than one read initial DNA sequences, the number of mutations and the mean number of mutations are reported, both for the DNA sequences and for their coded protein sequences. This is the generation number 0. The list of the DNA sequences followed by the list of the protein sequences, at each displayed generation. The cumulated numbers of DNA sequences and protein sequences, their associated cumulated mutated positions and mean mutation ratios are output at each generation. ................ END SHORT DOC .........................................