Please note: This user interface responds dynamically to your input prompts.

There is no "Submit" button. If it seems slow to update, it is busy processing your data. Please be patient.

Step 1: Select Sites

This phase of analysis requires:

  1. A protein alignment, provisionally assumed to be in FASTA format.
  2. A way to recognize how the longitudinal samples are labeled.
    By default, sequence names are assumed to be dot-delimited, with the timepoint label in the first (left-most) field.
  3. An indication of which sequence is the TF sequence.
    This is currently taken to be the first sequence in the alignment. If HXB2 is included with the alignment, it is used to number alignment columns then removed.

Specify a Protein Alignment

You can choose either to use the example data from CH505 or your own alignment. If you select your own alignment file, consider using the Advanced Options.

Choose a Cutoff TF Loss

You must choose a setting for the cutoff parameter value. The number of sites selected depends on the TF loss cutoff threshold. Higher cutoffs include fewer sites. This is set high by default, but can be adjusted using the slider bar.

Review Results

A list of selected sites will appear when defined. The table is interactive. You can sort rows by clicking on the column name and page through results. Searching is also possible.

Things to look for when reviewing results are the number of sites and accurately parsed timepoint labels.

Timepoint labels can be in any units and are assumed to indicate time elapsed since infection. A mixture of different units is not advised. The values are used primarily when reordering sites. After parsing the field indicated, any characters (e.g. "d123" or "56dpi") will be stripped and the remaining digits converted to a numeric value (123 and 56).

Other informative columns for review are when_up and tf_area. The former indicates when the site first exceeds 10% TF loss, or another value if given by the "Order sites by when they first reach..." slider. The tf_area is computed as the polygon area under the TF line, which may help resolve timing between sites that revert or change at different rates. The columns labeled 'l' and 'r' are used internally to label sites where indels occur relative to the reference sequence, and can be disregarded. (They will eventually be removed from this table.) For rows where these values equal, either value gives the reference sequence numbering. Rows where 'l' and 'r' differ indicate inserted sites, relative to the reference sequence, as is common for hypervariable loops of the HIV-1 Envelope glycoprotein.

When you are satisfied with the selected sites, check the "Send selected sites to Step 2" box to proceed, then click on "Step 2: Select Sequences".

Advanced Options

Check the box to enable several more choices:

  1. Specify the format of your alignment file.
  2. Transform asparagines (N) in PNG motifs to another character (O). The PNG motif is Nx[ST], where x is any amino acid except proline.
  3. Define options to parse the sample timepoint label from each sequence. Identifying timepoint labels is central to the longitudinal analysis. Each sequence should use a standard naming scheme, delimited by one of the characters listed, and in the same field. Parse errors will appear in place of the table of selected sites.

Not yet done:

  1. Force inclusion/exclusion of specific sites.
  2. Add option to specify TF sequence name.
  3. Add option to specify reference sequence name.

Step 2: Select Sequences

Information on this tab depends on information provided in the preceding step. When results are available, a summary of the number of selected sequences will appear on the top left, and a list of concatamers will appear in a large panel.

Not yet done (future "advanced options"):

  1. Generate/download time-series plots per site.
  2. Force inclusion/exclusion of specific sequences by name.
  3. Specify maximum tolerated deletion & insertion length.

Concatamer List

Concatamers are the selected sites from Step 1, strung together. One concatamer is listed for each of the selected sequences. The order of sites from left to right is intended to capture their relative strength of selection, sorted first by when each site first reaches at least 10% TF loss, then by the cumulative (non-) TF area.

Download Results

Each link under "Download Results" provides a different summary of the analysis, including plots of sequence logos for selected sites among the selected sequences.

Advanced Options

  1. The "Minimum variant count" setting specifies that each mutation should occur in the alignment at least this many times for inclusion among selected sequences, e.g. 2 omits singleton mutations. Increasing this setting, or reducing the number of selected sites in Step 1, will give fewer sets of selected sequences.
  2. To do: specify indel lengths to be tolerated for inclusion.
  3. Add support to include and exclude sequences by name.
  4. Add ability to download refseq lookup table.

Sequence Logos

The code to generate sequence logos is a standalone python program. If you see an error like this "Could not find Ghostscript on path. There should be either a gs executable or a gswin32c.exe on your system's path" then please install gs or gswin32c.exe per instructions for "Dependencies" provided at http://weblogo.threeplusone.com/manual.html#download




phraber/meta-lass documentation built on May 25, 2019, 6:02 a.m.