README.md

Molecular Data Organization for Publication (MDOP)

Description:

The Molecular Data Organization for Publication (MDOP) R package has functions to assist researchers in uploading molecular sequence data and associated metadata to public databases.

Functions:

target_file_list()

recursive_copy()

max_packs()

copy_by_list()

degap()

rank_seq()

head_derep()

seq_derep()

multi_to_single_fasta()

Dependent Functions:

cmd_line_instruction_directory()

cmd_line_instruction_file()

filetypeasinteger()

readmaxfoldersize()

readpath()

Installation:

Function Descriptions:

Upload tools

The following three sections describe scenarios where data or files need to be manipulated or obtained. Please note that this package and the nine tools work in a Microsoft Windows and MAC OS environment without the need to submit argument(s) when initiating the function. However, if so desired the user does have the option to submit one or more of the arguments associated with each tool and where data are still required the tool will prompt the user for the missing data.

Obtaining file lists

When preparing uploads to centeralized databases, a list of all files is needed with associated information (eg. Image file data, trace file data). These data can be obtained through the use of a DOS or IOS commands. However, the use of the command prompt is not always possible, particularly in places where security features limit this possibility such as government institutions and industry.

target_file_list()

The function target_file_list() can be used to address this difficulty. This function lists files with the extensions JPG, AB1, or FASTA/FAS for a chosen directory and all subdirectories. The list of file paths and file names will be saved as a text file in the chosen directory. The user can either choose to submit the file path and the file type as arguments when initiating the tool or alternatively can run the tool with no arguments and be prompted for the necessary information. If running without arguments the user will first be prompted to choose a file folder as a location to save the output file. Then it is necessary to input the type of file for which you would like to have a list (JPG, AB1, or FAS). The output for the script will appear in a text file with the naming convention YYYYMMDD_target_file_list TYP.txt, where the first eight characters represent the date of running, the second section is in reference to the function name, and the final section with TYP is the file type chosen (JPG, AB1, or FAS).

File Manipulation

Moving, copying, sorting, and subsetting large numbers of files is often necessary when preparing to upload data to shared databases. The organization of files for upload is made more difficult when processing files from multiple sources, research groups, and over time. These three functions can assist in the organization of diverse sets of files for upload.

recursive_copy()

Often, the submission numerous files, including image and chromatogram (trace) files, to a centralized data system is necessary. Bringing files into a central folder may be difficult when dealing with large numbers of files stored in cascading file structures. recursive_copy() is written to bring all files with a specific extension into a central location thereby making it easier to upload these files. The recursive_copy() function copies files with the extensions JPG, AB1, or FASTA/FAS in a directory and all subdirectories and places these files in a single destination folder. The user can either choose to submit the file path and the file type as arguments when initiating the tool or can run the tool with no arguments and be prompted for the necessary information. If running without arguments the user will first be prompted to choose a file folder where the new folder of copied files will be located. Then it is necessary to input the type of file for which you would like to copy the files (JPG, AB1, or FAS). The output for the script will appear in a file folder with the naming convention YYYYMMDD_target_file_list TYP.txt, where the first eight characters represent the date of running, the second section is in reference to the function name, and the final section with TYP is the file type chosen (JPG, AB1, or FAS).

max_packs()

Uploads of image files to centralized databases are often limited to a particular size per upload. It can be time consuming and challenging to partition files into folders of target sizes. The max_packs() function can be utilized to create these partitioned folders quickly and easily. This function will take a single file folder (but not containing folders) with target files (JPG, AB1, or FAS) and place them into file folders based on a maximum file folder size. The user can either choose to submit the file path, file type, and maximum desired file folder size as arguments when initiating the tool or can run the tool with no arguments and be prompted for the necessary information. If running without arguments the user will first be prompted to choose a file folder where the new folder of copied files will be located. Then it is necessary to input the type of file for which you would like to copy the files (JPG, AB1, or FAS). Finally, the user will be required to input an integer value for the maximum allowable size for the folders created with the copied files. The output for the script will appear in the target file folder location with the naming convention YYYYMMDD_ max_packs_TYP_# where the first eight characters represent the date of running, the second section is in reference to the function name, and third element TYP is the file type chosen (JPG, AB1, or FAS) and the final element # is an index for the folder number.

copy_by_list()

It is likely, after scrutiny, that some files associated with molecular records will not need to be uploaded to shared databases due to quality filtering. For example, if a DNA sequence was of poor quality it might be removed from the dataset for potential upload. This would then require the removal of associated metadata files. It is often time consuming to complete a point-and-click removal for all these records. The screening of these poor-quality records is often completed in fasta files and/or through the use of lists. copy_by_list() will assist in the copying of select files in a larger file folder and placing them into a new file folder based on a specified list. This tool will copy the files based on a list of file names in a target text file and place the copies in a file folder at the identified location. This script will not look at subdirectories in the target directory. To place all files, including subdirectories, into a single file folder see recursive_copy(). The user can either choose to submit the file path and target file list as arguments when initiating the tool or can run the tool with no arguments and be prompted for the necessary information. If running without arguments the user will first be prompted to choose a file folder where the new folder with copied files will be located. Then it is necessary to select the target file with the list of desired files. The name of the new destination file folder will follow the format YYYYMMDD_copy_by_list where the first eight characters represent the date of running, the second section is in reference to the function name. The text file with the list of target files which the user wants to be copied into a single folder needs to have one file name per line and a single blank line at the end of the list.

Sequence Manipulation

The manipulation of sequence data can also be a challenge to get organized. This is especially true when dealing with large data sets containing multiple markers from different sources, researchers, or naming conventions. The following five R functions will help to manipulate multiple sequence fasta files. One note is that degap(), rank_seq(), head_derep(), and seq_derep() require a single line (not multiline) fasta input file for proper functioning. If the working file is in multiline format, the user can use multi_to_single_fasta()to convert it to single line format.

degap()

It is often desired to only upload unaligned data to public databases. To accomplish this easily we present the degap() tool. This function is designed to remove gaps (represented by “-”) from all sequences in the selected fasta file. Users will need to select a file folder as a location to save the output file and an input fasta file. The user can either choose to submit the file path and the file they want to work on as arguments when initiating the tool or they can run the tool with no arguments and be prompted for the necessary information. The output file will follow the naming convention YYYYMMDD_degap.fas and be saved in the selected working directory.

rank_seq()

Often it is useful to screen out sequences of shorter length from further analyses. rank_seq() will take a multiple sequence fasta file and organize the sequences from shortest to longest. This will ease the visualisation of the fasta file in an alignment program and facilitate the selection of sequences over a given length and removal of sequence below a target length. This tool takes a select multiple sequence fasta file and organizes the sequences from shortest to longest and saves the output in a new fasta file. Users will need to select a file folder as a location to save the output file and an input fasta file. The user can either choose to submit the file path and the file they want to work on as arguments when initiating the tool or alternatively can run the tool with no arguments and be prompted for the necessary information. The new sorted file will be saved in the selected location with the naming convention YYYYMMDD_rank_seq.fas.

head_derep()

Removing duplicate records based on the fasta file header may be necessary to ensure no repetition of data. The following head_derep() and seq_derep()address this need in easily applicable scripts. This tool will reduce a select fasta file to all unique entries based on the headers. Users will need to select a file folder as a location to save the output file and an input fasta file. The user can either choose to submit the file path and the file they want to work on as arguments when initiating the tool or alternatively can run the tool with no arguments and be prompted for the necessary information. The output will be saved in the selected directory with the naming convention of YYYYMMDD_head_derep.fas.

seq_derep()

Removing duplicate records based on sequence may be necessary to ensure no repetition of data or when looking to determine the haplotype diversity in a multiple sequence file. This tool will reduce a select fasta file to all unique entries based on the sequences. Users will need to select a file folder as a location to save the output file and an input fasta file. The user can either choose to submit the file path and the file they want to work on as arguments when initiating the tool or alternatively can run the tool with no arguments and be prompted for the necessary information.The output will be saved in the selected directory with the naming convention of YYYYMMDD_seq_derep.fas

multi_to_single_fasta()

Often multiline fasta files where the header is on the first line followed by one or more lines of up to 80 characters containing nucleotide sequence data can be problematic when using different programs or tools. multi_to_single_fasta() can be used to change a multiple line fasta file format to a single line format where each header has a single line of nucleotide sequence data associated with a header. This tool will accept a multi-line fasta file and convert it to a single line fasta file format. Users will need to select a file folder as a location to save the output file and an input fasta file. The user can either choose to submit the file path and the file they want to work on as arguments when initiating the tool or alternatively can run the tool with no arguments and be prompted for the necessary information. The output will be saved in the selected directory with the naming convention of YYYYMMDD_multi_to_single_fasta.fas

Function Examples:

Each of the following examples use the files included in the folder ‘TEST’ found on with the package on Github. Using these files will provide the same results as is described in this section. Download this ‘TEST’ file to your computer and follow the below examples with arguments when initiating the functions or when prompted after calling on the function. The examples below have been completed with the exampe files in the 'TEST' folder located at the root directory for the computer. The following examples use the functions with argument input but each function outputs are the identical when selecting the arguments when prompted. These examples can be executed in either R or R studio.

target_file_list()

Enter the following command in the R terminal...

target_file_list("C:/TEST",1)

The first argument is the target file folder and the second agrument indicates the selection of .JPG files (as opposed to 2 for .AB1 files and 3 for .FAS files).

The terminal should display the message...

[1] "Task complete, look for output file YYYYMMDD_target_file_list_JPG.txt at this location C:/TEST"

If you navigate to the 'TEST' file folder you will find the file 'YYYYMMDD_target_file_list_JPG.txt'. The contents of this file should look like the following...

      A.jpg   C:\TEST\1\A.jpg       B.jpg   C:\TEST\1\B.jpg       C.jpg   C:\TEST\1\C.jpg       D.jpg   C:\TEST\1\D.jpg       E.jpg   C:\TEST\2\E.jpg       F.jpg   C:\TEST\2\F.jpg       G.jpg   C:\TEST\2\G.jpg       H.jpg   C:\TEST\2\H.jpg       I.jpg   C:\TEST\2\I.jpg       J.jpg   C:\TEST\2\J.jpg       K.jpg   C:\TEST\2\Subfolder_2_1\K.jpg       L.jpg   C:\TEST\2\Subfolder_2_1\L.jpg       M.jpg   C:\TEST\2\Subfolder_2_1\M.jpg       N.jpg   C:\TEST\3\N.jpg       O.jpg   C:\TEST\3\O.jpg       P.jpg   C:\TEST\3\P.jpg       Q.jpg   C:\TEST\3\Subfolder_3_1\Q.jpg       R.jpg   C:\TEST\3\Subfolder_3_1\Subfloder_3_1_1\R.jpg

recursive_copy()

Enter the following command in the R terminal...

recursive_copy("C:/TEST",1)

The first argument is the target file folder and the second agrument indicates the selection of .JPG files (as opposed to 2 for .AB1 files and 3 for .FAS files).

The terminal should display the message...

[1] "Task complete, look for output file folder 20200120_recursive_copy_JPG at this location C:/TEST"

If you navigate to the 'TEST' file folder you will find the folder 'YYYYMMDD_target_file_list_JPG'. The contents of this folder should have the following files...

A.jpg B.jpg C.jpg D.jpg E.jpg F.jpg G.jpg H.jpg I.jpg J.jpg K.jpg L.jpg M.jpg N.jpg O.jpg P.jpg Q.jpg R.jpg

max_packs()

NOTE: To complete this example you first need to run recursive_copy() as described above.

Enter the following command in the R terminal...

recursive_copy("C:/TEST/20200120_recursive_copy_JPG",1,20)

The first argument is the target file folder, the second agrument indicates the selection of .JPG files (as opposed to 2 for .AB1 files and 3 for .FAS files), and the third argument is for the maximum file folder size of 20 MB.

The terminal should display the message...

[1] "Task complete. Find file folders with a max size of 20 MB at C:/TEST/YYYMMDD_recursive_copy_JPG"

If you navigate to the 'TEST/YYYYMMDD_target_file_list_JPG' you will have three new folders each containing image files as shown in the following list...

YYYYMMDD_max_packs_JPG_1 A.jpg B.jpg C.jpg D.jpg E.jpg F.jpg G.jpg H.jpg

YYYYMMDD_max_packs_JPG_2 I.jpg J.jpg K.jpg L.jpg M.jpg N.jpg O.jpg

YYYYMMDD_max_packs_JPG_3 P.jpg Q.jpg R.jpg

copy_by_list()

degap()

rank_seq()

head_derep()

seq_derep()

multi_to_single_fasta()



rgyoung6/Molecular-Data-Organization-for-Publication-MDOP documentation built on Jan. 21, 2020, 12:12 a.m.