Description Usage Arguments Details Author(s) References See Also Examples
Converts the example dataset provided with the package to PLINK format. This file is for demo purposes only. You will need to modify it to go from your file format to PLINK.
1 2 3 | ex2plink(dir.file, dir.out, file.name = "genotypes_10_90.txt",
annotation.name = "Identifiers_comma.csv", out.prefix.ped = "genotypes_",
out.prefix.dat = "genos_chr")
|
dir.file |
The directory where |
dir.out |
The directory to which output files should go. |
file.name |
The name of the file that contains the example dataset. This file should be of the following format: Status 1 0 1 ... 1719214 AG GG AG ... 2320341 TT TT TT ... ... - Tab delimited - No header - First row is the disease status - First column is the list of Markers - rows: geno information, no separator between alleles. - columns: individuals/patients/samples |
annotation.name |
The file containing SNP information about columns of Marker,RefSNP_ID,CHROMOSOME,CHROMOSOME_LOCATION ... 1546,,1,2103664 ... 1996,rs1338382,1,2708522 .... 2841,"rs2887274,rs4369170",1,3504300 ... ... - Comma delimited (due to missing values) - Has a header - Col 1: Markers, most appear in Col 1 of file.name - Col 2: RefSNP_ID: * empty if missing * one SNP ID * two or 3 corresponding SNP IDs, in double quotes, comma separated, no space. - Col 3: chromosome number - Col 4: physical location - First 4 columns are important, other columns will be ignored. - rows: correspond to all available SNP IDs |
out.prefix.ped |
The beginning of output file name for pedegree files. This prefix will be used to name .ped files for each chromosome. These files will be of the following format: p1 p1 0 0 1 2 C/C N/N T/C ... p2 p2 0 0 1 2 T/T A/C G/G ... ... - Tab separated - No header - 6 non-SNP leading columns - Col 1 and Col 2: patient ID: some unique ID - Col 3 and Col 4: parents: mother/father: set to 0 - Col 5: gender, default to 1 (male) - Col 6: disease status: 1 CONTROL and 2 CASE - Col 7+: geno information, slash separator between alleles. |
out.prefix.dat |
The beginning of output file name for .map file. This prefix will be used to name .map file. The file will be of the following format: 19 rs32453434 0 5465475 19 rs6547434 0 23534543 ... - Space separated - No header - 4 columns: - Col 1: Chromosome number (Col 3 from annotation file) - Col 2: SNP ID or Marker if SNP is not known (Col 2 from annotation file, or Col1 if Col2="") - Col 3: always 0 - Col 4: physical locations (Col 4 from annotation file) - Number of rows is the number of SNPs used in the given chromosome. (= number of SNP columns of .ped) |
This program is not part of the functionality of GenMOSS package. It is merely a demo that helps to show the conversion from one existing file format, to the desired Plink format. Users will need to write something similar to this program to convert their file format to Plink in a similar way. This function will write 2 files for each chromosome: .ped, and .map.
Olia Vesselova
Wherever genotype file is obtained from.
pre0.dir.create
, pre1.plink2mach.batch
,
pre1.plink2mach
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 | ## The function is currently defined as
function (dir.file, dir.out, file.name = "genotypes_10_90.txt",
annotation.name = "Identifiers_comma.csv", out.prefix.ped = "genotypes_",
out.prefix.dat = "genos_chr")
{
## Read in the data file and annotation file
data.file <- read.table(paste(dir.file, file.name, sep = "/"),
sep = "\t", header = FALSE, stringsAsFactors = FALSE)
ann.file <- read.table(paste(dir.file, annotation.name, sep = "/"),
sep = ",", header = TRUE, stringsAsFactors = FALSE)
# Transpose the data.file, such that columns are SNPs,
# and 1st column becomes disease status.
# and 1st row lists all the SNP Markers.
data.file <- t(data.file)
# Save the disease status and SNP Marker names separately
disease.status <- data.file[2:nrow(data.file), 1]
marker.names <- data.file[1, 2:ncol(data.file)]
# Now set data.file to be pure data
data.file <- data.file[2:nrow(data.file), 2:ncol(data.file)]
ncols <- ncol(data.file)
# ******************************************************************** #
# Iterate over all the Markers of data file.
# For each marker, find its corresponding row in annotation file
# If a marker does not exist in annotation file, print error
# (since we don't know chromosome number for it)
i <- 1
# Array that keeps at which index in annotation file Marker was found.
ids.ann <- matrix(0, ncols, 1)
# Since finding the indexes takes a long time, we can save them and
# use them instead of generating them every time.
index.name <- paste(dir.file, "indices.ann.txt", sep = "/")
if (file.exists(index.name)) {
ids.ann <- read.table(index.name, header = FALSE, sep = " ",
stringsAsFactors = FALSE)
ids.ann <- unlist(ids.ann)
}
else {
# The following code shows how to generate that file with indices.
print(paste("Processing ", ncols, " SNPs. This is slow...",
sep = ""))
while (i <= ncols) {
if (i%%1000 == 0)
print(paste("i = ", i, sep = ""))
# Find index of current Marker in annotation file's 1st column
id <- match(marker.names[i], ann.file[, 1])
# If the search failed, then we do not know anything about this marker
if (is.na(id)) {
print(paste("Warning: Marker ", data.file[1,
i], " was not found in annotation file", sep = ""))
}
else {
ids.ann[i] <- id
}
i <- i + 1
}
# save the indexes
write.table(ids.ann, file = index.name, sep = " ", col.names = FALSE,
row.names = FALSE, quote = FALSE)
}
# ********************************************************************* #
# Now ids.ann contain annotation file IDs for each marker in data.file.
# Get all the SNPs that are used and throw out the rest.
# Set ann.file to contain all info from annotation file only for used SNPs,
# ordered in the same way as SNPs are ordered in the data file.
# Get all chromosome numbers that are used (all.chroms) and sort them.
ann.file <- ann.file[ids.ann, 1:4]
all.chroms <- unique(ann.file[, 3])
# Convert all chromosomes to numeric values (luckily for this dataset,
# all chroms are numeric, but if they were not, we would need to encode
# non-numeric values as numeric: for example "X" as 23, "Y" as 24, etc).
all.chroms.sort <- sort(as.numeric(all.chroms))
# ********************************************************************* #
# For each chromosome, create 2 files: .ped and .map of the format described above.
i <- 1
while (i <= length(all.chroms.sort)) {
curr.chrom <- all.chroms.sort[i]
# boolean has TRUE for all rows that correspond to current chromosome
bool.chrom <- (ann.file[, 3] == curr.chrom)
# Data for this chromosome, its annotation, and its markers
chrom.data <- data.file[, bool.chrom]
chrom.ann <- ann.file[bool.chrom, ]
chrom.markers <- marker.names[bool.chrom]
# Data should consist of Alleles separated by a slash,
# whereas this dataset currently has no separator between Alleles
chrom.data <- matrix(paste(substr(chrom.data, 1, 1),
substr(chrom.data, 2, 2), sep = "/"), nrow = nrow(chrom.data),
byrow = F)
# Prepare the .ped file format:
# Col 1 and 2: invent some unique names for data rows
# Col 3 and 4: remain 0s
# Col 5: set to 1, as if all are males.
# Col 6: disease status, originally we have 0-CONTROL and 1-CASE,
# now we re-encode it as 1-CONTROL and 2-CASE
ped.file <- matrix(0, nrow(chrom.data), 6)
ped.file[, 1] <- paste("p", (1:nrow(chrom.data)), sep = "")
ped.file[, 2] <- ped.file[, 1]
ped.file[, 5] <- rep(1, nrow(chrom.data))
ped.file[, 6] <- as.numeric(disease.status) + 1
ped.file <- cbind(ped.file, chrom.data)
# Save .ped file:
ped.name <- paste(dir.out, "/", out.prefix.ped, curr.chrom,
".ped", sep = "")
write.table(ped.file, file = ped.name, col.names = FALSE,
row.names = FALSE, quote = FALSE, sep = "\t")
# Prepare the .map file format:
# Col1: chrom number
# Col2: SNP ID, or Marker if no SNP ID
# Col3: 0
# Col4: physical location, Col4 from annotation
dat.file <- matrix(0, ncol(chrom.data), 4)
dat.file[, 1] <- rep(curr.chrom, ncol(chrom.data))
# Iterate over all SNP IDs in annotation, extract the first SNP ID from
# each row (since for any one entry there may be multiple SNP IDs, comma separated)
# If there is no SNP ID for given entry, then use the Marker name
id.splits <- strsplit(chrom.ann[, 2], ",")
j <- 1
while (j <= ncol(chrom.data)) {
dat.file[j, 2] <- unlist(id.splits[j])[1]
if (is.na(dat.file[j, 2]))
dat.file[j, 2] <- chrom.markers[j]
j <- j + 1
}
dat.file[, 4] <- chrom.ann[, 4]
# Save the .map file
dat.name <- paste(dir.out, "/", out.prefix.dat, curr.chrom,
".map", sep = "")
write.table(dat.file, file = dat.name, col.names = FALSE,
row.names = FALSE, quote = FALSE, sep = " ")
print(paste("Chromosome ", curr.chrom, " written.", sep = ""))
i <- i + 1
}
}
print("See the demo 'gendemo'.")
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.