referencePrepare: Creates reference file
In IntEREst: Intron-Exon Retention Estimator

Description Usage Arguments Value Author(s) Examples

Creates reference file for IntEREst functions, e.g. interest(). The function uses functions of biomaRt library.

referencePrepare( outFileTranscriptsAnnotation="",
	annotateGeneIds=TRUE, 
	u12IntronsChr=c(), u12IntronsBeg=c(), u12IntronsEnd=c(),
	u12IntronsRef,	collapseExons=TRUE, sourceBuild="UCSC", 
	ucscGenome="hg19", ucscTableName="knownGene",
	ucscUrl="http://genome-euro.ucsc.edu/cgi-bin/",
	biomart="ENSEMBL_MART_ENSEMBL",
	biomartDataset="hsapiens_gene_ensembl",
	biomartTranscriptIds=NULL, biomartExtraFilters=NULL, 
	biomartIdPrefix="ensembl_",	biomartHost="www.ensembl.org",
	biomartPort=80,	circSeqs="", miRBaseBuild=NA, taxonomyId=NA,
	filePath="", fileFormat=c("auto", "gff3", "gtf"), fileDatSrc=NA,
	fileOrganism=NA, fileChrInf=NULL, 
	fileDbXrefTag=c(), addCollapsedTranscripts=TRUE, 
	ignore.strand=FALSE )

`outFileTranscriptsAnnotation`	If defined outputs transcripts annotations.
`annotateGeneIds`	Wether annotate and add the gene ids information.
`collapseExons`	Whether collapse (i.e. reduce) the exonic regions. TRUE by default.
`sourceBuild`	The source to use to build the reference data, `"UCSC"`, `"biomaRt"`, and `"file"` (for GFF3 or GTF files) are supported.
`ucscGenome`	The genome to use. `"hg19"` is the default. See `genome` parameter of `makeTxDbFromUCSC` function of `GenomicFeatures` library for more information.
`ucscTableName`	The UCSC table name to use. See `tablename` parameter of `makeTxDbFromUCSC` function of `GenomicFeatures` library for more information.
`ucscUrl`	The UCSC URL address. See `url` parameter of `makeTxDbFromUCSC` function of `GenomicFeatures` library for more information.
`u12IntronsChr`	A vector of character strings that includes chromsomal locations of the U12 type introns. If defined together with `u12IntronsBeg` and `u12IntronsBeg`, they would be used to annotate the U12-type introns.
`u12IntronsBeg`	A vector of numbers that defines the begin (or start) coordinates of the u12-type introns.
`u12IntronsEnd`	A vector of numbers that defines the end coordinates of the u12-type introns.
`u12IntronsRef`	A GRanges object that includes the coordinates of the U12 type introns. If defined, it would be used to annotate the U12-type introns.
`biomart`	BioMart database name. See `biomart` parameter of `makeTxDbFromBiomart` function of `GenomicFeatures` library for more information.
`biomartDataset`	BioMart dataset name; default is "hsapiens_gene_ensembl". See `dataset` parameter of `makeTxDbFromBiomart` function of `GenomicFeatures` library for more information.
`biomartTranscriptIds`	optional parameter to only retrieve transcript annotation results for a defined set of transcript ids. See `transcript_ids` parameter of `makeTxDbFromBiomart` function of `GenomicFeatures` library for more information.
`biomartExtraFilters`	A list of names; i.e. additional filters to use in the BioMart query. See `filters` parameter of `makeTxDbFromBiomart` function of `GenomicFeatures` library for more information.
`biomartIdPrefix`	A list of names; i.e. additional filters to use in the BioMart query. See `id_prefix` parameter of `makeTxDbFromBiomart` function of `GenomicFeatures` library for more information.
`biomartHost`	Host to connect to; the default is "www.ensembl.org". For older versions of the GRCH you can provide the archive websites, e.g. for GRCH37 you can use "grch37.ensembl.org".
`biomartPort`	The port to use in the HTTP communication with the host. Default is 80.
`circSeqs`	A character vector that includes chromosomes that should be marked as circular. See `circ_seqs` parameter of `makeTxDbFromBiomart` and `makeTxDbFromUCSC` functions of `GenomicFeatures` library for more information.
`miRBaseBuild`	Set appropriate build Information from mirbase.db to use for microRNAs (default=NA). See `miRBaseBuild` parameter of `makeTxDbFromBiomart` and `makeTxDbFromUCSC` functions of `GenomicFeatures` library for more information.
`taxonomyId`	This parameter can be used to provide taxonomy Ids. It is set to NA by default. You can check the taxonomy Ids with the `available.species()` function in `GenomeInfoDb` package. For more information see `taxonomyId` parameter of `makeTxDbFromBiomart` and `makeTxDbFromUCSC` functions of `GenomicFeatures` library.
`filePath`	Character string i.e. the path to file. Used if `sourceBuild` is `"file"`.
`fileFormat`	The format of the input file. `"auto"`, `"gff3"` and `"gtf"` is supported.
`fileDatSrc`	Character string describing the source of the data file. Used if `sourceBuild` is `"file"`.
`fileOrganism`	The genus and species name of the organism. Used if `sourceBuild` is `"file"`.
`fileChrInf`	Dataframe that includes information about the chromosome. The first column represents the chromosome name and the second column is the length of the chromosome. Used if `sourceBuild` is `"file"`.
`fileDbXrefTag`	A vector of chracater strings which if defined it would be used as feature names. Used if `sourceBuild` is `"file"`.
`addCollapsedTranscripts`	Whether add a column that includes the collapsed transcripts information. Used if `collapseExons` is `TRUE`.
`ignore.strand`	Whether consider the strands in the reference. If set `TURE` the strands would be ingnored.

Data frame that includes the coordinates and annotations of the introns and exons of the transcripts, i.e. the reference.

Ali Oghabian

	# Build test gff3 data
	tmpGen<- u12[u12[,"ens_trans_id"]=="ENST00000413811",]
	tmpEx<-tmpGen[tmpGen[,"int_ex"]=="exon",]
	exonDat<- cbind(tmpEx[,3], ".", 
		tmpEx[,c(7,4,5)], ".", tmpEx[,6], ".",paste("ID=exon", 
		tmpEx[,11], "; Parent=ENST00000413811", sep="") )
	trDat<- c(tmpEx[1,3], ".", "mRNA", as.numeric(min(tmpEx[,4])), 
		as.numeric(max(tmpEx[,5])), ".", tmpEx[1,6], ".", 
		"ID=ENST00000413811")

	outDir<- file.path(tempdir(),"tmpFolder")
	dir.create(outDir)
	outDir<- normalizePath(outDir)

	gff3File=paste(outDir, "gffFile.gff", sep="/")

	cat("##gff-version 3\n",file=gff3File, append=FALSE)
	cat(paste(paste(trDat, collapse="\t"),"\n", sep=""),
		file=gff3File, append=TRUE)

	write.table(exonDat, gff3File,
		row.names=FALSE, col.names=FALSE,
		sep='\t', quote=FALSE, append=TRUE)	

	# Selecting U12 introns info from 'u12' data
	u12Int<-u12[u12$int_ex=="intron"&u12$int_type=="U12",]

	# Test the function
	refseqRef<- referencePrepare (sourceBuild="file", 
		filePath=gff3File, u12IntronsChr=u12Int[,"chr"], 
		u12IntronsBeg=u12Int[,"begin"], 
		u12IntronsEnd=u12Int[,"end"], collapseExons=TRUE, 
		fileFormat="gff3", annotateGeneIds=FALSE)