% A Guide to using dataMashR
This is a guide for entering in data to create a database using dataMashR with minimal coding involved.
DataMashR is a program for R which is able to effectively compile various data files into a database. It's designed so that all information is transparent; it's easy to identify all variables, where they came from, any methods used and if something was entered incorrectly it's also easy to change it.
This program is picky, however. It takes a little longer to enter in data here than it would if you were just creating a database in a single excel spreadsheet and if you have any mistakes it won't compile properly until they are fixed. So take care to be as accurate as possible.
If you are working with multiple people on this project, make sure to pull the project to Github using RStudio when you will begin working on it, and push it to Github when you are done making changes. This gives everyone the most up-to-date version when working on it.
This contains information on all of the files and folders necessary to create your database from scratch. If you are using the template, you need only read the first subsection.
For starters, you will already need to have RStudio and the datamashR package installed on your computer. See Daniel about this.
You will need to start a new git repository in Github (the + symbol next to your username). Give it a name , whatever you want to call your database folder
Copy and paste the https address it will give you in the next option.
Open RStudio, go to file -> new project ->version control -> Git-> then copy and paste the HTTPS address it gives you into the top line.
a. This will create a folder in your documents, and this is where your database is going to be.
If you want to use the template I have given to start off your database, copy and paste all files from the template into your newly-created database folder. You can skip to the next section “Entering in Data”. Unless
If you are making a database from scratch or want more information on the folders and files used, then keep reading this section.
Once you have your database folder created, it only contains 3 files (blank, .Rhistory and an R Project file with the same name as your folder). Ignore these.
You are going to need to add all of your files and folders from scratch, and keep in mind all excel spreadsheets must be in .csv format, NOT .xml.
The most important folder is the config folder. This must include the following files:
postProcess <- function(data, lookupSpecies="none"){
data
}
Though not necessary for the program to run, you will need to create a few more files in order to successfully incorporate data into your database.
The first folder is the Template folder. This will contain all of the empty files you will need to fill in for each data source in order for it to compile into the database. You need only to copy, paste and rename it for each data source you wish to add. Keep in mind all excel files must be in .csv format, NOT .xml. Here are the files you need to add into it:
manipulate \<- function(raw) {
raw
}
dataNew.csv: this facilitates things if you need to alter variables and values later on. Headings you will need:
studyContact.csv: this provides a record of who filled out each data file, and how to contact them. Headings include:
This folder will be automatically created by dataMashR if not already present. But if you want to know what to expect, here it is:
You already have everything you need. But here are a few suggestions of folders you may find useful:
This is the how-to of entering data into a database. It's written from the perspective of entering papers from a scientific paper, but it can be applied to a variety of sources. When entering data, copy and paste as much as possible to avoid unnecessary errors.
All data from your paper needs to be entered into a single excel spreadsheet called 'data' within the template file. Don't worry about altering your column titles too much; it will be done later. a. Ensure all data you want from your source is here in a single table. i. Combine all tables from your data here; ii. Create new columns for data extracted from text b. If any cells are blank for any reason (which happens a lot when combining multiple tables), put 'NA' in them rather than leave them blank i. It's easy to do a 'find and replace' function for this. c. Highlight and copy all of your data before closing the .csv file for the first time. Re-open and check that your formatting hasn't changed. You can paste it back if it has. i. .csv files can mess with your formatting which can be tedious in your data files. But once it survives being closed once, it will survive it again. d. A few things to keep in mind: i. Copy and paste as much as possible. Even for single words or numbers. This drastically reduces errors from mindless typos. ii. Make sure your titles don't have any symbols besides full stops (.) and underscores (_). Anything else will throw R into a fit, so you will have to alter your titles to accommodate this. iii. Make sure your titles don't have units, as this usually means that there are symbols involved. You'll add those in later. iv. Don't mix numbers with characters (symbols/letters) in your columns if the column only requires a number. You need to tell R later whether columns are numeric or character. You can have numeric values in character columns, but not the other way around. 1. This includes all symbols, so standard errors (±) must be removed as well as values presented in a range (2-6). v. The =TRIM function is useful for removing unwanted spaces that can occur when importing tables from pdf format using adobe acrobat. This may save you all kinds of trouble when those odd spaces prevent you from compiling the database. vi. Consider adding a notes column at the end of the table. Sometimes values can use an explanation to better understand them
Fill out the studyMetadata. a. Open this spreadsheet and copy and paste in your headings from the 'data' spreadsheet (using the transpose option in 'paste special' if you right click). They should be listed vertically under the 'Topic' heading. b. Under the 'description' heading enter information about each topic. State what it is referring to, what the units are (if any), and where the information was extracted from. i. This makes things easier if you have forgotten what a variable was referring to, or if someone else needs to know where you got a piece of information from. c. A few things to keep in mind: i. From this table, one should understand exactly what each heading on your table is referring to and where you got that information from. ii. Make sure you copy and paste your heading titles into the topic heading from your data, because they must both match up exactly.
Papers will report the same data in different ways, and this next process is what will standardize your variable names and units so you can directly compare data across multiple sources. You will be going back and forth a lot between your dataMatchColumns file and the files in your config folder.
There is no exact order the config files need to be done in. As long as it's done and done accurately everything will work just fine.
ANY empty cells in your dataMatchColumns (usually in method and unit_in) must be filled in with NA.
You will need to also enter a reference file into each template folder. 1. Go into Mendeley, enter your paper and let Mendeley extract the relevant data for you. 2. Insert the reference into your template folder a. In Mendeley to go file ->export. b. Rename it 'studyRef' and save it as a .bib (should be the default option) into your template folder. And you are done! It will get easier. Once you add in more papers, many of the var_out values will already be there ready for copying and pasting, methods may be the same and some of the exact conversions you need will already be there. Remember: everything must match perfectly in order for it to mash seamlessly into a nice clean database.
There are only a few main differences when entering in data from a database instead of a scientific paper:
This section describes how to error-check and build your database, as well as explain the output after the database is built.
Test
library(testthat)
library(dataMashR)
checkPackage(".")
Build
library (dataMashR)
options(error=recover)
mashData(verbose=T)
As mentioned previously, this is the folder where your database is kept and will be created automatically when your database is built. Most files added into it will be self-evident when you click on them. But I'll break it down anyways so you know what to expect in each of the two files:
If there are variables that could be calculated using other given variables, that's where postProcess comes in in the config folder. Talk to Daniel and he can arrange for these things to be filled in for you.
A quick guide to check while working to make sure you don't miss any steps.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.