In eriqande/HatcheryPedAgree:

knitr::opts_chunk$set(echo = TRUE)

Topics for discussion during meetings

Thursday, May 14, 2020

1. Installing packages

Make sure this is done:

install.packages(c("devtools", "roxygen2", "testthat", "knitr", "styler"))

2. Code style

This is mandatory reading: https://style.tidyverse.org/, at least, for now, Chapters 1--7.

3. Roxygen

Make sure that the configurations are done on their projects. See and follow directions at:
https://r-pkgs.org/man.html#man-workflow-2

Note that we have:

Roxygen: list(markdown = TRUE)

in the DESCRIPTION file.

4. Forking and Pull requests

Let's follow through the packages book.

git config --global merge.conflictstyle diff3

Also set up upstream:

git remote add upstream https://github.com/eriqande/HatcheryPedAgree.git

Then, any time you want to get new commits from the upstream repo, you do:

git fetch upstream

That "fetches" the objects.

And, to merge them into your current branch, once they are fetched, you do:

git merge upstream/master

To be able to pull upstream into the current branch, you can set that up with:

git branch -u upstream/master

Then, to pull it into master you can:

git checkout master
git pull

To make it easier to work with branches, we installed .git-prompt.sh and updates PS1 to show the branch on our bash prompt.

Be sure to get upstream set up. Then run through the pull request procedure.

5. Data sets

What format are these currently in? I want to to have both of the steelhead data sets in a canonical form to be in this project, but have them gitignored. Then I will put the coho SNPs up on GitHub for running through examples, etc.

I will send a specification of what I want, once I have figured that out, and then Anne and Laura can email those to me and I can put them together in a directory that we can send out to everyone. That way we can all test new functions on these private data sets.

DON'T COMMIT THESE DATA SETS TO THE REPO!

For: Wednesday, May 20, 2020

1. Data set format

Laura and Anne, here is the format we want the main, initial datasets to be in:

A tibble of genotypes that must have at least these columns:
1. indiv: the individual ID (could be, for example, the NMFS_DNA_ID
2. locus: the name of the locus
3. gene_copy: a column of 1's and 2's
4. allele_int: a column of alleles, 1-4, with NA for missing data.
These columns can be in any order, but ordered as above is natural. Note that the genotypes can contain more columns. For example, it could have a full microhaplotype allele column, etc. But for the SNP data sets, we don't really need another one.
A tibble of metadata relevant to the parentage. At this point I am thinking that this includes:
1. indiv
2. sex a column of Male, Female, or NA
3. spawner_group: typically the data of collection/spawning
4. year: the year the individual was sampled/spawned
Anne and Laura, can you think of anything else that belongs in here?

An example of these two data files can be found now by loading the package which gives access to coho_genotypes and coho_metadata. (You will have to pull from upstream to master and load that version).

2. Let's talk about the `data` folder and `data.R`

Just make sure we understand how package data can work.
Talk about extdata too.

3. Matching samples

We need to do a few things here:

Function to convert to rubias format
Run rubias to find matching samples.
Find connected components.
Figure out which of the version of the individuals will be kept as the canonical genotype.

Along the way, we will want to investigate the results. So, we will break this down into different functions.

Here are some things to read about that we will use:

pivoting. This is gather/spread, but on steroids. New from the 'tidyr' package.
tidygraph Thomas Lin Pedersen's tidyverse solution to handling graphs and networks. TLP was an intern with Hadley at RStudio when he did gganinmate. Now, he is still with RStudio, I think, and does amazing work.

I need you guys to get your steelhead data sets into that format so we can have those to test on.

4. A little bit about NSE and rlang::enquo()

Weird stuff, but it will let us convert to rubias with microhaps, when we have those.

5. R CMD Check

Two main points here:

Namespace addressing and Imports in Description.
Visible bindings for 'global' variables.

For: Thursday, May 21, 2020

1. Eric munges the RR and CV steelhead data sets

Both Anne and Laura have sent me their genotypes and metadata file. They did a lot of the backend hard work, but I have a few things to do here to clean them up.

CVSH

Genotypes: These all looked great. I checked them, ungrouped the tibble and then reordered the column names. I also arranged things so that all the genotypes for an individual are together in one place. This is not a necessity, but it makes it easier to look at.

Metadata. This looks good, but I caught a few issues:

# read in the original
cm <- read_rds("../private/as_sent/cvsh_meta.rds")

# It turns out that the hatchery names are not consistent. So we
# need to fix those. Also, I don't want spaces in those names.

# Note, we need to put spaces in hatchery names into a check() function.
cm2 <- cm %>%
  mutate(
    hatchery = recode(
      hatchery,
      `Coleman National Fish Hatchery` = "Coleman Hatchery",
      `Mokelumne Hatchery` = "Mokelumne River Hatchery",
      `Nimbus River Hatchery` = "Nimbus Hatchery"
    ),
    hatchery = str_replace_all(hatchery, " +", "_")
  )

# then i resaved that
write_rds(cm2, path = "../private/cvsh_metadata.rds", compress = "xz")

Russian River

# having a look
rg <- read_rds("../private/as_sent/RR_steelhead_geno.rds")

# minor tweaks
rg %>% 
  select(indiv, locus, gene_copy, allele_int) %>%
  arrange(indiv, locus, gene_copy) %>%
  write_rds("../private/rrsh_genotypes.rds", compress = "xz")

# how about metadata?
# It needs ungrouping...
# hatchery names and sex look good
read_rds("../private/as_sent/RR_steelhead_meta.rds") %>%
  ungroup() %>%
  write_rds(path = "../private/rrsh_metadata.rds", compress = "xz")

Week of Memorial Day (May 26--29)

A Minor Revamp of snppit

After looking over Anne and Laura's data about the frequency of iteroparous fish and also the occurrence of individuals that are spawned on multiple days within a year (sometimes under different names), I realized that it might work better to finish implementing snppit support for individuals to belong to multiple spawner groups. Looking over the code I see that it already has support for fish to spawn in multiple years, and it should not take too much to allow individuals to belong to multiple different spawner groups.

I have done this with a few changes to the code. I am doing it in a new branch of the snppit repo called allow-multiple-spawner-groups. Here are the changes that I made: https://github.com/eriqande/snppit/commit/5ccad7b2e619e14436df38a0d4d2086fbbe62d48.

Now that is done...

Still working on testing it, but now the drill for repeat spawners and "matching samples" is to:

standardize all occurrences to a single ID. I think by convention we should use the earliest occurrence.
For occurrence as a candidate parent, we will include only a single row for the individual, but we include all years that they occur in the year field (comma separated) and all spawner groups in the spawner group field (also comma separated).
For the inclusion in the offspring field, we will include only their earliest occurrence in the SNPPIT input file.

So, I must throw down some R code to do that...

Friday, May 29, 2020

Some things we have dealt with:

Missing data in the spawner groups needs to be explicitly denoted by ? instead of NA. But, in the user-supplied meta data we should still use NA if there is nothing known about it. Only in cases where user wants to add the unknown spawner to within a comma-separated string does the user code missing as ?. (For example 4/11/204,?,3/12/2013). However it is hard to imagine that this is something anyone would ever need to do.

Stuff to discuss:

Multiple spawner groups and years with matchers.
Cross-hatchery and cross-sex matchers. Note cross-hatchery stuff is tough to fix, so we make extra copies of individuals. There is potential for self-assignment in those cases, so we have named those copies so that it is easy to find them in the output and vet them.
The system and system2 commands. For now we are just going to keep it simple.
Temporary files and directories.
The inst directory and binaries (They don't fly with CRAN!).

To Do:

Add a new_id column to the cross_hatchery_matches and to cross_sex_matches as well, if possible.

eriqande/HatcheryPedAgree documentation built on Sept. 21, 2023, 7:24 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

eriqande/HatcheryPedAgree

In eriqande/HatcheryPedAgree:

Topics for discussion during meetings

Thursday, May 14, 2020

1. Installing packages

2. Code style

3. Roxygen

4. Forking and Pull requests

5. Data sets

For: Wednesday, May 20, 2020

1. Data set format

2. Let's talk about the `data` folder and `data.R`

3. Matching samples

4. A little bit about NSE and rlang::enquo()

5. R CMD Check

For: Thursday, May 21, 2020

1. Eric munges the RR and CV steelhead data sets

CVSH

Russian River

Week of Memorial Day (May 26--29)

A Minor Revamp of snppit

Now that is done...

Friday, May 29, 2020

To Do:

R Package Documentation

Browse R Packages

We want your feedback!

eriqande/HatcheryPedAgree

In eriqande/HatcheryPedAgree:

Topics for discussion during meetings

Thursday, May 14, 2020

1. Installing packages

2. Code style

3. Roxygen

4. Forking and Pull requests

5. Data sets

For: Wednesday, May 20, 2020

1. Data set format

2. Let's talk about the data folder and data.R

3. Matching samples

4. A little bit about NSE and rlang::enquo()

5. R CMD Check

For: Thursday, May 21, 2020

1. Eric munges the RR and CV steelhead data sets

CVSH

Russian River

Week of Memorial Day (May 26--29)

A Minor Revamp of snppit

Now that is done...

Friday, May 29, 2020

To Do:

R Package Documentation

Browse R Packages

We want your feedback!

2. Let's talk about the `data` folder and `data.R`