README.md

kgp

CRAN
status Lifecycle:
stable arXiv

This kgp data package provides metadata about populations and data about samples from the 1000 Genomes Project, including the 2,504 samples sequenced for the Phase 3 release and the expanded collection of 3,202 samples with 602 additional trios.

Installation

You can install the released version of kgp from CRAN with:

install.packages("kgp")

You can install the development version of kgp from GitHub with:

# install.packages("devtools")
devtools::install_github("stephenturner/kgp")

About the data

The 1000 Genomes Project data Phase 3 data contains 2,504 samples with sequence data available, and was later expanded to 3,202 samples with high coverage adding 602 trios. Data is available through the 1000 Genomes FTP site and GitHub.

There are three data sets available in the kgp package.

library(kgp)
data(kgp)

The kgp3 data contains pedigree and population information for the 2,504 samples included in the Phase 3 release of the 1000 Genomes Project data.

kgp3
#> # A tibble: 2,504 × 10
#>    fid     id      pid   mid     sex sexf   pop   reg   population        region
#>    <chr>   <chr>   <chr> <chr> <int> <fct>  <chr> <chr> <chr>             <chr> 
#>  1 HG00096 HG00096 0     0         1 male   GBR   EUR   British in Engla… Europe
#>  2 HG00097 HG00097 0     0         2 female GBR   EUR   British in Engla… Europe
#>  3 HG00099 HG00099 0     0         2 female GBR   EUR   British in Engla… Europe
#>  4 HG00100 HG00100 0     0         2 female GBR   EUR   British in Engla… Europe
#>  5 HG00101 HG00101 0     0         1 male   GBR   EUR   British in Engla… Europe
#>  6 HG00102 HG00102 0     0         2 female GBR   EUR   British in Engla… Europe
#>  7 HG00103 HG00103 0     0         1 male   GBR   EUR   British in Engla… Europe
#>  8 HG00105 HG00105 0     0         1 male   GBR   EUR   British in Engla… Europe
#>  9 HG00106 HG00106 0     0         2 female GBR   EUR   British in Engla… Europe
#> 10 HG00107 HG00107 0     0         1 male   GBR   EUR   British in Engla… Europe
#> # … with 2,494 more rows
#> # ℹ Use `print(n = ...)` to see more rows

The kgpe data contains pedigree and population information all 3,202 samples included in the expanded 1000 Genomes Project data, which includes 602 trios.

kgpe
#> # A tibble: 3,202 × 11
#>    fid     id      pid   mid     sex sexf   pop   reg   population region phase3
#>    <chr>   <chr>   <chr> <chr> <int> <fct>  <chr> <chr> <chr>      <chr>  <lgl> 
#>  1 HG00096 HG00096 0     0         1 male   GBR   EUR   British i… Europe TRUE  
#>  2 HG00097 HG00097 0     0         2 female GBR   EUR   British i… Europe TRUE  
#>  3 HG00099 HG00099 0     0         2 female GBR   EUR   British i… Europe TRUE  
#>  4 HG00100 HG00100 0     0         2 female GBR   EUR   British i… Europe TRUE  
#>  5 HG00101 HG00101 0     0         1 male   GBR   EUR   British i… Europe TRUE  
#>  6 HG00102 HG00102 0     0         2 female GBR   EUR   British i… Europe TRUE  
#>  7 HG00103 HG00103 0     0         1 male   GBR   EUR   British i… Europe TRUE  
#>  8 HG00105 HG00105 0     0         1 male   GBR   EUR   British i… Europe TRUE  
#>  9 HG00106 HG00106 0     0         2 female GBR   EUR   British i… Europe TRUE  
#> 10 HG00107 HG00107 0     0         1 male   GBR   EUR   British i… Europe TRUE  
#> # … with 3,192 more rows
#> # ℹ Use `print(n = ...)` to see more rows

The kgpmeta contains population metadata for the 26 populations across five continental regions.

kgpmeta
#> # A tibble: 26 × 7
#>    pop   population                          reg   region   regco…¹   lat    lng
#>    <chr> <chr>                               <chr> <chr>    <chr>   <dbl>  <dbl>
#>  1 CDX   Chinese Dai in Xishuangbanna, China EAS   East As… #778500  22   101.  
#>  2 CHB   Han Chinese in Bejing, China        EAS   East As… #778500  39.9 116.  
#>  3 JPT   Japanese in Tokyo, Japan            EAS   East As… #778500  35.7 140.  
#>  4 KHV   Kinh in Ho Chi Minh City, Vietnam   EAS   East As… #778500  10.8 107.  
#>  5 CHS   Southern Han Chinese, China         EAS   East As… #778500  23.1 113.  
#>  6 BEB   Bengali in Bangladesh               SAS   South A… #c44cfd  23.7  90.4 
#>  7 GIH   Gujarati Indian in Houston,TX       SAS   South A… #c44cfd  29.8 -95.4 
#>  8 ITU   Indian Telugu in the UK             SAS   South A… #c44cfd  52.5  -1.89
#>  9 PJL   Punjabi in Lahore,Pakistan          SAS   South A… #c44cfd  31.6  74.4 
#> 10 STU   Sri Lankan Tamil in the UK          SAS   South A… #c44cfd  52.5  -1.89
#> # … with 16 more rows, and abbreviated variable name ¹​regcolor
#> # ℹ Use `print(n = ...)` to see more rows

Examples

library(dplyr)
library(ggplot2)
library(kgp)
data(kgp)

Count the number of samples in each region, or in each population:

kgp3 %>% 
  count(region) %>% 
  knitr::kable()

| region | n | |:-----------|----:| | Africa | 661 | | America | 347 | | East Asia | 504 | | Europe | 503 | | South Asia | 489 |

kgp3 %>% 
  count(region, population) %>% 
  knitr::kable()

| region | population | n | |:-----------|:-----------------------------------------------------------|----:| | Africa | African Ancestry in Southwest US | 61 | | Africa | African Caribbean in Barbados | 96 | | Africa | Esan in Nigeria | 99 | | Africa | Gambian in Western Division, The Gambia | 113 | | Africa | Luhya in Webuye, Kenya | 99 | | Africa | Mende in Sierra Leone | 85 | | Africa | Yoruba in Ibadan, Nigeria | 108 | | America | Colombian in Medellin, Colombia | 94 | | America | Mexican Ancestry in Los Angeles, California | 64 | | America | Peruvian in Lima, Peru | 85 | | America | Puerto Rican in Puerto Rico | 104 | | East Asia | Chinese Dai in Xishuangbanna, China | 93 | | East Asia | Han Chinese in Bejing, China | 103 | | East Asia | Japanese in Tokyo, Japan | 104 | | East Asia | Kinh in Ho Chi Minh City, Vietnam | 99 | | East Asia | Southern Han Chinese, China | 105 | | Europe | British in England and Scotland | 91 | | Europe | Finnish in Finland | 99 | | Europe | Iberian populations in Spain | 107 | | Europe | Toscani in Italy | 107 | | Europe | Utah residents with Northern and Western European ancestry | 99 | | South Asia | Bengali in Bangladesh | 86 | | South Asia | Gujarati Indian in Houston,TX | 103 | | South Asia | Indian Telugu in the UK | 102 | | South Asia | Punjabi in Lahore,Pakistan | 96 | | South Asia | Sri Lankan Tamil in the UK | 102 |

kgp3 %>% 
  count(region, population) %>% 
  arrange(region, n) %>% 
  mutate(population=forcats::fct_inorder(population)) %>% 
  ggplot(aes(population, n)) + 
  geom_col(aes(fill=region)) + 
  labs(fill=NULL, x=NULL, x="N") + 
  coord_flip() + 
  theme_bw() + 
  theme(legend.position="bottom")

The latitude and longitude coordinates in kgpmeta can be used to plot a map of the locations of the 1000 Genomes populations. There is also a column for region color, which provides a hexadecimal color code to enable reproduction of the population data map as shown on the IGSR population data page. The figure below shows a static map produced using ggplot2, but interactive maps such as that shown on the IGSR population data portal can be created with the leaflet package.

pal <- kgpmeta %>% distinct(reg, regcolor) %>% tibble::deframe()
ggplot() + 
  geom_polygon(data=map_data("world"), 
               aes(long, lat, group=group), 
               col="gray30", fill="gray95", lwd=.2, alpha=.5) + 
  geom_point(data=kgpmeta, aes(lng, lat, col=reg), size=4) + 
  scale_colour_manual(values=pal) +
  theme_minimal() + 
  theme(axis.ticks = element_blank(), 
        axis.text = element_blank(), 
        axis.title = element_blank(), 
        legend.title = element_blank(),
        panel.grid = element_blank(),
        legend.position = "bottom")

Map showing locations of the 1000 Genomes Phase 3 populations.

The table below shows a selection of samples from kgpe showing pedigree information for each sample. This pedigree information could be used in downstream analysis to filter out related individuals, select only trios, or to visualize family structure.

kgpe %>% 
  filter(pid!="0" & mid!="0") %>% 
  group_by(pop) %>% 
  slice(1) %>% 
  head(12) %>% 
  arrange(reg, pop) %>% 
  select(fid:reg) %>% 
  select(-sexf) %>% 
  knitr::kable()

| fid | id | pid | mid | sex | pop | reg | |:-------|:--------|:--------|:--------|----:|:----|:----| | BB01 | HG01881 | HG01879 | HG01880 | 2 | ACB | AFR | | 2367 | NA19702 | NA19700 | NA19701 | 1 | ASW | AFR | | NG06 | HG02924 | HG02923 | HG02922 | 1 | ESN | AFR | | GB15 | HG02463 | HG02461 | HG02462 | 1 | GWD | AFR | | SL02 | HG03056 | HG03054 | HG03055 | 1 | MSL | AFR | | CLM03 | HG01114 | HG01112 | HG01113 | 2 | CLM | AMR | | SH001 | HG00405 | HG00403 | HG00404 | 2 | CHS | EAS | | VN046 | HG02015 | HG02017 | HG02016 | 1 | KHV | EAS | | 1341 | NA06991 | NA06993 | NA06985 | 2 | CEU | EUR | | IBS001 | HG01502 | HG01500 | HG01501 | 1 | IBS | EUR | | BD01 | HG03008 | HG03006 | HG03007 | 1 | BEB | SAS | | IT002 | HG03719 | HG03725 | HG03722 | 2 | ITU | SAS |

The figure below shows an example of a pedigree plot made by parsing the pedigree information using skater and plotting using kinship2. The skater package provides documentation, examples, and a vignette demonstrating how to iteratively plot all pedigrees in a given data set.

kgpe %>% 
  filter(fid=="13291") %>% 
  transmute(fid, id, dadid=pid, momid=mid, sex, affected=1) %>% 
  skater::fam2ped() %>% 
  pull(ped) %>% 
  purrr::pluck(1) %>% 
  kinship2::plot.pedigree(mar=c(4,2,4,2), cex=.8)

Trios in 1000 Genomes Project family 13291.



Try the kgp package in your browser

Any scripts or data that you put into this service are public.

kgp documentation built on Dec. 28, 2022, 1:17 a.m.