inst/extdata/analysis/01_data_exploration.md

The Marine’s dataset

The first step aims to investigate simple information about Marina’s dataset. Let’s read it and print a sample of the data.frame.

marine <- read.table(file = "../../../data/raw/ships.csv", header = TRUE, sep = ",", dec = ".", fill = T)
head(marine)
##        LAT      LON SPEED COURSE HEADING DESTINATION FLAG LENGTH SHIPNAME
## 1 54.77127 18.99692    99    200     196      GDANSK   MT    100   KAROLI
## 2 54.76542 18.99361   100    200     196      GDANSK   MT    100   KAROLI
## 3 54.76007 18.99059   102    196     196      GDANSK   MT    100   KAROLI
## 4 54.75468 18.98753   102    198     196      GDANSK   MT    100   KAROLI
## 5 54.74926 18.98447   102    196     195      GDANSK   MT    100   KAROLI
## 6 54.74385 18.98150   102    198     196      GDANSK   MT    100   KAROLI
##   SHIPTYPE SHIP_ID WIDTH  DWT             DATETIME   PORT       date week_nb
## 1        7    2764    14 5727 2016-12-19T11:29:01Z gdansk 2016-12-19      51
## 2        7    2764    14 5727 2016-12-19T11:31:02Z gdansk 2016-12-19      51
## 3        7    2764    14 5727 2016-12-19T11:33:02Z gdansk 2016-12-19      51
## 4        7    2764    14 5727 2016-12-19T11:35:02Z gdansk 2016-12-19      51
## 5        7    2764    14 5727 2016-12-19T11:37:02Z gdansk 2016-12-19      51
## 6        7    2764    14 5727 2016-12-19T11:39:01Z gdansk 2016-12-19      51
##   ship_type   port is_parked
## 1     Cargo Gdańsk         0
## 2     Cargo Gdańsk         0
## 3     Cargo Gdańsk         0
## 4     Cargo Gdańsk         0
## 5     Cargo Gdańsk         0
## 6     Cargo Gdańsk         0

Although it takes around 10 seconds to read the data, I will not use a different package (such as data.table) just to read it faster.

Missing data

Let’s see which columns have missing data and their proportion.

100*sort(colMeans(is.na(marine)))
##         LAT         LON       SPEED      COURSE     HEADING        FLAG 
##   0.0000000   0.0000000   0.0000000   0.0000000   0.0000000   0.0000000 
##    SHIPNAME    DATETIME        PORT        date   ship_type        port 
##   0.0000000   0.0000000   0.0000000   0.0000000   0.0000000   0.0000000 
##    SHIPTYPE     SHIP_ID     week_nb   is_parked      LENGTH       WIDTH 
##   0.2932108   0.2932108   0.2932108   0.2932108   0.6163292   0.6163292 
## DESTINATION         DWT 
##   8.7926180  38.6579660

We can observe missing data in the SHIP_ID variable which will be used later as an id. Therefore I am going to remove these lines since I will not be able to use them.

marine <- marine %>%
  dplyr::filter(!is.na(SHIP_ID))

Now I have missing data only for LENGTH, WIDTH, DWT, and DESTINATION. I think it is ok for now.

SHIP_ID column

As SHIP_ID is an identifier, it should be unique. In my mind (assumption), we should not have different ids for two vessels for a specific ship type. Let’s check!

name_id <- marine %>%
  dplyr::group_by(SHIPNAME, ship_type) %>%
  dplyr::summarise(n = dplyr::n_distinct(SHIP_ID)) %>%
  dplyr::arrange(desc(n))
## `summarise()` has grouped output by 'SHIPNAME'. You can override using the `.groups` argument.
name_id
## # A tibble: 1,201 × 3
## # Groups:   SHIPNAME [1,186]
##    SHIPNAME          ship_type       n
##    <chr>             <chr>       <int>
##  1 [SAT-AIS]         Tug            19
##  2 AMANDA            Cargo           2
##  3 ARGO              Tug             2
##  4 TINDRA            Pleasure        2
##  5 TORNADO           Tug             2
##  6 VOVAN             Tanker          2
##  7 YASNYY            Tug             2
##  8 . PRINCE OF WAVES Cargo           1
##  9 .WLA-311          Fishing         1
## 10 0                 Unspecified     1
## # … with 1,191 more rows

I will remove the [SAT-AIS] ship as it seems something related to the AIS technology. For the remaining, I am going to keep them, and later I will pick one of the IDs at random instead of removing them.

marine <- marine %>%
  dplyr::filter(SHIPNAME != "[SAT-AIS]")

Another possibility is the opposite: Can I have different ship names for the same SHIP_ID?

id_name <- marine %>%
  dplyr::group_by(SHIP_ID, ship_type) %>%
  dplyr::summarise(n = dplyr::n_distinct(SHIPNAME)) %>%
  dplyr::arrange(desc(n)) %>%
  dplyr::filter(n > 1)
## `summarise()` has grouped output by 'SHIP_ID'. You can override using the `.groups` argument.
id_name
## # A tibble: 11 × 3
## # Groups:   SHIP_ID [11]
##    SHIP_ID ship_type       n
##      <dbl> <chr>       <int>
##  1 4666609 Unspecified     6
##  2  315731 Tug             2
##  3  315950 Fishing         2
##  4  316404 Pleasure        2
##  5  316482 Pleasure        2
##  6  345254 Tug             2
##  7  347195 Tanker          2
##  8  364937 Cargo           2
##  9  406999 Cargo           2
## 10  757619 Cargo           2
## 11 3653787 Tug             2

It happens a lot. I will remove the id 4666609 and explore the remaining.

marine <- marine %>%
  dplyr::filter(SHIP_ID != 4666609)
marine %>%
  dplyr::filter(SHIP_ID %in% id_name$SHIP_ID) %>%
  dplyr::select(SHIP_ID, ship_type, SHIPNAME) %>%
  unique() %>%
  dplyr::arrange(SHIP_ID, ship_type)
##    SHIP_ID ship_type          SHIPNAME
## 1   315731       Tug              ODYS
## 2   315731       Tug              BBAS
## 3   315950   Fishing          .WLA-311
## 4   315950   Fishing           WLA-311
## 5   316404  Pleasure KAPITAN BORCHARDT
## 6   316404  Pleasure KM ,TAN BORCHARDT
## 7   316482  Pleasure   WYSPA SZCZESCIA
## 8   316482  Pleasure   WXA A SZCZESCIA
## 9   345254       Tug IVAN KRUZENSHTERN
## 10  345254       Tug           SAT AIS
## 11  347195    Tanker  GAZPROMNEFT WEST
## 12  347195    Tanker             VOVAN
## 13  364937     Cargo        BOMAR MOON
## 14  364937     Cargo        NBAAR MOON
## 15  406999     Cargo SEASTAR ENDURANCE
## 16  406999     Cargo ZBASTAR ENDURANCE
## 17  757619     Cargo . PRINCE OF WAVES
## 18  757619     Cargo   PRINCE OF WAVES
## 19 3653787       Tug              ARGO
## 20 3653787       Tug                 C

Most of them seem to be the same. Again I am going to keep them and later I will pick one at random.

Let’s see what we have for each vessel:

ship_id <- sample(x = marine$SHIP_ID, size = 1)

marine_plt <- marine %>%
  dplyr::filter(SHIP_ID == ship_id) %>%
  dplyr::arrange(DATETIME)

ggplot2::ggplot(data = marine_plt) +
  ggplot2::geom_point(mapping = aes(x = LON, y = LAT), color = "steelblue") +
  ggplot2::theme_bw()

# Using ggplot2 just to make it easy to show on github
# leaflet::leaflet(data = marine_plt) %>%
#   leaflet::addCircleMarkers(
#     lng = ~LON,
#     lat = ~LAT,
#     label = ~SHIPNAME,
#     color = "white",
#     radius = 1
#   ) %>%
#   leaflet::addProviderTiles(
#     provider = leaflet::providers$CartoDB.DarkMatter,
#     options = leaflet::providerTileOptions(noWrap = T)
#   )

Some stats

Now I am going to visualize some statistics and plots looking for oddities.

Ship types

dplyr::n_distinct(marine$ship_type)
## [1] 9
unique(marine$ship_type)
## [1] "Cargo"        "Tanker"       "Unspecified"  "Tug"          "Fishing"     
## [6] "Passenger"    "Pleasure"     "Navigation"   "High Special"

Ports

dplyr::n_distinct(marine$port)
## [1] 6
unique(marine$port)
## [1] "Gdańsk"         "Gdynia"         "gothenborg"     "Kalingrad"     
## [5] "Klaipeda"       "St. Petersburg"

Ports

dplyr::n_distinct(marine$SHIP_ID)
## [1] 1189

Destination

dplyr::n_distinct(marine$DESTINATION)
## [1] 643
head(sort(table(marine$DESTINATION), decreasing = TRUE))
## 
##       CLASS B        GDANSK           SPB ST.PETERSBURG        GDYNIA 
##        578906        210951        196469        125474        116179 
## ST PETERSBURG 
##         89167

Speed

summary(marine$SPEED) # 1550? :O
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    0.000    0.000    0.000    7.821    0.000 1550.000
ggplot2::ggplot(data = marine) + # Too many zeros
  ggplot2::geom_histogram(mapping = aes(x = SPEED), fill = "steelblue", bins = 100) +
  ggplot2::xlab("Speed") +
  ggplot2::ylab("Count") +
  ggplot2::scale_y_continuous(labels = scales::label_number_si()) +
  ggplot2::theme_bw()

Let’s remove those observations for parked vessels.

marine_np <- marine %>%
  dplyr::filter(!is_parked)

summary(marine_np$SPEED)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.00   37.00   67.00   71.53   89.00 1550.00
ggplot2::ggplot(data = marine_np) +
  ggplot2::geom_histogram(mapping = aes(x = SPEED), fill = "steelblue", bins = 100) +
  ggplot2::xlab("Speed") +
  ggplot2::ylab("Count") +
  ggplot2::scale_y_continuous(labels = scales::label_number_si()) +
  ggplot2::theme_bw()

I think this variable has some inconsistency. However, I am going to present it in the shiny app.

Length

ggplot2::ggplot(data = marine) +
  ggplot2::geom_histogram(mapping = aes(x = LENGTH), fill = "steelblue", bins = 50) +
  ggplot2::xlab("Length (m)") +
  ggplot2::ylab("Count") +
  ggplot2::scale_y_continuous(labels = scales::label_number_si()) +
  ggplot2::theme_bw()
## Warning: Removed 4872 rows containing non-finite values (stat_bin).

DWT

ggplot2::ggplot(data = marine) +
  ggplot2::geom_histogram(mapping = aes(x = DWT), fill = "steelblue", bins = 50) +
  ggplot2::xlab("Deadweight (Ton.)") +
  ggplot2::ylab("Count") +
  ggplot2::scale_y_continuous(labels = scales::label_number_si()) +
  ggplot2::theme_bw()
## Warning: Removed 1175780 rows containing non-finite values (stat_bin).

Datetime

marine <- marine %>%
  dplyr::mutate(DATETIME = lubridate::ymd_hms(DATETIME))

range(marine$DATETIME)
## [1] "2016-12-13 11:21:01 UTC" "2016-12-20 11:11:06 UTC"


DouglasMesquita/marineApp documentation built on Dec. 17, 2021, 5:29 p.m.