The first step aims to investigate simple information about Marina’s dataset. Let’s read it and print a sample of the data.frame.
marine <- read.table(file = "../../../data/raw/ships.csv", header = TRUE, sep = ",", dec = ".", fill = T)
head(marine)
## LAT LON SPEED COURSE HEADING DESTINATION FLAG LENGTH SHIPNAME
## 1 54.77127 18.99692 99 200 196 GDANSK MT 100 KAROLI
## 2 54.76542 18.99361 100 200 196 GDANSK MT 100 KAROLI
## 3 54.76007 18.99059 102 196 196 GDANSK MT 100 KAROLI
## 4 54.75468 18.98753 102 198 196 GDANSK MT 100 KAROLI
## 5 54.74926 18.98447 102 196 195 GDANSK MT 100 KAROLI
## 6 54.74385 18.98150 102 198 196 GDANSK MT 100 KAROLI
## SHIPTYPE SHIP_ID WIDTH DWT DATETIME PORT date week_nb
## 1 7 2764 14 5727 2016-12-19T11:29:01Z gdansk 2016-12-19 51
## 2 7 2764 14 5727 2016-12-19T11:31:02Z gdansk 2016-12-19 51
## 3 7 2764 14 5727 2016-12-19T11:33:02Z gdansk 2016-12-19 51
## 4 7 2764 14 5727 2016-12-19T11:35:02Z gdansk 2016-12-19 51
## 5 7 2764 14 5727 2016-12-19T11:37:02Z gdansk 2016-12-19 51
## 6 7 2764 14 5727 2016-12-19T11:39:01Z gdansk 2016-12-19 51
## ship_type port is_parked
## 1 Cargo Gdańsk 0
## 2 Cargo Gdańsk 0
## 3 Cargo Gdańsk 0
## 4 Cargo Gdańsk 0
## 5 Cargo Gdańsk 0
## 6 Cargo Gdańsk 0
Although it takes around 10 seconds to read the data, I will not use a different package (such as data.table) just to read it faster.
Let’s see which columns have missing data and their proportion.
100*sort(colMeans(is.na(marine)))
## LAT LON SPEED COURSE HEADING FLAG
## 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## SHIPNAME DATETIME PORT date ship_type port
## 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## SHIPTYPE SHIP_ID week_nb is_parked LENGTH WIDTH
## 0.2932108 0.2932108 0.2932108 0.2932108 0.6163292 0.6163292
## DESTINATION DWT
## 8.7926180 38.6579660
We can observe missing data in the SHIP_ID variable which will be used later as an id. Therefore I am going to remove these lines since I will not be able to use them.
marine <- marine %>%
dplyr::filter(!is.na(SHIP_ID))
Now I have missing data only for LENGTH, WIDTH, DWT, and DESTINATION. I think it is ok for now.
As SHIP_ID is an identifier, it should be unique. In my mind (assumption), we should not have different ids for two vessels for a specific ship type. Let’s check!
name_id <- marine %>%
dplyr::group_by(SHIPNAME, ship_type) %>%
dplyr::summarise(n = dplyr::n_distinct(SHIP_ID)) %>%
dplyr::arrange(desc(n))
## `summarise()` has grouped output by 'SHIPNAME'. You can override using the `.groups` argument.
name_id
## # A tibble: 1,201 × 3
## # Groups: SHIPNAME [1,186]
## SHIPNAME ship_type n
## <chr> <chr> <int>
## 1 [SAT-AIS] Tug 19
## 2 AMANDA Cargo 2
## 3 ARGO Tug 2
## 4 TINDRA Pleasure 2
## 5 TORNADO Tug 2
## 6 VOVAN Tanker 2
## 7 YASNYY Tug 2
## 8 . PRINCE OF WAVES Cargo 1
## 9 .WLA-311 Fishing 1
## 10 0 Unspecified 1
## # … with 1,191 more rows
I will remove the [SAT-AIS] ship as it seems something related to the AIS technology. For the remaining, I am going to keep them, and later I will pick one of the IDs at random instead of removing them.
marine <- marine %>%
dplyr::filter(SHIPNAME != "[SAT-AIS]")
Another possibility is the opposite: Can I have different ship names for the same SHIP_ID?
id_name <- marine %>%
dplyr::group_by(SHIP_ID, ship_type) %>%
dplyr::summarise(n = dplyr::n_distinct(SHIPNAME)) %>%
dplyr::arrange(desc(n)) %>%
dplyr::filter(n > 1)
## `summarise()` has grouped output by 'SHIP_ID'. You can override using the `.groups` argument.
id_name
## # A tibble: 11 × 3
## # Groups: SHIP_ID [11]
## SHIP_ID ship_type n
## <dbl> <chr> <int>
## 1 4666609 Unspecified 6
## 2 315731 Tug 2
## 3 315950 Fishing 2
## 4 316404 Pleasure 2
## 5 316482 Pleasure 2
## 6 345254 Tug 2
## 7 347195 Tanker 2
## 8 364937 Cargo 2
## 9 406999 Cargo 2
## 10 757619 Cargo 2
## 11 3653787 Tug 2
It happens a lot. I will remove the id 4666609 and explore the remaining.
marine <- marine %>%
dplyr::filter(SHIP_ID != 4666609)
marine %>%
dplyr::filter(SHIP_ID %in% id_name$SHIP_ID) %>%
dplyr::select(SHIP_ID, ship_type, SHIPNAME) %>%
unique() %>%
dplyr::arrange(SHIP_ID, ship_type)
## SHIP_ID ship_type SHIPNAME
## 1 315731 Tug ODYS
## 2 315731 Tug BBAS
## 3 315950 Fishing .WLA-311
## 4 315950 Fishing WLA-311
## 5 316404 Pleasure KAPITAN BORCHARDT
## 6 316404 Pleasure KM ,TAN BORCHARDT
## 7 316482 Pleasure WYSPA SZCZESCIA
## 8 316482 Pleasure WXA A SZCZESCIA
## 9 345254 Tug IVAN KRUZENSHTERN
## 10 345254 Tug SAT AIS
## 11 347195 Tanker GAZPROMNEFT WEST
## 12 347195 Tanker VOVAN
## 13 364937 Cargo BOMAR MOON
## 14 364937 Cargo NBAAR MOON
## 15 406999 Cargo SEASTAR ENDURANCE
## 16 406999 Cargo ZBASTAR ENDURANCE
## 17 757619 Cargo . PRINCE OF WAVES
## 18 757619 Cargo PRINCE OF WAVES
## 19 3653787 Tug ARGO
## 20 3653787 Tug C
Most of them seem to be the same. Again I am going to keep them and later I will pick one at random.
ship_id <- sample(x = marine$SHIP_ID, size = 1)
marine_plt <- marine %>%
dplyr::filter(SHIP_ID == ship_id) %>%
dplyr::arrange(DATETIME)
ggplot2::ggplot(data = marine_plt) +
ggplot2::geom_point(mapping = aes(x = LON, y = LAT), color = "steelblue") +
ggplot2::theme_bw()
# Using ggplot2 just to make it easy to show on github
# leaflet::leaflet(data = marine_plt) %>%
# leaflet::addCircleMarkers(
# lng = ~LON,
# lat = ~LAT,
# label = ~SHIPNAME,
# color = "white",
# radius = 1
# ) %>%
# leaflet::addProviderTiles(
# provider = leaflet::providers$CartoDB.DarkMatter,
# options = leaflet::providerTileOptions(noWrap = T)
# )
Now I am going to visualize some statistics and plots looking for oddities.
dplyr::n_distinct(marine$ship_type)
## [1] 9
unique(marine$ship_type)
## [1] "Cargo" "Tanker" "Unspecified" "Tug" "Fishing"
## [6] "Passenger" "Pleasure" "Navigation" "High Special"
dplyr::n_distinct(marine$port)
## [1] 6
unique(marine$port)
## [1] "Gdańsk" "Gdynia" "gothenborg" "Kalingrad"
## [5] "Klaipeda" "St. Petersburg"
dplyr::n_distinct(marine$SHIP_ID)
## [1] 1189
dplyr::n_distinct(marine$DESTINATION)
## [1] 643
head(sort(table(marine$DESTINATION), decreasing = TRUE))
##
## CLASS B GDANSK SPB ST.PETERSBURG GDYNIA
## 578906 210951 196469 125474 116179
## ST PETERSBURG
## 89167
summary(marine$SPEED) # 1550? :O
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 7.821 0.000 1550.000
ggplot2::ggplot(data = marine) + # Too many zeros
ggplot2::geom_histogram(mapping = aes(x = SPEED), fill = "steelblue", bins = 100) +
ggplot2::xlab("Speed") +
ggplot2::ylab("Count") +
ggplot2::scale_y_continuous(labels = scales::label_number_si()) +
ggplot2::theme_bw()
Let’s remove those observations for parked vessels.
marine_np <- marine %>%
dplyr::filter(!is_parked)
summary(marine_np$SPEED)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.00 37.00 67.00 71.53 89.00 1550.00
ggplot2::ggplot(data = marine_np) +
ggplot2::geom_histogram(mapping = aes(x = SPEED), fill = "steelblue", bins = 100) +
ggplot2::xlab("Speed") +
ggplot2::ylab("Count") +
ggplot2::scale_y_continuous(labels = scales::label_number_si()) +
ggplot2::theme_bw()
I think this variable has some inconsistency. However, I am going to present it in the shiny app.
ggplot2::ggplot(data = marine) +
ggplot2::geom_histogram(mapping = aes(x = LENGTH), fill = "steelblue", bins = 50) +
ggplot2::xlab("Length (m)") +
ggplot2::ylab("Count") +
ggplot2::scale_y_continuous(labels = scales::label_number_si()) +
ggplot2::theme_bw()
## Warning: Removed 4872 rows containing non-finite values (stat_bin).
ggplot2::ggplot(data = marine) +
ggplot2::geom_histogram(mapping = aes(x = DWT), fill = "steelblue", bins = 50) +
ggplot2::xlab("Deadweight (Ton.)") +
ggplot2::ylab("Count") +
ggplot2::scale_y_continuous(labels = scales::label_number_si()) +
ggplot2::theme_bw()
## Warning: Removed 1175780 rows containing non-finite values (stat_bin).
marine <- marine %>%
dplyr::mutate(DATETIME = lubridate::ymd_hms(DATETIME))
range(marine$DATETIME)
## [1] "2016-12-13 11:21:01 UTC" "2016-12-20 11:11:06 UTC"
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.