In ropensci/visdat: Preliminary Visualisation of Data

Response to reviewers

I would like to thank the reviewers for taking the time to review visdat so thoroughly. It is fantastic to hear such encouraging comments from members of the community.

Below I respond to suggested changes and other constructive criticisms.

@seaaan's Review

Documentation: README

You might consider breaking the "experimental" parts of the package into a separate README like you do with the vignettes.

Agreed! I would prefer to keep the README as a whole piece, Perhaps instead I could just link to the experimental vignette as in the README. This has been logged in issue 39 and fixed in the indicated commit messages.

Link to the vignettes from the README.

Agreed, this has been addressed in issue 39

The title of the github project is "A package to assist in visually testing a dataset", which doesn't describe the project that well. I suggest changing it to be the same title as on the DESCRIPTION file, which is effective in quickly summarizing what the package does.

Agreed, this makes more sense - changed!

You don't introduce the airquality data set at first usage, but you do introduce it after the second plot. Move the introduction to first use.

Thank you for this, this has been addressed in issue 39.

I don't understand this sentence: "When there is <0.1% of missingness, vis_miss indicates that there is >1% missingness."

Agreed, this has been changed in issue 39.
Vignettes:

Make sure you have permission to reproduce the image from R4DS. The first few paragraphs of the "Using visdat" vignette are a bit unnecessary in my opinion. You could start much more simply, with something like: "When you get a new data set, you need to get a sense of what it contains and potential problems with it." and then continue with the discussion of different ways to approach this challenge (e.g. head, glimpse, ...).

I agree with you that this is a clearer way to introduce the topic - this is now changed.

In the "Using visdat" vignette, it says "missing data represented by black", but it shows up as gray on my computer.

This is fixed this now.

I don't know why, but in the plotly section, a vis_dat() (non-interactive) plot of df_test appears between the first two interactive plots. I can't explain it, hopefully it's just a weird quirk on my computer.

I can't seem to replicate this, can you maybe take a screenshot of this if this is still happening?

There is a "future work and experimental features" section in the "Using visdat" vignette -- I suggest transferring that to the experimental vignette and just linking to it from the "Using visdat" vignette. Perhaps move the vis_guess() and the interactive plots to the experimental features vignette as well, since they seem to be evolving.

Great point! Thank you! I have now linked to this additional experimental vignette in the "using visdat" vignette.

The plot in the experimental features vignette shows up small, I think you need to add knitr::opts_chunk$set(fig.width = 5, fig.height = 4) to the first chunk like you have in the other vignette.

Good idea! This has been addressed now.

Function documentation

Did you mean to export guess_type()?

Yes, as I thought that it might be a useful function for users, although I can't think of a good usecase outside of vis_guess right now. For the moment I will unexport guess_type now.

Formatting: links, code, and other formatting need to be done with .Rd syntax. For example, for code, use \code{}, not backticks. For bold, use \strong{} instead of asterisks.

I have now converted this over to use markdown with roxygen, this should be fixed now.

Return values: The documentation doesn't include the "Value" section which is typically used to say what a function returns. I would add this. E.g. @return A \code{ggplot} object displaying the type of values in the data frame and the position of any missing values.

Thank you for this - this has been changed for vis_dat.R, vis_compare.R, vis_guess.R, and vis_miss.R.

?vis_compare: The documentation for this function needs to be updated. It is more a list of ideas for how to implement the function than a description of what it does.

Thank you for this comment, this has been updated.

?vis_guess: First sentence of description seems to have an extra word or phrase, not sure exactly what was intended.

Thank you for this comment, this has been updated.

?vis_miss_ly: The reference to vis_miss in the "See Also" section should be a link. Like this: \code{\link[visdat]{vis_miss}}

Thank you for this comment, this has been updated.

?visdat: It's good that this page exists. However, it doesn't have any content -- add a brief overview of the package with links to the documentation page for the main functions.

Excellent point, I have addressed this in this commit

Examples

Don't need to call library for either visdat or packages you use internally (only if you actually call a function from another package in the example code itself).

Great point, thanks for picking up on this, I have made the final changes here

example(vis_dat): "palette" is misspelled.

Thank for you picking up on this, this was fixed here

example(vis_compare) is a good example but gives a bunch of warning messages. example(vis_guess): gives warning message

These now both give warning messages like:

vis_compare / (vis_guess) is in BETA! If you have suggestions or errors
post an issue at https://github.com/njtierney/visdat/issues

Do you think this is OK?

Community guidelines

There are no guidelines for how to contribute, either in the README or in a CONTRIBUTING file.

This has now been added here.

JOSS

The paper seems more detailed than most other JOSS papers and goes into specific functions. Additionally, does JOSS allow figures? I didn't see any in the papers I looked at or see mentions of them in the author guidelines. But I'm not an expert on JOSS, so maybe someone else can weigh in on that.

I believe JOSS allows figures - the tidytext paper has a figure.

The paper doesn't have any references in the "References" section, but does have inline references.

These references are provided in the yaml, and in a .bib file - hopefully these get compiled with the JOSS paper perhaps @arfon can chime in here?

If you keep it, make sure you have permission to reproduce the image from R4DS.

I have removed the figure now, in favour of your earlier suggestion to keep things simpler - thank you for this suggestion! :)

Functionality

Common issues

It is unintuitive to me to have the rows in reverse order (i.e. row 1 at the bottom) and for the columns to be clustered by type, rather than appear in the order they appear in the data frame. I think the default behavior should be for rows and columns to appear in the same order as in the input. Additionally, always putting the titles at the top of the columns makes sense to me. Including sort_type as an option might be useful, but by default it should be off. I'm not quite sure of the use case for the flip argument, but maybe there is one!

The flip argument should be available for all of the functions if you provide it for any, for the sake of consistency. Possibly also the palette argument.

I agree that the rows should be in the order that you see the dataframes. I would also prefer for the column names to be on top of the visualisation - I have now implemented this, although I think that the presentation of the column names is perhaps not as nice.
The flip argument was placed there to experiment with this idea of flipping the plot - as you suggested, to have row 1 at the top. The flip argument has now been removed, as I cannot now really think of any times where I would prefer the old arrangement of rows.
I think that the default sort_type option should be to put all of the similar columns together - to me this makes it easier to identify what is similar, and what is different; when I start looking at a dataset I usually want to know what variables are what class. I am willing to change this, however.^
I agree with you re the palette argument being provided for all plots, although I'm not sure how this could work for vis_miss. But perhaps I should allow for users to insert their own palette for this. Personally I think that there should be a standard palette where the common classes (character, factor, integer, numeric, datetime, etc) each have their own distinct colour, and other classes are given different factors.

A number of the functions emit warning messages ("Warning message: attributes are not identical across measure variables; they will be dropped"), specifically when they are called with a data frame that has >1 factor column with different levels. This arises from the call tidyr::gather_(x, "variables", "value", names(x))$value. This message is not relevant to the end user and should be suppressed.

There is some code duplication across some of the functions, where you could replace it with a helper function. The code creating the plots is generally similar across functions. In addition, the following code blocks appear in essentially identical form in >1 function:

  dplyr::mutate(rows = seq_len(nrow(.))) %>%
  tidyr::gather_(key_col = "variables",
                 value_col = "valueType",
                 gather_cols = names(.)[-length(.)])

Using a helper function for this next one in particular would allow you to suppress the warning messages just once:

  d$value <- tidyr::gather_(x, "variables", "value", names(x))$value

The code for flipping the rows is also duplicated:

 if (flip == TRUE){
    suppressMessages({
      vis_dat_plot +
        ggplot2::scale_y_reverse() +
        ggplot2::scale_x_discrete(position = "top",
                                  limits = type_order_index) +
        ggplot2::theme(axis.text.x = ggplot2::element_text(hjust = 0.5)) +
        ggplot2::labs(x = "")
    })
  } else if (flip == FALSE){
    vis_dat_plot
  }

Thank you for this, I have created a helper function for these code duplications, it makes things much easier to handle! These are now in the internals file

vis_compare and vis_guess are indicated as being in beta, which seems also to apply to the plotly versions of the functions. The message that they emit is a helpful indication that they may change in the future. Before submitting to CRAN, however, you might consider moving those functions to the development version of the package and only uploading the functions with a stable API, then adding the beta version later once they stabilize. This is a judgment call; I think I tend towards being conservative on this issue personally.

Agreed, I will create a dev branch, where these experimental features can reside.

This is a minor issue, but was so thoroughly drummed into me by CS professors that I have to mention it. In calls like if (sort_type == TRUE), the == TRUE is redundant, so it's equivalent to just write if (sort_type). Similarly, if (x == FALSE) is equivalent to if (!x). There are a number of these in the code for various functions.

I've actually not heard this before, but I can certainly understand why you would want to avoid called such as if(x == TRUE) over if(x). This has now been changed.

vis_dat()

Maximum data frame size: the heatmaps stop working on my computer with very large data sets. E.g. library(nycflights13); vis_dat(flights) hangs for a minute or so and then displays an empty grid. It works with 10000 rows, but not 100000. Similar issue with many columns. Especially because this package is designed for people to get a sense of new data sets, where users may not have realized how big the data are, I think it makes sense to prevent this issue. Possibilities: (1) max_rows = 10000 parameter, which stops if nrow(data) > max_rows and gives a descriptive error message. Then the user can increase the maximum or subset their data frame. Might also need to consider max_cols because 10000 rows and 2 cols is very different from 10000 rows and 500 cols. (2) Downsample the non-missing rows somehow and indicate to the user where omissions were made. (3) ???

This is an interesting one! It was brought to my attention here, by @MilesMcBain, but I couldn't replicate this issue. Similarly, I can actually display the nycflights data. So it seems that this is to do with the fact that computations power has a bit to do with it. I think that option 1 is the best case here, but option 2 actually gets a little bit difficult, because this leads to another issue of effectively downsampling a dataset whilst maintaining data structure. This could be a problem, say, if I did perform down-sampling and then the user looked at the data with visdat and plotly, they might get a mis-represented sample. Likewise, if there is a very large proportion of missing data, this might be difficult to preserve exactly. Ways around this might include some very explicit warning messages, or perhaps even a note directly on the plot that the data have been downsampled. ^At the moment one solution forward is to have a check first, something like:

vis_dat(really_large_data)

Warning: This dataset contains over 10,000 rows, which may take a while to render. Are you sure that you want to continue?

1: Yes
2. No (You should use a sample of your data)

What do you think?

The palette argument doesn't cause color changes, as far as I can tell. I think you need to do vis_dat_plot <- vis_dat_plot + , whereas currently the value of vis_dat_plot isn't being updated in the blocks related to color palettes. Additionally, you could do ggplot2::scale_color_brewer(type = "qual", palette = "Set1") rather than explicitly writing out the color names, although the colors would be in a slightly different order than you have now.

^I believe that this is now fixed,

vis_dat(airquality, palette = "qual")
vis_dat(airquality, palette = "cb_safe")

Return different plots now, right?

If the user provides a palette name, you should check that it's valid. If not, throw an error with a message. Currently, if you misspell a palette name, it just goes ahead with the default palette.

Great point! I have fixed this now with:

     } else  {
       warning("palette arguments need to be either 'qual' 'cb_safe' or 'default'")
     } # close else brace

at the end of vis_dat.R

Minor comments

There are some lines of commented-out code: # x = airquality and # mutate_each_(funs(fingerprint), tbl_vars(.)) %>%

Inside of the first if-block, there are some comments about arranging columns in order of missingness, but as far as I can tell, that is not done.

Thank you for picking up on these comments, they are now removed!

vis_miss()

It would be cool to indicate the % missing for each column individually (maybe in column labels?) like how you indicate the % missing for the data overall.

great suggestion, this has been added! And can be turned off in the options with show_perc_col == FALSE.

vis_compare()

This is a neat function! It would be cool if you could generalize it to also handle data frames of unequal dimensions, perhaps by showing which rows/columns are only in one data frame or the other. Just an idea I had though, not a requirement at all!

I am glad you like this! It is a difficult problem that I cannot work out, re the rows/cols matching when they are out of the dimension. I'm not quite sure what to do here to make this possible, I think that this should be tackled in a future release of visdat, and will be moved to a development branch.

Warning message: "1: In if (dim(df1) != dim(df2)) { : the condition has length > 1 and only the first element will be used". This is a consequence of calling dim(df1) != dim(df2). The dim function returns a two element vector in this case, but the != only compares the first element of each vector. (I.e. it's only comparing the number of rows). Instead, use !identical(dim(df1), dim(df2)).

In the error message about data frames of unequal dimensions, you mean to say "vis_compare only handles dataframes of identical dimensions", not "does not handle".

  if (!identical(dim(df1), dim(df2))){
    stop("Dimensions of df1 and df2 are not the same. vis_compare currently only handles dataframes of identical dimensions.")
  }

Thank you for both of these comments! These have been changed in vis_compare.R.

vis_guess

vis_guess lacks some of the parameters that vis_dat has, but should have the same options (except maybe not sort_type?)

Agree, this is a good idea, I have added a helper palette function now so that vis_guess and vis_dat can share the same palette

Performance: Setting the locale once and then calling collectorGuess instead of guess_parser is about 33% faster on my computer. I think the savings is by avoiding repeatedly calling the locale function and avoiding repeatedly calling stopifnot (inside the readr code).

l <- readr::locale()

output[!nas] <- vapply(FUN = function(y) readr:::collectorGuess(y, locale_ = l),
                       X = x[!nas], FUN.VALUE = character(1))

If you're interested in converting to C++ for speed, I played around a little bit with implementing a vectorized version for guess_parser in readr. It's about 20X faster. Basically, I wrote a wrapper around collectorGuess that operates on a vector instead of single elements. This is faster because it does the looping in C++ rather than repeatedly transitioning from R to C++ and because it does some initial code once per vector instead of once per element. It is currently written as a modification of readr because that was more convenient, but I imagine it is possible to implement in your package as well. Let me know if you want to discuss that, but I'm definitely not an expert. To get it up and running, clone readr and then put this in readr/src/collectorGuess.cpp:

std::string collectorGuess2(const std::string& input, LocaleInfo locale) {

  // Work from strictest to most flexible
  if (canParse(input, isLogical, &locale))
    return "logical";
  if (canParse(input, isInteger, &locale))
    return "integer";
  if (canParse(input, isDouble, &locale))
    return "double";
  if (canParse(input, isNumber, &locale))
    return "number";
  if (canParse(input, isTime, &locale))
    return "time";
  if (canParse(input, isDate, &locale))
    return "date";
  if (canParse(input, isDateTime, &locale))
    return "datetime";

  // Otherwise can always parse as a character
  return "character";
}

// [[Rcpp::export]]
CharacterVector collectorGuessVector(CharacterVector input, List locale_) {
  LocaleInfo locale(locale_);
  CharacterVector output(input.size());

  if (input.size() == 0 || allMissing(input)) {
    CharacterVector result(1);
    result[0] = "character";
    return result;
  }

  for (int i = 0; i < input.size(); i++) {
    output[i] = collectorGuess2(std::string(input[i]), locale);
  }

  return output;
}

Then put this in readr/R/collectors.R:

#' @rdname parse_guess
#' @export
guess_parser_vector <- function(x, locale = default_locale()) {
  stopifnot(is.locale(locale))
  collectorGuessVector(x, locale)
}

Then rebuild readr. Back in visdat, call guess_parser_vector once with the whole vector of unknowns instead of calling guess_parser repeatedly.

^OK, so I think that this is a fantastic idea! But I'm not sure on the best way to move this into production of visdat, since it sounds like we would need to borrow a chunk of code from readr and then modify it. I wonder if perhaps the solution here is to do a pull request for this specific feature in readr?

vis_miss_ly

The pop-up window is great! Can you make it the same as for ggplotly(vis_dat())? It's especially helpful to show the row number.

I can get it close! Getting these the same will

Why not have the same arguments for this function as for vis_miss?

At the moment this is because I do now know the API for plotly that well. Once we are happy with the default arguments and appearance for vis_miss, we can focus on getting the same appearance with plotly

vis_dat_ly()

With vis_dat(airquality) %>% ggplotly(), the window that appears on mouseover should say "type" instead of "valueType", "variable" instead of "variables" and "row" instead of "rows"

I can get this to say "variable" instead of "variables", but I cannot change "valueType", as I think that changing this makes breaking changes. "rows" appears to be autogenerated from the plotly::ggplotly code, so I cannot change this.

If you provide vis_miss_ly, it makes sense to me to also provide vis_dat_ly, and possibly also vis_compare_ly and vis_guess_ly. An initial implementation could just be vis_dat(x) %>% ggplotly(), but then you have the API in place at least.

I have whipped up vis_dat_ly, vis_compare_ly, and vis_guess_ly using ggplotly(vis_*).
I would prefer to build it in straight plotly, as plotly is really fast, and the process of converting ggplot plots into plotly is a bit computationally intense. I will see what I can whip up in plotly in the future. I do not know much about it at the moment, but @carson has a free book, so I'll get onto that.

Tests

Cool use of vdiffr and generally good coverage. The tests don't all pass on my machine (they are mostly skipped), which I think is because I don't have the .svg reference files on my computer.

Yes, you require the references.

I suggest adding a test for each function using typical_data, because that currently emits a warning message. That way once you stop the error message from occurring, you'll have a regression test for the future.

This is a great suggestion!

Maybe I don't understand how vdiffr works, but I'm confused about test-vis-dat.R where in the calls to vdiffr you always provide vis_dat_plot, never any of the other plots you created. Additionally, both of the plots you create using non-default palettes are given the same name, so you're overwriting one of them.

Oops, error on my part, thanks for picking that up! :)

On my computer, the palette argument doesn't change the appearance of the vis_dat output, but the tests seem to pass, so can you check that you have a test that catches that issue? When I run vdiffr, vis-dat-qualitative-palette.svg and vis-dat-vanilla.svg appear to be the same.

Yes, done. Thank you for picking up on this!

@batpigandme's review

vis_dat's power lies in the fact that it's essentially a visual analogue to one's data, if imagined as a frame/spreadsheet. In order to enhance this, I would suggest that the charts, by default, "read" in the spirit of this analogy by going from top to bottom.

Absolutely agree, I have made these changes now and it makes a LOT more sense to do it this way!

Documentation: README

I have made some minor changes to the wording in README and submitted them as a pull request here, most of which I think are self-explanatory.

Thank you! Pull request was merged here

vis_dat(airquality) is numeric and integer values, when I run it. Thus, I've swapped numeric in for character as it was described.

Thank you!

Added canonical link to the mi package, since you do so for wakefield

Thank you!

Though the README suggests that when less than one percent is missing, "vis_miss indicates that there is >1%", the chart that results from vis_miss(test_miss_df) reads Missing (<0.1%) in the legend

Agreed, this has now been fixed.

When running the first example from README, (vis_dat(airquality)), the user will get a warning re. deprecation of dmap(). If this is not of consequence to the package's functionality, then you might consider adding it to the suppressed messages. If it's something most users supress locally, then you might suggest that somewhere in the README.

#> dmap() is deprecated. Please use the new colwise family in dplyr.
#> E.g., summarise_all(), mutate_all(), etc.

Thank you for finding this, I have fixed this by suppressing the error as @seaaan suggested.

Minor suggestion (which may or may not work comply with the actual vignette formatting) the purpose of demonstrating the difference between vis_guess() and vis_dat, you might consider showing the two side by side (perhaps have the vis_guess() example alone and then side by side). For this you could just add fig.show='hold', out.width='50%' to the chunk options in the Rmd.

Excellent suggestion, done!

Based on the README, the status of vis_dat_ly isn't totally clear (since it's not yet a function of its own).

I have removed the mentions of these, and put them into the experimental features vignette, which is referenced in the README.

I didn't quite understand the Visualising expectations section. That said, I've also never used expectation in assertr— so, this may fall under the category of things that those who need to know do, and, thus, no more need be said.

I've constructed a small example to illustrate this at the end of the experimental features vignette.

Documentation: Vignettes

Some of the same copy edits I made in README I would also suggest in the vignettes (e.g. the compare this to...Where sentence that sort of goes around the vis_dat figure in the vis_guess example section)

The experimental features vignette is a bit hard to follow at the moment, which may be in part due to the layout (the warning messages take up a good bit of space). The figure is also quite small. I'd recommend adjusting the settings to match your main vignette to make it more legible.

Agreed, the warning messages have been fixed, and the plot size changed as well.

vis_dat_ly(), and vis_miss_ly() aren't in the experimental vignette, though the opening paragraph suggests that they will be

Thank you for picking that up, they have been added now.

Documentation: Help

The vis_guess description is a bit hard to follow. I'd start by stating what it does do, before making the comparison to vis_dat, just so it's a bit more self-contained.

Done, great suggestion!

The visdat and visdat-package sections are empty, when, I assume, you'll want the description to be included (right now description just says "visdat").

Thank you for picking up on this, I've updated the main description file now

Examples

Note: the function examples in R all throw warning: dmap() is deprecated. As noted in @seaaan's comments, there's also the warning re. dataframes of different dimensions e.g. example("vis_compare") results in

  #>  Warning messages:
  #> 1: In if (dim(df1) != dim(df2)) { :
  #>  the condition has length > 1 and only the first element will be used
  #> 2: attributes are not identical across measure variables; they will be dropped
  #> 3: attributes are not identical across measure variables; they will be dropped

Agreed, this error message for vis_compare has been addressed.

vis_compare example also seems to cluster by default (all different data is shown at the bottom, of the two columns)

I believe this is because I changed those particular values, I'll change the values to be something else now, so that it is clear that vis_compare() identifies values...

Community guidelines

I don't believe these are currently in the package.

Thank you for picking up on this, I've added them now, :)

Paper

The vis_miss chunk is unnamed. I'd put a name in for you, but I can't knit the doc (see below). Either way, quick fix.

Done!

I'm getting an error 83 because pandoc-citeproc is missing. I believe this is something I'm missing on my machine, and, thus you don't need to worry about. But, just in case, I'm letting you know!

This could be due to the yaml info, does it still happen?

Functionality

Installation and Tests

Installation built successfully (on three different computers, and versions of R at that), as did automated testing. Good work!

Hooray!

Functionality

All of the functions run as expected, and as described. My feedback here primarily concerns messages and warnings, that apply to several functions.

Warning messages can be scary, so it's always a relief see them acknowledged or shown in vignette and/or README examples (as is done re. non-identical attributes). vis_compare (a function for which I can imagine plenty of use cases) is currently returning enough text that it may leave the user confused as to what's most important, or an error on their end, etc. Currently, running vis_compare(iris_diff, iris) (as in the given example), returns:

#> vis_compare is still in BETA! If you have suggestions or errors,
#>        post an issue at https://github.com/njtierney/visdat/issues
#>  dmap() is deprecated. Please use the new colwise family in dplyr.
#>  E.g., summarise_all(), mutate_all(), etc.
#>  Warning messages:
#>  1: In if (dim(df1) != dim(df2)) { :
#>    the condition has length > 1 and only the first element will be used
#>  2: attributes are not identical across measure variables; they will be dropped
#>  3: attributes are not identical across measure variables; they will be dropped

The function runs fine, and appears to have worked, but it's enough "warning" text, that it may warrant concern for the user.

Absolutely agreed, this is a problem and barrier, the messages for using vis_compare have been reduced to:

vis_compare(typical_data)
...

vis_guess is where the cell-to-row visual disparity becomes most confusing. I think this is a fantastic function, especially for beginners, and so the ability to essentially "read" the visual, as one would a table (top to bottom), would be invaluable.

Agreed, this has been addressed by making the "flip" argument become the hard set default. I cannot honestly see a good reason to use the old print method, so I have removed it to make it read top down first.

vis_dat with plotly / interactive functions are great, and, again, here it would be even more helpful if one could "read" the charts as one would a table.

This has been implemented for all but vis_miss_ly - as this is written in plotly, and I'm not certain how to get it to flip.

njtierney summary response and remaining tasks to complete

This has been a fantastic process to be a part of. To summarise, the main changes that I made were:

re-orienting the vis_* family to read in the way that one reads a dataframe
removing many of the error messages that occur due to some tidy/dplyr collation of factors
refactoring repeated code into internal functions to more clearly express what was happening, and avoiding repetition. This resulted in the creation of internal functions vis_gather_, vis_extract_value, and vis_create_
adding percentage missing to the column names in vis_miss
For the moment I have decided to submit visdat to CRAN, with the main functions:
vis_dat
vis_miss

Then, in a separate branch, there will be all of the features: - vis_compare() - vis_guess() - vis_*_ly() family

There is still more work to be done on these functions. Specifically:

vis_guess. This needs to be made much faster, and will incorporate the C code from @seaaan to do so.
vis_*_ly() family will be written in pure plotly, for speed.

I now summarise some of the changes that I wanted to confirm with the reviewers, @seaaan and @batpigandme:

Is the warning message for vis_guess/vis_compare code OK? example(vis_guess) / example(vis_compare)

vis_compare / (vis_guess) is in BETA! If you have suggestions or errors
post an issue at https://github.com/njtierney/visdat/issues

The default sort option. I think that the default sort_type option should be to put all of the similar columns together - to me this makes it easier to identify what is similar, and what is different; when I start looking at a dataset I usually want to know what variables are what class. I am willing to change this, however.
Maximum row size issue. This is an interesting one! It was brought to my attention here, by @MilesMcBain, but I couldn't replicate this issue. Similarly, I can actually display the nycflights data. So it seems that this is to do with the fact that computations power has a bit to do with it. I think that option 1 is the best case here, but option 2 actually gets a little bit difficult, because this leads to another issue of effectively downsampling a dataset whilst maintaining data structure. This could be a problem, say, if I did perform down-sampling and then the user looked at the data with visdat and plotly, they might get a mis-represented sample. Likewise, if there is a very large proportion of missing data, this might be difficult to preserve exactly. Ways around this might include some very explicit warning messages, or perhaps even a note directly on the plot that the data have been downsampled. ^At the moment one solution forward is to have a check first, something like:

vis_dat(really_large_data)

Warning: This dataset contains over 10,000 rows, which may take a while to render. Are you sure that you want to continue?

1: Yes
2. No (You should use a sample of your data)

Speed for vis_guess. @seaaan wrote some nice code to make this work much much faster, and I think that this is a fantastic idea! But I'm not sure on the best way to move this into production of visdat, since it sounds like we would need to borrow a chunk of code from readr and then modify it. I wonder if perhaps the solution here is to do a pull request for this specific feature in readr?
Can the two issues below be replicated?

I don't know why, but in the plotly section, a vis_dat() (non-interactive) plot of df_test appears between the first two interactive plots. I can't explain it, hopefully it's just a weird quirk on my computer.

I'm getting an error 83 because pandoc-citeproc is missing. I believe this is something I'm missing on my machine, and, thus you don't need to worry about. But, just in case, I'm letting you know!

ropensci/visdat documentation built on July 20, 2024, 4:46 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

ropensci/visdat
Preliminary Visualisation of Data

In ropensci/visdat: Preliminary Visualisation of Data

Response to reviewers

@seaaan's Review

Function documentation

Examples

Community guidelines

JOSS

Functionality

Common issues

vis_dat()

Minor comments

vis_miss()

vis_compare()

vis_guess

vis_miss_ly

vis_dat_ly()

Tests

@batpigandme's review

Documentation: README

Documentation: Vignettes

Documentation: Help

Examples

Community guidelines

Paper

Functionality

Installation and Tests

Functionality

njtierney summary response and remaining tasks to complete

R Package Documentation

Browse R Packages

We want your feedback!

ropensci/visdat Preliminary Visualisation of Data

In ropensci/visdat: Preliminary Visualisation of Data

Response to reviewers

@seaaan's Review

Function documentation

Examples

Community guidelines

JOSS

Functionality

Common issues

vis_dat()

Minor comments

vis_miss()

vis_compare()

vis_guess

vis_miss_ly

vis_dat_ly()

Tests

@batpigandme's review

Documentation: README

Documentation: Vignettes

Documentation: Help

Examples

Community guidelines

Paper

Functionality

Installation and Tests

Functionality

njtierney summary response and remaining tasks to complete

R Package Documentation

Browse R Packages

We want your feedback!

ropensci/visdat
Preliminary Visualisation of Data