summarytools provides a coherent set of functions centered on data exploration and simple reporting. At its core reside the following four functions:
| Function | Description | 
|---|---|
| freq() | Frequency Tables featuring counts, proportions, as well as missing data information | 
| ctable() | Cross-Tabulations (joint frequencies) between pairs of discrete/categorical variables, featuring marginal sums as well as row, column or total proportions | 
| descr() | Descriptive (Univariate) Statistics for numerical data, featuring common measures of central tendency and dispersion | 
| dfSummary() | Extensive Data Frame Summaries featuring type-specific information for all variables in a data frame: univariate statistics and/or frequency distributions, bar charts or histograms, as well as missing data counts and proportions. Very useful to quickly detect anomalies and identify trends at a glance | 
The package was developed with the following objectives in mind:
Results can be
dfSummary(), all core functions support sampling weightsst_options(); this simplifies coding and minimizes redundancyThe freq() function generates frequency tables with counts, proportions, as well as missing data information.
iris$Species
Type: Factor
| Freq | % Valid | % Valid Cum. | % Total | % Total Cum. | |
|---|---|---|---|---|---|
| setosa | 50 | 33.33 | 33.33 | 33.33 | 33.33 | 
| versicolor | 50 | 33.33 | 66.67 | 33.33 | 66.67 | 
| virginica | 50 | 33.33 | 100.00 | 33.33 | 100.00 | 
| <NA> | 0 | 0.00 | 100.00 | ||
| Total | 150 | 100.00 | 100.00 | 100.00 | 100.00 | 
In this first example, the plain.ascii and style arguments were specified. However, since we have defined them globally with st_options() in the setup chunk, they are redundant and will be omitted from hereon. See section 13 for more details on this vignette’s setup.
The report.nas argument can be set to FALSE in order to ignore missing values (NA’s). Doing so has the following effects on the resulting table:
| Freq | % | % Cum. | |
|---|---|---|---|
| setosa | 50 | 33.33 | 33.33 | 
| versicolor | 50 | 33.33 | 66.67 | 
| virginica | 50 | 33.33 | 100.00 | 
| Total | 150 | 100.00 | 100.00 | 
Note that the headings = FALSE parameter suppresses the heading section.
By “switching off” all optional elements, a much simpler table will be produced:
| Freq | % | |
|---|---|---|
| setosa | 50 | 33.33 | 
| versicolor | 50 | 33.33 | 
| virginica | 50 | 33.33 | 
To generate frequency tables for all variables in a data frame, no need to use lapply(); freq() handles whole data frames, too:
To avoid cluttering the results, numerical columns having more than 25 distinct values will be discarded. This threshold of 25 can be changed by using for example st_options(freq.ignore.threshold = 10).
Note: the tobacco data frame contains simulated data and is included in the package.
The rows parameter allows subsetting frequency tables; we can use this parameter it different ways:
rows = 1:10 will show the frequencies for the first 10 values only?regex for more information on regular expressionsUsed in combination with the order argument, the subsetting feature can be quite practical. For a character variable containing a large number of distinct values, showing only the most frequent is easily done:
tobacco$disease
Type: Character
| Freq | % Valid | % Valid Cum. | % Total | % Total Cum. | |
|---|---|---|---|---|---|
| Hypertension | 36 | 16.22 | 16.22 | 3.60 | 3.60 | 
| Cancer | 34 | 15.32 | 31.53 | 3.40 | 7.00 | 
| Cholesterol | 21 | 9.46 | 40.99 | 2.10 | 9.10 | 
| Heart | 20 | 9.01 | 50.00 | 2.00 | 11.10 | 
| Pulmonary | 20 | 9.01 | 59.01 | 2.00 | 13.10 | 
| (Other) | 91 | 40.99 | 100.00 | 9.10 | 22.20 | 
| <NA> | 778 | 77.80 | 100.00 | ||
| Total | 1000 | 100.00 | 100.00 | 100.00 | 100.00 | 
Instead of "freq", we can use "-freq" to reverse the ordering and get results ranked from lowest to highest in frequency.
To account for the frequencies of unshown values, the “(Other)” row is automatically added.
When generating html results, use the collapse = TRUE argument with print() or view() to get collapsible sections; clicking on the variable name in the heading section will collapse / reveal the frequency table (results not shown).
ctable() generates cross-tabulations (joint frequencies) for pairs of categorical variables.
Since markdown does not support multiline table headings (but does accept html code), we’ll use the html rendering feature for this section.
Using the tobacco data frame, we’ll cross-tabulate the two categorical variables smoker and diseased.
| diseased | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| smoker | Yes | No | Total | |||||||||
| Yes | 125 | ( | 41.9% | ) | 173 | ( | 58.1% | ) | 298 | ( | 100.0% | ) | 
| No | 99 | ( | 14.1% | ) | 603 | ( | 85.9% | ) | 702 | ( | 100.0% | ) | 
| Total | 224 | ( | 22.4% | ) | 776 | ( | 77.6% | ) | 1000 | ( | 100.0% | ) | 
Row proportions are shown by default. To display column or total proportions, use prop = "c" or prop = "t", respectively. To omit proportions altogether, use prop = "n".
By “switching off” all optional features, we get a simple “2 x 2” table:
with(tobacco, 
     print(ctable(x = smoker, y = diseased, prop = 'n',
                  totals = FALSE, headings = FALSE),
           method = "render"))| diseased | ||
|---|---|---|
| smoker | Yes | No | 
| Yes | 125 | 173 | 
| No | 99 | 603 | 
To display the chi-square statistic, set chisq = TRUE. To show how pipes can be used with summarytools, we’ll use magrittr’s %$% and %>% operators:
library(magrittr)
tobacco %$%  # The %$% operator replaces with(tobacco, ...)
  ctable(gender, smoker, chisq = TRUE, headings = FALSE) %>%
  print(method = "render")| smoker | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| gender | Yes | No | Total | |||||||||
| F | 147 | ( | 30.1% | ) | 342 | ( | 69.9% | ) | 489 | ( | 100.0% | ) | 
| M | 143 | ( | 29.2% | ) | 346 | ( | 70.8% | ) | 489 | ( | 100.0% | ) | 
| <NA> | 8 | ( | 36.4% | ) | 14 | ( | 63.6% | ) | 22 | ( | 100.0% | ) | 
| Total | 298 | ( | 29.8% | ) | 702 | ( | 70.2% | ) | 1000 | ( | 100.0% | ) | 
Χ2 = .5415 df = 2 p = .7628
descr() generates descriptive / univariate statistics, i.e. common central tendency statistics and measures of dispersion. It accepts single vectors as well as data frames; in the latter case, all non-numerical columns are ignored, with a message to that effect.
Non-numerical variable(s) ignored: Speciesiris
N: 150
| Petal.Length | Petal.Width | Sepal.Length | Sepal.Width | |
|---|---|---|---|---|
| Mean | 3.76 | 1.20 | 5.84 | 3.06 | 
| Std.Dev | 1.77 | 0.76 | 0.83 | 0.44 | 
| Min | 1.00 | 0.10 | 4.30 | 2.00 | 
| Q1 | 1.60 | 0.30 | 5.10 | 2.80 | 
| Median | 4.35 | 1.30 | 5.80 | 3.00 | 
| Q3 | 5.10 | 1.80 | 6.40 | 3.30 | 
| Max | 6.90 | 2.50 | 7.90 | 4.40 | 
| MAD | 1.85 | 1.04 | 1.04 | 0.44 | 
| IQR | 3.50 | 1.50 | 1.30 | 0.50 | 
| CV | 0.47 | 0.64 | 0.14 | 0.14 | 
| Skewness | -0.27 | -0.10 | 0.31 | 0.31 | 
| SE.Skewness | 0.20 | 0.20 | 0.20 | 0.20 | 
| Kurtosis | -1.42 | -1.36 | -0.61 | 0.14 | 
| N.Valid | 150.00 | 150.00 | 150.00 | 150.00 | 
| Pct.Valid | 100.00 | 100.00 | 100.00 | 100.00 | 
Results can be transposed by using transpose = TRUE, and statistics can be selected using the stats argument:
Non-numerical variable(s) ignored: Species| Mean | Std.Dev | |
|---|---|---|
| Petal.Length | 3.76 | 1.77 | 
| Petal.Width | 1.20 | 0.76 | 
| Sepal.Length | 5.84 | 0.83 | 
| Sepal.Width | 3.06 | 0.44 | 
See ?descr for a list of all available statistics. Special values “all”, “fivenum”, and “common” are also valid values for the stats argument. The default value is “all”.
dfSummary() creates a summary table with statistics, frequencies and graphs for all variables in a data frame. The information displayed is type-specific (character, factor, numeric, date) and also varies according to the number of distinct values.
To see the results in RStudio’s Viewer (or in the default Web browser if working in another IDE or from a terminal window), we use the view() function:
When using dfSummary() in Rmarkdown documents, it is generally a good idea to exclude a column or two to avoid margin overflow. Since the Valid and Missing columns are redundant, we can drop either one of them.
dfSummary(tobacco, plain.ascii = FALSE, style = "grid", 
          graph.magnif = 0.75, valid.col = FALSE, tmp.img.dir = "/tmp")The tmp.img.dir parameter is mandatory when generating dfSummaries in Rmarkdown documents, except for html rendering. The explanation for this can be found further below.
This function
Although most columns can be excluded using the function’s parameters, it is also possible to delete them with the following syntax (results not shown):
To produce optimal results, summarytools has its own version of the base by() function. It’s called stby(), and we use it exactly as we would by():
(iris_stats_by_species <- stby(data = iris, 
                               INDICES = iris$Species, 
                               FUN = descr, stats = "common", transpose = TRUE))Non-numerical variable(s) ignored: Speciesiris
Group: Species = setosa
N: 50
| Mean | Std.Dev | Min | Median | Max | N.Valid | Pct.Valid | |
|---|---|---|---|---|---|---|---|
| Petal.Length | 1.46 | 0.17 | 1.00 | 1.50 | 1.90 | 50.00 | 100.00 | 
| Petal.Width | 0.25 | 0.11 | 0.10 | 0.20 | 0.60 | 50.00 | 100.00 | 
| Sepal.Length | 5.01 | 0.35 | 4.30 | 5.00 | 5.80 | 50.00 | 100.00 | 
| Sepal.Width | 3.43 | 0.38 | 2.30 | 3.40 | 4.40 | 50.00 | 100.00 | 
Group: Species = versicolor
N: 50
| Mean | Std.Dev | Min | Median | Max | N.Valid | Pct.Valid | |
|---|---|---|---|---|---|---|---|
| Petal.Length | 4.26 | 0.47 | 3.00 | 4.35 | 5.10 | 50.00 | 100.00 | 
| Petal.Width | 1.33 | 0.20 | 1.00 | 1.30 | 1.80 | 50.00 | 100.00 | 
| Sepal.Length | 5.94 | 0.52 | 4.90 | 5.90 | 7.00 | 50.00 | 100.00 | 
| Sepal.Width | 2.77 | 0.31 | 2.00 | 2.80 | 3.40 | 50.00 | 100.00 | 
Group: Species = virginica
N: 50
| Mean | Std.Dev | Min | Median | Max | N.Valid | Pct.Valid | |
|---|---|---|---|---|---|---|---|
| Petal.Length | 5.55 | 0.55 | 4.50 | 5.55 | 6.90 | 50.00 | 100.00 | 
| Petal.Width | 2.03 | 0.27 | 1.40 | 2.00 | 2.50 | 50.00 | 100.00 | 
| Sepal.Length | 6.59 | 0.64 | 4.90 | 6.50 | 7.90 | 50.00 | 100.00 | 
| Sepal.Width | 2.97 | 0.32 | 2.20 | 3.00 | 3.80 | 50.00 | 100.00 | 
When used to produce split-group statistics for a single variable, stby() assembles everything into a single table instead of displaying a series of one-column tables.
with(tobacco, stby(data = BMI, INDICES = age.gr, 
                   FUN = descr, stats = c("mean", "sd", "min", "med", "max")))BMI by age.gr
Data Frame: tobacco
N: 258
| 18-34 | 35-50 | 51-70 | 71 + | |
|---|---|---|---|---|
| Mean | 23.84 | 25.11 | 26.91 | 27.45 | 
| Std.Dev | 4.23 | 4.34 | 4.26 | 4.37 | 
| Min | 8.83 | 10.35 | 9.01 | 16.36 | 
| Median | 24.04 | 25.11 | 26.77 | 27.52 | 
| Max | 34.84 | 39.44 | 39.21 | 38.37 | 
The syntax is a little trickier for this one, so here is an example (results not shown):
stby(list(x = tobacco$smoker, y = tobacco$diseased), 
     INDICES = tobacco$gender, FUN = ctable)
# or equivalently
with(tobacco, 
     stby(list(x = smoker, y = diseased), 
          INDICES = gender, FUN = ctable))To create grouped statistics with freq(), descr() or dfSummary(), it is possible to use dplyr’s group_by() as an alternative to stby(). Syntactic differences aside, one key distinction is that group_by() considers NA values on the grouping variables as a valid category, albeit with a warning message suggesting the use of forcats::fct_explicit_na to make NA’s explicit in factors. Following this advice, we get:
library(dplyr)
tobacco$gender %<>% forcats::fct_explicit_na()
tobacco %>% group_by(gender) %>% descr(stats = "fivenum")Non-numerical variable(s) ignored: age.gr, smoker, diseased, diseasetobacco
Group: gender = F
N: 489
| BMI | age | cigs.per.day | samp.wgts | |
|---|---|---|---|---|
| Min | 9.01 | 18.00 | 0.00 | 0.86 | 
| Q1 | 22.98 | 34.00 | 0.00 | 0.86 | 
| Median | 25.87 | 50.00 | 0.00 | 1.04 | 
| Q3 | 29.48 | 66.00 | 10.50 | 1.05 | 
| Max | 39.44 | 80.00 | 40.00 | 1.06 | 
Group: gender = M
N: 489
| BMI | age | cigs.per.day | samp.wgts | |
|---|---|---|---|---|
| Min | 8.83 | 18.00 | 0.00 | 0.86 | 
| Q1 | 22.52 | 34.00 | 0.00 | 0.86 | 
| Median | 25.14 | 49.50 | 0.00 | 1.04 | 
| Q3 | 27.96 | 66.00 | 11.00 | 1.05 | 
| Max | 36.76 | 80.00 | 40.00 | 1.06 | 
Group: gender = (Missing)
N: 22
| BMI | age | cigs.per.day | samp.wgts | |
|---|---|---|---|---|
| Min | 20.24 | 19.00 | 0.00 | 0.86 | 
| Q1 | 24.97 | 36.00 | 0.00 | 1.04 | 
| Median | 27.16 | 55.50 | 0.00 | 1.05 | 
| Q3 | 30.23 | 64.00 | 10.00 | 1.05 | 
| Max | 32.43 | 80.00 | 28.00 | 1.06 | 
When generating freq() or descr() tables, it is possible to turn the results into “tidy” tables with the use of the tb() function (think of tb as a diminutive for tibble). For example:
# A tibble: 4 x 8
  variable      mean    sd   min   med   max n.valid pct.valid
  <chr>        <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>     <dbl>
1 Petal.Length  3.76 1.77    1    4.35   6.9     150       100
2 Petal.Width   1.20 0.762   0.1  1.3    2.5     150       100
3 Sepal.Length  5.84 0.828   4.3  5.8    7.9     150       100
4 Sepal.Width   3.06 0.436   2    3      4.4     150       100# A tibble: 3 x 3
  Species     freq   pct
  <fct>      <dbl> <dbl>
1 setosa        50  33.3
2 versicolor    50  33.3
3 virginica     50  33.3By definition, no total rows are part of tidy tables, and the row names are converted to a regular column. Note that for displaying tibbles using Rmarkdown, the knitr chunk option ‘results’ should be set to “markup” instead of “asis”.
Here are some examples showing how lists created using stby() or group_by() can be transformed into tidy tibbles.
grouped_descr <- stby(data = exams, INDICES = exams$gender, 
                      FUN = descr, stats = "common")
grouped_descr %>% tb()# A tibble: 12 x 9
   gender variable   mean    sd   min   med   max n.valid pct.valid
   <fct>  <chr>     <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>     <dbl>
 1 Girl   economics  72.5  7.79  62.3  70.2  89.6      14      93.3
 2 Girl   english    73.9  9.41  58.3  71.8  93.1      14      93.3
 3 Girl   french     71.1 12.4   44.8  68.4  93.7      14      93.3
 4 Girl   geography  67.3  8.26  50.4  67.3  78.9      15     100  
 5 Girl   history    71.2  9.17  53.9  72.9  86.4      15     100  
 6 Girl   math       73.8  9.03  55.6  74.8  86.3      14      93.3
 7 Boy    economics  75.2  9.40  60.5  71.7  94.2      15     100  
 8 Boy    english    77.8  5.94  69.6  77.6  90.2      15     100  
 9 Boy    french     76.6  8.63  63.2  74.8  94.7      15     100  
10 Boy    geography  73   12.4   47.2  71.2  96.3      14      93.3
11 Boy    history    74.4 11.2   54.4  72.6  93.5      15     100  
12 Boy    math       73.3  9.68  60.5  72.2  93.2      14      93.3The order parameter controls row ordering:
# A tibble: 12 x 9
   gender variable   mean    sd   min   med   max n.valid pct.valid
   <fct>  <chr>     <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>     <dbl>
 1 Girl   economics  72.5  7.79  62.3  70.2  89.6      14      93.3
 2 Boy    economics  75.2  9.40  60.5  71.7  94.2      15     100  
 3 Girl   english    73.9  9.41  58.3  71.8  93.1      14      93.3
 4 Boy    english    77.8  5.94  69.6  77.6  90.2      15     100  
 5 Girl   french     71.1 12.4   44.8  68.4  93.7      14      93.3
 6 Boy    french     76.6  8.63  63.2  74.8  94.7      15     100  
 7 Girl   geography  67.3  8.26  50.4  67.3  78.9      15     100  
 8 Boy    geography  73   12.4   47.2  71.2  96.3      14      93.3
 9 Girl   history    71.2  9.17  53.9  72.9  86.4      15     100  
10 Boy    history    74.4 11.2   54.4  72.6  93.5      15     100  
11 Girl   math       73.8  9.03  55.6  74.8  86.3      14      93.3
12 Boy    math       73.3  9.68  60.5  72.2  93.2      14      93.3Setting order = 3 changes the order of the sort variables exactly as with order = 2, but it also reorders the columns:
# A tibble: 12 x 9
   variable  gender  mean    sd   min   med   max n.valid pct.valid
   <chr>     <fct>  <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>     <dbl>
 1 economics Girl    72.5  7.79  62.3  70.2  89.6      14      93.3
 2 economics Boy     75.2  9.40  60.5  71.7  94.2      15     100  
 3 english   Girl    73.9  9.41  58.3  71.8  93.1      14      93.3
 4 english   Boy     77.8  5.94  69.6  77.6  90.2      15     100  
 5 french    Girl    71.1 12.4   44.8  68.4  93.7      14      93.3
 6 french    Boy     76.6  8.63  63.2  74.8  94.7      15     100  
 7 geography Girl    67.3  8.26  50.4  67.3  78.9      15     100  
 8 geography Boy     73   12.4   47.2  71.2  96.3      14      93.3
 9 history   Girl    71.2  9.17  53.9  72.9  86.4      15     100  
10 history   Boy     74.4 11.2   54.4  72.6  93.5      15     100  
11 math      Girl    73.8  9.03  55.6  74.8  86.3      14      93.3
12 math      Boy     73.3  9.68  60.5  72.2  93.2      14      93.3For more details, see ?tb.
summarytools objects are not always compatible with packages focused on table formatting, such as formattable or kableExtra. However, tb() can be used as a “bridge”, an intermediary step turning freq() and descr() objects into simple tables that any package can work with. Here is an example using kableExtra:
library(kableExtra)
library(magrittr)
stby(iris, iris$Species, descr, stats = "fivenum") %>%
  tb(order = 3) %>%
  kable(format = "html", digits = 2) %>%
  collapse_rows(columns = 1, valign = "top")| variable | Species | min | q1 | med | q3 | max | 
|---|---|---|---|---|---|---|
| Petal.Length | setosa | 1.0 | 1.4 | 1.50 | 1.6 | 1.9 | 
| versicolor | 3.0 | 4.0 | 4.35 | 4.6 | 5.1 | |
| virginica | 4.5 | 5.1 | 5.55 | 5.9 | 6.9 | |
| Petal.Width | setosa | 0.1 | 0.2 | 0.20 | 0.3 | 0.6 | 
| versicolor | 1.0 | 1.2 | 1.30 | 1.5 | 1.8 | |
| virginica | 1.4 | 1.8 | 2.00 | 2.3 | 2.5 | |
| Sepal.Length | setosa | 4.3 | 4.8 | 5.00 | 5.2 | 5.8 | 
| versicolor | 4.9 | 5.6 | 5.90 | 6.3 | 7.0 | |
| virginica | 4.9 | 6.2 | 6.50 | 6.9 | 7.9 | |
| Sepal.Width | setosa | 2.3 | 3.2 | 3.40 | 3.7 | 4.4 | 
| versicolor | 2.0 | 2.5 | 2.80 | 3.0 | 3.4 | |
| virginica | 2.2 | 2.8 | 3.00 | 3.2 | 3.8 | 
Using the file argument with print() or view(), we can write outputs to a file, be it html, Rmd, md, or just plain text (txt). The file extension is used to determine the type of content to write out.
view(iris_stats_by_species, file = "~/iris_stats_by_species.html")
view(iris_stats_by_species, file = "~/iris_stats_by_species.md")A Note About PDF documents
There is no direct way to create a PDF file with summarytools. One option is to generate an html file and convert it to PDF using Pandoc or WK<html>TOpdf (the latter gives better results than Pandoc with dfSummary() output). Another option is to create an Rmd document using PDF as the output format, but with a caveat: displaying graphs with dfSummary() will cause vertical misalignment (we hope to resolve this issue in a future version).
The append argument allows adding content to existing files generated by summarytools. This is useful if we wish to include several statistical tables in a single file. It is a quick alternative to creating an Rmd document.
The following options can be set with st_options():
| Option name | Default | Note | 
|---|---|---|
| style | “simple” | Set to “rmarkdown” in .Rmd documents | 
| plain.ascii | TRUE | Set to FALSE in .Rmd documents | 
| round.digits | 2 | Number of decimals to show | 
| headings | TRUE | Formerly “omit.headings” | 
| footnote | “default” | Personalize, or set to NA to omit | 
| display.labels | TRUE | Show variable / data frame labels in headings | 
| bootstrap.css (*) | TRUE | Include Bootstrap 4 CSS in html output files | 
| custom.css | NA | Path to your own CSS file | 
| escape.pipe | FALSE | Useful for some Pandoc conversions | 
| subtitle.emphasis | TRUE | Controls headings formatting | 
| lang | “en” | Language (always 2-letter, lowercase) | 
(*) Set to FALSE in Shiny apps
| Option name | Default | Note | 
|---|---|---|
| freq.totals | TRUE | Display totals row in freq() | 
| freq.report.nas | TRUE | Display | 
| freq.silent | FALSE | Hide console messages | 
| ctable.prop | “r” | Display row proportions by default | 
| ctable.totals | TRUE | Show marginal totals | 
| descr.stats | “all” | “fivenum”, “common” or vector of stats | 
| descr.transpose | FALSE | Display stats in columns instead of rows | 
| descr.silent | FALSE | Hide console messages | 
| dfSummary.varnumbers | TRUE | Show variable numbers in 1st col. | 
| dfSummary.labels.col | TRUE | Show variable labels when present | 
| dfSummary.graph.col | TRUE | Show graphs | 
| dfSummary.valid.col | TRUE | Include the Valid column in the output | 
| dfSummary.na.col | TRUE | Include the Missing column in the output | 
| dfSummary.graph.magnif | 1 | Zoom factor for bar plots and histograms | 
| dfSummary.silent | FALSE | Hide console messages | 
| tmp.img.dir | NA | Directory to store temporary images | 
Examples
When a summarytools object is created, its formatting attributes are stored within it. However, we can override most of them when using print() or view().
This table indicates what arguments can be used with print() or view() to override formatting attributes:
| Argument | freq | ctable | descr | dfSummary | 
|---|---|---|---|---|
| style | x | x | x | x | 
| round.digits | x | x | x | |
| plain.ascii | x | x | x | x | 
| justify | x | x | x | x | 
| headings | x | x | x | x | 
| display.labels | x | x | x | x | 
| varnumbers | x | |||
| labels.col | x | |||
| graph.col | x | |||
| valid.col | x | |||
| na.col | x | |||
| col.widths | x | |||
| totals | x | x | ||
| report.nas | x | |||
| display.type | x | |||
| missing | x | |||
| split.tables (*) | x | x | x | x | 
| caption (*) | x | x | x | x | 
(*) These are pander options
To change the information shown in the heading section, use the following arguments with print() or view():
| Argument | freq | ctable | descr | dfSummary | 
|---|---|---|---|---|
| Data.frame | x | x | x | x | 
| Data.frame.label | x | x | x | x | 
| Variable | x | x | x | |
| Variable.label | x | x | x | |
| Group | x | x | x | x | 
| date | x | x | x | x | 
| Weights | x | x | ||
| Data.type | x | |||
| Row.variable | x | |||
| Col.variable | x | 
In the following example, we will override three formatting, and one heading attribute:
tobacco$age.gr
Type: Factor
| Freq | % Valid | % Valid Cum. | % Total | % Total Cum. | |
|---|---|---|---|---|---|
| 18-34 | 258 | 26.46 | 26.46 | 25.80 | 25.80 | 
| 35-50 | 241 | 24.72 | 51.18 | 24.10 | 49.90 | 
| 51-70 | 317 | 32.51 | 83.69 | 31.70 | 81.60 | 
| 71 + | 159 | 16.31 | 100.00 | 15.90 | 97.50 | 
| <NA> | 25 | 2.50 | 100.00 | ||
| Total | 1000 | 100.00 | 100.00 | 100.00 | 100.00 | 
tobacco$age.gr
Label: Age Group
| Freq | % | % Cum. | |
|---|---|---|---|
| 18-34 | 258 | 26.46 | 26.46 | 
| 35-50 | 241 | 24.72 | 51.18 | 
| 51-70 | 317 | 32.51 | 83.69 | 
| 71 + | 159 | 16.31 | 100.00 | 
print() or view() parameters have precedence (overriding feature)freq() / ctable() / descr() / dfSummary() parameters come secondst_options() come thirdWhen creating html reports, both Bootstrap’s CSS and summarytools.css are included by default. For greater control on the looks of html content, it is also possible to add class definitions in a custom CSS file.
We need to use a very small font size for a simple html report containing a dfSummary(). For this, we create a .css file (with the name of our choosing) which contains the following class definition:
Then we use print()’s custom.css argument to specify to location of our newly created CSS file (results not shown):
print(dfSummary(tobacco), custom.css = 'path/to/custom.css', 
      table.classes = 'tiny-text', file = "tiny-tobacco-dfSummary.html")To successfully include summarytools functions in Shiny apps,
bootstrap.css = FALSE to avoid interacting with the app’s layoutheadings = FALSE in case problems ariseprint()’s graph.magnif parameter or with the dfSummary.graph.magnif global optiondfSummary() tables are too wide, omit a column or two (valid.col and varnumbers, for instance)print()’s col.widths parameterExample (results not shown)
print(dfSummary(somedata, varnumbers = FALSE, valid.col = FALSE, 
                graph.magnif = 0.8), 
      method = 'render',
      headings = FALSE,
      bootstrap.css = FALSE)When using dfSummary() in an Rmd document using markdown styling (as opposed to html rendering), three elements are needed in order to display the png graphs properly:
1 - plain.ascii must be set to FALSE
2 - style must be set to “grid”
3 - tmp.img.dir must be defined
Why the third element? Although R makes it really easy to create temporary files and directories, they do have long pathnames, especially on Windows. Unfortunately, Pandoc determines the final (rendered) column widths by counting characters in a cell, even if those characters are paths pointing to images.
At this time, there seems to be only one solution around this problem: cut down on characters in image paths. So instead of this:
+-----------+---------------------------------------------------------------------+---------+
| Variable  | Graph                                                               | Valid   |
+===========+=====================================================================+=========+
| gender\   |  | 978\    |
| [factor]  |                                                                     | (97.8%) |
+----+---------------+------------------------------------------------------------+---------+…we aim for this:
+---------------+----------------------+---------+
| Variable      | Graph                | Valid   |
+===============+======================+=========+
| gender\       |  | 978\    |
| [factor]      |                      | (97.8%) |
+---------------+----------------------+---------+CRAN policies are really strict when it comes to writing content in the user directories, or anywhere outside R’s temporary zone (for good reasons). So the users need to set this location themselves, therefore consenting to having content written outside R’s predefined temporary zone.
On Mac OS and Linux, using “/tmp” makes a lot of sense: it’s a short path, and it’s self-cleaning. On Windows, there is no such convenient directory, so we need to pick one – be it absolute (“/tmp”) or relative (“img”, or simply “.”). Two things are to be kept in mind: it needs to be short (5 characters max) and it needs to be cleaned up manually.
Thanks to the R community’s efforts, the following languages can be used, in addition to English (default): French (fr), Portuguese (pt), Russian (ru), Spanish (es), and Turkish (tr).
To switch languages, simply use
All output from the core functions will now use that language:
iris$Species
Type: Facteur
| Fréq. | % Valide | % Valide cum. | % Total | % Total cum. | |
|---|---|---|---|---|---|
| setosa | 50 | 33.33 | 33.33 | 33.33 | 33.33 | 
| versicolor | 50 | 33.33 | 66.67 | 33.33 | 66.67 | 
| virginica | 50 | 33.33 | 100.00 | 33.33 | 100.00 | 
| <NA> | 0 | 0.00 | 100.00 | ||
| Total | 150 | 100.00 | 100.00 | 100.00 | 100.00 | 
On most Windows systems, it will be necessary to change the LC_CTYPE element of the locale settings if the character set is not included in the system’s default locale. For instance, in order to get good results with the Russian language in a “latin1” environment, we need to do the following:
Then to go back to default settings:
Using the function use_custom_lang(), it is possible to add your own set of translations. To achieve this, get the csv template, customize the +/- 70 items, and call use_custom_lang(), giving it as sole argument the path to the edited csv template. Note that such custom translations will not persist across R sessions. This means that you should always have this csv file handy for future use.
Sometimes, all you might want to do is change just a few keywords – for instance, you could prefer using “N” instead of “Freq” in the title row of freq() tables. For this, use define_keywords(). Calling this function without any arguments will bring up, on systems that support graphical devices (the vast majority, that is), an editable window allowing to modify only the desired item(s).
After closing the edit window, you will be able to export the resulting “custom language” into a csv file that you can reuse in the future by calling use_custom_lang().
It is also possible to programmatically define one or several keywords using define_keywords(). For instance:
See ?define_keywords for more details.
Knowing how this vignette is configured can help users get started with using summarytools in Rmarkdown documents.
The output element is the one what matters:
## ---
## output: 
##   rmarkdown::html_vignette: 
##     css: 
##     - !expr system.file("rmarkdown/templates/html_vignette/resources/vignette.css", 
##                         package = "rmarkdown")
## ---## ```{r setup, include=FALSE}
## library(knitr)
## opts_chunk$set(results = 'asis',      # Can also be set at the chunk-level
##                comment = NA,
##                prompt  = FALSE,
##                cache   = FALSE)
## library(summarytools)
## st_options(plain.ascii = FALSE,        # Always use this option in Rmd documents
##            style        = "rmarkdown", # Always use this option in Rmd documents
##            footnote     = NA,          # Makes html-rendered results more concise
##            subtitle.emphasis = FALSE)  # Improves layout with some rmardown themes
## ```The needed CSS is automatically added to html files created using print() or view() with the file argument. But in Rmarkdown documents, this needs to be done explicitly:
## ```{r, echo=FALSE}
## st_css()
## ```The package comes with no guarantees. It is a work in progress and feedback is always welcome. Please open an issue on GitHub if you find a bug or wish to submit a feature request.
Check out the GitHub project’s page; from there you can see the latest updates and also submit feature requests.
For a preview of what’s coming in the next release, have a look at the development branch.