The tableone package is an R package that eases the construction of “Table 1”, i.e., patient baseline characteristics table commonly found in biomedical research papers. The packages can summarize both continuous and categorical variables mixed within one table. Categorical variables can be summarized as counts and/or percentages. Continuous variables can be summarized in the “normal” way (means and standard deviations) or “nonnormal” way (medians and interquartile ranges).
A screencast demonstrating this vignette is available at: https://www.youtube.com/watch?v=IZgDKmOC0Wg&feature=youtu.be
## tableone package itself
library(tableone)
## survival pcakge for Mayo Clinic's PBC data
library(survival)
data(pbc)
The simplest use case is summarizing the whole dataset. You can just feed in the data frame to the main workhorse function CreateTableOne(). You can see there are 418 patients in the dataset.
CreateTableOne(data = pbc)
                       Overall          
  n                        418          
  id (mean (sd))        209.50 (120.81) 
  time (mean (sd))     1917.78 (1104.67)
  status (mean (sd))      0.83 (0.96)   
  trt (mean (sd))         1.49 (0.50)   
  age (mean (sd))        50.74 (10.45)  
  sex = f (%)              374 (89.5)   
  ascites (mean (sd))     0.08 (0.27)   
  hepato (mean (sd))      0.51 (0.50)   
  spiders (mean (sd))     0.29 (0.45)   
  edema (mean (sd))       0.10 (0.25)   
  bili (mean (sd))        3.22 (4.41)   
  chol (mean (sd))      369.51 (231.94) 
  albumin (mean (sd))     3.50 (0.42)   
  copper (mean (sd))     97.65 (85.61)  
  alk.phos (mean (sd)) 1982.66 (2140.39)
  ast (mean (sd))       122.56 (56.70)  
  trig (mean (sd))      124.70 (65.15)  
  platelet (mean (sd))  257.02 (98.33)  
  protime (mean (sd))    10.73 (1.02)   
  stage (mean (sd))       3.02 (0.88)   
Most of the categorical variables are coded numerically, so we either have to transform them to factors in the dataset or use factorVars argument to transform them on-the-fly. Also it's a better practice to specify which variables to summarize by the vars argument, and exclude the ID variable(s). How do we know which ones are numerically-coded categorical variables? Please check your data dictionary (in this case help(pbc)). This time I am saving the result object in a variable.
## Get variables names
dput(names(pbc))
c("id", "time", "status", "trt", "age", "sex", "ascites", "hepato", 
"spiders", "edema", "bili", "chol", "albumin", "copper", "alk.phos", 
"ast", "trig", "platelet", "protime", "stage")
## Vector of variables to summarize
myVars <- c("time", "status", "trt", "age", "sex", "ascites", "hepato",
          "spiders", "edema", "bili", "chol", "albumin", "copper", "alk.phos",
          "ast", "trig", "platelet", "protime", "stage")
## Vector of categorical variables that need transformation
catVars <- c("status", "trt", "ascites", "hepato",
             "spiders", "edema", "stage")
## Create a TableOne object
tab2 <- CreateTableOne(vars = myVars, data = pbc, factorVars = catVars)
OK. It's more interpretable now. Binary categorical variables are summarized as counts and percentages of the second level. For example, if it is coded as 0 and 1, the “1” level is summarized. For 3+ category variable all levels are summarized. Please bear in mind, the percentages are calculated after excluding missing values.
tab2
                       Overall          
  n                        418          
  time (mean (sd))     1917.78 (1104.67)
  status (%)                            
     0                     232 (55.5)   
     1                      25 ( 6.0)   
     2                     161 (38.5)   
  trt = 2 (%)              154 (49.4)   
  age (mean (sd))        50.74 (10.45)  
  sex = f (%)              374 (89.5)   
  ascites = 1 (%)           24 ( 7.7)   
  hepato = 1 (%)           160 (51.3)   
  spiders = 1 (%)           90 (28.8)   
  edema (%)                             
     0                     354 (84.7)   
     0.5                    44 (10.5)   
     1                      20 ( 4.8)   
  bili (mean (sd))        3.22 (4.41)   
  chol (mean (sd))      369.51 (231.94) 
  albumin (mean (sd))     3.50 (0.42)   
  copper (mean (sd))     97.65 (85.61)  
  alk.phos (mean (sd)) 1982.66 (2140.39)
  ast (mean (sd))       122.56 (56.70)  
  trig (mean (sd))      124.70 (65.15)  
  platelet (mean (sd))  257.02 (98.33)  
  protime (mean (sd))    10.73 (1.02)   
  stage (%)                             
     1                      21 ( 5.1)   
     2                      92 (22.3)   
     3                     155 (37.6)   
     4                     144 (35.0)   
If you want to show all levels, you can use showAllLevels argument to the print() method.
print(tab2, showAllLevels = TRUE)
                       level Overall          
  n                              418          
  time (mean (sd))           1917.78 (1104.67)
  status (%)           0         232 (55.5)   
                       1          25 ( 6.0)   
                       2         161 (38.5)   
  trt (%)              1         158 (50.6)   
                       2         154 (49.4)   
  age (mean (sd))              50.74 (10.45)  
  sex (%)              m          44 (10.5)   
                       f         374 (89.5)   
  ascites (%)          0         288 (92.3)   
                       1          24 ( 7.7)   
  hepato (%)           0         152 (48.7)   
                       1         160 (51.3)   
  spiders (%)          0         222 (71.2)   
                       1          90 (28.8)   
  edema (%)            0         354 (84.7)   
                       0.5        44 (10.5)   
                       1          20 ( 4.8)   
  bili (mean (sd))              3.22 (4.41)   
  chol (mean (sd))            369.51 (231.94) 
  albumin (mean (sd))           3.50 (0.42)   
  copper (mean (sd))           97.65 (85.61)  
  alk.phos (mean (sd))       1982.66 (2140.39)
  ast (mean (sd))             122.56 (56.70)  
  trig (mean (sd))            124.70 (65.15)  
  platelet (mean (sd))        257.02 (98.33)  
  protime (mean (sd))          10.73 (1.02)   
  stage (%)            1          21 ( 5.1)   
                       2          92 (22.3)   
                       3         155 (37.6)   
                       4         144 (35.0)   
If you need more detailed information including the number/proportion missing. Use the summary() method on the result object. The continuous variables are shown first, and the categorical variables are shown second.
summary(tab2)
     ### Summary of continuous variables ###
strata: Overall
           n miss p.miss mean     sd median    p25  p75   min   max  skew kurt
time     418    0    0.0 1918 1104.7   1730 1092.8 2614  41.0  4795  0.47 -0.5
age      418    0    0.0   51   10.4     51   42.8   58  26.3    78  0.09 -0.6
bili     418    0    0.0    3    4.4      1    0.8    3   0.3    28  2.72  8.1
chol     418  134   32.1  370  231.9    310  249.5  400 120.0  1775  3.41 14.3
albumin  418    0    0.0    3    0.4      4    3.2    4   2.0     5 -0.47  0.6
copper   418  108   25.8   98   85.6     73   41.2  123   4.0   588  2.30  7.6
alk.phos 418  106   25.4 1983 2140.4   1259  871.5 1980 289.0 13862  2.99  9.7
ast      418  106   25.4  123   56.7    115   80.6  152  26.4   457  1.45  4.3
trig     418  136   32.5  125   65.1    108   84.2  151  33.0   598  2.52 11.8
platelet 418   11    2.6  257   98.3    251  188.5  318  62.0   721  0.63  0.9
protime  418    2    0.5   11    1.0     11   10.0   11   9.0    18  2.22 10.0
=======================================================================================
     ### Summary of categorical variables ### 
strata: Overall
     var   n miss p.miss level freq percent cum.percent
  status 418    0    0.0     0  232    55.5        55.5
                             1   25     6.0        61.5
                             2  161    38.5       100.0
     trt 418  106   25.4     1  158    50.6        50.6
                             2  154    49.4       100.0
     sex 418    0    0.0     m   44    10.5        10.5
                             f  374    89.5       100.0
 ascites 418  106   25.4     0  288    92.3        92.3
                             1   24     7.7       100.0
  hepato 418  106   25.4     0  152    48.7        48.7
                             1  160    51.3       100.0
 spiders 418  106   25.4     0  222    71.2        71.2
                             1   90    28.8       100.0
   edema 418    0    0.0     0  354    84.7        84.7
                           0.5   44    10.5        95.2
                             1   20     4.8       100.0
   stage 418    6    1.4     1   21     5.1         5.1
                             2   92    22.3        27.4
                             3  155    37.6        65.0
                             4  144    35.0       100.0
It looks like most of the continuous variables are highly skewed except time, age, albumin, and platelet (biomarkers are usually distributed with strong positive skews). Summarizing them as such may please your future peer reviewer(s). Let's do it with the nonnormal argument to the print() method. Can you see the difference. If you just say nonnormal = TRUE, all variables are summarized the “nonnormal” way.
biomarkers <- c("bili","chol","copper","alk.phos","ast","trig","protime")
print(tab2, nonnormal = biomarkers)
                          Overall                  
  n                           418                  
  time (mean (sd))        1917.78 (1104.67)        
  status (%)                                       
     0                        232 (55.5)           
     1                         25 ( 6.0)           
     2                        161 (38.5)           
  trt = 2 (%)                 154 (49.4)           
  age (mean (sd))           50.74 (10.45)          
  sex = f (%)                 374 (89.5)           
  ascites = 1 (%)              24 ( 7.7)           
  hepato = 1 (%)              160 (51.3)           
  spiders = 1 (%)              90 (28.8)           
  edema (%)                                        
     0                        354 (84.7)           
     0.5                       44 (10.5)           
     1                         20 ( 4.8)           
  bili (median [IQR])        1.40 [0.80, 3.40]     
  chol (median [IQR])      309.50 [249.50, 400.00] 
  albumin (mean (sd))        3.50 (0.42)           
  copper (median [IQR])     73.00 [41.25, 123.00]  
  alk.phos (median [IQR]) 1259.00 [871.50, 1980.00]
  ast (median [IQR])       114.70 [80.60, 151.90]  
  trig (median [IQR])      108.00 [84.25, 151.00]  
  platelet (mean (sd))     257.02 (98.33)          
  protime (median [IQR])    10.60 [10.00, 11.10]   
  stage (%)                                        
     1                         21 ( 5.1)           
     2                         92 (22.3)           
     3                        155 (37.6)           
     4                        144 (35.0)           
If you want to fine tune the table further, please check out ?print.TableOne for the full list of options.
Often you want to group patients and summarize group by group. It's also pretty simple. Grouping by exposure categories is probably the most common way, so let's do it by the treatment variable. According to ?pbc, it is coded as (1) D-penicillmain (it's really “D-penicillamine”), (2) placebo, and (NA) not randomized. NA's do not function as a grouping variable, so it is dropped. If you do want to show the result for the NA group, then you need to recoded it something other than NA.
tab3 <- CreateTableOne(vars = myVars, strata = "trt" , data = pbc, factorVars = catVars)
tab3
                      Stratified by trt
                       1                 2                 p      test
  n                        158               154                      
  time (mean (sd))     2015.62 (1094.12) 1996.86 (1155.93)  0.883     
  status (%)                                                0.894     
     0                      83 (52.5)         85 ( 55.2)              
     1                      10 ( 6.3)          9 (  5.8)              
     2                      65 (41.1)         60 ( 39.0)              
  trt = 2 (%)                0 ( 0.0)        154 (100.0)   <0.001     
  age (mean (sd))        51.42 (11.01)     48.58 (9.96)     0.018     
  sex = f (%)              137 (86.7)        139 ( 90.3)    0.421     
  ascites = 1 (%)           14 ( 8.9)         10 (  6.5)    0.567     
  hepato = 1 (%)            73 (46.2)         87 ( 56.5)    0.088     
  spiders = 1 (%)           45 (28.5)         45 ( 29.2)    0.985     
  edema (%)                                                 0.877     
     0                     132 (83.5)        131 ( 85.1)              
     0.5                    16 (10.1)         13 (  8.4)              
     1                      10 ( 6.3)         10 (  6.5)              
  bili (mean (sd))        2.87 (3.63)       3.65 (5.28)     0.131     
  chol (mean (sd))      365.01 (209.54)   373.88 (252.48)   0.748     
  albumin (mean (sd))     3.52 (0.44)       3.52 (0.40)     0.874     
  copper (mean (sd))     97.64 (90.59)     97.65 (80.49)    0.999     
  alk.phos (mean (sd)) 2021.30 (2183.44) 1943.01 (2101.69)  0.747     
  ast (mean (sd))       120.21 (54.52)    124.97 (58.93)    0.460     
  trig (mean (sd))      124.14 (71.54)    125.25 (58.52)    0.886     
  platelet (mean (sd))  258.75 (100.32)   265.20 (90.73)    0.555     
  protime (mean (sd))    10.65 (0.85)      10.80 (1.14)     0.197     
  stage (%)                                                 0.201     
     1                      12 ( 7.6)          4 (  2.6)              
     2                      35 (22.2)         32 ( 20.8)              
     3                      56 (35.4)         64 ( 41.6)              
     4                      55 (34.8)         54 ( 35.1)              
As you can see in the previous table, when there are two or more groups group comparison p-values are printed along with the table (well, let's not argue the appropriateness of hypothesis testing for table 1 in an RCT for now.). Very small p-values are shown with the less than sign. The hypothesis test functions used by default are chisq.test() for categorical variables (with continuity correction) and oneway.test() for continous variables (with equal variance assumption, i.e., regular ANOVA). Two-group ANOVA is equivalent of t-test.
You may be worried about the nonnormal variables and small cell counts in the stage variable. In such a situation, you can use the nonnormal argument like before as well as the exact (test) argument in the print() method. Now kruskal.test() is used for the nonnormal continous variables and fisher.test() is used for categorical variables specified in the exact argument. kruskal.test() is equivalent to wilcox.test() in the two-group case. The column named test is to indicate which p-values were calculated using the non-default tests.
To also show standardized mean differences, use the smd option.
print(tab3, nonnormal = biomarkers, exact = "stage", smd = TRUE)
                         Stratified by trt
                          1                         2                         p      test    SMD   
  n                           158                       154                                        
  time (mean (sd))        2015.62 (1094.12)         1996.86 (1155.93)          0.883          0.017
  status (%)                                                                   0.894          0.054
     0                         83 (52.5)                 85 ( 55.2)                                
     1                         10 ( 6.3)                  9 (  5.8)                                
     2                         65 (41.1)                 60 ( 39.0)                                
  trt = 2 (%)                   0 ( 0.0)                154 (100.0)           <0.001          NaN  
  age (mean (sd))           51.42 (11.01)             48.58 (9.96)             0.018          0.270
  sex = f (%)                 137 (86.7)                139 ( 90.3)            0.421          0.111
  ascites = 1 (%)              14 ( 8.9)                 10 (  6.5)            0.567          0.089
  hepato = 1 (%)               73 (46.2)                 87 ( 56.5)            0.088          0.207
  spiders = 1 (%)              45 (28.5)                 45 ( 29.2)            0.985          0.016
  edema (%)                                                                    0.877          0.058
     0                        132 (83.5)                131 ( 85.1)                                
     0.5                       16 (10.1)                 13 (  8.4)                                
     1                         10 ( 6.3)                 10 (  6.5)                                
  bili (median [IQR])        1.40 [0.80, 3.20]         1.30 [0.72, 3.60]       0.842 nonnorm  0.171
  chol (median [IQR])      315.50 [247.75, 417.00]   303.50 [254.25, 377.00]   0.544 nonnorm  0.038
  albumin (mean (sd))        3.52 (0.44)               3.52 (0.40)             0.874          0.018
  copper (median [IQR])     73.00 [40.00, 121.00]     73.00 [43.00, 139.00]    0.717 nonnorm <0.001
  alk.phos (median [IQR]) 1214.50 [840.75, 2028.00] 1283.00 [922.50, 1949.75]  0.812 nonnorm  0.037
  ast (median [IQR])       111.60 [76.73, 151.51]    117.40 [83.78, 151.90]    0.459 nonnorm  0.084
  trig (median [IQR])      106.00 [84.50, 146.00]    113.00 [84.50, 155.00]    0.370 nonnorm  0.017
  platelet (mean (sd))     258.75 (100.32)           265.20 (90.73)            0.555          0.067
  protime (median [IQR])    10.60 [10.03, 11.00]      10.60 [10.00, 11.40]     0.588 nonnorm  0.146
  stage (%)                                                                    0.205 exact    0.246
     1                         12 ( 7.6)                  4 (  2.6)                                
     2                         35 (22.2)                 32 ( 20.8)                                
     3                         56 (35.4)                 64 ( 41.6)                                
     4                         55 (34.8)                 54 ( 35.1)                                
My typical next step is to export the table to Excel for editing, and then to Word (clinical medical journals usually do not offer LaTeX submission).
The quick and dirty way that I usually use is copy and paste. Use the quote = TRUE argument to show the quotes and noSpaces = TRUE to remove spaces used to align text in the R console (the latter is optional). Now you can just copy and paste the whole thing to an Excel spread sheet. After pasting, click the small pasting icon to choose Use Text Import Wizard…, in the dialogue you can just click finish to fit the values in the appropriate cells. Then you can edit or re-align things as you like. I usualy center-align the group summaries, and right-aligh the p-values.
print(tab3, nonnormal = biomarkers, exact = "stage", quote = TRUE, noSpaces = TRUE)
                           "Stratified by trt"
 ""                         "1"                         "2"                         "p"      "test"   
  "n"                       "158"                       "154"                       ""       ""       
  "time (mean (sd))"        "2015.62 (1094.12)"         "1996.86 (1155.93)"         "0.883"  ""       
  "status (%)"              ""                          ""                          "0.894"  ""       
  "   0"                    "83 (52.5)"                 "85 (55.2)"                 ""       ""       
  "   1"                    "10 (6.3)"                  "9 (5.8)"                   ""       ""       
  "   2"                    "65 (41.1)"                 "60 (39.0)"                 ""       ""       
  "trt = 2 (%)"             "0 (0.0)"                   "154 (100.0)"               "<0.001" ""       
  "age (mean (sd))"         "51.42 (11.01)"             "48.58 (9.96)"              "0.018"  ""       
  "sex = f (%)"             "137 (86.7)"                "139 (90.3)"                "0.421"  ""       
  "ascites = 1 (%)"         "14 (8.9)"                  "10 (6.5)"                  "0.567"  ""       
  "hepato = 1 (%)"          "73 (46.2)"                 "87 (56.5)"                 "0.088"  ""       
  "spiders = 1 (%)"         "45 (28.5)"                 "45 (29.2)"                 "0.985"  ""       
  "edema (%)"               ""                          ""                          "0.877"  ""       
  "   0"                    "132 (83.5)"                "131 (85.1)"                ""       ""       
  "   0.5"                  "16 (10.1)"                 "13 (8.4)"                  ""       ""       
  "   1"                    "10 (6.3)"                  "10 (6.5)"                  ""       ""       
  "bili (median [IQR])"     "1.40 [0.80, 3.20]"         "1.30 [0.72, 3.60]"         "0.842"  "nonnorm"
  "chol (median [IQR])"     "315.50 [247.75, 417.00]"   "303.50 [254.25, 377.00]"   "0.544"  "nonnorm"
  "albumin (mean (sd))"     "3.52 (0.44)"               "3.52 (0.40)"               "0.874"  ""       
  "copper (median [IQR])"   "73.00 [40.00, 121.00]"     "73.00 [43.00, 139.00]"     "0.717"  "nonnorm"
  "alk.phos (median [IQR])" "1214.50 [840.75, 2028.00]" "1283.00 [922.50, 1949.75]" "0.812"  "nonnorm"
  "ast (median [IQR])"      "111.60 [76.73, 151.51]"    "117.40 [83.78, 151.90]"    "0.459"  "nonnorm"
  "trig (median [IQR])"     "106.00 [84.50, 146.00]"    "113.00 [84.50, 155.00]"    "0.370"  "nonnorm"
  "platelet (mean (sd))"    "258.75 (100.32)"           "265.20 (90.73)"            "0.555"  ""       
  "protime (median [IQR])"  "10.60 [10.03, 11.00]"      "10.60 [10.00, 11.40]"      "0.588"  "nonnorm"
  "stage (%)"               ""                          ""                          "0.205"  "exact"  
  "   1"                    "12 (7.6)"                  "4 (2.6)"                   ""       ""       
  "   2"                    "35 (22.2)"                 "32 (20.8)"                 ""       ""       
  "   3"                    "56 (35.4)"                 "64 (41.6)"                 ""       ""       
  "   4"                    "55 (34.8)"                 "54 (35.1)"                 ""       ""       
If you do not like the manual labor of copy-and-paste, you can potentially automate the task by the following way. The print() method for a TableOne object invisibly return a matrix identical to what you see. You can capture this by assignment to a variable (here tab3Mat). Do not use the quote argument in this case, the noSpaces argument is again optional. The self-contradictory printToggle = FALSE for the print() method avoids unnecessary printing if you wish. Then you can save the object to a CSV file. As it is a regular matrix object, you can save it to an Excel file using packages such as XLConnect.
tab3Mat <- print(tab3, nonnormal = biomarkers, exact = "stage", quote = FALSE, noSpaces = TRUE, printToggle = FALSE)
## Save to a CSV file
write.csv(tab3Mat, file = "myTable.csv")
You may want to see the categorical or continous variables only. You can do this by accessing the CatTable part and ContTable part of the TableOne object as follows. summary() methods are defined for both as well as print() method with various arguments. Please see ?print.CatTable and ?print.ContTable for details.
## Categorical part only
tab3$CatTable
                 Stratified by trt
                  1           2            p      test
  n               158         154                     
  status (%)                                0.894     
     0             83 (52.5)   85 ( 55.2)             
     1             10 ( 6.3)    9 (  5.8)             
     2             65 (41.1)   60 ( 39.0)             
  trt = 2 (%)       0 ( 0.0)  154 (100.0)  <0.001     
  sex = f (%)     137 (86.7)  139 ( 90.3)   0.421     
  ascites = 1 (%)  14 ( 8.9)   10 (  6.5)   0.567     
  hepato = 1 (%)   73 (46.2)   87 ( 56.5)   0.088     
  spiders = 1 (%)  45 (28.5)   45 ( 29.2)   0.985     
  edema (%)                                 0.877     
     0            132 (83.5)  131 ( 85.1)             
     0.5           16 (10.1)   13 (  8.4)             
     1             10 ( 6.3)   10 (  6.5)             
  stage (%)                                 0.201     
     1             12 ( 7.6)    4 (  2.6)             
     2             35 (22.2)   32 ( 20.8)             
     3             56 (35.4)   64 ( 41.6)             
     4             55 (34.8)   54 ( 35.1)             
## Continous part only
print(tab3$ContTable, nonnormal = biomarkers)
                         Stratified by trt
                          1                         2                         p      test   
  n                       158                       154                                     
  time (mean (sd))        2015.62 (1094.12)         1996.86 (1155.93)          0.883        
  age (mean (sd))           51.42 (11.01)             48.58 (9.96)             0.018        
  bili (median [IQR])        1.40 [0.80, 3.20]         1.30 [0.72, 3.60]       0.842 nonnorm
  chol (median [IQR])      315.50 [247.75, 417.00]   303.50 [254.25, 377.00]   0.544 nonnorm
  albumin (mean (sd))        3.52 (0.44)               3.52 (0.40)             0.874        
  copper (median [IQR])     73.00 [40.00, 121.00]     73.00 [43.00, 139.00]    0.717 nonnorm
  alk.phos (median [IQR]) 1214.50 [840.75, 2028.00] 1283.00 [922.50, 1949.75]  0.812 nonnorm
  ast (median [IQR])       111.60 [76.73, 151.51]    117.40 [83.78, 151.90]    0.459 nonnorm
  trig (median [IQR])      106.00 [84.50, 146.00]    113.00 [84.50, 155.00]    0.370 nonnorm
  platelet (mean (sd))     258.75 (100.32)           265.20 (90.73)            0.555        
  protime (median [IQR])    10.60 [10.03, 11.00]      10.60 [10.00, 11.40]     0.588 nonnorm