In this vignette, we demonstrate how complex structural types in Presto can be translated into R types (e.g., vectors, list, and tibbles).
library(RPresto)You can check your RPresto version by running the
packageVersion() function. You need version 1.3.9
or later to have a more comprehensive and robust complex types
support.
packageVersion("RPresto")
#> [1] '1.4.5'Complex types refer to structural
types including ARRAY, MAP and
ROW. Those data types are basically containers to hold
other data types (hence complex).
We summarize the similarities and differences between the complex types below.
| Type | Is atomic? | Has names/keys? |
|---|---|---|
| ARRAY | Yes | No |
| MAP | Yes | Yes |
| ROW | No | Yes |
Atomic here means all elements in the container share the same type
(usually primitive types, but can be complex types too). For example, an
ARRAY of integer can only hold integer type elements
whereas a ROW can have elements of different types (e.g.,
one element is integer and the other boolean).
We assume that the user already have a Presto server with a memory connector set up. If you don’t have such a server set up, refer to the Presto documentation for instructions if you want to follow along.
We first create a PrestoConnection which will serve as
the bridge between the Presto server and R.
con <- DBI::dbConnect(
drv = RPresto::Presto(),
host = "http://localhost",
port = 8080,
user = Sys.getenv("USER"),
catalog = "memory",
schema = "default"
)We first issue a simple query to see if the Presto connection is working properly.
DBI::dbGetQuery(con, "SELECT 1+1 AS res")
#> # A tibble: 1 × 1
#> res
#> <int>
#> 1 2| Presto type | R type |
|---|---|
| ARRAY | unnamed typed vector |
| MAP | named typed vector |
ARRAYsGiven the atomic and unnamed nature of ARRAYs, we map
them to unnamed typed vectors in R.
ARRAYs of primitive typesWe first create a table with ARRAYs of all supported
primitive Presto data types using the
create_primitive_arrays_table() function included in the
RPresto package.
RPresto:::create_primitive_arrays_table(
con, table_name = "presto_primitive_arrays", verbose = FALSE
)We can check if the table now exists in Presto.
DBI::dbExistsTable(con, "presto_primitive_arrays")
#> [1] TRUEWe can list the fields in the table. They are named after the Presto types they represent.
DBI::dbListFields(con, "presto_primitive_arrays")
#> [1] "array_boolean" "array_tinyint"
#> [3] "array_smallint" "array_integer"
#> [5] "array_bigint" "array_real"
#> [7] "array_double" "array_decimal"
#> [9] "array_varchar" "array_char"
#> [11] "array_varbinary" "array_date"
#> [13] "array_time" "array_time_with_tz"
#> [15] "array_timestamp" "array_timestamp_with_tz"
#> [17] "array_interval_year_to_month" "array_interval_day_to_second"ARRAYs to R vectors(
df.array_of_primitive_types <- dbGetQuery(
con,
"SELECT * FROM presto_primitive_arrays",
bigint = "integer64"
)
)
#> # A tibble: 1 × 18
#> array_boolean array_tinyint array_smallint array_integer array_bigint
#> <list> <list> <list> <list> <list>
#> 1 <lgl [3]> <int [3]> <int [3]> <int [3]> <int64 [3]>
#> # ℹ 13 more variables: array_real <list>, array_double <list>,
#> # array_decimal <list>, array_varchar <list>, array_char <list>,
#> # array_varbinary <list>, array_date <list>, array_time <list>,
#> # array_time_with_tz <list>, array_timestamp <list>,
#> # array_timestamp_with_tz <list>, array_interval_year_to_month <list>,
#> # array_interval_day_to_second <list>We can verify the R types of each column.
tibble::enframe(purrr::map_chr(df.array_of_primitive_types, ~class(.[[1]])[1]))
#> # A tibble: 18 × 2
#> name value
#> <chr> <chr>
#> 1 array_boolean logical
#> 2 array_tinyint integer
#> 3 array_smallint integer
#> 4 array_integer integer
#> 5 array_bigint integer64
#> 6 array_real numeric
#> 7 array_double numeric
#> 8 array_decimal character
#> 9 array_varchar character
#> 10 array_char character
#> 11 array_varbinary list
#> 12 array_date Date
#> 13 array_time hms
#> 14 array_time_with_tz hms
#> 15 array_timestamp POSIXct
#> 16 array_timestamp_with_tz POSIXct
#> 17 array_interval_year_to_month Duration
#> 18 array_interval_day_to_second DurationAll vectors are unnamed.
purrr::every(df.array_of_primitive_types, ~is.null(names(.[[1]])))
#> [1] TRUEWe can also call functions such as length() on each of
the columns to get the ARRAY cardinality. It shows that all
ARRAYs have 3 elements in them.
tibble::enframe(purrr::map_int(df.array_of_primitive_types, ~length(.[[1]])))
#> # A tibble: 18 × 2
#> name value
#> <chr> <int>
#> 1 array_boolean 3
#> 2 array_tinyint 3
#> 3 array_smallint 3
#> 4 array_integer 3
#> 5 array_bigint 3
#> 6 array_real 3
#> 7 array_double 3
#> 8 array_decimal 3
#> 9 array_varchar 3
#> 10 array_char 3
#> 11 array_varbinary 3
#> 12 array_date 3
#> 13 array_time 3
#> 14 array_time_with_tz 3
#> 15 array_timestamp 3
#> 16 array_timestamp_with_tz 3
#> 17 array_interval_year_to_month 3
#> 18 array_interval_day_to_second 3MAPsA MAP in Presto can be thought as a combination of two
same-length ARRAYs. The first ARRAY contains
the keys of the MAP and the second contains the values of
the MAP. In fact, that’s exactly how MAP
literals are created in Presto (e.g.,
MAP(ARRAY[1, 2], ARRAY['a', 'b']) creates a 2-element
MAP).
Following the logic, we translate MAPs to named
typed vectors in R.
MAPS of primitive typesWe first create a table with MAPSs of all supported
primitive Presto data types using the
create_primitive_maps_table() function included in the
RPresto package.
RPresto:::create_primitive_maps_table(
con, table_name = "presto_primitive_maps", verbose = FALSE
)We can check if the table now exists in Presto.
DBI::dbExistsTable(con, "presto_primitive_maps")
#> [1] TRUEWe can list the fields in the table. They are named after the Presto types they represent.
DBI::dbListFields(con, "presto_primitive_maps")
#> [1] "map_boolean" "map_tinyint"
#> [3] "map_smallint" "map_integer"
#> [5] "map_bigint" "map_real"
#> [7] "map_double" "map_decimal"
#> [9] "map_varchar" "map_char"
#> [11] "map_varbinary" "map_date"
#> [13] "map_time" "map_time_with_tz"
#> [15] "map_timestamp" "map_timestamp_with_tz"
#> [17] "map_interval_year_to_month" "map_interval_day_to_second"MAPs to R vectors(
df.map_of_primitive_types <- dbGetQuery(
con,
"SELECT * FROM presto_primitive_maps",
bigint = "integer64"
)
)
#> # A tibble: 1 × 18
#> map_boolean map_tinyint map_smallint map_integer map_bigint map_real
#> <list> <list> <list> <list> <list> <list>
#> 1 <lgl [3]> <int [3]> <int [3]> <int [3]> <int64 [3]> <dbl [3]>
#> # ℹ 12 more variables: map_double <list>, map_decimal <list>,
#> # map_varchar <list>, map_char <list>, map_varbinary <list>, map_date <list>,
#> # map_time <list>, map_time_with_tz <list>, map_timestamp <list>,
#> # map_timestamp_with_tz <list>, map_interval_year_to_month <list>,
#> # map_interval_day_to_second <list>We can verify the R types of each column.
tibble::enframe(purrr::map_chr(df.map_of_primitive_types, ~class(.[[1]])[1]))
#> # A tibble: 18 × 2
#> name value
#> <chr> <chr>
#> 1 map_boolean logical
#> 2 map_tinyint integer
#> 3 map_smallint integer
#> 4 map_integer integer
#> 5 map_bigint integer64
#> 6 map_real numeric
#> 7 map_double numeric
#> 8 map_decimal character
#> 9 map_varchar character
#> 10 map_char character
#> 11 map_varbinary list
#> 12 map_date Date
#> 13 map_time hms
#> 14 map_time_with_tz hms
#> 15 map_timestamp POSIXct
#> 16 map_timestamp_with_tz POSIXct
#> 17 map_interval_year_to_month Duration
#> 18 map_interval_day_to_second DurationAll vectors are named.
purrr::none(df.map_of_primitive_types, ~is.null(names(.[[1]])))
#> [1] TRUEARRAYs and MAPsIt’s possible to have repeated ARRAYs and
MAPs in Presto in the form of ARRAYs of
ARRAYs and ARRAYs of MAPs.
| Repeated Presto type | R type |
|---|---|
| ARRAY of ARRAY | not supported |
| ARRAY of MAP | unnamed list of named typed vectors |
We are not supporting nested ARRAYs at the moment
although it’s technically possible in Presto.
For ARRAYs of MAPs, we translate the
ARRAY container into an unnamed list and each of the
MAP element into a named typed vector.
We first create an array-of-maps table by using the
create_array_of_maps_table() function.
RPresto:::create_array_of_maps_table(
con, table_name = "presto_array_of_maps", verbose = FALSE
)We can check if the table now exists in Presto.
DBI::dbExistsTable(con, "presto_array_of_maps")
#> [1] TRUEWe can list the fields in the table.
DBI::dbListFields(con, "presto_array_of_maps")
#> [1] "array_map_boolean" "array_map_tinyint"
#> [3] "array_map_smallint" "array_map_integer"
#> [5] "array_map_bigint" "array_map_real"
#> [7] "array_map_double" "array_map_decimal"
#> [9] "array_map_varchar" "array_map_char"
#> [11] "array_map_varbinary" "array_map_date"
#> [13] "array_map_time" "array_map_time_with_tz"
#> [15] "array_map_timestamp" "array_map_timestamp_with_tz"
#> [17] "array_map_interval_year_to_month" "array_map_interval_day_to_second"Let’s import all the data into R.
(
df.array_of_maps <- dbGetQuery(
con,
"SELECT * FROM presto_array_of_maps",
bigint = "integer64"
)
)
#> # A tibble: 1 × 18
#> array_map_boolean array_map_tinyint array_map_smallint array_map_integer
#> <list> <list> <list> <list>
#> 1 <list [2]> <list [2]> <list [2]> <list [2]>
#> # ℹ 14 more variables: array_map_bigint <list>, array_map_real <list>,
#> # array_map_double <list>, array_map_decimal <list>,
#> # array_map_varchar <list>, array_map_char <list>,
#> # array_map_varbinary <list>, array_map_date <list>, array_map_time <list>,
#> # array_map_time_with_tz <list>, array_map_timestamp <list>,
#> # array_map_timestamp_with_tz <list>,
#> # array_map_interval_year_to_month <list>, …We need to pry open the wrapping unnamed list to reveal the types of the vectors underneath.
tibble::enframe(purrr::map_chr(df.array_of_maps, ~class(.[[1]][[1]])[1]))
#> # A tibble: 18 × 2
#> name value
#> <chr> <chr>
#> 1 array_map_boolean logical
#> 2 array_map_tinyint integer
#> 3 array_map_smallint integer
#> 4 array_map_integer integer
#> 5 array_map_bigint integer64
#> 6 array_map_real numeric
#> 7 array_map_double numeric
#> 8 array_map_decimal character
#> 9 array_map_varchar character
#> 10 array_map_char character
#> 11 array_map_varbinary list
#> 12 array_map_date Date
#> 13 array_map_time hms
#> 14 array_map_time_with_tz hms
#> 15 array_map_timestamp POSIXct
#> 16 array_map_timestamp_with_tz POSIXct
#> 17 array_map_interval_year_to_month Duration
#> 18 array_map_interval_day_to_second DurationROW typeThe easiest way to think about the ROW type in Presto is
to think of it literally as a row of a table. Just as a table
can have multiple columns of different data types, a ROW
can have multiple elements of different types. And just like a table
having a name for each column, every element of a ROW has a
name associated with the value.
Depending on whether the ROW type is repeated (i.e.,
wrapped in an ARRAY), the translation into R is
different.
We translate single ROW value to a named list in
R.
Rather than interpret repeated ROWs (i.e.,
ARRAY of ROWs) as a list of named lists, we
translate the collection of ROWs into a tibble.
| Presto type | R type |
|---|---|
| Single ROW | named list |
| Repeated ROWs | tibble |
ROW translationTo demonstrate how ROW types are translated into R
types, we first create a table using an auxiliary
create_primitive_rows_table() function included in the
package. The resulting table has only 1 column named
row_primitive_types which is a ROW that
includes 18 sub-columns representing all supported primitive types.
RPresto:::create_primitive_rows_table(
con, table_name = "presto_primitive_rows", verbose = FALSE
)We can check if the table now exists in Presto.
DBI::dbExistsTable(con, "presto_primitive_rows")
#> [1] TRUEWe can list the fields in the table.
DBI::dbListFields(con, "presto_primitive_rows")
#> [1] "row_primitive_types"We can then retrieve all the data from the table.
(
df.row_of_primitive <- dbGetQuery(
con,
"SELECT row_primitive_types FROM presto_primitive_rows",
bigint = "integer64"
)
)
#> # A tibble: 3 × 1
#> row_primitive_types
#> <list>
#> 1 <named list [18]>
#> 2 <named list [18]>
#> 3 <named list [18]>We can check the R types of each element in the named list.
tibble::enframe(
purrr::map_chr(df.row_of_primitive$row_primitive_types[[1]], ~class(.)[1])
)
#> # A tibble: 18 × 2
#> name value
#> <chr> <chr>
#> 1 boolean logical
#> 2 tinyint integer
#> 3 smallint integer
#> 4 integer integer
#> 5 bigint integer64
#> 6 real numeric
#> 7 double numeric
#> 8 decimal character
#> 9 varchar character
#> 10 char character
#> 11 varbinary list
#> 12 date Date
#> 13 time hms
#> 14 time_with_tz hms
#> 15 timestamp POSIXct
#> 16 timestamp_with_tz POSIXct
#> 17 interval_year_to_month Duration
#> 18 interval_day_to_second DurationROWs translationTo construct a repeated ROW column, we use the auxiliary
create_array_of_rows_table() function.
RPresto:::create_array_of_rows_table(
con, table_name = "presto_array_of_rows", verbose = FALSE
)We can check if the table now exists in Presto and the field name.
DBI::dbExistsTable(con, "presto_array_of_rows")
#> [1] TRUEDBI::dbListFields(con, "presto_array_of_rows")
#> [1] "array_of_rows"We can import the whole data into R.
(
df.array_of_rows <- dbGetQuery(
con,
"SELECT array_of_rows FROM presto_array_of_rows",
bigint = "integer64"
)
)
#> # A tibble: 3 × 1
#> array_of_rows
#> <list>
#> 1 <tibble [2 × 18]>
#> 2 <tibble [2 × 18]>
#> 3 <tibble [2 × 18]>We can verify the tibble’s column types.
tibble::enframe(
purrr::map_chr(df.array_of_rows$array_of_rows[[1]], ~class(.)[1])
)
#> # A tibble: 18 × 2
#> name value
#> <chr> <chr>
#> 1 boolean logical
#> 2 tinyint integer
#> 3 smallint integer
#> 4 integer integer
#> 5 bigint integer64
#> 6 real numeric
#> 7 double numeric
#> 8 decimal character
#> 9 varchar character
#> 10 char character
#> 11 varbinary list
#> 12 date Date
#> 13 time hms
#> 14 time_with_tz hms
#> 15 timestamp POSIXct
#> 16 timestamp_with_tz POSIXct
#> 17 interval_year_to_month Duration
#> 18 interval_day_to_second DurationDBI::dbDisconnect(con)
#> [1] TRUE