efts is an R package to access ensemble forecast time series stored (EFTS) in netCDF format. It offers convenient functions to access time series, hiding the bug-prone details of netCDF array manipulations.
EFTS netCDF data sets follow the schema described at this location at the time of writing.
The package comes with API documentation, as well as the present vignette. You can access both navigating via
?eftsPerhaps as an unusual introduction, we will learn to write a EFTS netCDF data set prior to reading it.
Typical EFTS data is multidimensional, and the netCDF schema reflects this. To create a new data set, you need to define these dimensions. To create the information about the main time axis, you can use the create_time_info helper function in the package. The following command creates an hourly data set over two days
library(efts)
timeAxisStart <- ISOdate(year=2010, month=08, day=01, hour = 12, min = 0, sec = 0, tz = "UTC")
timeDimInfo <- create_time_info(from=timeAxisStart, n=48, time_step='hours since', time_step_delta=1L)
str(timeDimInfo)#> List of 2
#>  $ units : chr "hours since 2010-08-01 12:00:00 +0000"
#>  $ values: int [1:48] 0 1 2 3 4 5 6 7 8 9 ...Other dimensions are easier to create. Variables are more involved; one way to define several variables is via a data frame. For instance, you can start from a stub such as created by the following:
variable_names <- c('rain_sim','pet_sim')
varDef <- create_variable_definition_dataframe(variable_names=variable_names, long_names = rep('synthetic data', 2), dimensions=4L)
str(varDef)#> 'data.frame':    2 obs. of  12 variables:
#>  $ name                : chr  "rain_sim" "pet_sim"
#>  $ longname            : chr  "synthetic data" "synthetic data"
#>  $ standard_name       : chr  "rain_sim" "pet_sim"
#>  $ units               : chr  "mm" "mm"
#>  $ missval             : num  -9999 -9999
#>  $ precision           : chr  "double" "double"
#>  $ dimensions          : int  4 4
#>  $ type                : int  2 2
#>  $ type_description    : chr  "accumulated over the preceding interval" "accumulated over the preceding interval"
#>  $ dat_type            : chr  "der" "der"
#>  $ dat_type_description: chr  "AWAP data interpolated from observations" "AWAP data interpolated from observations"
#>  $ location_type       : chr  "Point" "Point"Do note the column names of the data frame varDef; the package is picky about these, to comply with the EFTS netCDF data schema.
Let’s create a data set with 2 point stations, 3 ensembles and four time step forecast lead.
stationIds <- c(123,456)
nEns <- 3
nLead <- 4
fname <- tempfile() # or something you prefer.It is mandatory to provide some global file attributes, the function create_global_attributes provides a starting point.
global_attr <- create_global_attributes(
  title="data set title", 
  institution="my org", 
  catchment="My_Catchment", 
  source="A journal reference, URL", 
  comment="example for vignette")Similarly for optional geographic metadata variables:
(opt_metadatavars <- default_optional_variable_definitions_v2_0())#>                name                                     longname
#> x                 x  easting from the GDA94 datum in MGA Zone 55
#> y                 y northing from the GDA94 datum in MGA Zone 55
#> area           area                               catchment area
#> elevation elevation            station elevation above sea level
#>                   standard_name units missval precision
#> x         northing_GDA94_zone55            NA     float
#> y          easting_GDA94_zone55            NA     float
#> area                       area  km^2   -9999     float
#> elevation             elevation     m   -9999     floatSimilarly for optional geographic metadata variables:
snc <- create_efts(
  fname=fname,
  data_var_definitions=varDef,
  optional_vars=opt_metadatavars,
  time_dim_info=timeDimInfo,
  stations_ids=stationIds,
  station_names=NULL,
  nc_attributes=global_attr,
  lead_length=nLead,
  ensemble_length=nEns)The default print method for this object snc is the same output as objects of class ncdf4
snc#> File /tmp/RtmpKDVAG9/file1e206b54f8d7 (NC_FORMAT_CLASSIC):
#> 
#>      10 variables (excluding dimension variables):
#>         double rain_sim[lead_time,station,ens_member,time]   
#>             units: mm
#>             _FillValue: -9999
#>             long_name: synthetic data
#>             standard_name: rain_sim
#>             type: 2
#>             type_description: accumulated over the preceding interval
#>             dat_type: der
#>             dat_type_description: AWAP data interpolated from observations
#>             location_type: Point
#>         double pet_sim[lead_time,station,ens_member,time]   
#>             units: mm
#>             _FillValue: -9999
#>             long_name: synthetic data
#>             standard_name: pet_sim
#>             type: 2
#>             type_description: accumulated over the preceding interval
#>             dat_type: der
#>             dat_type_description: AWAP data interpolated from observations
#>             location_type: Point
#>         int station_id[station]   
#>             long_name: station or node identification code
#>         char station_name[str_len,station]   
#>             long_name: station or node name
#>         float lat[station]   
#>             units: degrees north
#>             _FillValue: -9999
#>             long_name: latitude
#>             axis: y
#>         float lon[station]   
#>             units: degrees east
#>             _FillValue: -9999
#>             long_name: longitude
#>             axis: x
#>         float x[station]   
#>             _FillValue: NaN
#>             long_name: easting from the GDA94 datum in MGA Zone 55
#>             standard_name: northing_GDA94_zone55
#>             axis: x
#>         float y[station]   
#>             _FillValue: NaN
#>             long_name: northing from the GDA94 datum in MGA Zone 55
#>             standard_name: easting_GDA94_zone55
#>             axis: y
#>         float area[station]   
#>             units: km^2
#>             _FillValue: -9999
#>             long_name: catchment area
#>             standard_name: area
#>         float elevation[station]   
#>             units: m
#>             _FillValue: -9999
#>             long_name: station elevation above sea level
#>             standard_name: elevation
#> 
#>      5 dimensions:
#>         lead_time  Size:4
#>             units: hours since time
#>             long_name: forecast lead time
#>             standard_name: lead_time
#>             axis: v
#>         station  Size:2
#>         ens_member  Size:3
#>             units: member id
#>             long_name: ensemble member
#>             standard_name: ens_member
#>             axis: u
#>         time  Size:48   *** is unlimited ***
#>             units: hours since 2010-08-01 12:00:00 +0000
#>             long_name: time
#>             standard_name: time
#>             time_standard: UTC
#>             axis: t
#>         str_len  Size:30
#>             long_name: string length
#> 
#>     8 global attributes:
#>         STF_convention_version: 2
#>         STF_nc_spec: https://github.com/jmp75/efts/blob/107c553045a37e6ef36b2eababf6a299e7883d50/docs/netcdf_for_water_forecasting.md
#>         history: : 2018-04-26 01:13:59 UTC file created with the R package efts 0.9-0
#>         title: data set title
#>         institution: my org
#>         source: A journal reference, URL
#>         catchment: My_Catchment
#>         comment: example for vignettesnc is a type of object that few R aficionados are aware of; this is a reference class. Without entering into unnecessary technical details, this is mostly a design choice done to achieve better memory usage and performance in some contexts.
The following command displays the main characteristics of this reference object, of class EftsDataSet Note that you should really use the methods, and not access directly the fields.
str(snc, max.level=2)#> Reference class 'EftsDataSet' [package "efts"] with 8 fields
#>  $ ncfile                  :List of 14
#>   ..- attr(*, "class")= chr "ncdf4"
#>  $ time_dim                : POSIXct[1:1], format: NA
#>  $ time_zone               : chr "UTC"
#>  $ identifiers_dimensions  : list()
#>  $ stations_varname        : chr "station_id"
#>  $ stations_dim_name       : chr "station"
#>  $ lead_time_dim_name      : chr "lead_time"
#>  $ ensemble_member_dim_name: chr "ens_member"
#>  and 46 methods, of which 32 are  possibly relevant:
#>    close, get_all_series, get_dim_names, get_ensemble_for_stations,
#>    get_ensemble_forecasts, get_ensemble_forecasts_for_station,
#>    get_ensemble_series, get_ensemble_size, get_lead_time_count,
#>    get_single_series, get_station_count, get_stations_varname,
#>    get_time_dim, get_time_unit, get_time_zone, get_utc_offset, get_values,
#>    get_variable_dim_names, get_variable_names, index_for_identifier,
#>    index_for_time, initialize, initialize#NetCdfDataSet,
#>    put_ensemble_forecasts, put_ensemble_forecasts_for_station,
#>    put_ensemble_series, put_single_series, put_values, set_time_zone,
#>    show#envRefClass, summary, syncYou can get to the documentation page for the methods in this class with:
?efts::EftsDataSetOur EFTS data set object is ready to be populated with data. Let’s create synthetic data. The basic idea of the object’s methods is to offer intuitive and concise means to get/set ensemble of forecast time series. A method can be called in a syntax that may be unfamiliar to most R users, but similar to most object oriented languages: theObject$theMethodName(someArgumentsIfAny), for instance snc$get_time_dim() in our EFTS data set, to retrieve its time axis.
TODO: test, document and demonstrate missing value handling.
set.seed(42)
td <- snc$get_time_dim()
for (i in 1:length(td)) {
    for (station in stationIds) {
        rain <- 6 * rnorm(nEns*nLead)
        rain <- matrix(pmax(as.numeric(rain), 0), nrow=nLead) # nEns replicates of a forecast of length nLead
        pet <- 6.0 + rnorm(nEns*nLead)
        pet <- matrix(pmax(as.numeric(pet), 0), nrow=nLead)
        dtime = td[i]
        snc$put_ensemble_forecasts(rain, variable_name = variable_names[1], identifier = station, start_time = dtime) 
        snc$put_ensemble_forecasts(pet,  variable_name = variable_names[2], identifier = station, start_time = dtime) 
    }
}Now we can demonstrate how to retrieve data. If you had closed the previous EFTS data set object snc and deleted the variable, you’d reopen it for reading with the following command:
if (!exists('snc')) snc <- open_efts(fname)You get the ensemble forecast for a variable (pet in this case) for a point in time with the following command
td <- snc$get_time_dim()
timeStamp <- td[5] # for instance
d <- snc$get_ensemble_forecasts(variable_names[2], stationIds[1], start_time=timeStamp)
str(d)#> An 'xts' object on 2010-08-01 16:00:00/2010-08-01 19:00:00 containing:
#>   Data: num [1:4, 1:3] 4.62 4.85 5.29 4.95 5.35 ...
#>   Indexed by objects of class: [POSIXct,POSIXt] TZ: UTC
#>   xts Attributes:  
#>  NULLThe object returned is of class xts. The package xts has been chosen as the default time series structure for efts. The rationale is empirical: previous experience with the xts in conjunction with plyr (note to self: possibly dplyr in the future) showed good performance to calculate hydrologic statistics on ensemble of time series.
To quickly visualize it (if the data is not too big), using the function plot.zoo has the advantage of stacking the series:
zoo::plot.zoo(d)As of version 0.5-1, there is a convenience function to retrieve the UTC offset of the units of time dimension. Here there is none, hence a duration of zero returned:
paste0( "UTC offset as a string: ", snc$get_utc_offset())#> [1] "UTC offset as a string: +0000"paste0( "UTC offset as a difftime: ", snc$get_utc_offset(as_string=FALSE))#> [1] "UTC offset as a difftime: 0"Close the data set with the following commands:
snc$close()
rm(snc)This vignette creates by default a temporary file. This shall be cleaned up by default with:
if(file.exists(fname)) file.remove(fname)#> [1] TRUE