Depending on the state of your spatial data, they may require some cleaning and modification before interpolation. The types of issues to address pre-interpolation fall into a few distinct categories:
Each of these conditions will be discussed below. The following examples assume:
library(areal)
library(dplyr) # data wrangling
library(sf) # spatial data operations
# load data into enviornment
race <- ar_stl_race
asthma <- ar_stl_asthma
wards <- ar_stl_wards
# create example data - non-spatial data
asthmaTbl <- ar_stl_asthma
st_geometry(asthmaTbl) <- NULL
# create example data - wrong crs
race83 <- st_transform(race, crs = 4269)areal has a built-in tool for data validation, ar_validate(), that provides an excellent starting place for ensuring your data are ready to interpolate. The validation process covers the first three issues listed in the introduction. It can be run in a simple format:
If ar_validate() returns a FALSE value, it can also be run in a verbose manner that returns a detailed tibble:
ar_validate(source = asthma, target = wards, varList = "ASTHMA", method = "aw", verbose = TRUE)
#> # A tibble: 6 x 2
#> test result
#> <chr> <lgl>
#> 1 sf Objects TRUE
#> 2 CRS Match TRUE
#> 3 CRS is Planar TRUE
#> 4 Variables Exist in Source TRUE
#> 5 No Variable Conflicts in Target TRUE
#> 6 Overall Evaluation TRUEUse the verbose = TRUE output as a checklist to address issues before interpolating your data.
Data need to be loaded as sf objects in R. If ar_validate() returns FALSE for the first condition, data are not stored as sf objects:
ar_validate(source = asthmaTbl, target = wards, varList = "ASTHMA", method = "aw", verbose = TRUE)
#> # A tibble: 6 x 2
#> test result
#> <chr> <lgl>
#> 1 sf Objects FALSE
#> 2 CRS Match NA
#> 3 CRS is Planar NA
#> 4 Variables Exist in Source NA
#> 5 No Variable Conflicts in Target NA
#> 6 Overall Evaluation FALSEIf you have external data, using st_read() is the best way to load them:
The st_read() function supports a variety of the most common data sources.
If you are working with data that is an sp object in R, the sf package has a function st_as_sf() can be used to convert the object to sf:
If you are using either the tidycensus or tigris packages to access spatial data, there are options for downloading data as sf objects as well. In tidycensus, the geometry = TRUE option should be used. In tigris, the class = "sf" option should be used.
If you have tabular data, you will need to merge them with a shapefile or other spatial data set that contains the geometry for the corresponding spatial features. For American users, a potentially common scenario here will be table of census geography (such as tracts) that lacks the geometry for those features. These can be downloaded using tigris and then combined using dplyr. The following example assumes the tract data contains an identification number column named GEOID with the appropriate tract identifiers for St. Louis:
Two coordinate system rules are enforced by ar_validate: both the source and the target data must be in the same coordinate system and that coordinate system must be a projected coordinate system. The sf package contains a function named st_crs() for previewing the current coordinate system of your data:
st_crs(race83)
#> Coordinate Reference System:
#> User input: EPSG:4269
#> wkt:
#> GEOGCS["NAD83",
#> DATUM["North_American_Datum_1983",
#> SPHEROID["GRS 1980",6378137,298.257222101,
#> AUTHORITY["EPSG","7019"]],
#> TOWGS84[0,0,0,0,0,0,0],
#> AUTHORITY["EPSG","6269"]],
#> PRIMEM["Greenwich",0,
#> AUTHORITY["EPSG","8901"]],
#> UNIT["degree",0.0174532925199433,
#> AUTHORITY["EPSG","9122"]],
#> AUTHORITY["EPSG","4269"]]The EPSG value 4269 refers to a specific type of coordinate system known as as geographic coordinate system. These data use latitude-longitude values, which vary in length and thus cannot be used to calculate area (a key component of carrying out areal interpolations). These data would fail the ar_validate() process:
ar_validate(source = race83, target = wards, varList = "TOTAL_E", method = "aw", verbose = TRUE)
#> # A tibble: 6 x 2
#> test result
#> <chr> <lgl>
#> 1 sf Objects TRUE
#> 2 CRS Match FALSE
#> 3 CRS is Planar FALSE
#> 4 Variables Exist in Source TRUE
#> 5 No Variable Conflicts in Target TRUE
#> 6 Overall Evaluation FALSEThese data fail both the matching CRS validation and the planar CRS validation because the race and wards data are in two different coordinate systems. We’ve already seen the CRS data for race above so we’ll look at the CRS data forwards`:
st_crs(wards)
#> Coordinate Reference System:
#> User input: EPSG:26915
#> wkt:
#> PROJCS["NAD83 / UTM zone 15N",
#> GEOGCS["NAD83",
#> DATUM["North_American_Datum_1983",
#> SPHEROID["GRS 1980",6378137,298.257222101,
#> AUTHORITY["EPSG","7019"]],
#> TOWGS84[0,0,0,0,0,0,0],
#> AUTHORITY["EPSG","6269"]],
#> PRIMEM["Greenwich",0,
#> AUTHORITY["EPSG","8901"]],
#> UNIT["degree",0.0174532925199433,
#> AUTHORITY["EPSG","9122"]],
#> AUTHORITY["EPSG","4269"]],
#> PROJECTION["Transverse_Mercator"],
#> PARAMETER["latitude_of_origin",0],
#> PARAMETER["central_meridian",-93],
#> PARAMETER["scale_factor",0.9996],
#> PARAMETER["false_easting",500000],
#> PARAMETER["false_northing",0],
#> UNIT["metre",1,
#> AUTHORITY["EPSG","9001"]],
#> AXIS["Easting",EAST],
#> AXIS["Northing",NORTH],
#> AUTHORITY["EPSG","26915"]]If these data fail that process, st_crs() can be used to identify whether the source or the target (or both) are in an inappropriate coordinate system. The EPSG value 26915 refers to another type of coordinate system known as as projected coordinate system (or “planar” data). Further, 26915 is a particular type of projected coordinate system known as the Universal Transverse Mercator coordinate system. This is a good option for using areal because it is available around the world. There are a variety of tools for identifying the appropriate UTM zone, including this excellent interactive map. Once you have an EPSG zone, the website epsg.io can be used to search for the appropriate EPSG value.
In this case, the wards data are in an appropriate system but the race data need to be transformed. The st_transform() function can be used to modify them to an appropriate coordinate system:
Once this is done, these data should pass the ar_validate() process:
ar_validate(source = raceFixed, target = wards, varList = "TOTAL_E", method = "aw", verbose = TRUE)
#> # A tibble: 6 x 2
#> test result
#> <chr> <lgl>
#> 1 sf Objects TRUE
#> 2 CRS Match TRUE
#> 3 CRS is Planar TRUE
#> 4 Variables Exist in Source TRUE
#> 5 No Variable Conflicts in Target TRUE
#> 6 Overall Evaluation TRUEIf a variable does not exist in the source data, the validation process will fail:
ar_validate(source = race, target = wards, varList = "TOTAL", method = "aw", verbose = TRUE)
#> # A tibble: 6 x 2
#> test result
#> <chr> <lgl>
#> 1 sf Objects TRUE
#> 2 CRS Match TRUE
#> 3 CRS is Planar TRUE
#> 4 Variables Exist in Source FALSE
#> 5 No Variable Conflicts in Target TRUE
#> 6 Overall Evaluation FALSEIn this case, the fix is as easy as correcting the typo in our ar_validate() call - the variable TOTAL should really be TOTAL_E:
names(race)
#> [1] "GEOID" "STATEFP" "COUNTYFP" "TRACTCE" "NAMELSAD" "ALAND"
#> [7] "AWATER" "TOTAL_E" "TOTAL_M" "WHITE_E" "WHITE_M" "BLACK_E"
#> [13] "BLACK_M" "AIAN_E" "AIAN_M" "ASIAN_E" "ASIAN_M" "NHPI_E"
#> [19] "NHPI_M" "OTHER_E" "OTHER_M" "TWOPLUS_E" "TWOPLUS_M" "geometry"Another common problem is that a variable exists in both our source and target data. For instance, we could add a TOTAL_E variable to our target data:
wardsVar <- mutate(wards, TOTAL_E = TRUE)
ar_validate(source = race, target = wardsVar, varList = "TOTAL_E", method = "aw", verbose = TRUE)
#> # A tibble: 6 x 2
#> test result
#> <chr> <lgl>
#> 1 sf Objects TRUE
#> 2 CRS Match TRUE
#> 3 CRS is Planar TRUE
#> 4 Variables Exist in Source TRUE
#> 5 No Variable Conflicts in Target FALSE
#> 6 Overall Evaluation FALSENow, we fail the validation process because there is a conflict between the source and target data - TOTAL_E exists in both data sets. We can use the select() function from dplyr to remove the offending variable from the target data before proceeding:
wardsFixed <- select(wardsVar, -TOTAL_E)
ar_validate(source = race, target = wardsFixed, varList = "TOTAL_E", method = "aw", verbose = TRUE)
#> # A tibble: 6 x 2
#> test result
#> <chr> <lgl>
#> 1 sf Objects TRUE
#> 2 CRS Match TRUE
#> 3 CRS is Planar TRUE
#> 4 Variables Exist in Source TRUE
#> 5 No Variable Conflicts in Target TRUE
#> 6 Overall Evaluation TRUEAnother option would be to rename the column using the rename() function from dplyr:
wardsFixed <- rename(wardsVar, oldTotal = TOTAL_E)
ar_validate(source = race, target = wardsFixed, varList = "TOTAL_E", method = "aw", verbose = TRUE)
#> # A tibble: 6 x 2
#> test result
#> <chr> <lgl>
#> 1 sf Objects TRUE
#> 2 CRS Match TRUE
#> 3 CRS is Planar TRUE
#> 4 Variables Exist in Source TRUE
#> 5 No Variable Conflicts in Target TRUE
#> 6 Overall Evaluation TRUEEither option will work for areal’s purposes.
One cautionary note when using areal is that the validation process will not check to see if too many variables or observations exist in the data. Cleaning the data ahead of time so that only the relevant observations exist can help improve performance - if you have tract data for an entire state but only need a particular city, everything will go faster if you subset these data using the filter() function from dplyr:
Having too many columns poses an organizational problem more than anything else. We strongly encourage users to limit their interpolated data to only the needed values. For example, the wards data contains an AREA column whose units we don’t know and an OBJECTID column that is an artifact from its creation using an ESRI product:
We can use the select() function from dplyr to remove these before interpolation since we don’t need them: