Reading spatial data from SQL Server with DBI and wk (not sf)


Whew! So that was a long hiatus.

Things have changed quite a bit for me in the last year and a half. About a year ago I moved up to Washington State and began a job with the Department of Ecology as a software developer and GIS specialist with the Spills Prevention, Preparedness, and Response program. It’s been an interesting ride, and I’ve had the opportunity to develop some new skills and proficiency in SQL Database architecture, reproducible data pipelines, and high-performance computing as a developer supporting the Safety of Oil Transportation Act risk modeling work—which just so happens to be implemented in R, so I was able to jump right in.

My work on the risk model has required handling and processing billions of data records describing maritime traffic; geographic points of interest such as berths, anchorages, and waypoints; ship attributes; ocean bathymetry; and modeled wind and ocean currents. That’s, um, slightly more data than R is generally equipped to handle, so we use a series of Microsoft SQL Server databases for data storage and retrieval (nevermind that MS SQL Server has extremely mediocre support for spatial data).

The DBI and odbc packages provide the tools we need to interface with the databases from R… mostly. DBI can’t handle the SQL Server geometry or geography spatial types directly, so those objects get replaced with an unhelpful character string in the resulting R data frame:

library(DBI)
library(odbc)

# connect to database
db_con <- dbConnect(odbc(), server = "my_server", database = "my_db")

# query data
dbGetQuery(db_con, "SELECT PointID, Shape FROM my_table")
##   PointID Shape
## 1       5 \036i
## 2       3 \036i
## 3      86 \036i
## 4      27 \036i
## ...

This isn’t that surprising since MS SQL Server spatial data formats aren’t the same as those used by R. You can cast the spatial data columns into Well-Known Binary (WKB) or Well-Known-Text (WKT) in the SQL statement, but this doesn’t immediately result in something useable either:

dbGetQuery(db_con, "SELECT PointID, Shape.STAsBinary() AS Shape FROM my_table")
##   PointID      Shape
## 1       5 blob[21 B]
## 2       3 blob[21 B]
## 3      86 blob[21 B]
## 4      27 blob[21 B]
## ...

Most of the advice online suggests using st_read() (from the sf package) to get the spatial data in a useable form:

library(sf)

st_read(db_con, query = "Select PointID, Shape.STAsBinary() AS Shape FROM my_table")
## Simple feature collection with 154 features and 1 field
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
## CRS:  NA
## First 10 features:
##    PointID                 Shape
## 1        5 POINT (-81.47276 3...
## 2        3 POINT (-81.23989 3...
## 3       86 POINT (-80.45634 3...
## 4       27 POINT (-76.00897 3...
## ...

This works, but it precludes the use of some of the fanciest capabilities of DBI; specifically, using dbBind() to parameterize statements (which is critical if you, say, need to grab wind and ocean current data for tens of thousands of vessels scattered across the entire Salish Sea at arbitrary times throughout the year). This means that you need to either pull the whole dataset into R at once (not feasible for me) or use for loops/lapply()/etc. to pull data down in chunks, which is much slower than dbBind() and starts getting messy fast if you need to do anything remotely complex.

Thankfully, there’s a way to use DBI to get spatial data without relying on sf. Instead, we can use the wk package to parse the WKB objects directly after pulling them into R. However, this requires using list columns, so we’ll get an error unless we convert the output to a tbl_df object first (e.g., using the dplyr package).

library(dplyr)
library(wk)

# get the data
dbGetQuery(db_con, "Select PointID, Shape.STAsBinary() AS Shape FROM my_table") |>
  # convert to tibble
  as_tibble() |>
  # parse the WKB
  mutate(Shape = as_wkb(Shape))
## # A tibble: 154 x 2
##   PointID Shape
##     <int> <wk_wkb>
## 1       5 POINT (-81.47276 3...
## 2       3 POINT (-81.23989 3...
## 3      86 POINT (-80.45634 3...
## 4      27 POINT (-76.00897 3...
## ...

Once this is done, we can easily convert the dataset to a spatial data frame using sf::st_set_geometry()—or keep it as a tbl_df object until we need to call sf functions.

The primary advantage of doing things this way is that we can now use dbBind() with spatial tables. Even better, we can use this strategy to use dbplyr workflows and take advantage of lazy evaluation:

library(dbplyr)

tbl(db_con, "my_table") |>
  filter(PointID %in% c(5L, 27L)) |>
  mutate(Shape = sql("Shape.STAsBinary()")) |>
  select(PointID, Shape) |>
  collect() |>
  mutate(Shape = as_wkb(Shape))
## # A tibble: 2 x 2
##   PointID Shape
##     <int> <wk_wkb>
## 1       5 POINT (-81.47276 3...
## 2      27 POINT (-76.00897 3...

Note that we need to use mutate() (rather than select()) in conjunction with sql() to extract the “Shape” field as WKB objects, but otherwise we can use all the features and functionality of dbplyr. This lets us perform much more of the work on the SQL side to e.g., take advantage of indexes and the server’s computing power. However, I’m not sure how well it will perform when e.g., filtering by hundreds or thousands of values, so I may need to fall back to dbGetQuery()/dbBind() for those cases.

That’s it! Working with spatial data this way has not only sped up my code, but also made it more readable—especially for some of my colleagues who do not have much experience with SQL databases but are quite familiar with dplyr syntax. There is still a bit of fiddling required to write the spatial data back to SQL Server, since you need to first write the data as WKB objects and then convert them to the SQL Server geometry/geography objects after. I’ll make another post if I figure out a smarter way to do that too.


Comments