cuSpatial Library Design#

Overview#

At a high level, cuspatial has three parts:

  • A GPU backed GeoDataFrame data structure

  • A set of computation APIs

  • A Cython API layer

Core Data Structures#

Note

Note: the core data structure of cuSpatial shares the same name as that of geopandas, so we refer to geopandas’ dataframe object as geopandas.GeoDataFrame and to cuspatial’s dataframe object as GeoDataFrame.

Introduction to GeoArrow Format#

Under the hood, cuspatial can perform parallel computation on geometry data thanks to its structure of arrays (SoA) format. Specifically, cuspatial adopts GeoArrow format, which is an extension to Apache Arrow format that uses Arrow’s Variable-size List Layout to support geometry arrays.

By definition, each increase in geometry complexity (dimension, or multi- geometry) requires an extra level of indirection. In cuSpatial, we use the following names for the levels of indirection from highest level to lowest: geometries, parts, rings and coordinates. The first three are integral offset arrays and the last is a floating-point interleaved xy-coordinate array.

Geoarrow also allows a mixture of geometry types to be present in the same column by adopting the Dense Union Array Layout.

Read the geoarrow format specification for more detail.

GeoColumn#

cuSpatial implements a specialization of Arrow dense union via GeoColumn and GeoMeta. A GeoColumn is a composition of child columns and a GeoMeta object. The GeoMeta owns two arrays that are similar to the types buffer and offsets buffer from Arrow dense union.

Note

Currently, GeoColumn implements four concrete array types: points, multipoints, multilinestrings and multipolygons. Linestrings and multilinestrings are stored uniformly as multilinestrings in the multilinestrings array. Polygons and multipolygons are stored uniformly as multipolygons in the multipolygons array.

Points and multipoints are stored separately in different arrays, because storing points in a multipoints array requires 50% more storage overhead. While this may also be true for linestrings and polygons, many uses of cuSpatial involve more complex linestrings and polygons, where the storage overhead of multigeometry indirection is lower compared to points.

GeoSeries and GeoDataFrame inherit from cudf.Series and cudf.DataFrame respectively. Series and DataFrame are both generic Frame objects which represent a collection of generic columns. cuSpatial extends these cuDF objects by allowing GeoColumns to be present in the frame.

GeoSeries and GeoDataFrame are convertible to and from geopandas. Interoperability between cuspatial, geopandas and other data formats is maintained in the cuspatial.io package.

UnionArray Compliance#

As previously mentioned, cuspatial’s GeoColumn is a specialization of Arrow’s dense UnionArray. A fundamental addition to cuDF data types should be implemented in cuDF so that GeoColumn can simply inherit its functionality. However, dense UnionArray stands distinct from existing data types in libcudf and requires substantial effort to implement. In the interim, cuSpatial provides a GeoColumn complying to the dense UnionArray specification. This may be upstreamed to libcudf as it matures.

Geospatial computation APIs#

In addition to data structures, cuSpatial provides a set of computation APIs. The computation APIs are organized into several modules. All spatial computation modules are further grouped into a spatial subpackage. Module names should correspond to a specific computation category, such as distance or join. Cuspatial avoids using general category names, such as generic.

Legacy and Modern APIs#

For historical reasons, older cuSpatial APIs expose raw array inputs for users to provide raw geometry coordinate arrays and offsets. Newer Python APIs should accept a GeoSeries or GeoDataFrame as input. Developers may extract geometry offsets and coordinates via cuSpatial’s geometry accessors such as GeoSeries.points, GeoSeries.multipoints, GeoSeries.lines, GeoSeries.polygons. Developer can then pass the geometries offsets and coordinate arrays to Cython APIs.

Cython Layer#

The lowest layer of cuspatial is its interaction with libcuspatial via Cython. The Cython layer is composed of two components: C++ bindings and Cython wrappers. The first component consists of .pxd files, which are Cython declaration files that expose the contents of C++ header files to other Cython files. The second component consists of Cython wrappers for this functionality. These wrappers are necessary to expose this functionality to pure Python code.

To interact with the column-based APIs in libcuspatial, developers should have basic familiarity with libcudf objects. libcudf is built around two principal objects whose names are largely self-explanatory: column and table. libcudf also defines corresponding non-owning “view” types column_view and table_view. Both libcudf and libcuspatial APIs typically accept views and return owning types. When a cuspatial object owns one ore more c++ owning objects, the lifetime of these objects is automatically managed by python’s reference counting mechanism.

Similar to cuDF, Cython wrappers must convert Column objects into column_view objects, call the libcuspatial API, and reconstruct a cuDF object from the c++ result. By the time code reaches this stage, the objects are assumed to be fully legal inputs to the libcuspatial API. Therefore the wrapper should not contain additional components besides the above.