Internals#

This page includes information to help users understand the internal data structure of cuspatial.

GeoArrow Format#

Geospatial data is context rich; aside from just a set of numbers representing coordinates, they together represent certain geometry that requires grouping. For example, given 5 points in a plane, they could be 5 separate points, 2 line segments, a single linestring, or a pentagon. Many geometry libraries stores the points in arrays of geometric objects, commonly known as “Array of Structure” (AoS). AoS is not efficient for accelerated computing on parallel devices such as GPU. Therefore, GeoArrow format was introduced to store geodata in densely packed format, commonly known as “Structure of Arrays” (SoA).

The GeoArrow format specifies a tabular data format for geometry information. Supported types include Point, MultiPoint, LineString, MultiLineString, Polygon, and MultiPolygon. In order to store these coordinate types in a strictly tabular fashion, columns are created for Points, MultiPoints, LineStrings, and Polygons. MultiLines and MultiPolygons are stored in the same data structure as LineStrings and Polygons.

GeoArrow format packs complex geometry types into 14 single-column Arrow tables. See GeoArrowBuffers docstring for the complete list of keys for the columns.

Examples#

The Point geometry is the simplest. N points are stored in a length 2*N buffer with interleaved x,y coordinates. An optional z buffer of length N can be used.

A Multipoint is a group of points, and is the second simplest GeoArrow geometry type. It is identical to points, with the addition of a multipoints_offsets buffer. The offsets buffer stores N+1 indices. The first multipoint offset is specified by 0, which is always stored in offsets[0]. The second offset is stored in offsets[1], and so on. The number of points in multipoint i is the difference between offsets[i+1] and offsets[i].

Consider:

buffers = GeoArrowBuffers({
    "multipoints_xy":
        [0, 0, 0, 1, 0, 2, 1, 0, 1, 1, 1, 2, 2, 0, 2, 1, 2, 2],
    "multipoints_offsets":
        [0, 6, 12, 18]
})

which encodes the following GeoPandas Series:

series = geopandas.Series([
    MultiPoint((0, 0), (0, 1), (0, 2)),
    MultiPoint((1, 0), (1, 1), (1, 2)),
    MultiPoint((2, 0), (2, 1), (2, 2)),
])

LineString geometry is more complicated than multipoints because the format allows for the use of LineString and MultiLineString in the same buffer, via the mlines buffer. The mlines buffer stores 2M indices, where M is the number of MultiLineString s. The starting and ending Linestring offset of the i th MultiLineString is stored at mlines[2*i] and mlines[2*i+1] respectively.

Consider:

buffers = GeoArrowBuffers({
    "lines_xy":
        [0, 0, 0, 1, 0, 2, 1, 0, 1, 1, 1, 2, 2, 0, 2, 1, 2, 2, 3, 0,
            3, 1, 3, 2, 4, 0, 4, 1, 4, 2],
    "lines_offsets":
        [0, 6, 12, 18, 24, 30],
    "mlines":
        [1, 3]
})

Which encodes a GeoPandas Series:

series = geopandas.Series([
    LineString((0, 0), (0, 1), (0, 2)),
    MultiLineString([(1, 0), (1, 1), (1, 2)],
                    [(2, 0), (2, 1), (2, 2)],
    )
    LineString((3, 0), (3, 1), (3, 2)),
    LineString((4, 0), (4, 1), (4, 2)),
])

Note that mlines has 2 entries, and therefore there is 1 MultiLineString in buffers. It consists of 2 LineStrings: the second and third LineString in the defined by lines_offsets.

Polygon geometry includes mpolygons for MultiPolygons similar to the LineString geometry. Polygons are encoded using the same format as Shapefile , with left-wound external rings and right-wound internal rings.

GeoArrow Internal APIs#

class cuspatial.GeoArrowBuffers(data: typing.Union[dict, cuspatial.geometry.geoarrowbuffers.T], data_locale: object = <module 'cudf' from '/opt/conda/envs/rapids/lib/python3.9/site-packages/cudf/__init__.py'>)#

A GPU GeoArrowBuffers object.

Parameters
dataA dict or a GeoArrowBuffers object.
The GeoArrow format specifies a tabular data format for geometry
information. Supported types include `Point`, `MultiPoint`, `LineString`,
`MultiLineString`, `Polygon`, and `MultiPolygon`. In order to store
these coordinate types in a strictly tabular fashion, columns are
created for Points, MultiPoints, LineStrings, and Polygons.
MultiLines and MultiPolygons are stored in the same data structure
as LineStrings and Polygons. GeoArrowBuffers are constructed from a dict
of host buffers with accepted keys:
* points_xy
* points_z
* multipoints_xy
* multipoints_z
* multipoints_offsets
* lines_xy
* lines_z
* lines_offsets
* mlines
* polygons_xy
* polygons_z
* polygons_polygons
* polygons_rings
* mpolygons
There are no correlations in length between any of the above columns.
Accepted host buffer object types include python list and any type that
implements numpy’s `__array__interface__` protocol.
GeoArrow Format
GeoArrow format packs complex geometry types into 14 single-column Arrow
tables. This description is included for better understanding GeoArrow
format. Interacting with the GeoArrowBuffers is only required if you want
to convert cudf data to GeoPandas objects without starting from GeoPandas.
The points geometry is the simplest: N points are stored in a length 2*N
buffer with interleaved x,y coordinates. An optional z buffer of length N
can be used.
The multipoints geometry is the second simplest - identical to points,
with the addition of a multipoints_offsets buffer. The offsets buffer
stores N+1 indexes. The first multipoint is specified by 0, which is always
stored in offsets[0], and offsets[1], which is the length in points of
the first multipoint geometry. Subsequent multipoints are the prefix-sum of
the lengths of previous multipoints.
Consider::
buffers = GeoArrowBuffers({
“multipoints_xy”:

[0, 0, 0, 1, 0, 2, 1, 0, 1, 1, 1, 2, 2, 0, 2, 1, 2, 2],

“multipoints_offsets”:

[0, 6, 12, 18]

})

which encodes the following GeoPandas Series::
series = geopandas.Series([

MultiPoint((0, 0), (0, 1), (0, 2)), MultiPoint((1, 0), (1, 1), (1, 2)), MultiPoint((2, 0), (2, 1), (2, 2)),

])

LineString geometry is more complicated than multipoints because the
format allows for the use of LineStrings and MultiLineStrings in the same
buffer, via the mlines key::
buffers = GeoArrowBuffers({
“lines_xy”:
[0, 0, 0, 1, 0, 2, 1, 0, 1, 1, 1, 2, 2, 0, 2, 1, 2, 2, 3, 0,

3, 1, 3, 2, 4, 0, 4, 1, 4, 2],

“lines_offsets”:

[0, 6, 12, 18, 24, 30],

“mlines”:

[1, 3]

})

Which encodes a GeoPandas Series::
series = geopandas.Series([

LineString((0, 0), (0, 1), (0, 2)), MultiLineString([(1, 0), (1, 1), (1, 2)],

[(2, 0), (2, 1), (2, 2)],

) LineString((3, 0), (3, 1), (3, 2)), LineString((4, 0), (4, 1), (4, 2)),

])

Polygon geometry includes `mpolygons` for MultiPolygons similar to the
LineString geometry. Polygons are encoded using the same format as
Shapefiles, with left-wound external rings and right-wound internal rings.
An exact example of `GeoArrowBuffers` to `geopandas.Series` is left to the
reader as an exercise. Convert any GeoPandas `Series` or `DataFrame` with
`cuspatial.from_geopandas(geopandas_object)`.

Notes

Legacy cuspatial algorithms depend on separated x and y columns. Access them with the .x and .y properties.

Examples

GeoArrowBuffers accept a dict as argument. Valid keys are in the bullet list above. Valid values are any datatype that implements numpy’s __array_interface__. Any or all of the four basic geometry types is supported as argument:

buffers = GeoArrowBuffers({
    "points_xy":
        [0, 0, 0, 1, 0, 2, 1, 0, 1, 1, 1, 2, 2, 0, 2, 1, 2, 2],
    "multipoints_xy":
        [0, 0, 0, 1, 0, 2, 1, 0, 1, 1, 1, 2, 2, 0, 2, 1, 2, 2],
    "multipoints_offsets":
        [0, 6, 12, 18]
    "lines_xy":
        [0, 0, 0, 1, 0, 2, 1, 0, 1, 1, 1, 2, 2, 0, 2, 1, 2, 2, 3, 0,
         3, 1, 3, 2, 4, 0, 4, 1, 4, 2],
    "lines_offsets":
        [0, 6, 12, 18, 24, 30],
    "mlines":
        [1, 3]
    "polygons_xy":
        [0, 0, 0, 1, 0, 2, 1, 0, 1, 1, 1, 2, 2, 0, 2, 1, 2, 2, 3, 0,
         3, 1, 3, 2, 4, 0, 4, 1, 4, 2],
    "polygons_polygons": [0, 1, 2],
    "polygons_rings": [0, 1, 2],
    "mpolygons": [1, 3],
})

or another GeoArrowBuffers:

buffers2 = GeoArrowBuffers(buffers)
Attributes
lines

Contains the coordinates column, an offsets column, and a mlines column.

multipoints

Similar to the Points column with the addition of an offsets column.

points

A simple numeric column.

polygons

Contains the coordinates column, a rings column specifying the beginning and end of every polygon, a polygons column specifying the beginning, or exterior, ring of each polygon and the end ring.

Methods

copy([deep])

Create a copy of all of the GPU-backed data structures in this GeoArrowBuffers.

to_host

copy(deep=True)#

Create a copy of all of the GPU-backed data structures in this GeoArrowBuffers.

property lines#

Contains the coordinates column, an offsets column, and a mlines column. The mlines column is optional. The mlines column stores the indices of the offsets that indicate the beginning and end of each MultiLineString segment. The absence of an mlines column indicates there are no MultiLineStrings in the data source, only `LineString`s.

property multipoints#

Similar to the Points column with the addition of an offsets column. The offsets column stores the comparable sizes and coordinates of each MultiPoint in the GeoArrowBuffers.

property points#

A simple numeric column. x and y coordinates are interleaved such that even coordinates are x axis and odd coordinates are y axis.

property polygons#

Contains the coordinates column, a rings column specifying the beginning and end of every polygon, a polygons column specifying the beginning, or exterior, ring of each polygon and the end ring. All rings after the first ring are interior rings. Finally a mpolygons column stores the offsets of the polygons that should be grouped into MultiPolygons.

class cuspatial.geometry.geocolumn.GeoMeta(meta: Union[cuspatial.geometry.geoarrowbuffers.GeoArrowBuffers, dict])#

Creates input_types and input_lengths for GeoColumns that are created using native GeoArrowBuffers. These will be used to convert to GeoPandas GeoSeries if necessary.

Methods

copy

class cuspatial.geometry.geocolumn.GeoColumn(data: cuspatial.geometry.geoarrowbuffers.GeoArrowBuffers, meta: Optional[cuspatial.geometry.geocolumn.GeoMeta] = None, shuffle_order: Optional[cudf.core.index.Index] = None)#
Parameters
dataA GeoArrowBuffers object
metaA GeoMeta object (optional)

Notes

The GeoColumn class subclasses NumericalColumn. Combined with _copy_type_metadata, this assures support for existing cudf algorithms.

Attributes
iloc

Return the i-th row of the GeoSeries.

lines
loc

Not currently supported.

multipoints
points
polygons

Methods

copy([deep])

Create a copy of all of the GPU-backed data structures in this GeoColumn.

max([skipna, min_count])

Compute max of column values.

min([skipna, min_count])

Compute min of column values.

product([skipna, min_count])

Compute product of column values.

sum([skipna, min_count])

Compute sum of column values.

sum_of_squares([skipna, min_count])

Compute sum_of_squares of column values.

cummax

cummin

cumprod

cumsum

to_host

copy(deep=True)#

Create a copy of all of the GPU-backed data structures in this GeoColumn.

property iloc#

Return the i-th row of the GeoSeries.

property loc#

Not currently supported.

max(skipna: bool = None, min_count: int = 0, *args, **kwargs)#

Compute max of column values.

skipnabool

Whether or not na values must be skipped.

min_countint, default 0

The minimum number of entries for the reduction, otherwise the reduction returns NaN.

min(skipna: bool = None, min_count: int = 0, *args, **kwargs)#

Compute min of column values.

skipnabool

Whether or not na values must be skipped.

min_countint, default 0

The minimum number of entries for the reduction, otherwise the reduction returns NaN.

product(skipna: bool = None, min_count: int = 0, *args, **kwargs)#

Compute product of column values.

skipnabool

Whether or not na values must be skipped.

min_countint, default 0

The minimum number of entries for the reduction, otherwise the reduction returns NaN.

sum(skipna: bool = None, min_count: int = 0, *args, **kwargs)#

Compute sum of column values.

skipnabool

Whether or not na values must be skipped.

min_countint, default 0

The minimum number of entries for the reduction, otherwise the reduction returns NaN.

sum_of_squares(skipna: bool = None, min_count: int = 0, *args, **kwargs)#

Compute sum_of_squares of column values.

skipnabool

Whether or not na values must be skipped.

min_countint, default 0

The minimum number of entries for the reduction, otherwise the reduction returns NaN.