Internals#
This page includes information to help users understand the internal data structure of cuspatial.
GeoArrow Format#
Geospatial data is context rich; aside from just a set of numbers representing coordinates, they together represent certain geometry that requires grouping. For example, given 5 points in a plane, they could be 5 separate points, 2 line segments, a single linestring, or a pentagon. Many geometry libraries stores the points in arrays of geometric objects, commonly known as “Array of Structure” (AoS). AoS is not efficient for accelerated computing on parallel devices such as GPU. Therefore, GeoArrow format was introduced to store geodata in densely packed format, commonly known as “Structure of Arrays” (SoA).
The GeoArrow format specifies a tabular data format for geometry information. Supported types include Point, MultiPoint, LineString, MultiLineString, Polygon, and MultiPolygon. In order to store these coordinate types in a strictly tabular fashion, columns are created for Points, MultiPoints, LineStrings, and Polygons. MultiLines and MultiPolygons are stored in the same data structure as LineStrings and Polygons.
GeoArrow format packs complex geometry types into 14 single-column Arrow
tables. See GeoArrowBuffers
docstring
for the complete list of keys for the columns.
Examples#
The Point geometry is the simplest. N points are stored in a length 2*N buffer with interleaved x,y coordinates. An optional z buffer of length N can be used.
A Multipoint is a group of points, and is the second simplest GeoArrow
geometry type. It is identical to points, with the addition of a
multipoints_offsets
buffer. The offsets buffer stores N+1 indices. The
first multipoint offset is specified by 0, which is always stored in
offsets[0]
. The second offset is stored in offsets[1]
, and so on.
The number of points in multipoint i
is the difference between
offsets[i+1]
and offsets[i]
.
Consider:
buffers = GeoArrowBuffers({
"multipoints_xy":
[0, 0, 0, 1, 0, 2, 1, 0, 1, 1, 1, 2, 2, 0, 2, 1, 2, 2],
"multipoints_offsets":
[0, 6, 12, 18]
})
which encodes the following GeoPandas Series:
series = geopandas.Series([
MultiPoint((0, 0), (0, 1), (0, 2)),
MultiPoint((1, 0), (1, 1), (1, 2)),
MultiPoint((2, 0), (2, 1), (2, 2)),
])
LineString geometry is more complicated than multipoints because the
format allows for the use of LineString and MultiLineString in the same
buffer, via the mlines
buffer. The mlines
buffer stores 2M indices, where M
is the number of MultiLineString s. The starting and ending Linestring offset of the i th
MultiLineString is stored at mlines[2*i]
and mlines[2*i+1]
respectively.
Consider:
buffers = GeoArrowBuffers({
"lines_xy":
[0, 0, 0, 1, 0, 2, 1, 0, 1, 1, 1, 2, 2, 0, 2, 1, 2, 2, 3, 0,
3, 1, 3, 2, 4, 0, 4, 1, 4, 2],
"lines_offsets":
[0, 6, 12, 18, 24, 30],
"mlines":
[1, 3]
})
Which encodes a GeoPandas Series:
series = geopandas.Series([
LineString((0, 0), (0, 1), (0, 2)),
MultiLineString([(1, 0), (1, 1), (1, 2)],
[(2, 0), (2, 1), (2, 2)],
)
LineString((3, 0), (3, 1), (3, 2)),
LineString((4, 0), (4, 1), (4, 2)),
])
Note that mlines
has 2 entries, and therefore there is 1
MultiLineString in buffers
. It consists of 2
LineStrings: the second and third LineString in the defined by
lines_offsets
.
Polygon geometry includes mpolygons for MultiPolygons similar to the LineString geometry. Polygons are encoded using the same format as Shapefile , with left-wound external rings and right-wound internal rings.
GeoArrow Internal APIs#
- class cuspatial.GeoArrowBuffers(data: typing.Union[dict, cuspatial.geometry.geoarrowbuffers.T], data_locale: object = <module 'cudf' from '/opt/conda/envs/rapids/lib/python3.9/site-packages/cudf/__init__.py'>)#
A GPU GeoArrowBuffers object.
- Parameters
- dataA dict or a GeoArrowBuffers object.
- The GeoArrow format specifies a tabular data format for geometry
- information. Supported types include `Point`, `MultiPoint`, `LineString`,
- `MultiLineString`, `Polygon`, and `MultiPolygon`. In order to store
- these coordinate types in a strictly tabular fashion, columns are
- created for Points, MultiPoints, LineStrings, and Polygons.
- MultiLines and MultiPolygons are stored in the same data structure
- as LineStrings and Polygons. GeoArrowBuffers are constructed from a dict
- of host buffers with accepted keys:
- * points_xy
- * points_z
- * multipoints_xy
- * multipoints_z
- * multipoints_offsets
- * lines_xy
- * lines_z
- * lines_offsets
- * mlines
- * polygons_xy
- * polygons_z
- * polygons_polygons
- * polygons_rings
- * mpolygons
- There are no correlations in length between any of the above columns.
- Accepted host buffer object types include python list and any type that
- implements numpy’s `__array__interface__` protocol.
- GeoArrow Format
- GeoArrow format packs complex geometry types into 14 single-column Arrow
- tables. This description is included for better understanding GeoArrow
- format. Interacting with the GeoArrowBuffers is only required if you want
- to convert cudf data to GeoPandas objects without starting from GeoPandas.
- The points geometry is the simplest: N points are stored in a length 2*N
- buffer with interleaved x,y coordinates. An optional z buffer of length N
- can be used.
- The multipoints geometry is the second simplest - identical to points,
- with the addition of a multipoints_offsets buffer. The offsets buffer
- stores N+1 indexes. The first multipoint is specified by 0, which is always
- stored in offsets[0], and offsets[1], which is the length in points of
- the first multipoint geometry. Subsequent multipoints are the prefix-sum of
- the lengths of previous multipoints.
- Consider::
- buffers = GeoArrowBuffers({
- “multipoints_xy”:
[0, 0, 0, 1, 0, 2, 1, 0, 1, 1, 1, 2, 2, 0, 2, 1, 2, 2],
- “multipoints_offsets”:
[0, 6, 12, 18]
})
- which encodes the following GeoPandas Series::
- series = geopandas.Series([
MultiPoint((0, 0), (0, 1), (0, 2)), MultiPoint((1, 0), (1, 1), (1, 2)), MultiPoint((2, 0), (2, 1), (2, 2)),
])
- LineString geometry is more complicated than multipoints because the
- format allows for the use of LineStrings and MultiLineStrings in the same
- buffer, via the mlines key::
- buffers = GeoArrowBuffers({
- “lines_xy”:
- [0, 0, 0, 1, 0, 2, 1, 0, 1, 1, 1, 2, 2, 0, 2, 1, 2, 2, 3, 0,
3, 1, 3, 2, 4, 0, 4, 1, 4, 2],
- “lines_offsets”:
[0, 6, 12, 18, 24, 30],
- “mlines”:
[1, 3]
})
- Which encodes a GeoPandas Series::
- series = geopandas.Series([
LineString((0, 0), (0, 1), (0, 2)), MultiLineString([(1, 0), (1, 1), (1, 2)],
[(2, 0), (2, 1), (2, 2)],
) LineString((3, 0), (3, 1), (3, 2)), LineString((4, 0), (4, 1), (4, 2)),
])
- Polygon geometry includes `mpolygons` for MultiPolygons similar to the
- LineString geometry. Polygons are encoded using the same format as
- Shapefiles, with left-wound external rings and right-wound internal rings.
- An exact example of `GeoArrowBuffers` to `geopandas.Series` is left to the
- reader as an exercise. Convert any GeoPandas `Series` or `DataFrame` with
- `cuspatial.from_geopandas(geopandas_object)`.
Notes
Legacy cuspatial algorithms depend on separated x and y columns. Access them with the .x and .y properties.
Examples
GeoArrowBuffers accept a dict as argument. Valid keys are in the bullet list above. Valid values are any datatype that implements numpy’s __array_interface__. Any or all of the four basic geometry types is supported as argument:
buffers = GeoArrowBuffers({ "points_xy": [0, 0, 0, 1, 0, 2, 1, 0, 1, 1, 1, 2, 2, 0, 2, 1, 2, 2], "multipoints_xy": [0, 0, 0, 1, 0, 2, 1, 0, 1, 1, 1, 2, 2, 0, 2, 1, 2, 2], "multipoints_offsets": [0, 6, 12, 18] "lines_xy": [0, 0, 0, 1, 0, 2, 1, 0, 1, 1, 1, 2, 2, 0, 2, 1, 2, 2, 3, 0, 3, 1, 3, 2, 4, 0, 4, 1, 4, 2], "lines_offsets": [0, 6, 12, 18, 24, 30], "mlines": [1, 3] "polygons_xy": [0, 0, 0, 1, 0, 2, 1, 0, 1, 1, 1, 2, 2, 0, 2, 1, 2, 2, 3, 0, 3, 1, 3, 2, 4, 0, 4, 1, 4, 2], "polygons_polygons": [0, 1, 2], "polygons_rings": [0, 1, 2], "mpolygons": [1, 3], })
or another GeoArrowBuffers:
buffers2 = GeoArrowBuffers(buffers)
- Attributes
lines
Contains the coordinates column, an offsets column, and a mlines column.
multipoints
Similar to the Points column with the addition of an offsets column.
points
A simple numeric column.
polygons
Contains the coordinates column, a rings column specifying the beginning and end of every polygon, a polygons column specifying the beginning, or exterior, ring of each polygon and the end ring.
Methods
copy
([deep])Create a copy of all of the GPU-backed data structures in this GeoArrowBuffers.
to_host
- copy(deep=True)#
Create a copy of all of the GPU-backed data structures in this GeoArrowBuffers.
- property lines#
Contains the coordinates column, an offsets column, and a mlines column. The mlines column is optional. The mlines column stores the indices of the offsets that indicate the beginning and end of each MultiLineString segment. The absence of an mlines column indicates there are no MultiLineStrings in the data source, only `LineString`s.
- property multipoints#
Similar to the Points column with the addition of an offsets column. The offsets column stores the comparable sizes and coordinates of each MultiPoint in the GeoArrowBuffers.
- property points#
A simple numeric column. x and y coordinates are interleaved such that even coordinates are x axis and odd coordinates are y axis.
- property polygons#
Contains the coordinates column, a rings column specifying the beginning and end of every polygon, a polygons column specifying the beginning, or exterior, ring of each polygon and the end ring. All rings after the first ring are interior rings. Finally a mpolygons column stores the offsets of the polygons that should be grouped into MultiPolygons.
- class cuspatial.geometry.geocolumn.GeoMeta(meta: Union[cuspatial.geometry.geoarrowbuffers.GeoArrowBuffers, dict])#
Creates input_types and input_lengths for GeoColumns that are created using native GeoArrowBuffers. These will be used to convert to GeoPandas GeoSeries if necessary.
Methods
copy
- class cuspatial.geometry.geocolumn.GeoColumn(data: cuspatial.geometry.geoarrowbuffers.GeoArrowBuffers, meta: Optional[cuspatial.geometry.geocolumn.GeoMeta] = None, shuffle_order: Optional[cudf.core.index.Index] = None)#
- Parameters
- dataA GeoArrowBuffers object
- metaA GeoMeta object (optional)
Notes
The GeoColumn class subclasses NumericalColumn. Combined with _copy_type_metadata, this assures support for existing cudf algorithms.
- Attributes
Methods
copy
([deep])Create a copy of all of the GPU-backed data structures in this GeoColumn.
max
([skipna, min_count])Compute max of column values.
min
([skipna, min_count])Compute min of column values.
product
([skipna, min_count])Compute product of column values.
sum
([skipna, min_count])Compute sum of column values.
sum_of_squares
([skipna, min_count])Compute sum_of_squares of column values.
cummax
cummin
cumprod
cumsum
to_host
- copy(deep=True)#
Create a copy of all of the GPU-backed data structures in this GeoColumn.
- property iloc#
Return the i-th row of the GeoSeries.
- property loc#
Not currently supported.
- max(skipna: bool = None, min_count: int = 0, *args, **kwargs)#
Compute max of column values.
- skipnabool
Whether or not na values must be skipped.
- min_countint, default 0
The minimum number of entries for the reduction, otherwise the reduction returns NaN.
- min(skipna: bool = None, min_count: int = 0, *args, **kwargs)#
Compute min of column values.
- skipnabool
Whether or not na values must be skipped.
- min_countint, default 0
The minimum number of entries for the reduction, otherwise the reduction returns NaN.
- product(skipna: bool = None, min_count: int = 0, *args, **kwargs)#
Compute product of column values.
- skipnabool
Whether or not na values must be skipped.
- min_countint, default 0
The minimum number of entries for the reduction, otherwise the reduction returns NaN.