String UDF memory management#
Inside UDFs, some string methods like concat()
and replace()
produce new
strings. For a CUDA thread to create a new string, it must dynamically allocate
memory on the device to hold the string’s data. The cleanup of this memory by the
thread later on must preserve Python’s semantics, for example when the variable
corresponding to the new string goes out of scope. To accomplish this in cuDF, UDF
memory management (allocation and freeing of the underlying data) is handled
transparently for the user, via a reference counting mechanism. This reference
counting implementation is distinct from the one in python and has its own interface
and requirements,
Along with the code generated from the functions and operations within the passed UDF, numba-cuda will automatically weave the necessary reference counting operations into the final device function that each thread will ultimately run. This allows the programmer to pass a UDF that may utilize memory allocating types such strings generally as one would in python:
def udf(string):
if len(string) > 2:
result = string.upper() # new allocation
else:
result = string + string # new allocation
return result + 'abc'
Numba memory management and the Numba Runtime (NRT)#
The API functions used to update the reference count associated with a variable derive from Numba’s memory management for nopython mode code. This runtime library (NRT or Numba Runtime) provides implementations for operators that increase and decrease a variable’s reference count (INCREF/DECREF), and numba analyzes the passed UDF to determine where the calls targeting these implementations should go and what objects they should operate on. Below are some examples of situations where numba-cuda would detect a reference counting operation needs to be applied to an object:
The creation of a new object: During object creation, memory is allocated and a structure to track the memory is created and initialized.
When new references are created: For example during assignments, the reference count of the assigned-from object is incremented.
When references are destroyed: For example when an object goes out of scope, or when an object holding a reference is destroyed. During these events, the reference count of the tracked object is decremented. If the reference count of an object falls to zero, the Numba Runtime will invoke its destructor.
When an intermediate variable is no longer needed: For example when creating a new variable for inspection then disposing of it, as in
string.upper() == 'A'
Numba does not reference count every variable, as only variables with an associated
heap memory allocation need to be tracked. Numba determines if this is true for a
variable during compilation by querying the properties of the datamodel underlying
the variable’s type. We provide a string type ManagedUDFString
that implements
the required properties and backs any new string that is created on the device. Its
datamodel is defined under the data structures section below and is registered to
the extension type as shown.
Data structures#
The core concept is a ManagedUDFString
numba extension type that fulfills the
requirements to be reference counted by NRT. It is composed of a cudf::udf_string
that owns the string data and a pointer to a MemInfo
object, which the NRT API
uses for reference counting.
from cudf.core.udf.strings_typing import ManagedUDFString
from numba.cuda.descriptor import cuda_target
@register_model(ManagedUDFString)
class managed_udf_string_model(models.StructModel):
_members = (("meminfo", types.voidptr), ("udf_string", udf_string))
def __init__(self, dmm, fe_type):
super().__init__(dmm, fe_type, self._members)
def has_nrt_meminfo(self):
return True
def get_nrt_meminfo(self, builder, value):
# effectively returns self.meminfo in IR form
udf_str_and_meminfo = numba.core.cgutils.create_struct_proxy(ManagedUDFString())(
cuda_target.target_context, builder, value=value
)
return udf_str_and_meminfo.meminfo
The actual NRT APIs for adjusting the reference count of an object expect to operate
on this MemInfo
object itself rather than the instance:
extern "C"
struct MemInfo {
cuda::atomic<size_t, cuda::thread_scope_device> refct;
NRT_dtor_function dtor;
void* dtor_info;
void* data;
size_t size;
};
typedef struct MemInfo NRT_MemInfo;
Every instance of a reference counted type within the scope of a CUDA thread executing
the UDF is associated with a separate instance of this MemInfo
struct. An INCREF or
DECREF on the instance in numba’s intermediate representation formed during compilation
will resolve to an increase or decrease of the refct
of the MemInfo
associated
with that instance. The NRT_decref implementation calls the dtor
on the data
if
the refct
is found to be zero:
extern "C" __device__ void NRT_decref(NRT_MemInfo* mi)
{
if (mi != NULL) {
mi->refct--;
if (mi->refct == 0) { NRT_MemInfo_call_dtor(mi); }
}
}
NRT Requirements#
For a type to participate in Numba’s reference counting correctly, the following must be true:
The datamodel for the type needs to report that it has a meminfo. This is done by returning
True
fromhas_nrt_meminfo
.The datamodel must expose the location of the meminfo for that instance to numba’s lowering phase. This means implementing
get_nrt_meminfo()
such that it returns the meminfo in a predictable location in heap memory.Operators or functions that return the type must initialize the meminfo and place it at the location numba will report it exists at through (2). This is done in the lowering for the operations we support, such as
concat
.
ManagedUDFString
fulfills (2) by tying the MemInfo and the string instance that it owns
together into a parent struct. This allows (2) to be implemented by just returning its own
.meminfo
member, effectively relating the meminfo location to self
via an offset.
Lowering for operations like concat
populate this member before returning.
cuDF string data structures#
On the C++ side, libcudf permits storing entire columns of strings. The
cudf::string_view
class is a non-owning view of a string — usually a
single row in a libcudf column — that provides a convenient abstraction
over working with individual strings in device code, for example in custom
kernels. cuDF Python introduces the cudf::strings::udf::udf_string
class,
an owning container around a single string. This class is used by the numba UDF
code to create new strings in device code. All libcudf string functions are made
available in cuDF Python UDFs by constructing cudf::string_view
instances
that view the strings owned by udf_string
instances.
The cuDF extensions to Numba generate code to manipulate instances of these classes, so we outline the members of these classes to aid in understanding them. These classes also have various methods; consult the cuDF C++ Developer Documentation for further details of these structures.
class string_view {
// A pointer to the underlying string data
char const* p{};
// The length of the underlying string data in bytes
size_type bytes{};
// The offset into the underlying string data in characters
size_type char_pos{};
// The offset into the underlying string data in bytes
size_type byte_pos{};
};
class udf_string {
// A pointer to the underlying string data
char* m_data{};
// The length of the string data in bytes
cudf::size_type m_bytes{};
// The size of the underlying allocation in bytes
cudf::size_type m_capacity{};
};
Note
A udf_string
has a destructor that frees the underlying string data. This is
important, because the C++ destructor is invoked during destruction of a
Python-side Managed UDF String object.
Implementation#
The cuDF implementations for Managed UDF Strings is required to provide:
Typing and lowering for Managed UDF String operations. The typing has no special properties; it is similar to any other typing implementation in a Numba extension. The lowering is required to ensure that
NRT_MemInfo
objects for each managed object are created and initialized correctly.C++ implementations of string functions, some of which use libcudf’s C++ string functionality. Other functions are provided by the
strings_udf
C++ library in cuDF Python. These help with the allocation of data and implement the required destructors.Numba shim functions to adapt calls to C++ code for use in Numba code and Numba extensions are also required.
Conversion from String UDF data to and from
cudf::column
.
Use of C++ code for string functionality is not a hard requirement for implementing string support in a Numba extension - it is instead a pragmatic choice so that the Python and C++ sides of cuDF can share a single implementation for string operations instead of trying to keep two separately-maintained implementations in sync.
The majority of the complexity in the implementation comes from two areas:
Combining the requirement to use C++ implementations, with the need to provide correct initialization of
NRT_MemInfo
object, andConversion of Managed UDF String objects back into cuDF columns when a UDF returns strings.
String Lifecycle Details#
Let’s trace the complete lifecycle of a string created by result = str1 + str2
in a UDF:
Phase 1: Compilation#
1.1 Numba Analysis
# User UDF
def my_udf(str1, str2):
result = str1 + str2
return result
Typing phase identifies
str1 + str2
as returning aManagedUDFString
Lowering phase begins for the
+
operator
1.2 Stack Allocation
managed_ptr = builder.alloca(
context.data_model_manager[managed_udf_string].get_value_type()
)
Allocates stack space for the complete
ManagedUDFString
instanceAt this point, both fields are uninitialized
1.3 Member Pointer Extraction
udf_str_ptr = builder.gep(managed_ptr, [ir.IntType(32)(0), ir.IntType(32)(1)])
Gets pointer to the
udf_string
member within the allocated struct
Phase 2: String Creation via Shim Function#
2.1 Shim Function Call
meminfo = context.compile_internal(
builder, call_concat_string_view,
types.voidptr(_UDF_STRING_PTR, _STR_VIEW_PTR, _STR_VIEW_PTR),
(udf_str_ptr, lhs_ptr, rhs_ptr)
)
2.2 Inside the Shim Function
extern "C" __device__ int concat_shim(void** out_meminfo,
void* output_udf_str,
void* const* lhs,
void* const* rhs) {
auto lhs_sv = reinterpret_cast<cudf::string_view const*>(lhs);
auto rhs_sv = reinterpret_cast<cudf::string_view const*>(rhs);
// Perform actual concat- allocates GPU memory for result
auto result_str = cudf::strings::udf::concat(*lhs_sv, *rhs_sv);
// Place result into pre-allocated stack space using placement new
auto udf_str_ptr = new (output_udf_str) udf_string(std::move(result_str));
// Create and return the meminfo
*out_meminfo = make_meminfo_for_new_udf_string(udf_str_ptr);
return 0;
}
In the above, critically the final string is constructed through placement
new which relieves the compiler of the responsibility for cleaning up the
cudf::udf_string
created there.
2.3 MemInfo Creation Details
__device__ NRT_MemInfo* make_meminfo_for_new_udf_string(udf_string* udf_str) {
struct mi_str_allocation {
NRT_MemInfo mi;
udf_string st;
};
// Single heap allocation for both structures
mi_str_allocation* heap_allocation = (mi_str_allocation*)NRT_Allocate(sizeof(mi_str_allocation));
NRT_MemInfo* mi_ptr = &(heap_allocation->mi);
udf_string* heap_str_ptr = &(heap_allocation->st);
// Initialize MemInfo pointing to co-allocated string
NRT_MemInfo_init(mi_ptr, heap_str_ptr, 0, udf_str_dtor, NULL);
// Copy string data to heap location
memcpy(heap_str_ptr, udf_str, sizeof(udf_string));
return mi_ptr;
}
mi_str_allocation
is similar in structure to ManagedUDFString
but has
a MemInfo
struct value as its first member rather than a pointer.
Phase 3: Object Assembly and Return#
3.1 Final Assembly
managed = cgutils.create_struct_proxy(managed_udf_string)(context, builder)
managed.meminfo = meminfo # Points to heap MemInfo
return managed._getvalue()
3.2 Current Memory State
Stack:
ManagedUDFString
struct with validmeminfo
pointer andudf_string
dataHeap: Co-allocated MemInfo and udf_string structures
GPU Memory: String data owned by heap-allocated udf_string
Reference Count: 1 (object just created)
Phase 4: Runtime Usage and Reference Management#
4.1 Assignment Operations
Within the broader kernel being launched, the result of the overall UDF is assigned:
result = my_udf(input_string)
At this point, result
is a fully initialized ManagedUDFString
:
Numba detects assignment of reference counted return value
Automatically inserts
NRT_incref(managed.meminfo)
heap_allocation->mi.refct
becomes 2passed_udf exits, causing an
NRT_decref(managed.meminfo)
.
4.2 Setitem into the final array
The final line of the containing kernel sets the result into the output array:
output_string_ary[tid] = result
Adds an incref, bumping the refcount back up to 2.
Phase 5: Destruction Sequence#
5.1 Final Reference Release
The kernel being launched is ultimately overall a void
function. Any
variables contained locally therein will be decref’d at function’s exit,
like any other function.
result
variable decref’d, but still referred to by the output arrayheap_allocation->mi.refct
becomes 1
5.2 Destructor Execution
The function column_from_managed_udf_string_array
creates a cudf::column
from the output buffer containing the strings. cuDF launches a freeing kernel
that decrefs all the result strings one last time:
def free_managed_udf_string_array(ary, size):
gid = cuda.grid(1)
if gid < size:
NRT_decref(ary[gid])
NRT_MemInfo_call_dtor
invokes the destructor for the object
__device__ void udf_str_dtor(void* udf_str, size_t size, void* dtor_info) {
auto ptr = reinterpret_cast<udf_string*>(udf_str);
ptr->~udf_string();
}
A
MemInfo
dies after invoking its destructor - the NRT API ensures that once this is done, the originallyNRT_Allocat
ed pointer is freed. This has the effect of freeing the entiremi_str_allocation
.
5.3 Final Memory State
GPU String Memory: Freed
Heap MemInfo Block: Freed
Stack: Original
ManagedUDFString
becomes invalid/out-of-scopeReference Count: N/A (object destroyed)
cuDF A
cudf::column
of string type containing the result of the UDF