String handling#

Series.str can be used to access the values of the series as strings and apply several methods to it. These can be accessed like Series.str.<function/property>.

Series.str

Vectorized string functions for Series and Index.

byte_count()

Computes the number of bytes of each string in the Series/Index.

capitalize()

Convert strings in the Series/Index to be capitalized.

cat()

Concatenate strings in the Series/Index with given separator.

center(width[, fillchar])

Filling left and right side of strings in the Series/Index with an additional character.

character_ngrams([n, as_list])

Generate the n-grams from characters in a column of strings.

character_tokenize()

Each string is split into individual characters.

code_points()

Returns an array by filling it with the UTF-8 code point values for each character of each string.

contains(pat[, case, flags, na, regex])

Test if pattern or regex is contained within a string of a Series or Index.

count(pat[, flags])

Count occurrences of pattern in each string of the Series/Index.

detokenize(indices[, separator])

Combines tokens into strings by concatenating them in the order in which they appear in the indices column.

edit_distance(targets)

The targets strings are measured against the strings in this instance using the Levenshtein edit distance algorithm.

edit_distance_matrix()

Computes the edit distance between strings in the series.

endswith(pat)

Test if the end of each string element matches a pattern.

extract(pat[, flags, expand])

Extract capture groups in the regex pat as columns in a DataFrame.

filter_alphanum([repl, keep])

Remove non-alphanumeric characters from strings in this column.

filter_characters(table[, keep, repl])

Remove characters from each string using the character ranges in the given mapping table.

filter_tokens(min_token_length[, ...])

Remove tokens from within each string in the series that are smaller than min_token_length and optionally replace them with the replacement string.

find(sub[, start, end])

Return lowest indexes in each strings in the Series/Index where the substring is fully contained between [start:end].

findall(pat[, flags])

Find all occurrences of pattern or regular expression in the Series/Index.

find_multiple(patterns)

Find all first occurrences of patterns in the Series/Index.

get([i])

Extract element from each component at specified position.

get_json_object(json_path, *[, ...])

Applies a JSONPath string to an input strings column where each row in the column is a valid json string

hex_to_int()

Returns integer value represented by each hex string.

htoi()

Returns integer value represented by each hex string.

index(sub[, start, end])

Return lowest indexes in each strings where the substring is fully contained between [start:end].

insert([start, repl])

Insert the specified string into each string in the specified position.

ip2int()

This converts ip strings to integers

ip_to_int()

This converts ip strings to integers

is_consonant(position)

Return true for strings where the character at position is a consonant.

is_vowel(position)

Return true for strings where the character at position is a vowel -- not a consonant.

isalnum()

Check whether all characters in each string are alphanumeric.

isalpha()

Check whether all characters in each string are alphabetic.

isdecimal()

Check whether all characters in each string are decimal.

isdigit()

Check whether all characters in each string are digits.

isempty()

Check whether each string is an empty string.

isfloat()

Check whether all characters in each string form floating value.

ishex()

Check whether all characters in each string form a hex integer.

isinteger()

Check whether all characters in each string form integer.

isipv4()

Check whether all characters in each string form an IPv4 address.

isspace()

Check whether all characters in each string are whitespace.

islower()

Check whether all characters in each string are lowercase.

isnumeric()

Check whether all characters in each string are numeric.

isupper()

Check whether all characters in each string are uppercase.

istimestamp(format)

Check whether all characters in each string can be converted to a timestamp using the given format.

istitle()

Check whether each string is title formatted.

jaccard_index(input, width)

Compute the Jaccard index between this column and the given input strings column.

join([sep, string_na_rep, sep_na_rep])

Join lists contained as elements in the Series/Index with passed delimiter.

len()

Computes the length of each element in the Series/Index.

like(pat[, esc])

Test if a like pattern matches a string of a Series or Index.

ljust(width[, fillchar])

Filling right side of strings in the Series/Index with an additional character.

lower()

Converts all characters to lowercase.

lstrip([to_strip])

Remove leading and trailing characters.

match(pat[, case, flags])

Determine if each string matches a regular expression.

minhash([seeds, width])

Compute the minhash of a strings column.

ngrams([n, separator])

Generate the n-grams from a set of tokens, each record in series is treated a token.

ngrams_tokenize([n, delimiter, separator])

Generate the n-grams using tokens from each string.

normalize_characters([do_lower])

Normalizes strings characters for tokenizing.

normalize_spaces()

Remove extra whitespace between tokens and trim whitespace from the beginning and the end of each string.

pad(width[, side, fillchar])

Pad strings in the Series/Index up to width.

partition([sep, expand])

Split the string at the first occurrence of sep.

porter_stemmer_measure()

Compute the Porter Stemmer measure for each string.

repeat(repeats)

Duplicate each string in the Series or Index.

removeprefix(prefix)

Remove a prefix from an object series.

removesuffix(suffix)

Remove a suffix from an object series.

replace(pat, repl[, n, case, flags, regex])

Replace occurrences of pattern/regex in the Series/Index with some other string.

replace_tokens(targets, replacements[, ...])

The targets tokens are searched for within each string in the series and replaced with the corresponding replacements if found.

replace_with_backrefs(pat, repl)

Use the repl back-ref template to create a new string with the extracted elements found using the pat expression.

rfind(sub[, start, end])

Return highest indexes in each strings in the Series/Index where the substring is fully contained between [start:end].

rindex(sub[, start, end])

Return highest indexes in each strings where the substring is fully contained between [start:end].

rjust(width[, fillchar])

Filling left side of strings in the Series/Index with an additional character.

rpartition([sep, expand])

Split the string at the last occurrence of sep.

rsplit([pat, n, expand, regex])

Split strings around given separator/delimiter.

rstrip([to_strip])

Remove leading and trailing characters.

slice([start, stop, step])

Slice substrings from each element in the Series or Index.

slice_from(starts, stops)

Return substring of each string using positions for each string.

slice_replace([start, stop, repl])

Replace the specified section of each string with a new string.

split([pat, n, expand, regex])

Split strings around given separator/delimiter.

startswith(pat)

Test if the start of each string element matches a pattern.

strip([to_strip])

Remove leading and trailing characters.

swapcase()

Change each lowercase character to uppercase and vice versa.

title()

Uppercase the first letter of each letter after a space and lowercase the rest.

token_count([delimiter])

Each string is split into tokens using the provided delimiter.

tokenize([delimiter])

Each string is split into tokens using the provided delimiter(s).

translate(table)

Map all characters in the string through the given mapping table.

upper()

Convert each string to uppercase.

url_decode()

Returns a URL-decoded format of each string.

url_encode()

Returns a URL-encoded format of each string.

wrap(width, **kwargs)

Wrap long strings in the Series/Index to be formatted in paragraphs with length less than a given width.

zfill(width)

Pad strings in the Series/Index by prepending '0' characters.