String Tools

The module string_tools gathers utils for string manipulation, essentially cleaning.

class string_tools.StringCleaner[source]

Provides tools to clean strings, like accents removal and standardisation.

clean_string(s, bord='')[source]

Apply cleaning operations to a string, especially accents removal.

Parameters:
  • s (string) – the string to clean
  • bord (string) – an optional character to surround the words with
Returns:

the cleaned string

Return type:

string

remove_accents(s)[source]

Replaces all accentuated characters by their non-accentuated equivalent.

Parameters:s (string) – the string to transform
Returns:the deburred string
Return type:string
class string_tools.StringHasher(n=1)[source]

Provides tools to transform a sentence into a bag-of-words vector.

Parameters:n (int) – the dimension of n-gram
hash(s)[source]

Transforms a string into a n-gram count representation.

Parameters:s (string) – the string to hash
Returns:n-gram count representation of the string given in input.
Return type:np.ndarray
init_ngrams(tokens)[source]

Computes the ngrams from a list of words and affects them to self.ngrams.

Todo

deal with the case n != 1

Parameters:tokens (list of strings) – list of words from which compute the n-grams
load_ngrams(ngrams_)[source]

Loads a list of ngrams into self.ngrams.

Parameters:ngrams (list of strings) – the list of n-grams to load
print_ngrams()[source]

Prints the list of ngrams.

class string_tools.WordHasher(n=3, bord='#')[source]

Provides tools to transform a string into a bag-of-ngrams vector.

Parameters:
  • n (int) – dimension of n-gram
  • bord (string) – delimiter character to surround words with
hash(s)[source]

Transforms a string into a n-gram count representation.

Parameters:s (string) – the string to hash
Returns:a n-gram count representation of the string given in input.
Return type:np.ndarray
init_ngrams(tokens)[source]

Computes the ngrams from a list of words and affects them to self.ngrams.

Parameters:tokens (list of strings) – list of words from which compute the ngrams
load_ngrams(ngrams_)[source]

Loads a list of ngrams into self.ngrams.

Parameters:ngrams (list of strings) – the list of ngrams to load
print_ngrams()[source]

Prints the list of ngrams.