String Tools¶
The module string_tools
gathers utils for string manipulation, essentially cleaning.
-
class
string_tools.
StringCleaner
[source]¶ Provides tools to clean strings, like accents removal and standardisation.
-
class
string_tools.
StringHasher
(n=1)[source]¶ Provides tools to transform a sentence into a bag-of-words vector.
Parameters: n (int) – the dimension of n-gram -
hash
(s)[source]¶ Transforms a string into a n-gram count representation.
Parameters: s (string) – the string to hash Returns: n-gram count representation of the string given in input. Return type: np.ndarray
-
init_ngrams
(tokens)[source]¶ Computes the ngrams from a list of words and affects them to
self.ngrams
.Todo
deal with the case n != 1
Parameters: tokens (list of strings) – list of words from which compute the n-grams
-
-
class
string_tools.
WordHasher
(n=3, bord='#')[source]¶ Provides tools to transform a string into a bag-of-ngrams vector.
Parameters: - n (int) – dimension of n-gram
- bord (string) – delimiter character to surround words with
-
hash
(s)[source]¶ Transforms a string into a n-gram count representation.
Parameters: s (string) – the string to hash Returns: a n-gram count representation of the string given in input. Return type: np.ndarray
-
init_ngrams
(tokens)[source]¶ Computes the ngrams from a list of words and affects them to
self.ngrams
.Parameters: tokens (list of strings) – list of words from which compute the ngrams