Dataset Tools¶
The module dataset_tools
provides functions to construct a dataset from raw data.
-
dataset_tools.
build_dataset
(papers, citations, bad_papers, num_entries=6, verbosity=1)[source]¶ Build a dataset from features variables.
Parameters: - papers (dict) – features of the user’s papers (dict string -> np.ndarray)
- citations (dict) – features of cited papers (dict string -> tuple(list of string, np.ndarray))
- bad_papers (dict) – features of unrelated papers (dict string -> np.ndarray)
- num_entries (int) – the number of compared papers in the DSSM structure
- verbosity (int) – 0: quiet; 1: normal; 2: high
Returns: the dataset
Return type: np.ndarray
-
dataset_tools.
compute_features
(papers, stringHasher, verbosity=1)[source]¶ Computes the features of a list of papers, with a given list of ngrams.
Parameters: - papers (list of tuples) – the list of papers (each element is a tuple of 3 strings: id, title, abstract)
- stringHasher (
string_tools.StringHasher
) – the object which contains the list of ngrams - verbosity (int) – 0: quiet; 1: normal; 2: high
Returns: the list of papers represented as bag-of-words vectors
Return type: dict of
np.ndarray
-
dataset_tools.
dataset_from_file
(filename)[source]¶ Load a dataset from file.
Parameters: filename (string) – the name of the file from which extract the dataset Returns: the dataset (np.ndarray) and the ngrams (list of strings) Return type: tuple
-
dataset_tools.
dataset_to_file
(dataset, ngrams, filename='dataset')[source]¶ Save a dataset to a file.
Parameters: - dataset (
np.ndarray
) – the dataset to save (built withdataset_tools.build_dataset()
) - ngrams (list of strings) – the ngrams used to compute the features
- filename (string) – the filename without extension (will be .npz)
- dataset (
-
dataset_tools.
generate_vocab
(papers)[source]¶ Returns the vocabulary used in the papers given in parameters, after cleaning and stopwords removal.
Parameters: papers (list of tuples) – the raw list of papers from which generates the vocabulary (each element is a tuple of 3 strings: id, title and abstract) Returns: the list of tokens forming the vocabulary Return type: list of strings
-
dataset_tools.
invert_citations
(citations, verbosity=1)[source]¶ Transforms a list of citation relations into a hashtable cited_paper -> list of citing papers.
Parameters: - citations (list of tuples) – the list of citation relations (each element is a tuple of 2 string: citing paper’s id, cited paper’s id)
- verbosity (int) – 0: quiet; 1: normal; 2: high
Returns: a dict whose keys are cited papers ids and whose values are the lists of the ids of the papers that cite the keys (string -> list of strings)
Return type: dict
-
dataset_tools.
prepare_dataset
(user_papers, citations, cited_papers, tokens, bad_papers=None, verbosity=1)[source]¶ Prepares data from string representations of papers in order to buidl a numeric dataset.
The result is a tuple of 4 elements:
(1) the user’s papers, as a dictionary: each key is a the id of a paper written by the user, and the value is the features of the paper (1D np.ndarray), (2) the cited papers, as a dictionary: each key is the id of a paper cited by the user, and the value is a tuple constituted of the list of papers id in which the paper is cited (list of strings), and the features of the paper (1D np.ndarray), (3) the irrelevant papers, as a dictionary like the first one, (4) the ngrams used to compute the features (list of strings).
Parameters: - user_papers (list of 3-tuples) – the papers written by the user (each element is a tuple of 3 strings: id, title, abstract)
- citations (list of 2-tuples) – the list of citation relations
- cited_papers (list of 3-tuples) – the papers that the user has cited (each element is a tuple of 3 strings: id, title, abstract)
- tokens (list of strings) – the vocabulary to use for computing features
- bad_papers (list of 3-tuples or None) – unrelated papers (each element is a tuple of 3 strings: id, title, abstract)
- verbosity (int) – 0: quiet; 1: normal; 2: high
Returns: data to build a dataset with
Return type: tuple