Dataset Tools¶

The module dataset_tools provides functions to construct a dataset from raw data.

dataset_tools.build_dataset(papers, citations, bad_papers, num_entries=6, verbosity=1)[source]¶

Build a dataset from features variables.

Parameters:	papers (dict) – features of the user’s papers (dict string -> np.ndarray) citations (dict) – features of cited papers (dict string -> tuple(list of string, np.ndarray)) bad_papers (dict) – features of unrelated papers (dict string -> np.ndarray) num_entries (int) – the number of compared papers in the DSSM structure verbosity (int) – 0: quiet; 1: normal; 2: high
Returns:	the dataset
Return type:	`np.ndarray`

dataset_tools.compute_features(papers, stringHasher, verbosity=1)[source]¶

Computes the features of a list of papers, with a given list of ngrams.

Parameters:	papers (list of tuples) – the list of papers (each element is a tuple of 3 strings: id, title, abstract) stringHasher (`string_tools.StringHasher`) – the object which contains the list of ngrams verbosity (int) – 0: quiet; 1: normal; 2: high
Returns:	the list of papers represented as bag-of-words vectors
Return type:	dict of `np.ndarray`

dataset_tools.dataset_from_file(filename)[source]¶

Load a dataset from file.

Parameters:	filename (string) – the name of the file from which extract the dataset
Returns:	the dataset (np.ndarray) and the ngrams (list of strings)
Return type:	tuple

dataset_tools.dataset_to_file(dataset, ngrams, filename='dataset')[source]¶

Save a dataset to a file.

Parameters:	dataset (`np.ndarray`) – the dataset to save (built with `dataset_tools.build_dataset()`) ngrams (list of strings) – the ngrams used to compute the features filename (string) – the filename without extension (will be .npz)

dataset_tools.generate_vocab(papers)[source]¶

Returns the vocabulary used in the papers given in parameters, after cleaning and stopwords removal.

Parameters:	papers (list of tuples) – the raw list of papers from which generates the vocabulary (each element is a tuple of 3 strings: id, title and abstract)
Returns:	the list of tokens forming the vocabulary
Return type:	list of strings

dataset_tools.invert_citations(citations, verbosity=1)[source]¶

Transforms a list of citation relations into a hashtable cited_paper -> list of citing papers.

Parameters:	citations (list of tuples) – the list of citation relations (each element is a tuple of 2 string: citing paper’s id, cited paper’s id) verbosity (int) – 0: quiet; 1: normal; 2: high
Returns:	a dict whose keys are cited papers ids and whose values are the lists of the ids of the papers that cite the keys (string -> list of strings)
Return type:	dict

dataset_tools.prepare_dataset(user_papers, citations, cited_papers, tokens, bad_papers=None, verbosity=1)[source]¶

Prepares data from string representations of papers in order to buidl a numeric dataset.

The result is a tuple of 4 elements:

(1) the user’s papers, as a dictionary: each key is a the id of a paper written by the user, and the value is the features of the paper (1D np.ndarray), (2) the cited papers, as a dictionary: each key is the id of a paper cited by the user, and the value is a tuple constituted of the list of papers id in which the paper is cited (list of strings), and the features of the paper (1D np.ndarray), (3) the irrelevant papers, as a dictionary like the first one, (4) the ngrams used to compute the features (list of strings).

Parameters:	user_papers (list of 3-tuples) – the papers written by the user (each element is a tuple of 3 strings: id, title, abstract) citations (list of 2-tuples) – the list of citation relations cited_papers (list of 3-tuples) – the papers that the user has cited (each element is a tuple of 3 strings: id, title, abstract) tokens (list of strings) – the vocabulary to use for computing features bad_papers (list of 3-tuples or None) – unrelated papers (each element is a tuple of 3 strings: id, title, abstract) verbosity (int) – 0: quiet; 1: normal; 2: high
Returns:	data to build a dataset with
Return type:	tuple