Dataset Tools

The module dataset_tools provides functions to construct a dataset from raw data.

dataset_tools.build_dataset(papers, citations, bad_papers, num_entries=6, verbosity=1)[source]

Build a dataset from features variables.

Parameters:
  • papers (dict) – features of the user’s papers (dict string -> np.ndarray)
  • citations (dict) – features of cited papers (dict string -> tuple(list of string, np.ndarray))
  • bad_papers (dict) – features of unrelated papers (dict string -> np.ndarray)
  • num_entries (int) – the number of compared papers in the DSSM structure
  • verbosity (int) – 0: quiet; 1: normal; 2: high
Returns:

the dataset

Return type:

np.ndarray

dataset_tools.compute_features(papers, stringHasher, verbosity=1)[source]

Computes the features of a list of papers, with a given list of ngrams.

Parameters:
  • papers (list of tuples) – the list of papers (each element is a tuple of 3 strings: id, title, abstract)
  • stringHasher (string_tools.StringHasher) – the object which contains the list of ngrams
  • verbosity (int) – 0: quiet; 1: normal; 2: high
Returns:

the list of papers represented as bag-of-words vectors

Return type:

dict of np.ndarray

dataset_tools.dataset_from_file(filename)[source]

Load a dataset from file.

Parameters:filename (string) – the name of the file from which extract the dataset
Returns:the dataset (np.ndarray) and the ngrams (list of strings)
Return type:tuple
dataset_tools.dataset_to_file(dataset, ngrams, filename='dataset')[source]

Save a dataset to a file.

Parameters:
  • dataset (np.ndarray) – the dataset to save (built with dataset_tools.build_dataset())
  • ngrams (list of strings) – the ngrams used to compute the features
  • filename (string) – the filename without extension (will be .npz)
dataset_tools.generate_vocab(papers)[source]

Returns the vocabulary used in the papers given in parameters, after cleaning and stopwords removal.

Parameters:papers (list of tuples) – the raw list of papers from which generates the vocabulary (each element is a tuple of 3 strings: id, title and abstract)
Returns:the list of tokens forming the vocabulary
Return type:list of strings
dataset_tools.invert_citations(citations, verbosity=1)[source]

Transforms a list of citation relations into a hashtable cited_paper -> list of citing papers.

Parameters:
  • citations (list of tuples) – the list of citation relations (each element is a tuple of 2 string: citing paper’s id, cited paper’s id)
  • verbosity (int) – 0: quiet; 1: normal; 2: high
Returns:

a dict whose keys are cited papers ids and whose values are the lists of the ids of the papers that cite the keys (string -> list of strings)

Return type:

dict

dataset_tools.prepare_dataset(user_papers, citations, cited_papers, tokens, bad_papers=None, verbosity=1)[source]

Prepares data from string representations of papers in order to buidl a numeric dataset.

The result is a tuple of 4 elements:

(1) the user’s papers, as a dictionary: each key is a the id of a paper written by the user, and the value is the features of the paper (1D np.ndarray), (2) the cited papers, as a dictionary: each key is the id of a paper cited by the user, and the value is a tuple constituted of the list of papers id in which the paper is cited (list of strings), and the features of the paper (1D np.ndarray), (3) the irrelevant papers, as a dictionary like the first one, (4) the ngrams used to compute the features (list of strings).

Parameters:
  • user_papers (list of 3-tuples) – the papers written by the user (each element is a tuple of 3 strings: id, title, abstract)
  • citations (list of 2-tuples) – the list of citation relations
  • cited_papers (list of 3-tuples) – the papers that the user has cited (each element is a tuple of 3 strings: id, title, abstract)
  • tokens (list of strings) – the vocabulary to use for computing features
  • bad_papers (list of 3-tuples or None) – unrelated papers (each element is a tuple of 3 strings: id, title, abstract)
  • verbosity (int) – 0: quiet; 1: normal; 2: high
Returns:

data to build a dataset with

Return type:

tuple