Data Retrieval

The module data_retrieval provides functions to retrieve data from raw text files. These files must be formatted in a good way.

data_retrieval.generate_citations(author_papers)[source]

Returns the citation relations.

Parameters:author_papers (list of dicts) – the author’s papers, as a list of dicts produced by the function data_retrieval.list2paper()
Returns:the citation relations
Return type:pandas.DataFrame
data_retrieval.get_author_papers(author_name, author_slug, input_file)[source]

Returns the list of papers written by the author (list of dicts) from a raw text file.

The text file must be formatted in the following way:

  • each paper is a block of lines;
  • each line represents either the index, the title, the abstract, the list of authors or a citation reference;
  • there is a way to recognise the type of the line with a regular expression;
  • the papers are separated by a blank line.
Parameters:
  • author_name (string) – the real name of the user
  • author_slug (string) – a short and ASCII string to replace the author’s name
  • input_file (string) – the name of the file in which are stored the author’s papers
Returns:

the author’s papers as dictionaries: those with abstract and those without abstract

Return type:

tuple

data_retrieval.get_cited_papers(cited, db_cursor, papers_table='papers')[source]

Retrieves the cited papers data from a SQL database.

The table papers_table must have the columns: id, title and abstract.

Parameters:
  • cited (list of strings) – list of the cited papers’ ids
  • db_cursor (MySQLdb.cursors.Cursor) – cursor of a SQL database in which there is a papers table
  • papers_table (string) – name of the papers table in the SQL database
Returns:

the results of the SQL query

Return type:

tuple of tuples

data_retrieval.get_irrelevant_cited_papers(bad_papers, db_cursor, papers_table='papers')[source]

Retrieves the papers cited by the irrelevant papers given in input, from a SQL database.

Parameters:
  • bad_papers (list of dicts) – the list of irrelevant papers, formatted as the output of data_retrieval.list2paper()
  • db_cursor (MySQLdb.cursors.Cursor) – cursor of a SQL database in which there is a papers table
  • papers_table (string) – name of the papers table in the SQL database
Returns:

the results of the SQL query

Return type:

tuple of tuples

data_retrieval.get_irrelevant_papers(input_file)[source]

Return the list of irrelevant papers written (list of dicts) from a raw text file.

Parameters:input_file (string) – relative path to the raw text file
Returns:the list of irrelevant papers (with abstract) formatted as dicts
Return type:list of dicts
data_retrieval.list2paper(l_paper, r_index=None, r_author=None, r_title=None, r_abstract=None, r_cite=None)[source]

Transform a raw data paper (formatted as a list) into a dict.

This function uses regular expression to match title, abstract, authors, etc. in each element of the list given in input. If a regex is None, then a default regex is used.

Parameters:
  • l_paper (list of strings) – the list of elements forming the paper (title, authors, etc.), in raw format
  • r_index (_sre.SRE_pattern) – a compiled regex to match an index string
  • r_author (_sre.SRE_pattern) – a compiled regex to match an authors list
  • r_title (_sre.SRE_pattern) – a compiled regex to match a title
  • r_abstract (_sre.SRE_pattern) – a compiled regex to match an abstract
  • r_cite (_sre.SRE_pattern) – a compiled regex to match a citation
Returns:

the paper as a dict, with list of authors and list of citations

Return type:

dict