Data Retrieval¶
The module data_retrieval
provides functions to retrieve data from raw text files.
These files must be formatted in a good way.
-
data_retrieval.
generate_citations
(author_papers)[source]¶ Returns the citation relations.
Parameters: author_papers (list of dicts) – the author’s papers, as a list of dicts produced by the function data_retrieval.list2paper()
Returns: the citation relations Return type: pandas.DataFrame
Returns the list of papers written by the author (list of dicts) from a raw text file.
The text file must be formatted in the following way:
- each paper is a block of lines;
- each line represents either the index, the title, the abstract, the list of authors or a citation reference;
- there is a way to recognise the type of the line with a regular expression;
- the papers are separated by a blank line.
Parameters: - author_name (string) – the real name of the user
- author_slug (string) – a short and ASCII string to replace the author’s name
- input_file (string) – the name of the file in which are stored the author’s papers
Returns: the author’s papers as dictionaries: those with abstract and those without abstract
Return type: tuple
-
data_retrieval.
get_cited_papers
(cited, db_cursor, papers_table='papers')[source]¶ Retrieves the cited papers data from a SQL database.
The table
papers_table
must have the columns:id
,title
andabstract
.Parameters: - cited (list of strings) – list of the cited papers’ ids
- db_cursor (
MySQLdb.cursors.Cursor
) – cursor of a SQL database in which there is a papers table - papers_table (string) – name of the papers table in the SQL database
Returns: the results of the SQL query
Return type: tuple of tuples
-
data_retrieval.
get_irrelevant_cited_papers
(bad_papers, db_cursor, papers_table='papers')[source]¶ Retrieves the papers cited by the irrelevant papers given in input, from a SQL database.
Parameters: - bad_papers (list of dicts) – the list of irrelevant papers, formatted as the output of
data_retrieval.list2paper()
- db_cursor (
MySQLdb.cursors.Cursor
) – cursor of a SQL database in which there is a papers table - papers_table (string) – name of the papers table in the SQL database
Returns: the results of the SQL query
Return type: tuple of tuples
- bad_papers (list of dicts) – the list of irrelevant papers, formatted as the output of
-
data_retrieval.
get_irrelevant_papers
(input_file)[source]¶ Return the list of irrelevant papers written (list of dicts) from a raw text file.
Parameters: input_file (string) – relative path to the raw text file Returns: the list of irrelevant papers (with abstract) formatted as dicts Return type: list of dicts
-
data_retrieval.
list2paper
(l_paper, r_index=None, r_author=None, r_title=None, r_abstract=None, r_cite=None)[source]¶ Transform a raw data paper (formatted as a list) into a dict.
This function uses regular expression to match title, abstract, authors, etc. in each element of the list given in input. If a regex is None, then a default regex is used.
Parameters: - l_paper (list of strings) – the list of elements forming the paper (title, authors, etc.), in raw format
- r_index (
_sre.SRE_pattern
) – a compiled regex to match an index string - r_author (
_sre.SRE_pattern
) – a compiled regex to match an authors list - r_title (
_sre.SRE_pattern
) – a compiled regex to match a title - r_abstract (
_sre.SRE_pattern
) – a compiled regex to match an abstract - r_cite (
_sre.SRE_pattern
) – a compiled regex to match a citation
Returns: the paper as a dict, with list of authors and list of citations
Return type: dict