Data Retrieval¶

The module data_retrieval provides functions to retrieve data from raw text files. These files must be formatted in a good way.

data_retrieval.generate_citations(author_papers)[source]¶

Returns the citation relations.

Parameters:	author_papers (list of dicts) – the author’s papers, as a list of dicts produced by the function `data_retrieval.list2paper()`
Returns:	the citation relations
Return type:	`pandas.DataFrame`

data_retrieval.get_author_papers(author_name, author_slug, input_file)[source]¶

Returns the list of papers written by the author (list of dicts) from a raw text file.

The text file must be formatted in the following way:

each paper is a block of lines;
each line represents either the index, the title, the abstract, the list of authors or a citation reference;
there is a way to recognise the type of the line with a regular expression;
the papers are separated by a blank line.

Parameters:	author_name (string) – the real name of the user author_slug (string) – a short and ASCII string to replace the author’s name input_file (string) – the name of the file in which are stored the author’s papers
Returns:	the author’s papers as dictionaries: those with abstract and those without abstract
Return type:	tuple

data_retrieval.get_cited_papers(cited, db_cursor, papers_table='papers')[source]¶

Retrieves the cited papers data from a SQL database.

The table papers_table must have the columns: id, title and abstract.

Parameters:	cited (list of strings) – list of the cited papers’ ids db_cursor (`MySQLdb.cursors.Cursor`) – cursor of a SQL database in which there is a papers table papers_table (string) – name of the papers table in the SQL database
Returns:	the results of the SQL query
Return type:	tuple of tuples

data_retrieval.get_irrelevant_cited_papers(bad_papers, db_cursor, papers_table='papers')[source]¶

Retrieves the papers cited by the irrelevant papers given in input, from a SQL database.

Parameters:	bad_papers (list of dicts) – the list of irrelevant papers, formatted as the output of `data_retrieval.list2paper()` db_cursor (`MySQLdb.cursors.Cursor`) – cursor of a SQL database in which there is a papers table papers_table (string) – name of the papers table in the SQL database
Returns:	the results of the SQL query
Return type:	tuple of tuples

data_retrieval.get_irrelevant_papers(input_file)[source]¶

Return the list of irrelevant papers written (list of dicts) from a raw text file.

Parameters:	input_file (string) – relative path to the raw text file
Returns:	the list of irrelevant papers (with abstract) formatted as dicts
Return type:	list of dicts

data_retrieval.list2paper(l_paper, r_index=None, r_author=None, r_title=None, r_abstract=None, r_cite=None)[source]¶

Transform a raw data paper (formatted as a list) into a dict.

This function uses regular expression to match title, abstract, authors, etc. in each element of the list given in input. If a regex is None, then a default regex is used.

Parameters:	l_paper (list of strings) – the list of elements forming the paper (title, authors, etc.), in raw format r_index (`_sre.SRE_pattern`) – a compiled regex to match an index string r_author (`_sre.SRE_pattern`) – a compiled regex to match an authors list r_title (`_sre.SRE_pattern`) – a compiled regex to match a title r_abstract (`_sre.SRE_pattern`) – a compiled regex to match an abstract r_cite (`_sre.SRE_pattern`) – a compiled regex to match a citation
Returns:	the paper as a dict, with list of authors and list of citations
Return type:	dict