How to generate a dataset?¶
The script generate_dataset.py
transforms some raw data files
(author’s papers, irrelevant papers, cited papers) into a dataset usable by a neural network.
These files must be formatted in a good way.
Text file format¶
This script uses two raw input files: the user’s papers file (author-papers.txt
) and the irrelevant papers file (bad-papers.txt
).
They are formatted in the same way:
- each paper is a represented by several lines;
- the papers are separated by one blank line;
- each line in a block wears one kind of information (title, abstract, citation, ...);
- the type of information can be found with a regular expression.
Here is an example of formatting, taken from [Tang 2008]:
>>> #* --- paperTitle
>>> #@ --- Authors
>>> #t ---- Year
>>> #c --- publication venue
>>> #index 00---- index id of this paper
>>> #% ---- the id of references of this paper (there are multiple lines, with each indicating a reference)
>>> #! --- Abstract
Jie Tang, Jing Zhang, Limin Yao, Juanzi Li, Li Zhang, and Zhong Su. ArnetMiner: Extraction and Mining of Academic Social Networks. In Proceedings of the Fourteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD‘2008). pp.990-998.
Usage¶
In a console, you can use the following command:
>>> python generate_dataset.py -n "Gabriella Pasi" -s "pasi" -af "./data/pasi-papers.txt" -bf "./data/bad-papers.txt" -c 4 -d "dblp" -o "./data/dataset-pasi"
This will parse the files and request the SQL database in order to build a numeric dataset. The produced dataset is stored into a file and can be reused later, for exemple in the training part.
API¶
-
generate_dataset.
main
(author_name, author_slug, author_papers_file, bad_papers_file, num_entries, db_name, output_file)[source]¶ Given an author (name, papers), generates a dataset usable by the DSSM script.
Parameters: - author_name (string) – the full name of the author
- author_slug (string) – a short and ASCII string for the author’s name (example: “Gabriella Pasi” -> “pasi”)
- author_papers_file (string) – the relative path to the file containing the raw data of the author’s papers
- bad_papers_file (string) – the relative path to the file containing the raw data of irrelevant papers
- num_entries (int) – the number of compared papers in the DSSM structure (usually, 6)
- db_name (string) – the name of the SQL database in which are stored all the papers
- output_file (string) – the relative path to the file in which the dataset is saved