***************************************
How to generate a dataset?
***************************************

The script ``generate_dataset.py`` transforms some raw data files
(author's papers, irrelevant papers, cited papers) into a dataset usable by a neural network.
These files *must* be formatted in a good way.


------------------
Text file format
------------------

This script uses two raw input files: the user's papers file (``author-papers.txt``) and the irrelevant papers file (``bad-papers.txt``).
They are formatted in the same way:

* each paper is a represented by several lines;
* the papers are separated by one blank line;
* each line in a block wears one kind of information (title, abstract, citation, ...);
* the type of information can be found with a regular expression.

Here is an example of formatting, taken from [Tang 2008]:

>>> #* --- paperTitle
>>> #@ --- Authors
>>> #t ---- Year
>>> #c  --- publication venue
>>> #index 00---- index id of this paper
>>> #% ---- the id of references of this paper (there are multiple lines, with each indicating a reference)
>>> #! --- Abstract

.. pull-quote::
   Jie Tang, Jing Zhang, Limin Yao, Juanzi Li, Li Zhang, and Zhong Su.
   ArnetMiner: Extraction and Mining of Academic Social Networks.
   In Proceedings of the Fourteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD'2008).
   pp.990-998.

--------
Usage
--------
In a console, you can use the following command:

>>> python generate_dataset.py -n "Gabriella Pasi" -s "pasi" -af "./data/pasi-papers.txt" -bf "./data/bad-papers.txt" -c 4 -d "dblp" -o "./data/dataset-pasi"

This will parse the files and request the SQL database in order to build a numeric dataset.
The produced dataset is stored into a file and can be reused later, for exemple in the training part.


-----
API
-----

.. automodule:: generate_dataset
   :members: