Commit 0289d15f authored by Rodrigo Barbado Esteban's avatar Rodrigo Barbado Esteban
Browse files

minor changes

parent 73681502
......@@ -29,7 +29,7 @@ The following figure describes the architecture from a modular point of view, be
Tasks Server
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The tasks server is responsible of managing the incoming workflow and setting up a valid pipeline to obtain, analyze, organize and save the results in Fuseki or ElasticSearch to be displayed in the client application. Luigi framework is used as an orchestator to build a sequence of tasks in order to facilitate the analysis process.
The tasks server is responsible of managing the incoming workflow and setting up a valid pipeline to obtain, analyze, organize and save the results in `Fuseki <https://jena.apache.org/documentation/serving_data/>`_ or `ElasticSearch <https://www.elastic.co/>`_ to be displayed in the client application. Luigi framework is used as an orchestator to build a sequence of tasks in order to facilitate the analysis process.
This tasks server is activated periodically by an administrator of processes called cron, whose aim is to obtain more information everyday. That way, any user can visualize data any time with the certainty that there will be stored data in the system.
......@@ -45,9 +45,9 @@ As is represented above, pipelines architecture is divided into three main steps
* **Analyze** task is responsible of taking the input JSON file generated by the previous task, parsing it and analyzing each text strign using Senpy remote server for it. Senpy service is based on HTTP calls, obtaining an analyzed result for the text attached in the request. Once the task has collected the analysis result, it generates another JSON containing the original sentence and its analysis result.
* **Store** process consists on storing the JSON generated previously which contains the analysis result inside elasticSearch instance or Fuseki. ElasticSearch is a distributed, RESTful search and analytics engine capable of solving a growing number of use cases. As the heart of the Elastic Stack, it centrally stores the data so it is possible to discover the expected and uncover the unexpected. To carry out the saving process, it's necessary to provide two arguments, the **index**, which represents the elastic index where the information will be saved, and the **doc type**, which allows to categorize information that belongs to the same index. It exists a third parameter which is the **id** of the query, but it is automatically generated by default.
* **Store** process consists on storing the JSON generated previously which contains the analysis result inside ElasticSearch instance or Fuseki. ElasticSearch is a distributed, RESTful search and analytics engine capable of solving a growing number of use cases. As the heart of the Elastic Stack, it centrally stores the data so it is possible to discover the expected and uncover the unexpected. To carry out the saving process, it's necessary to provide two arguments, the **index**, which represents the elastic index where the information will be saved, and the **doc type**, which allows to categorize information that belongs to the same index. It exists a third parameter which is the **id** of the query, but it is automatically generated by default.
To better understand these concepts, we are going to give a clear example that shows how the pipeline processes work internally. Imagine that the user requests a **sentiment** analysis for a certain **Tweet**. One elasticSearch parameters approach that would fit could be, **twitter** as the elasticSearch *index*, **sentiment** as the *doc type* because there could exist an emotion within the same platform, and lastly the *id* that could be the **datetime** when the task request was triggered.
To better understand these concepts, we are going to give a clear example that shows how the pipeline processes work internally. Imagine that the user requests a **sentiment** analysis for a certain **Tweet**. One ElasticSearch parameters approach that would fit could be, **twitter** as the ElasticSearch *index*, **sentiment** as the *doc type* because there could exist an emotion within the same platform, and lastly the *id* that could be the **datetime** when the task request was triggered.
Once the Luigi orchestator has been explained, we will conclude this section detailing how the server behaves when it receives a user request, and what parameters are mandatory to run the operation. The workflow is shown in diagram below:
......
What is GSI Crawler?
----------------
--------------------
GSI Crawler is an innovative and useful framework which aims to extract information from web pages enriching following semantic approaches. At the moment, there are three available platforms: Twitter, Reddit and News. The user interacts with the tool through a web interface, selecting the analysis type he wants to carry out and the platform that is going to be examined.
.. image:: images/crawler1.png
:align: center
In this documentation we are going to introduce this framework, detailing the global architecture of the project and explaining each module functionality. Finally we will expose most a case study in order to better understand the system itself.
.. image:: images/crawler1.png
:align: center
......@@ -10,8 +10,8 @@ Contents:
.. toctree::
gsicrawler
architecture
tutorials
architecture
:maxdepth: 4
.. api
......
......@@ -24,44 +24,42 @@ For docker installation in Ubuntu, visit this `link <https://store.docker.com/ed
Docker-compose installation detailed instructions are available `here <https://docs.docker.com/compose/install/>`_.
First of all, you need to clone the repository:
First of all, you need to clone the repositories:
.. code:: bash
$ git clone https://lab.cluster.gsi.dit.upm.es/sefarad/gsicrawler.git
$ cd gsicrawler
$ git clone https://lab.cluster.gsi.dit.upm.es/sefarad/dashboard-gsicrawler.git
Then, it is needed to set up the environment variables. For this task, first create a file named ``.env`` in the root directory of the project. As you can see, Twitter and MeaningCloud credentials are needed if you wish to use those services.
Then, it is needed to set up the environment variables. For this task, first create a file named ``.env`` in the root directory of each project (gsicrawler and dashboard-gsicrawler). As you can see, Twitter and MeaningCloud credentials are needed if you wish to use those services.
.. code::
TWITTER_CONSUMER_KEY={YourConsumerKey}
TWITTER_CONSUMER_SECRET={YourConsumerSecret}
TWITTER_ACCESS_TOKEN={YourAccessToken}
TWITTER_ACCESS_TOKEN_SECRET={YourAccessTokenSecret}
TWITTER_CONSUMER_KEY={YourConsumerKey, get it on Twitter}
TWITTER_CONSUMER_SECRET={YourConsumerSecret, get it on Twitter}
TWITTER_ACCESS_TOKEN={YourAccessToken, get it on Twitter}
TWITTER_ACCESS_TOKEN_SECRET={YourAccessTokenSecret, get it on Twitter}
ES_ENDPOINT=elasticsearch
ES_PORT=9200
ES_ENDPOINT_EXTERNAL=localhost:19200
FUSEKI_PASSWORD={YourFusekiPass}
FUSEKI_ENDPOINT_EXTERNAL=fuseki:3030
FUSEKI_ENDPOINT={YourFusekiEndPoint}
API_KEY_MEANING_CLOUD={YourMeaningCloudApiKey}
API_KEY_MEANING_CLOUD={YourMeaningCloudApiKey, get it on Meaningcloud}
FUSEKI_ENDPOINT_DASHBOARD={YourFusekiEndpoint, e.g. localhost:13030}
FUSEKI_ENDPOINT = fuseki
FUSEKI_PORT = 3030
Once you have created the file, you should add a new attribute for the **luigi** service in the file called ``docker-compose.yml``, being ``.env`` its value.
.. code::
env_file:
- .env
Finally, to run the image:
Finally, in both repositories execute the following line:
.. code:: bash
$ sudo docker-compose up
$ cd gsicrawler
$ sudo docker-compose up
$ cd ../dashboard-gsicrawler
$ sudo docker/compose up
The information related to the initialization can be found in the console. If you wish to see how tasks are being executed, apart from seeing the logs you can access the Luigi task visualizer in ``localhost:8082``. In the next steps you will discover more about Luigi.
......
......@@ -23,8 +23,7 @@
<script type="text/javascript" src="_static/doctools.js"></script>
<link rel="index" title="Index" href="genindex.html" />
<link rel="search" title="Search" href="search.html" />
<link rel="next" title="Getting started" href="tutorials.html" />
<link rel="prev" title="What is GSI Crawler?" href="gsicrawler.html" />
<link rel="prev" title="Getting started" href="tutorials.html" />
<link rel="stylesheet" href="_static/custom.css" type="text/css" />
......@@ -57,7 +56,7 @@
<img alt="_images/arch.png" class="align-center" src="_images/arch.png" />
<div class="section" id="tasks-server">
<h3>Tasks Server<a class="headerlink" href="#tasks-server" title="Permalink to this headline"></a></h3>
<p>The tasks server is responsible of managing the incoming workflow and setting up a valid pipeline to obtain, analyze, organize and save the results in Fuseki or ElasticSearch to be displayed in the client application. Luigi framework is used as an orchestator to build a sequence of tasks in order to facilitate the analysis process.</p>
<p>The tasks server is responsible of managing the incoming workflow and setting up a valid pipeline to obtain, analyze, organize and save the results in <a class="reference external" href="https://jena.apache.org/documentation/serving_data/">Fuseki</a> or <a class="reference external" href="https://www.elastic.co/">ElasticSearch</a> to be displayed in the client application. Luigi framework is used as an orchestator to build a sequence of tasks in order to facilitate the analysis process.</p>
<p>This tasks server is activated periodically by an administrator of processes called cron, whose aim is to obtain more information everyday. That way, any user can visualize data any time with the certainty that there will be stored data in the system.</p>
<p>All the pipelines have the same structure, as represented in the figure below.</p>
<a class="reference internal image-reference" href="_images/picLuigi.png"><img alt="_images/picLuigi.png" class="align-center" src="_images/picLuigi.png" style="width: 768.0px; height: 432.0px;" /></a>
......@@ -65,9 +64,9 @@
<ul class="simple">
<li><strong>Fetch</strong> refers to the process of obtaining tweets, comments or any content which is desired to be analyzed, from the provided URL. Most of the times, this task involves webpage parsing, recognizing valuable information contained inside html tags and building a new JSON file with the selected data. This process is commonly known as <em>scraping</em> a website. In order to facilitate this filtering process,there exist multiple extensions or libraries that offer a well-formed structure to carry out this task in a more comfortable way. Inside the Tasks Server, we have imported the Scrapy library in order to agilize the data mining process. Scrapy is an open source and collaborative framework for extracting the data from websites, in a fast, simple, yet extensible way. It is based on sub classes named <em>spiders</em>, which contain the required methods to extract the information. Apart from the use of the Scrapy library, several APIs have also been used for retrieving data. The GSI Crawler application has three available scrapers, one for each Twitter and Reddit platform, and another one which includes spiders for different news sources. So to conclude, this task focuses on extracting the valuable data and generates a JSON which can be analyzed by the following task in the pipeline.</li>
<li><strong>Analyze</strong> task is responsible of taking the input JSON file generated by the previous task, parsing it and analyzing each text strign using Senpy remote server for it. Senpy service is based on HTTP calls, obtaining an analyzed result for the text attached in the request. Once the task has collected the analysis result, it generates another JSON containing the original sentence and its analysis result.</li>
<li><strong>Store</strong> process consists on storing the JSON generated previously which contains the analysis result inside elasticSearch instance or Fuseki. ElasticSearch is a distributed, RESTful search and analytics engine capable of solving a growing number of use cases. As the heart of the Elastic Stack, it centrally stores the data so it is possible to discover the expected and uncover the unexpected. To carry out the saving process, it’s necessary to provide two arguments, the <strong>index</strong>, which represents the elastic index where the information will be saved, and the <strong>doc type</strong>, which allows to categorize information that belongs to the same index. It exists a third parameter which is the <strong>id</strong> of the query, but it is automatically generated by default.</li>
<li><strong>Store</strong> process consists on storing the JSON generated previously which contains the analysis result inside ElasticSearch instance or Fuseki. ElasticSearch is a distributed, RESTful search and analytics engine capable of solving a growing number of use cases. As the heart of the Elastic Stack, it centrally stores the data so it is possible to discover the expected and uncover the unexpected. To carry out the saving process, it’s necessary to provide two arguments, the <strong>index</strong>, which represents the elastic index where the information will be saved, and the <strong>doc type</strong>, which allows to categorize information that belongs to the same index. It exists a third parameter which is the <strong>id</strong> of the query, but it is automatically generated by default.</li>
</ul>
<p>To better understand these concepts, we are going to give a clear example that shows how the pipeline processes work internally. Imagine that the user requests a <strong>sentiment</strong> analysis for a certain <strong>Tweet</strong>. One elasticSearch parameters approach that would fit could be, <strong>twitter</strong> as the elasticSearch <em>index</em>, <strong>sentiment</strong> as the <em>doc type</em> because there could exist an emotion within the same platform, and lastly the <em>id</em> that could be the <strong>datetime</strong> when the task request was triggered.</p>
<p>To better understand these concepts, we are going to give a clear example that shows how the pipeline processes work internally. Imagine that the user requests a <strong>sentiment</strong> analysis for a certain <strong>Tweet</strong>. One ElasticSearch parameters approach that would fit could be, <strong>twitter</strong> as the ElasticSearch <em>index</em>, <strong>sentiment</strong> as the <em>doc type</em> because there could exist an emotion within the same platform, and lastly the <em>id</em> that could be the <strong>datetime</strong> when the task request was triggered.</p>
<p>Once the Luigi orchestator has been explained, we will conclude this section detailing how the server behaves when it receives a user request, and what parameters are mandatory to run the operation. The workflow is shown in diagram below:</p>
<img alt="_images/task-diagram.png" class="align-center" src="_images/task-diagram.png" />
</div>
......@@ -108,6 +107,7 @@
<h3>Navigation</h3>
<ul class="current">
<li class="toctree-l1"><a class="reference internal" href="gsicrawler.html">What is GSI Crawler?</a></li>
<li class="toctree-l1"><a class="reference internal" href="tutorials.html">Getting started</a></li>
<li class="toctree-l1 current"><a class="current reference internal" href="#">Architecture</a><ul>
<li class="toctree-l2"><a class="reference internal" href="#overview">Overview</a></li>
<li class="toctree-l2"><a class="reference internal" href="#modules">Modules</a><ul>
......@@ -117,7 +117,6 @@
</li>
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="tutorials.html">Getting started</a></li>
</ul>
......
......@@ -76,8 +76,8 @@
<h3>Navigation</h3>
<ul>
<li class="toctree-l1"><a class="reference internal" href="gsicrawler.html">What is GSI Crawler?</a></li>
<li class="toctree-l1"><a class="reference internal" href="architecture.html">Architecture</a></li>
<li class="toctree-l1"><a class="reference internal" href="tutorials.html">Getting started</a></li>
<li class="toctree-l1"><a class="reference internal" href="architecture.html">Architecture</a></li>
</ul>
......
......@@ -23,7 +23,7 @@
<script type="text/javascript" src="_static/doctools.js"></script>
<link rel="index" title="Index" href="genindex.html" />
<link rel="search" title="Search" href="search.html" />
<link rel="next" title="Architecture" href="architecture.html" />
<link rel="next" title="Getting started" href="tutorials.html" />
<link rel="prev" title="Welcome to GSI Crawler’s documentation!" href="index.html" />
<link rel="stylesheet" href="_static/custom.css" type="text/css" />
......@@ -43,8 +43,8 @@
<div class="section" id="what-is-gsi-crawler">
<h1>What is GSI Crawler?<a class="headerlink" href="#what-is-gsi-crawler" title="Permalink to this headline"></a></h1>
<p>GSI Crawler is an innovative and useful framework which aims to extract information from web pages enriching following semantic approaches. At the moment, there are three available platforms: Twitter, Reddit and News. The user interacts with the tool through a web interface, selecting the analysis type he wants to carry out and the platform that is going to be examined.</p>
<p>In this documentation we are going to introduce this framework, detailing the global architecture of the project and explaining each module functionality. Finally we will expose most a case study in order to better understand the system itself.</p>
<img alt="_images/crawler1.png" class="align-center" src="_images/crawler1.png" />
<p>In this documentation we are going to introduce this framework, detailing the global architecture of the project and explaining each module functionality. Finally we will expose most a case study in order to better understand the system itself.</p>
</div>
......@@ -77,8 +77,8 @@
<h3>Navigation</h3>
<ul class="current">
<li class="toctree-l1 current"><a class="current reference internal" href="#">What is GSI Crawler?</a></li>
<li class="toctree-l1"><a class="reference internal" href="architecture.html">Architecture</a></li>
<li class="toctree-l1"><a class="reference internal" href="tutorials.html">Getting started</a></li>
<li class="toctree-l1"><a class="reference internal" href="architecture.html">Architecture</a></li>
</ul>
......
......@@ -45,6 +45,13 @@
<div class="toctree-wrapper compound">
<ul>
<li class="toctree-l1"><a class="reference internal" href="gsicrawler.html">What is GSI Crawler?</a></li>
<li class="toctree-l1"><a class="reference internal" href="tutorials.html">Getting started</a><ul>
<li class="toctree-l2"><a class="reference internal" href="tutorials.html#first-glance-into-gsi-crawler">First glance into GSI Crawler</a></li>
<li class="toctree-l2"><a class="reference internal" href="tutorials.html#tutorial-i-install">Tutorial I: Install</a></li>
<li class="toctree-l2"><a class="reference internal" href="tutorials.html#tutorial-ii-crawling-news">Tutorial II: Crawling news</a></li>
<li class="toctree-l2"><a class="reference internal" href="tutorials.html#tutorial-iii-semantic-enrichment-and-data-storage">Tutorial III: Semantic enrichment and data storage</a></li>
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="architecture.html">Architecture</a><ul>
<li class="toctree-l2"><a class="reference internal" href="architecture.html#overview">Overview</a></li>
<li class="toctree-l2"><a class="reference internal" href="architecture.html#modules">Modules</a><ul>
......@@ -54,13 +61,6 @@
</li>
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="tutorials.html">Getting started</a><ul>
<li class="toctree-l2"><a class="reference internal" href="tutorials.html#first-glance-into-gsi-crawler">First glance into GSI Crawler</a></li>
<li class="toctree-l2"><a class="reference internal" href="tutorials.html#tutorial-i-install">Tutorial I: Install</a></li>
<li class="toctree-l2"><a class="reference internal" href="tutorials.html#tutorial-ii-crawling-news">Tutorial II: Crawling news</a></li>
<li class="toctree-l2"><a class="reference internal" href="tutorials.html#tutorial-iii-semantic-enrichment-and-data-storage">Tutorial III: Semantic enrichment and data storage</a></li>
</ul>
</li>
</ul>
</div>
</div>
......@@ -95,8 +95,8 @@
<h3>Navigation</h3>
<ul>
<li class="toctree-l1"><a class="reference internal" href="gsicrawler.html">What is GSI Crawler?</a></li>
<li class="toctree-l1"><a class="reference internal" href="architecture.html">Architecture</a></li>
<li class="toctree-l1"><a class="reference internal" href="tutorials.html">Getting started</a></li>
<li class="toctree-l1"><a class="reference internal" href="architecture.html">Architecture</a></li>
</ul>
......
......@@ -99,8 +99,8 @@
<h3>Navigation</h3>
<ul>
<li class="toctree-l1"><a class="reference internal" href="gsicrawler.html">What is GSI Crawler?</a></li>
<li class="toctree-l1"><a class="reference internal" href="architecture.html">Architecture</a></li>
<li class="toctree-l1"><a class="reference internal" href="tutorials.html">Getting started</a></li>
<li class="toctree-l1"><a class="reference internal" href="architecture.html">Architecture</a></li>
</ul>
......
This diff is collapsed.
......@@ -23,7 +23,8 @@
<script type="text/javascript" src="_static/doctools.js"></script>
<link rel="index" title="Index" href="genindex.html" />
<link rel="search" title="Search" href="search.html" />
<link rel="prev" title="Architecture" href="architecture.html" />
<link rel="next" title="Architecture" href="architecture.html" />
<link rel="prev" title="What is GSI Crawler?" href="gsicrawler.html" />
<link rel="stylesheet" href="_static/custom.css" type="text/css" />
......@@ -55,32 +56,33 @@
<p>GSI Crawler installation is based in docker containers, so it is required to have both docker and docker-compose installed.</p>
<p>For docker installation in Ubuntu, visit this <a class="reference external" href="https://store.docker.com/editions/community/docker-ce-server-ubuntu?tab=description">link</a>.</p>
<p>Docker-compose installation detailed instructions are available <a class="reference external" href="https://docs.docker.com/compose/install/">here</a>.</p>
<p>First of all, you need to clone the repository:</p>
<p>First of all, you need to clone the repositories:</p>
<div class="code bash highlight-default"><div class="highlight"><pre><span></span>$ git clone https://lab.cluster.gsi.dit.upm.es/sefarad/gsicrawler.git
$ cd gsicrawler
$ git clone https://lab.cluster.gsi.dit.upm.es/sefarad/dashboard-gsicrawler.git
</pre></div>
</div>
<p>Then, it is needed to set up the environment variables. For this task, first create a file named <code class="docutils literal"><span class="pre">.env</span></code> in the root directory of the project. As you can see, Twitter and MeaningCloud credentials are needed if you wish to use those services.</p>
<div class="code highlight-default"><div class="highlight"><pre><span></span><span class="n">TWITTER_CONSUMER_KEY</span><span class="o">=</span><span class="p">{</span><span class="n">YourConsumerKey</span><span class="p">}</span>
<span class="n">TWITTER_CONSUMER_SECRET</span><span class="o">=</span><span class="p">{</span><span class="n">YourConsumerSecret</span><span class="p">}</span>
<span class="n">TWITTER_ACCESS_TOKEN</span><span class="o">=</span><span class="p">{</span><span class="n">YourAccessToken</span><span class="p">}</span>
<span class="n">TWITTER_ACCESS_TOKEN_SECRET</span><span class="o">=</span><span class="p">{</span><span class="n">YourAccessTokenSecret</span><span class="p">}</span>
<p>Then, it is needed to set up the environment variables. For this task, first create a file named <code class="docutils literal"><span class="pre">.env</span></code> in the root directory of each project (gsicrawler and dashboard-gsicrawler). As you can see, Twitter and MeaningCloud credentials are needed if you wish to use those services.</p>
<div class="code highlight-default"><div class="highlight"><pre><span></span><span class="n">TWITTER_CONSUMER_KEY</span><span class="o">=</span><span class="p">{</span><span class="n">YourConsumerKey</span><span class="p">,</span> <span class="n">get</span> <span class="n">it</span> <span class="n">on</span> <span class="n">Twitter</span><span class="p">}</span>
<span class="n">TWITTER_CONSUMER_SECRET</span><span class="o">=</span><span class="p">{</span><span class="n">YourConsumerSecret</span><span class="p">,</span> <span class="n">get</span> <span class="n">it</span> <span class="n">on</span> <span class="n">Twitter</span><span class="p">}</span>
<span class="n">TWITTER_ACCESS_TOKEN</span><span class="o">=</span><span class="p">{</span><span class="n">YourAccessToken</span><span class="p">,</span> <span class="n">get</span> <span class="n">it</span> <span class="n">on</span> <span class="n">Twitter</span><span class="p">}</span>
<span class="n">TWITTER_ACCESS_TOKEN_SECRET</span><span class="o">=</span><span class="p">{</span><span class="n">YourAccessTokenSecret</span><span class="p">,</span> <span class="n">get</span> <span class="n">it</span> <span class="n">on</span> <span class="n">Twitter</span><span class="p">}</span>
<span class="n">ES_ENDPOINT</span><span class="o">=</span><span class="n">elasticsearch</span>
<span class="n">ES_PORT</span><span class="o">=</span><span class="mi">9200</span>
<span class="n">ES_ENDPOINT_EXTERNAL</span><span class="o">=</span><span class="n">localhost</span><span class="p">:</span><span class="mi">19200</span>
<span class="n">FUSEKI_PASSWORD</span><span class="o">=</span><span class="p">{</span><span class="n">YourFusekiPass</span><span class="p">}</span>
<span class="n">FUSEKI_ENDPOINT_EXTERNAL</span><span class="o">=</span><span class="n">fuseki</span><span class="p">:</span><span class="mi">3030</span>
<span class="n">FUSEKI_ENDPOINT</span><span class="o">=</span><span class="p">{</span><span class="n">YourFusekiEndPoint</span><span class="p">}</span>
<span class="n">API_KEY_MEANING_CLOUD</span><span class="o">=</span><span class="p">{</span><span class="n">YourMeaningCloudApiKey</span><span class="p">}</span>
<span class="n">API_KEY_MEANING_CLOUD</span><span class="o">=</span><span class="p">{</span><span class="n">YourMeaningCloudApiKey</span><span class="p">,</span> <span class="n">get</span> <span class="n">it</span> <span class="n">on</span> <span class="n">Meaningcloud</span><span class="p">}</span>
<span class="n">FUSEKI_ENDPOINT_DASHBOARD</span><span class="o">=</span><span class="p">{</span><span class="n">YourFusekiEndpoint</span><span class="p">,</span> <span class="n">e</span><span class="o">.</span><span class="n">g</span><span class="o">.</span> <span class="n">localhost</span><span class="p">:</span><span class="mi">13030</span><span class="p">}</span>
<span class="n">FUSEKI_ENDPOINT</span> <span class="o">=</span> <span class="n">fuseki</span>
<span class="n">FUSEKI_PORT</span> <span class="o">=</span> <span class="mi">3030</span>
</pre></div>
</div>
<p>Once you have created the file, you should add a new attribute for the <strong>luigi</strong> service in the file called <code class="docutils literal"><span class="pre">docker-compose.yml</span></code>, being <code class="docutils literal"><span class="pre">.env</span></code> its value.</p>
<div class="code highlight-default"><div class="highlight"><pre><span></span><span class="n">env_file</span><span class="p">:</span>
<span class="o">-</span> <span class="o">.</span><span class="n">env</span>
</pre></div>
</div>
<p>Finally, to run the image:</p>
<div class="code bash highlight-default"><div class="highlight"><pre><span></span>$ sudo docker-compose up
<p>Finally, in both repositories execute the following line:</p>
<div class="code bash highlight-default"><div class="highlight"><pre><span></span>$ cd gsicrawler
$ sudo docker-compose up
$ cd ../dashboard-gsicrawler
$ sudo docker/compose up
</pre></div>
</div>
<p>The information related to the initialization can be found in the console. If you wish to see how tasks are being executed, apart from seeing the logs you can access the Luigi task visualizer in <code class="docutils literal"><span class="pre">localhost:8082</span></code>. In the next steps you will discover more about Luigi.</p>
......@@ -258,7 +260,6 @@ $ cd gsicrawler
<h3>Navigation</h3>
<ul class="current">
<li class="toctree-l1"><a class="reference internal" href="gsicrawler.html">What is GSI Crawler?</a></li>
<li class="toctree-l1"><a class="reference internal" href="architecture.html">Architecture</a></li>
<li class="toctree-l1 current"><a class="current reference internal" href="#">Getting started</a><ul>
<li class="toctree-l2"><a class="reference internal" href="#first-glance-into-gsi-crawler">First glance into GSI Crawler</a></li>
<li class="toctree-l2"><a class="reference internal" href="#tutorial-i-install">Tutorial I: Install</a></li>
......@@ -266,6 +267,7 @@ $ cd gsicrawler
<li class="toctree-l2"><a class="reference internal" href="#tutorial-iii-semantic-enrichment-and-data-storage">Tutorial III: Semantic enrichment and data storage</a></li>
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="architecture.html">Architecture</a></li>
</ul>
......
......@@ -29,7 +29,7 @@ The following figure describes the architecture from a modular point of view, be
Tasks Server
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The tasks server is responsible of managing the incoming workflow and setting up a valid pipeline to obtain, analyze, organize and save the results in Fuseki or ElasticSearch to be displayed in the client application. Luigi framework is used as an orchestator to build a sequence of tasks in order to facilitate the analysis process.
The tasks server is responsible of managing the incoming workflow and setting up a valid pipeline to obtain, analyze, organize and save the results in `Fuseki <https://jena.apache.org/documentation/serving_data/>`_ or `ElasticSearch <https://www.elastic.co/>`_ to be displayed in the client application. Luigi framework is used as an orchestator to build a sequence of tasks in order to facilitate the analysis process.
This tasks server is activated periodically by an administrator of processes called cron, whose aim is to obtain more information everyday. That way, any user can visualize data any time with the certainty that there will be stored data in the system.
......@@ -45,9 +45,9 @@ As is represented above, pipelines architecture is divided into three main steps
* **Analyze** task is responsible of taking the input JSON file generated by the previous task, parsing it and analyzing each text strign using Senpy remote server for it. Senpy service is based on HTTP calls, obtaining an analyzed result for the text attached in the request. Once the task has collected the analysis result, it generates another JSON containing the original sentence and its analysis result.
* **Store** process consists on storing the JSON generated previously which contains the analysis result inside elasticSearch instance or Fuseki. ElasticSearch is a distributed, RESTful search and analytics engine capable of solving a growing number of use cases. As the heart of the Elastic Stack, it centrally stores the data so it is possible to discover the expected and uncover the unexpected. To carry out the saving process, it's necessary to provide two arguments, the **index**, which represents the elastic index where the information will be saved, and the **doc type**, which allows to categorize information that belongs to the same index. It exists a third parameter which is the **id** of the query, but it is automatically generated by default.
* **Store** process consists on storing the JSON generated previously which contains the analysis result inside ElasticSearch instance or Fuseki. ElasticSearch is a distributed, RESTful search and analytics engine capable of solving a growing number of use cases. As the heart of the Elastic Stack, it centrally stores the data so it is possible to discover the expected and uncover the unexpected. To carry out the saving process, it's necessary to provide two arguments, the **index**, which represents the elastic index where the information will be saved, and the **doc type**, which allows to categorize information that belongs to the same index. It exists a third parameter which is the **id** of the query, but it is automatically generated by default.
To better understand these concepts, we are going to give a clear example that shows how the pipeline processes work internally. Imagine that the user requests a **sentiment** analysis for a certain **Tweet**. One elasticSearch parameters approach that would fit could be, **twitter** as the elasticSearch *index*, **sentiment** as the *doc type* because there could exist an emotion within the same platform, and lastly the *id* that could be the **datetime** when the task request was triggered.
To better understand these concepts, we are going to give a clear example that shows how the pipeline processes work internally. Imagine that the user requests a **sentiment** analysis for a certain **Tweet**. One ElasticSearch parameters approach that would fit could be, **twitter** as the ElasticSearch *index*, **sentiment** as the *doc type* because there could exist an emotion within the same platform, and lastly the *id* that could be the **datetime** when the task request was triggered.
Once the Luigi orchestator has been explained, we will conclude this section detailing how the server behaves when it receives a user request, and what parameters are mandatory to run the operation. The workflow is shown in diagram below:
......
What is GSI Crawler?
----------------
--------------------
GSI Crawler is an innovative and useful framework which aims to extract information from web pages enriching following semantic approaches. At the moment, there are three available platforms: Twitter, Reddit and News. The user interacts with the tool through a web interface, selecting the analysis type he wants to carry out and the platform that is going to be examined.
.. image:: images/crawler1.png
:align: center
In this documentation we are going to introduce this framework, detailing the global architecture of the project and explaining each module functionality. Finally we will expose most a case study in order to better understand the system itself.
.. image:: images/crawler1.png
:align: center
......@@ -10,8 +10,8 @@ Contents:
.. toctree::
gsicrawler
architecture
tutorials
architecture
:maxdepth: 4
.. api
......
......@@ -30,25 +30,24 @@ First of all, you need to clone the repositories:
$ git clone https://lab.cluster.gsi.dit.upm.es/sefarad/gsicrawler.git
$ git clone https://lab.cluster.gsi.dit.upm.es/sefarad/dashboard-gsicrawler.git
$ cd gsicrawler
Then, it is needed to set up the environment variables. For this task, first create a file named ``.env`` in the root directory of each project (gsicrawler and dashboard-gsicrawler). As you can see, Twitter and MeaningCloud credentials are needed if you wish to use those services.
.. code::
TWITTER_CONSUMER_KEY={YourConsumerKey}
TWITTER_CONSUMER_SECRET={YourConsumerSecret}
TWITTER_ACCESS_TOKEN={YourAccessToken}
TWITTER_ACCESS_TOKEN_SECRET={YourAccessTokenSecret}
TWITTER_CONSUMER_KEY={YourConsumerKey, get it on Twitter}
TWITTER_CONSUMER_SECRET={YourConsumerSecret, get it on Twitter}
TWITTER_ACCESS_TOKEN={YourAccessToken, get it on Twitter}
TWITTER_ACCESS_TOKEN_SECRET={YourAccessTokenSecret, get it on Twitter}
ES_ENDPOINT=elasticsearch
ES_PORT=9200
ES_ENDPOINT_EXTERNAL=localhost:19200
FUSEKI_PASSWORD={YourFusekiPass}
FUSEKI_ENDPOINT_EXTERNAL=fuseki:3030
FUSEKI_ENDPOINT={YourFusekiEndPoint}
API_KEY_MEANING_CLOUD={YourMeaningCloudApiKey}
API_KEY_MEANING_CLOUD={YourMeaningCloudApiKey, get it on Meaningcloud}
FUSEKI_ENDPOINT_DASHBOARD={YourFusekiEndpoint, e.g. localhost:13030}
FUSEKI_ENDPOINT = localhost
FUSEKI_ENDPOINT = fuseki
FUSEKI_PORT = 3030
......@@ -57,7 +56,10 @@ Finally, in both repositories execute the following line:
.. code:: bash
$ sudo docker-compose up
$ cd gsicrawler
$ sudo docker-compose up
$ cd ../dashboard-gsicrawler
$ sudo docker/compose up
The information related to the initialization can be found in the console. If you wish to see how tasks are being executed, apart from seeing the logs you can access the Luigi task visualizer in ``localhost:8082``. In the next steps you will discover more about Luigi.
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment