Commit be5833c9 authored by Rodrigo Barbado Esteban's avatar Rodrigo Barbado Esteban
Browse files

Tutorial changes

parent ca94e4fc
Architecture
----------------
Overview
~~~~~~~~~~~~~~~~~~~~~
GSI Crawler environment can be defined from a high level point of view as follows:
* **Data Ingestion**: this is the core function of GSI Crawler, consisting on extracting data according to the petitions sent to it. It works thanks to the use of web crawlers, which will be explained in more detail in the Modules section.
* **Semantic Representation**: before its storage, data will be enriched following semantic paradigms in order to allow a more powerful analysis later.
* **Data Storage**: after data acquisition and enrichment, the storage process is carried out. At this moment, both ElasticSearch and Fuseki are available for fulfulling this task.
Modules
~~~~~~~~~~~~~~~~~~~~~
The following figure describes the architecture from a modular point of view, being each of the modules described below.
.. image:: images/arch.png
:align: center
Tasks Server
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The tasks server is responsible of managing the incoming workflow and setting up a valid pipeline to obtain, analyze, organize and save the results in Fuseki or ElasticSearch to be displayed in the client application. Luigi framework is used as an orchestator to build a sequence of tasks in order to facilitate the analysis process.
This tasks server is activated periodically by an administrator of processes called cron, whose aim is to obtain more information everyday. That way, any user can visualize data any time with the certainty that there will be stored data in the system.
All the pipelines have the same structure, as represented in the figure below.
.. image:: images/picLuigiNews.png
:scale: 80%
:align: center
As is represented above, pipelines architecture is divided into three main steps, *Fetch*, *Analyze*, *Semantic* and *Save*:
* **Fetch** refers to the process of obtaining tweets, comments or any content which is desired to be analyzed, from the provided URL. Most of the times, this task involves webpage parsing, recognizing valuable information contained inside html tags and building a new JSON file with the selected data. This process is commonly known as *scraping* a website. In order to facilitate this filtering process,there exist multiple extensions or libraries that offer a well-formed structure to carry out this task in a more comfortable way. Inside the Tasks Server, we have imported the Scrapy library in order to agilize the data mining process. Scrapy is an open source and collaborative framework for extracting the data from websites, in a fast, simple, yet extensible way. It is based on sub classes named *spiders*, which contain the required methods to extract the information. Apart from the use of the Scrapy library, several APIs have also been used for retrieving data. The GSI Crawler application has three available scrapers, one for each Twitter and Reddit platform, and another one which includes spiders for different news sources. So to conclude, this task focuses on extracting the valuable data and generates a JSON which can be analyzed by the following task in the pipeline.
* **Analyze** task is responsible of taking the input JSON file generated by the previous task, parsing it and analyzing each text strign using Senpy remote server for it. Senpy service is based on HTTP calls, obtaining an analyzed result for the text attached in the request. Once the task has collected the analysis result, it generates another JSON containing the original sentence and its analysis result.
* **Semantic** task aims to structure data into triplets so as to be understood by the different ontologies supported. It takes as input the original JSON data and returns another JSON with the desired structure.
* **Store** process consists on storing the JSON generated previously which contains the analysis result inside elasticSearch instance or Fuseki. ElasticSearch is a distributed, RESTful search and analytics engine capable of solving a growing number of use cases. As the heart of the Elastic Stack, it centrally stores the data so it is possible to discover the expected and uncover the unexpected. To carry out the saving process, it's necessary to provide two arguments, the **index**, which represents the elastic index where the information will be saved, and the **doc type**, which allows to categorize information that belongs to the same index. It exists a third parameter which is the **id** of the query, but it is automatically generated by default.
To better understand these concepts, we are going to give a clear example that shows how the pipeline processes work internally. Imagine that the user requests a **sentiment** analysis for a certain **Tweet**. One elasticSearch parameters approach that would fit could be, **twitter** as the elasticSearch *index*, **sentiment** as the *doc type* because there could exist an emotion within the same platform, and lastly the *id* that could be the **datetime** when the task request was triggered.
Once the Luigi orchestator has been explained, we will conclude this section detailing how the server behaves when it receives a user request, and what parameters are mandatory to run the operation. The workflow is shown in diagram below:
.. image:: images/task-diagram.png
:align: center
Web App - Polymer Web Components
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
GSI Crawler framework uses a webpage based on Polymer web components to interact with all the functionalities offered by the tool. These Polymer Web Components are simply independent submodules that can be grouped each other to build the general dashboard interface. In this section we are going to present those components which actively participate in the main application workflow.
This example shows the representation of data obtained from the News scraper.
.. image:: images/news1.png
:align: center
|
The list of news obtained which fits the selected filters is shown as represented in the following image. The headline of each news item appears along with the logo of its source and the emotion analysis representation of its content, displayed as an emoji.
|
.. image:: images/news2.png
:align: center
|
Additionaly, it is possible to use the Sparql editor to execute semantic queries which work making use of several ontologies in order to enrich the gathered data.
.. image:: images/news3.png
:align: center
What is GSI Crawler?
----------------
GSI Crawler [#f1]_ is an innovative and useful framework which aims to extract information from web pages enriching following semantic approaches. At the moment, there are three available platforms: Twitter, Reddit and News. The user interacts with the tool through a web interface, selecting the analysis type he wants to carry out and the platform that is going to be examined.
GSI Crawler is an innovative and useful framework which aims to extract information from web pages enriching following semantic approaches. At the moment, there are three available platforms: Twitter, Reddit and News. The user interacts with the tool through a web interface, selecting the analysis type he wants to carry out and the platform that is going to be examined.
In this documentation we are going to introduce this framework, detailing the global architecture of the project and explaining each module functionality. Finally we will expose most a case study in order to better understand the system itself.
Architecture
----------------
Overview
~~~~~~~~~~~~~~~~~~~~~
GSI Crawler environment can be defined from a high level point of view as follows:
* **Data Ingestion**: this is the core function of GSI Crawler, consisting on extracting data according to the petitions sent to it. It works thanks to the use of web crawlers, which will be explained in more detail in the Modules section.
* **Semantic Representation**: before its storage, data will be enriched following semantic paradigms in order to allow a more powerful analysis later.
* **Data Storage**: after data acquisition and enrichment, the storage process is carried out. At this moment, both ElasticSearch and Fuseki are available for fulfulling this task.
Modules
~~~~~~~~~~~~~~~~~~~~~
The following figure describes the architecture from a modular point of view, being each of the modules described below.
.. image:: images/arch.png
:align: center
Tasks Server
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The tasks server is responsible of managing the incoming workflow and setting up a valid pipeline to obtain, analyze, organize and save the results in Fuseki or ElasticSearch to be displayed in the client application. Luigi framework is used as an orchestator to build a sequence of tasks in order to facilitate the analysis process.
This tasks server is activated periodically by an administrator of processes called cron, whose aim is to obtain more information everyday. That way, any user can visualize data any time with the certainty that there will be stored data in the system.
All the pipelines have the same structure, as represented in the figure below.
.. image:: images/picLuigiNews.png
:scale: 80%
:align: center
As is represented above, pipelines architecture is divided into three main steps, *Fetch*, *Analyze*, *Semantic* and *Save*:
* **Fetch** refers to the process of obtaining tweets, comments or any content which is desired to be analyzed, from the provided URL. Most of the times, this task involves webpage parsing, recognizing valuable information contained inside html tags and building a new JSON file with the selected data. This process is commonly known as *scraping* a website. In order to facilitate this filtering process,there exist multiple extensions or libraries that offer a well-formed structure to carry out this task in a more comfortable way. Inside the Tasks Server, we have imported the Scrapy library in order to agilize the data mining process. Scrapy is an open source and collaborative framework for extracting the data from websites, in a fast, simple, yet extensible way. It is based on sub classes named *spiders*, which contain the required methods to extract the information. Apart from the use of the Scrapy library, several APIs have also been used for retrieving data. The GSI Crawler application has three available scrapers, one for each Twitter and Reddit platform, and another one which includes spiders for different news sources. So to conclude, this task focuses on extracting the valuable data and generates a JSON which can be analyzed by the following task in the pipeline.
* **Analyze** task is responsible of taking the input JSON file generated by the previous task, parsing it and analyzing each text strign using Senpy remote server for it. Senpy service is based on HTTP calls, obtaining an analyzed result for the text attached in the request. Once the task has collected the analysis result, it generates another JSON containing the original sentence and its analysis result.
* **Semantic** task aims to structure data into triplets so as to be understood by the different ontologies supported. It takes as input the original JSON data and returns another JSON with the desired structure.
* **Store** process consists on storing the JSON generated previously which contains the analysis result inside elasticSearch instance or Fuseki. ElasticSearch is a distributed, RESTful search and analytics engine capable of solving a growing number of use cases. As the heart of the Elastic Stack, it centrally stores the data so it is possible to discover the expected and uncover the unexpected. To carry out the saving process, it's necessary to provide two arguments, the **index**, which represents the elastic index where the information will be saved, and the **doc type**, which allows to categorize information that belongs to the same index. It exists a third parameter which is the **id** of the query, but it is automatically generated by default.
To better understand these concepts, we are going to give a clear example that shows how the pipeline processes work internally. Imagine that the user requests a **sentiment** analysis for a certain **Tweet**. One elasticSearch parameters approach that would fit could be, **twitter** as the elasticSearch *index*, **sentiment** as the *doc type* because there could exist an emotion within the same platform, and lastly the *id* that could be the **datetime** when the task request was triggered.
Once the Luigi orchestator has been explained, we will conclude this section detailing how the server behaves when it receives a user request, and what parameters are mandatory to run the operation. The workflow is shown in diagram below:
.. image:: images/task-diagram.png
:align: center
Web App - Polymer Web Components
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
GSI Crawler framework uses a webpage based on Polymer web components to interact with all the functionalities offered by the tool. These Polymer Web Components are simply independent submodules that can be grouped each other to build the general dashboard interface. In this section we are going to present those components which actively participate in the main application workflow.
This example shows the representation of data obtained from the News scraper.
.. image:: images/news1.png
:align: center
|
The list of news obtained which fits the selected filters is shown as represented in the following image. The headline of each news item appears along with the logo of its source and the emotion analysis representation of its content, displayed as an emoji.
|
.. image:: images/news2.png
:align: left
|
Additionaly, it is possible to use the Sparql editor to execute semantic queries which work making use of several ontologies in order to enrich the gathered data.
.. image:: images/news3.png
:align: left
Install
-------
GSI Crawler installation is based in docker containers, so it is required to have both docker and docker-compose installed.
For docker installation in Ubuntu, visit this `link <https://store.docker.com/editions/community/docker-ce-server-ubuntu?tab=description>`_.
Docker-compose installation detailed instructions are available `here <https://docs.docker.com/compose/install/>`_.
First of all, you need to clone the repository:
.. code:: bash
$ git clone https://lab.cluster.gsi.dit.upm.es/sefarad/gsicrawler.git
$ cd gsicrawler
Then, it is needed to set up the environment variables. For this task, first create a file named ``.env`` in the root directory of the project. Once you have created the file, you should add a new attribute for the luigi service in the file called ``docker-compose.yml``, being ``.env`` its value.
Finally, to run the image:
.. code:: bash
$ sudo docker-compose up
Tutorial
--------
In this section we are going to build a Crawler from scratch, making use of the CNN news crawler as a reference. A pre-requirement for building a scraper is understanding how Luigi pipelines work. Essentialy, they are a concatenation of tasks compound by certain parameters which execute a concrete function and return and output which is used by the following task. The dependencies of tasks are described semantically as "one task requires another". For a better understanding of Luigi, please visit this `documentation <https://luigi.readthedocs.io/en/stable/>`_.
As it was shown in the tasks server diagram, the first task is related to the obtention of data. In this case, the scraper will use the CNN API for news, so the information to be extracted is obtained from a JSON returned by accessing a certain endpoint. In other cases, the information has to be extracted making use of the `Scrapy <https://docs.scrapy.org/en/latest/>`_ library, which allows to extract data allocated between HTML tags in an easy way using CSS selectors.
So, the first step is to create that script that is in charge of extracting the desired data. Additionaly, data is required to be organised according to `schema.org`. For achieving this, in our example we should look to the `NewsArticle <http://schema.org/NewsArticle>`_ entity to know how to structure our data fields. The result of each collected news item contains the following attributes, whose name is specified in the schema reference. The Python dictionary depicted below should be saved as a JSON in order to be understood by the following task.
.. code-block:: python
aux = dict()
aux["type"] = "NewsArticle"
aux["@id"] = newsitem["url"]
aux["datePublished"] = newsitem["firstPublishDate"]
aux["dateModified"] = newsitem["lastModifiedDate"]
aux["articleBody"] = newsitem["body"]
aux["about"] = newsitem["topics"]
aux["author"] = newsitem["source"]
aux["headline"] = newsitem["headline"]
aux["search"] = search
aux["thumbnailUrl"] = newsitem["thumbnail"] ##########
news.append(aux)
This script is called from a task like shown in the image below. As this is the first task, it doesn't require any other task.
.. code-block:: python
class ScrapyTask(luigi.Task):
"""
Generates a local file containing 5 elements of data in JSON format.
"""
url = luigi.Parameter()
id = luigi.Parameter()
analysisType = luigi.Parameter()
def run(self):
"""
Writes data in JSON format into the task's output target.
The data objects have the following attributes:
* _id is the default Elasticsearch id field,
* text: the text,
* date: the day when the data was created.
"""
filePath = '/tmp/_scrapy-%s.json' % self.id
retrieveCnnNews(self.url, 10, filePath)
retrieveNytimesNews(self.url, 10, filePath)
def output(self):
"""
Returns the target output for this task.
In this case, a successful execution of this task will create a file on the local filesystem.
:return: the target output for this task.
:rtype: object (🇵🇾class:luigi.target.Target)
"""
return luigi.LocalTarget(path='/tmp/_scrapy-%s.json' % self.id)
The second step requires reading the information extracted from the first one, which was stored in a JSON file, and enriching it with a sentiment and emotion analysis. In our case, we have used the Senpy service, which returns the sentiment score of each news article according to Marl ontology and the emotion analysis according to Onyx ontology. As a result of this process, the returned JSON contains three additional fields: sentiment, polarity (the score associated to the sentiment field) and emotion. This code would look like this (if senpy service was used):
.. code-block:: python
if 'sentiments' in self.analysisType:
i["containsSentimentsAnalysis"] = True
r = requests.get('http://test.senpy.cluster.gsi.dit.upm.es/api/?algo=sentiment-tass&i=%s' % i["text"])
response = r.content.decode('utf-8')
response_json = json.loads(response)
i["sentiments"] = response_json["entries"][0]["sentiments"]
if 'emotions' in self.analysisType:
i["containsEmotionsAnalysis"] = True
r = requests.get('http://test.senpy.cluster.gsi.dit.upm.es/api/?algo=emotion-anew&i=%s' % i["text"])
response = r.content.decode('utf-8')
The following tasks would be storing the generated JSON in the previous tasks in the desired way according to one or more database schemes. After that, the visualization part has to be developed. For creating a dashboard, follow this `documentation <http://sefarad.readthedocs.io/en/latest/dashboards-dev.html>`_.
.. rubric:: References
.. [#f1] José Emilio Carmona. (2016). Development of a Social Media Crawler for Sentiment Analysis.
.. [#f2] J. Fernando Sánchez-Rada, Carlos A. Iglesias, Ignacio Corcuera-Platas & Oscar Araque (2016). Senpy: A Pragmatic Linked Sentiment Analysis Framework. In Proceedings DSAA 2016 Special Track on Emotion and Sentiment in Intelligent Systems and Big Social Data Analysis (SentISData).
.. [#f3] http://elastic.co.
......@@ -4,12 +4,14 @@
contain the root `toctree` directive.
Welcome to GSI Crawler's documentation!
===================================
=======================================
Contents:
.. toctree::
gsicrawler
architecture
tutorials
:maxdepth: 4
.. api
......
Getting started
---------------
Tutorial I: Install
~~~~~~~~~~~~~~~~~~~~
GSI Crawler installation is based in docker containers, so it is required to have both docker and docker-compose installed.
For docker installation in Ubuntu, visit this `link <https://store.docker.com/editions/community/docker-ce-server-ubuntu?tab=description>`_.
Docker-compose installation detailed instructions are available `here <https://docs.docker.com/compose/install/>`_.
First of all, you need to clone the repository:
.. code:: bash
$ git clone https://lab.cluster.gsi.dit.upm.es/sefarad/gsicrawler.git
$ cd gsicrawler
Then, it is needed to set up the environment variables. For this task, first create a file named ``.env`` in the root directory of the project.
.. code::
LUIGI_ENDPOINT="gsicrawler-luigi"
LUIGI_ENDPOINT_EXTERNAL="gsicrawler-luigi.cluster.gsi.dit.upm.es"
CRAWLER_ENDPOINT="gsicrawler"
CRAWLER_ENDPOINT_EXTERNAL="gsicrawler.cluster.gsi.dit.upm.es"
TWITTER_CONSUMER_KEY={YourConsumerKey}
TWITTER_CONSUMER_SECRET={YourConsumerSecret}
TWITTER_ACCESS_TOKEN={YourAccessToken}
TWITTER_ACCESS_TOKEN_SECRET={YourAccessTokenSecret}
ES_ENDPOINT=elasticsearch
ES_PORT=9200
FUSEKI_PASSWORD=gsi2017fuseki
FUSEKI_ENDPOINT_EXTERNAL=fuseki:3030
FUSEKI_ENDPOINT="gsicrawler-fuseki"
API_KEY_MEANING_CLOUD={YourMeaningCloudApiKey}
FUSEKI_ENDPOINT_DASHBOARD={YourFusekiEndpoint}
Once you have created the file, you should add a new attribute for the **luigi** service in the file called ``docker-compose.yml``, being ``.env`` its value.
.. code::
env_file:
- .env
Finally, to run the image:
.. code:: bash
$ sudo docker-compose up
The information related to the initialization can be found in the console, and when the process finishes it is possible to access the Demo dashboard by accesing ``localhost:8080`` from your web browser.
|
Tutorial II: Crawling news
~~~~~~~~~~~~~~~~~~~~~~~~~~
This second tutorial will show how to build a crawler to gather news from the CNN extracting data from the CNN News API, but in a general case we could use `Scrapy <https://docs.scrapy.org/en/latest/>`_ library, which allows to extract data from web pages.
We will only obtain the headline and url of each piece of news appearing on the CNN related to one topic, storing those fields into a JSON file.
.. image:: images/cnnsearch.png
:align: center
|
The code of this example can be found in ``luigi/scrapers/tutorial2.py``:
.. code-block:: python
import requests
import json
def retrieveCnnNews(search, num, filepath):
r = requests.get("https://search.api.cnn.io/content?q=" + search + "&size=" + str(num) + "")
response = r.json()["result"]
with open(filepath, 'a') as outfile:
print("CRAWLING RESULT")
for newsitem in response:
aux = dict()
aux["url"] = newsitem["url"]
aux["headline"] = newsitem["headline"]
print(aux)
json.dump(aux, outfile)
outfile.write('\n')
Then, we have to program a Luigi task which orders to execute the code from above. For more information about Luigi pipelines of tasks, please visit this `documentation <https://luigi.readthedocs.io/en/stable/>`_. This task will appear in ``luigi/tutorialtask.py``.
.. code-block:: python
class CrawlerTask(luigi.Task):
"""
Generates a local file containing 5 elements of data in JSON format.
"""
url = luigi.Parameter()
id = luigi.Parameter()
def run(self):
"""
Writes data in JSON format into the task's output target.
"""
filePath = '/tmp/_scrapy-%s.json' % self.id
print(self.url, filePath)
retrieveCnnNews(self.url, 10, filePath)
def output(self):
"""
Returns the target output for this task.
In this case, a successful execution of this task will create a file on the local filesystem.
"""
return luigi.LocalTarget(path='/tmp/_scrapy-%s.json' % self.id)
Finally, for running the tutorial execute the following line from your repository path.
.. code:: bash
$ docker-compose exec luigi python -m crontasks tutorial2
|
The resulting JSON will appear on the console.
.. code:: json
{"headline": "Iraqi forces say they've recaptured Hawija city center from ISIS", "url": "http://www.cnn.com/2017/10/05/middleeast/iraq-isis-hawija/index.html"}
{"headline": "3 US troops killed in ambush in Niger", "url": "http://www.cnn.com/2017/10/04/politics/us-forces-hostile-fire-niger/index.html"}
Tutorial III: Semantic enrichment and data storage
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In this tutorial we are going to structure our data according to the `NewsArticle <http://schema.org/NewsArticle>`_ entity from Schema. The scraper code can be found in ``luigi/scrapers/tutorial3.py``.
.. code-block:: python
import requests
import json
def retrieveCnnNews(search, num, filepath):
r = requests.get("https://search.api.cnn.io/content?q=" + search + "&size=" + str(num) + "")
response = r.json()["result"]
with open(filepath, 'a') as outfile:
for newsitem in response:
aux = dict()
aux["@type"] = "schema:NewsArticle"
aux["@id"] = newsitem["url"]
aux["_id"] = newsitem["url"]
aux["schema:datePublished"] = newsitem["firstPublishDate"]
aux["schema:dateModified"] = newsitem["lastModifiedDate"]
aux["schema:articleBody"] = newsitem["body"]
aux["schema:about"] = newsitem["topics"]
aux["schema:author"] = newsitem["source"]
aux["schema:headline"] = newsitem["headline"]
aux["schema:search"] = search
aux["schema:thumbnailUrl"] = newsitem["thumbnail"]
json.dump(aux, outfile)
outfile.write('\n')
The Luigi pipeline has more complexity as now data has to be stored in Elastic Search and Fuseki. The code of the pipeline can also be found in ``luigi/scrapers/tutorial3.py``, being the task execution workflow initiated by ``PipelineTask``, which is in charge of calling its dependent tasks.
For executing this tutorial you should execute the following line:
.. code:: bash
$ docker-compose exec luigi python -m crontasks tutorial3
In order to access the stored data in Elastic Search, access ``localhost:19200/tutorial/_search?pretty`` from your web browser.
.. code:: json
{
"_index" : "tutorial",
"_type" : "news",
"_id" : "http://www.cnn.com/2017/10/04/politics/syria-russia-us-assad-at-tanf/index.html",
"_score" : 1.0,
"_source" : {
"@type" : "schema:NewsArticle",
"@id" : "http://www.cnn.com/2017/10/04/politics/syria-russia-us-assad-at-tanf/index.html",
"schema:datePublished" : "2017-10-04T18:05:30Z",
"schema:dateModified" : "2017-10-04T18:05:29Z",
"schema:articleBody" : "Forces aligned with Syrian President Bashar al-Assad made an incursion Wednesday into the 55km \"de-confliction zone..." ",
"schema:about" : [
"Syria conflict",
"Armed forces",
"ISIS",
"Military operations"
],
"schema:author" : "cnn",
"schema:headline" : "Syrian regime forces enter buffer zone surrounding US base",
"schema:search" : "\"isis\"",
"schema:thumbnailUrl" : "http://i2.cdn.turner.com/cnnnext/dam/assets/170616041647-baghdadi-file-story-body.jpg"
}
In the case of seeing it on Fuseki, the address would be ``localhost:13030/tutorial/data``.
.. code:: turtle
<http://www.cnn.com/2017/10/02/politics/las-vegas-domestic-terrorism/index.html>
a schema:NewsArticle ;
<http://latest.senpy.cluster.gsi.dit.upm.es/ns/_id>
"http://www.cnn.com/2017/10/02/politics/las-vegas-domestic-terrorism/index.html" ;
schema:about "Shootings" , "Mass murder" , "Las Vegas" , "2017 Las Vegas concert shooting" ;
schema:articleBody "President Donald Trump on Tuesday did not say ...\"" ;
schema:author "cnn" ;
schema:dateModified "2017-10-03T14:13:36Z" ;
schema:datePublished "2017-10-02T21:26:26Z" ;
schema:headline "Trump mum on whether Las Vegas shooting was domestic terrorism" ;
schema:search "\"isis\"" ;
schema:thumbnailUrl "http://i2.cdn.turner.com/cnnnext/dam/assets/171002123455-31-las-vegas-incident-1002-story-body.jpg" .
For developing visual analysis tools, we suggest to build a dashboard following this `documentation <http://sefarad.readthedocs.io/en/latest/dashboards-dev.html>`_.
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>Architecture &#8212; GSI Crawler 1.0 documentation</title>
<link rel="stylesheet" href="_static/alabaster.css" type="text/css" />
<link rel="stylesheet" href="_static/pygments.css" type="text/css" />
<script type="text/javascript">
var DOCUMENTATION_OPTIONS = {
URL_ROOT: './',
VERSION: '1.0',
COLLAPSE_INDEX: false,
FILE_SUFFIX: '.html',
HAS_SOURCE: true,
SOURCELINK_SUFFIX: '.txt'
};
</script>
<script type="text/javascript" src="_static/jquery.js"></script>
<script type="text/javascript" src="_static/underscore.js"></script>
<script type="text/javascript" src="_static/doctools.js"></script>
<link rel="index" title="Index" href="genindex.html" />
<link rel="search" title="Search" href="search.html" />
<link rel="next" title="Getting started" href="tutorials.html" />
<link rel="prev" title="What is GSI Crawler?" href="gsicrawler.html" />
<link rel="stylesheet" href="_static/custom.css" type="text/css" />
<meta name="viewport" content="width=device-width, initial-scale=0.9, maximum-scale=0.9" />
</head>
<body>
<div class="document">
<div class="documentwrapper">
<div class="bodywrapper">
<div class="body" role="main">
<div class="section" id="architecture">
<h1>Architecture<a class="headerlink" href="#architecture" title="Permalink to this headline"></a></h1>
<div class="section" id="overview">
<h2>Overview<a class="headerlink" href="#overview" title="Permalink to this headline"></a></h2>
<p>GSI Crawler environment can be defined from a high level point of view as follows:</p>
<ul class="simple">
<li><strong>Data Ingestion</strong>: this is the core function of GSI Crawler, consisting on extracting data according to the petitions sent to it. It works thanks to the use of web crawlers, which will be explained in more detail in the Modules section.</li>
<li><strong>Semantic Representation</strong>: before its storage, data will be enriched following semantic paradigms in order to allow a more powerful analysis later.</li>
<li><strong>Data Storage</strong>: after data acquisition and enrichment, the storage process is carried out. At this moment, both ElasticSearch and Fuseki are available for fulfulling this task.</li>
</ul>
</div>
<div class="section" id="modules">
<h2>Modules<a class="headerlink" href="#modules" title="Permalink to this headline"></a></h2>
<p>The following figure describes the architecture from a modular point of view, being each of the modules described below.</p>
<img alt="_images/arch.png" class="align-center" src="_images/arch.png" />
<div class="section" id="tasks-server">
<h3>Tasks Server<a class="headerlink" href="#tasks-server" title="Permalink to this headline"></a></h3>
<p>The tasks server is responsible of managing the incoming workflow and setting up a valid pipeline to obtain, analyze, organize and save the results in Fuseki or ElasticSearch to be displayed in the client application. Luigi framework is used as an orchestator to build a sequence of tasks in order to facilitate the analysis process.</p>
<p>This tasks server is activated periodically by an administrator of processes called cron, whose aim is to obtain more information everyday. That way, any user can visualize data any time with the certainty that there will be stored data in the system.</p>
<p>All the pipelines have the same structure, as represented in the figure below.</p>
<a class="reference internal image-reference" href="_images/picLuigiNews.png"><img alt="_images/picLuigiNews.png" class="align-center" src="_images/picLuigiNews.png" style="width: 768.0px; height: 432.0px;" /></a>
<p>As is represented above, pipelines architecture is divided into three main steps, <em>Fetch</em>, <em>Analyze</em>, <em>Semantic</em> and <em>Save</em>:</p>
<ul class="simple">
<li><strong>Fetch</strong> refers to the process of obtaining tweets, comments or any content which is desired to be analyzed, from the provided URL. Most of the times, this task involves webpage parsing, recognizing valuable information contained inside html tags and building a new JSON file with the selected data. This process is commonly known as <em>scraping</em> a website. In order to facilitate this filtering process,there exist multiple extensions or libraries that offer a well-formed structure to carry out this task in a more comfortable way. Inside the Tasks Server, we have imported the Scrapy library in order to agilize the data mining process. Scrapy is an open source and collaborative framework for extracting the data from websites, in a fast, simple, yet extensible way. It is based on sub classes named <em>spiders</em>, which contain the required methods to extract the information. Apart from the use of the Scrapy library, several APIs have also been used for retrieving data. The GSI Crawler application has three available scrapers, one for each Twitter and Reddit platform, and another one which includes spiders for different news sources. So to conclude, this task focuses on extracting the valuable data and generates a JSON which can be analyzed by the following task in the pipeline.</li>
<li><strong>Analyze</strong> task is responsible of taking the input JSON file generated by the previous task, parsing it and analyzing each text strign using Senpy remote server for it. Senpy service is based on HTTP calls, obtaining an analyzed result for the text attached in the request. Once the task has collected the analysis result, it generates another JSON containing the original sentence and its analysis result.</li>
<li><strong>Semantic</strong> task aims to structure data into triplets so as to be understood by the different ontologies supported. It takes as input the original JSON data and returns another JSON with the desired structure.</li>
<li><strong>Store</strong> process consists on storing the JSON generated previously which contains the analysis result inside elasticSearch instance or Fuseki. ElasticSearch is a distributed, RESTful search and analytics engine capable of solving a growing number of use cases. As the heart of the Elastic Stack, it centrally stores the data so it is possible to discover the expected and uncover the unexpected. To carry out the saving process, it’s necessary to provide two arguments, the <strong>index</strong>, which represents the elastic index where the information will be saved, and the <strong>doc type</strong>, which allows to categorize information that belongs to the same index. It exists a third parameter which is the <strong>id</strong> of the query, but it is automatically generated by default.</li>
</ul>