Commit 24475457 authored by Rodrigo Barbado Esteban's avatar Rodrigo Barbado Esteban
Browse files

Removed rst files

parent be5833c9
docs/_build/html/_images/arch.png

44.9 KB | W: | H:

docs/_build/html/_images/arch.png

43.4 KB | W: | H:

docs/_build/html/_images/arch.png
docs/_build/html/_images/arch.png
docs/_build/html/_images/arch.png
docs/_build/html/_images/arch.png
  • 2-up
  • Swipe
  • Onion skin
docs/_build/html/_images/picLuigi.png

43.6 KB | W: | H:

docs/_build/html/_images/picLuigi.png

41.1 KB | W: | H:

docs/_build/html/_images/picLuigi.png
docs/_build/html/_images/picLuigi.png
docs/_build/html/_images/picLuigi.png
docs/_build/html/_images/picLuigi.png
  • 2-up
  • Swipe
  • Onion skin
......@@ -33,18 +33,16 @@ This tasks server is activated periodically by an administrator of processes cal
All the pipelines have the same structure, as represented in the figure below.
.. image:: images/picLuigiNews.png
.. image:: images/picLuigi.png
:scale: 80%
:align: center
As is represented above, pipelines architecture is divided into three main steps, *Fetch*, *Analyze*, *Semantic* and *Save*:
As is represented above, pipelines architecture is divided into three main steps, *Fetch*, *Analyze*, *Store*:
* **Fetch** refers to the process of obtaining tweets, comments or any content which is desired to be analyzed, from the provided URL. Most of the times, this task involves webpage parsing, recognizing valuable information contained inside html tags and building a new JSON file with the selected data. This process is commonly known as *scraping* a website. In order to facilitate this filtering process,there exist multiple extensions or libraries that offer a well-formed structure to carry out this task in a more comfortable way. Inside the Tasks Server, we have imported the Scrapy library in order to agilize the data mining process. Scrapy is an open source and collaborative framework for extracting the data from websites, in a fast, simple, yet extensible way. It is based on sub classes named *spiders*, which contain the required methods to extract the information. Apart from the use of the Scrapy library, several APIs have also been used for retrieving data. The GSI Crawler application has three available scrapers, one for each Twitter and Reddit platform, and another one which includes spiders for different news sources. So to conclude, this task focuses on extracting the valuable data and generates a JSON which can be analyzed by the following task in the pipeline.
* **Analyze** task is responsible of taking the input JSON file generated by the previous task, parsing it and analyzing each text strign using Senpy remote server for it. Senpy service is based on HTTP calls, obtaining an analyzed result for the text attached in the request. Once the task has collected the analysis result, it generates another JSON containing the original sentence and its analysis result.
* **Semantic** task aims to structure data into triplets so as to be understood by the different ontologies supported. It takes as input the original JSON data and returns another JSON with the desired structure.
* **Store** process consists on storing the JSON generated previously which contains the analysis result inside elasticSearch instance or Fuseki. ElasticSearch is a distributed, RESTful search and analytics engine capable of solving a growing number of use cases. As the heart of the Elastic Stack, it centrally stores the data so it is possible to discover the expected and uncover the unexpected. To carry out the saving process, it's necessary to provide two arguments, the **index**, which represents the elastic index where the information will be saved, and the **doc type**, which allows to categorize information that belongs to the same index. It exists a third parameter which is the **id** of the query, but it is automatically generated by default.
To better understand these concepts, we are going to give a clear example that shows how the pipeline processes work internally. Imagine that the user requests a **sentiment** analysis for a certain **Tweet**. One elasticSearch parameters approach that would fit could be, **twitter** as the elasticSearch *index*, **sentiment** as the *doc type* because there could exist an emotion within the same platform, and lastly the *id* that could be the **datetime** when the task request was triggered.
......@@ -56,25 +54,4 @@ Once the Luigi orchestator has been explained, we will conclude this section det
Web App - Polymer Web Components
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
GSI Crawler framework uses a webpage based on Polymer web components to interact with all the functionalities offered by the tool. These Polymer Web Components are simply independent submodules that can be grouped each other to build the general dashboard interface. In this section we are going to present those components which actively participate in the main application workflow.
This example shows the representation of data obtained from the News scraper.
.. image:: images/news1.png
:align: center
|
The list of news obtained which fits the selected filters is shown as represented in the following image. The headline of each news item appears along with the logo of its source and the emotion analysis representation of its content, displayed as an emoji.
|
.. image:: images/news2.png
:align: center
|
Additionaly, it is possible to use the Sparql editor to execute semantic queries which work making use of several ontologies in order to enrich the gathered data.
.. image:: images/news3.png
:align: center
GSI Crawler framework uses a webpage based on Polymer web components to interact with all the functionalities offered by the tool. These Polymer Web Components are simply independent submodules that can be grouped each other to build the general dashboard interface.
Getting started
---------------
First glance into GSI Crawler
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The quickest way of exploring the possibilities offered by GSI Crawler is accessing this `demo <https://docs.docker.com/compose/install/>`_. There we can find a dashboard to visualize data collected from different News sources and Twitter. Some examples of added value offered by this tool are topic and sentiment extraction, identification of people appearing on the scraped data and geolocation of sources.
Tutorial I: Install
......
......@@ -60,12 +60,11 @@
<p>The tasks server is responsible of managing the incoming workflow and setting up a valid pipeline to obtain, analyze, organize and save the results in Fuseki or ElasticSearch to be displayed in the client application. Luigi framework is used as an orchestator to build a sequence of tasks in order to facilitate the analysis process.</p>
<p>This tasks server is activated periodically by an administrator of processes called cron, whose aim is to obtain more information everyday. That way, any user can visualize data any time with the certainty that there will be stored data in the system.</p>
<p>All the pipelines have the same structure, as represented in the figure below.</p>
<a class="reference internal image-reference" href="_images/picLuigiNews.png"><img alt="_images/picLuigiNews.png" class="align-center" src="_images/picLuigiNews.png" style="width: 768.0px; height: 432.0px;" /></a>
<p>As is represented above, pipelines architecture is divided into three main steps, <em>Fetch</em>, <em>Analyze</em>, <em>Semantic</em> and <em>Save</em>:</p>
<a class="reference internal image-reference" href="_images/picLuigi.png"><img alt="_images/picLuigi.png" class="align-center" src="_images/picLuigi.png" style="width: 768.0px; height: 432.0px;" /></a>
<p>As is represented above, pipelines architecture is divided into three main steps, <em>Fetch</em>, <em>Analyze</em>, <em>Store</em>:</p>
<ul class="simple">
<li><strong>Fetch</strong> refers to the process of obtaining tweets, comments or any content which is desired to be analyzed, from the provided URL. Most of the times, this task involves webpage parsing, recognizing valuable information contained inside html tags and building a new JSON file with the selected data. This process is commonly known as <em>scraping</em> a website. In order to facilitate this filtering process,there exist multiple extensions or libraries that offer a well-formed structure to carry out this task in a more comfortable way. Inside the Tasks Server, we have imported the Scrapy library in order to agilize the data mining process. Scrapy is an open source and collaborative framework for extracting the data from websites, in a fast, simple, yet extensible way. It is based on sub classes named <em>spiders</em>, which contain the required methods to extract the information. Apart from the use of the Scrapy library, several APIs have also been used for retrieving data. The GSI Crawler application has three available scrapers, one for each Twitter and Reddit platform, and another one which includes spiders for different news sources. So to conclude, this task focuses on extracting the valuable data and generates a JSON which can be analyzed by the following task in the pipeline.</li>
<li><strong>Analyze</strong> task is responsible of taking the input JSON file generated by the previous task, parsing it and analyzing each text strign using Senpy remote server for it. Senpy service is based on HTTP calls, obtaining an analyzed result for the text attached in the request. Once the task has collected the analysis result, it generates another JSON containing the original sentence and its analysis result.</li>
<li><strong>Semantic</strong> task aims to structure data into triplets so as to be understood by the different ontologies supported. It takes as input the original JSON data and returns another JSON with the desired structure.</li>
<li><strong>Store</strong> process consists on storing the JSON generated previously which contains the analysis result inside elasticSearch instance or Fuseki. ElasticSearch is a distributed, RESTful search and analytics engine capable of solving a growing number of use cases. As the heart of the Elastic Stack, it centrally stores the data so it is possible to discover the expected and uncover the unexpected. To carry out the saving process, it’s necessary to provide two arguments, the <strong>index</strong>, which represents the elastic index where the information will be saved, and the <strong>doc type</strong>, which allows to categorize information that belongs to the same index. It exists a third parameter which is the <strong>id</strong> of the query, but it is automatically generated by default.</li>
</ul>
<p>To better understand these concepts, we are going to give a clear example that shows how the pipeline processes work internally. Imagine that the user requests a <strong>sentiment</strong> analysis for a certain <strong>Tweet</strong>. One elasticSearch parameters approach that would fit could be, <strong>twitter</strong> as the elasticSearch <em>index</em>, <strong>sentiment</strong> as the <em>doc type</em> because there could exist an emotion within the same platform, and lastly the <em>id</em> that could be the <strong>datetime</strong> when the task request was triggered.</p>
......@@ -74,22 +73,7 @@
</div>
<div class="section" id="web-app-polymer-web-components">
<h3>Web App - Polymer Web Components<a class="headerlink" href="#web-app-polymer-web-components" title="Permalink to this headline"></a></h3>
<p>GSI Crawler framework uses a webpage based on Polymer web components to interact with all the functionalities offered by the tool. These Polymer Web Components are simply independent submodules that can be grouped each other to build the general dashboard interface. In this section we are going to present those components which actively participate in the main application workflow.</p>
<p>This example shows the representation of data obtained from the News scraper.</p>
<img alt="_images/news1.png" class="align-center" src="_images/news1.png" />
<div class="line-block">
<div class="line"><br /></div>
</div>
<p>The list of news obtained which fits the selected filters is shown as represented in the following image. The headline of each news item appears along with the logo of its source and the emotion analysis representation of its content, displayed as an emoji.</p>
<div class="line-block">
<div class="line"><br /></div>
</div>
<img alt="_images/news2.png" class="align-center" src="_images/news2.png" />
<div class="line-block">
<div class="line"><br /></div>
</div>
<p>Additionaly, it is possible to use the Sparql editor to execute semantic queries which work making use of several ontologies in order to enrich the gathered data.</p>
<img alt="_images/news3.png" class="align-center" src="_images/news3.png" />
<p>GSI Crawler framework uses a webpage based on Polymer web components to interact with all the functionalities offered by the tool. These Polymer Web Components are simply independent submodules that can be grouped each other to build the general dashboard interface.</p>
</div>
</div>
</div>
......@@ -126,7 +110,11 @@
<li class="toctree-l1"><a class="reference internal" href="gsicrawler.html">What is GSI Crawler?</a></li>
<li class="toctree-l1 current"><a class="current reference internal" href="#">Architecture</a><ul>
<li class="toctree-l2"><a class="reference internal" href="#overview">Overview</a></li>
<li class="toctree-l2"><a class="reference internal" href="#modules">Modules</a></li>
<li class="toctree-l2"><a class="reference internal" href="#modules">Modules</a><ul>
<li class="toctree-l3"><a class="reference internal" href="#tasks-server">Tasks Server</a></li>
<li class="toctree-l3"><a class="reference internal" href="#web-app-polymer-web-components">Web App - Polymer Web Components</a></li>
</ul>
</li>
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="tutorials.html">Getting started</a></li>
......
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>Dashboards &#8212; GSI Crawler 1.0 documentation</title>
<link rel="stylesheet" href="_static/alabaster.css" type="text/css" />
<link rel="stylesheet" href="_static/pygments.css" type="text/css" />
<script type="text/javascript">
var DOCUMENTATION_OPTIONS = {
URL_ROOT: './',
VERSION: '1.0',
COLLAPSE_INDEX: false,
FILE_SUFFIX: '.html',
HAS_SOURCE: true,
SOURCELINK_SUFFIX: '.txt'
};
</script>
<script type="text/javascript" src="_static/jquery.js"></script>
<script type="text/javascript" src="_static/underscore.js"></script>
<script type="text/javascript" src="_static/doctools.js"></script>
<link rel="index" title="Index" href="genindex.html" />
<link rel="search" title="Search" href="search.html" />
<link rel="stylesheet" href="_static/custom.css" type="text/css" />
<meta name="viewport" content="width=device-width, initial-scale=0.9, maximum-scale=0.9" />
</head>
<body>
<div class="document">
<div class="documentwrapper">
<div class="bodywrapper">
<div class="body" role="main">
<div class="section" id="dashboards">
<h1>Dashboards<a class="headerlink" href="#dashboards" title="Permalink to this headline"></a></h1>
<div class="section" id="available-dashboards">
<h2>Available dashboards<a class="headerlink" href="#available-dashboards" title="Permalink to this headline"></a></h2>
<div class="section" id="sparql-dbpedia">
<h3>SPARQL DBpedia <a class="footnote-reference" href="#f1" id="id1">[1]</a><a class="headerlink" href="#sparql-dbpedia" title="Permalink to this headline"></a></h3>
<p>DBpedia is a crowd-sourced community effort to extract structured information from Wikipedia and make this information available on the Web. DBpedia allows you to ask sophisticated queries against Wikipedia, and to link the different data sets on the Web to Wikipedia data.</p>
<p>This dashboard provides a graphic interface to ask SPARQL queries against DBpedia.</p>
<a class="reference internal image-reference" href="images/dbpedia.png"><img alt="images/dbpedia.png" class="align-center" src="images/dbpedia.png" style="height: 400.0px;" /></a>
</div>
<div class="section" id="tourpedia">
<h3>Tourpedia <a class="footnote-reference" href="#f1" id="id2">[1]</a><a class="headerlink" href="#tourpedia" title="Permalink to this headline"></a></h3>
<p>TourPedia is the result of an European project. It is a demo of OpenNER (Open Polarity Enhanced Name Entity Recognition). It contains information about accommodations, restaurants, points of interest and attractions of different places in Europe.</p>
<p>TourPedia provides two main datasets: Places and Reviews. Each place contains useful information such as the name, the address and its URI to Facebook, Foursquare, GooglePlaces and Booking. Reviews contain also some useful details ready for us to exploit.</p>
<p>This dashboard also allows you to ask SPARQL quereies against our TourPedia database.</p>
<a class="reference internal image-reference" href="images/tourpedia.png"><img alt="images/tourpedia.png" class="align-center" src="images/tourpedia.png" style="height: 400.0px;" /></a>
</div>
<div class="section" id="financial-twitter-tracker">
<h3>Financial Twitter Tracker <a class="footnote-reference" href="#f1" id="id3">[1]</a><a class="headerlink" href="#financial-twitter-tracker" title="Permalink to this headline"></a></h3>
<p>Financial Twitter Tracker is an R&amp;D project of GSI Group that contains information about people talking about brands in social media like Twitter, Facebook, and more…</p>
<p>This dashboard provides interactive Web Components to visualize people’s opinion polarities and also has a SPARQL editor to ask queries about these opinions using RDF specifications.</p>
<a class="reference internal image-reference" href="images/ftt.png"><img alt="images/ftt.png" class="align-center" src="images/ftt.png" style="height: 400.0px;" /></a>
</div>
<div class="section" id="footballmood">
<h3>Footballmood <a class="footnote-reference" href="#f2" id="id4">[2]</a><a class="headerlink" href="#footballmood" title="Permalink to this headline"></a></h3>
<p>Footballmood is an application developed for sentiment analysis of football in Twitter. This dashboard provides interactive Web Components to visualize people’s opinion polarities and also has a SPARQL editor to ask queries about football players against DBpedia.</p>
<a class="reference internal image-reference" href="images/footballmood.png"><img alt="images/footballmood.png" class="align-center" src="images/footballmood.png" style="height: 400.0px;" /></a>
</div>
<div class="section" id="aspects">
<h3>Aspects <a class="footnote-reference" href="#f3" id="id5">[3]</a><a class="headerlink" href="#aspects" title="Permalink to this headline"></a></h3>
<p>Aspects dashboard is an analyser developed for aspects sentiment analysis of restaurant reviews. This is the result of analysis showed on a dashboard based on web components and D3.js. To view your data you can use widgets and visualize it through them.</p>
<p>The data used for the dashboard is the Semeval 2015 ABSA dataset (Task 12) for restaurant domain, available <a class="reference external" href="http://alt.qcri.org/semeval2015/task12/">here</a></p>
<a class="reference internal image-reference" href="images/aspects.png"><img alt="images/aspects.png" class="align-center" src="images/aspects.png" style="height: 400.0px;" /></a>
</div>
<div class="section" id="gsi-crawler">
<h3>GSI Crawler <a class="footnote-reference" href="#f4" id="id6">[4]</a><a class="headerlink" href="#gsi-crawler" title="Permalink to this headline"></a></h3>
<p>This dashboard is useful to the analysis of comments from external aplications like Amazon and Foursquare. The user will choose the type of analysis he wants to carry out (Emotions, Sentiments or Fake Analysis) and he will also supply, for instance, a direct URL to a Amazon’s Product.</p>
<p>GSI Crawler will download the comments belonging to this element and, later, the pertinent analysis will be run using the Senpy tool <a class="footnote-reference" href="#f5" id="id7">[5]</a>. Once the analysis is finished, a summary of the result will be shown and the possibility of review each comment one by one will be also offered.</p>
<a class="reference internal image-reference" href="_images/gsicrawler.png"><img alt="_images/gsicrawler.png" class="align-center" src="_images/gsicrawler.png" style="width: 1131.0px; height: 400.0px;" /></a>
</div>
</div>
<div class="section" id="developing-your-own-dashboard">
<h2>Developing your own dashboard<a class="headerlink" href="#developing-your-own-dashboard" title="Permalink to this headline"></a></h2>
<p>In this section we will explain how to create new dashboards in Sefarad, or import existing ones. First of all you must create a new directory inside <strong>elements</strong> (e.g <code class="docutils literal"><span class="pre">elements/my-dashboard</span></code>), and move inside all your dashboard files (e.g my-dashboard.html).</p>
<p>In addition, you have to define your dashboard structure as follows in <strong>my-dashboard.html</strong> file.</p>
<div class="highlight-html"><div class="highlight"><pre><span></span><span class="p">&lt;</span><span class="nt">dom-module</span> <span class="na">id</span><span class="o">=</span><span class="s">&quot;my-dashboard&quot;</span><span class="p">&gt;</span>
<span class="p">&lt;</span><span class="nt">template</span><span class="p">&gt;</span>
<span class="c">&lt;!-- dashboard content --&gt;</span>
<span class="p">&lt;/</span><span class="nt">template</span><span class="p">&gt;</span>
<span class="p">&lt;/</span><span class="nt">dom-module</span><span class="p">&gt;</span>
</pre></div>
</div>
<p>Inside <code class="docutils literal"><span class="pre">&lt;dom-module&gt;</span></code> tag you have to define your new Polymer dashboard adding some JavaScript:</p>
<div class="highlight-javascript"><div class="highlight"><pre><span></span><span class="nx">Polymer</span><span class="p">({</span>
<span class="nx">is</span><span class="o">:</span> <span class="s1">&#39;my-dashboard&#39;</span><span class="p">,</span>
<span class="nx">properties</span><span class="o">:</span> <span class="p">{</span>
<span class="c1">// dashboard properties</span>
<span class="p">},</span>
<span class="nx">ready</span><span class="o">:</span> <span class="kd">function</span><span class="p">(){</span>
<span class="nx">do_some_function</span><span class="p">();</span>
<span class="p">}</span>
<span class="p">});</span>
</pre></div>
</div>
<p>Is also necessary to specify dependencies for this dashboard using a bower.json file. The structure of this file is like this example:</p>
<div class="highlight-json"><div class="highlight"><pre><span></span><span class="p">{</span>
<span class="nt">&quot;name&quot;</span><span class="p">:</span> <span class="s2">&quot;my-dashboard&quot;</span><span class="p">,</span>
<span class="nt">&quot;homepage&quot;</span><span class="p">:</span> <span class="s2">&quot;https://lab.cluster.gsi.dit.upm.es/sefarad/your-dashboard-url&quot;</span><span class="p">,</span>
<span class="nt">&quot;authors&quot;</span><span class="p">:</span> <span class="p">[</span>
<span class="s2">&quot;GSI-UPM&quot;</span>
<span class="p">],</span>
<span class="nt">&quot;description&quot;</span><span class="p">:</span> <span class="s2">&quot;&quot;</span><span class="p">,</span>
<span class="nt">&quot;main&quot;</span><span class="p">:</span> <span class="s2">&quot;&quot;</span><span class="p">,</span>
<span class="nt">&quot;license&quot;</span><span class="p">:</span> <span class="s2">&quot;MIT&quot;</span><span class="p">,</span>
<span class="nt">&quot;dependencies&quot;</span><span class="p">:</span> <span class="p">{</span>
<span class="nt">&quot;paper-card&quot;</span><span class="p">:</span> <span class="s2">&quot;PolymerElements/paper-card#^1.1.4&quot;</span><span class="p">,</span>
<span class="nt">&quot;polymer&quot;</span><span class="p">:</span> <span class="s2">&quot;polymer#*&quot;</span><span class="p">,</span>
<span class="nt">&quot;google-chart-elasticsearch&quot;</span><span class="p">:</span> <span class="s2">&quot;google-chart-elasticsearch#*&quot;</span>
<span class="p">}</span>
</pre></div>
</div>
<p>If you want to make your dashboard installable via bower you can register this package. This requires to have a git repository with all your dashboard code.</p>
<div class="highlight-bash"><div class="highlight"><pre><span></span>$ bower register &lt;my-package-name&gt; &lt;git-endpoint&gt;
</pre></div>
</div>
<p>Afterwards, you have to create a new file in <code class="docutils literal"><span class="pre">dashboards</span></code> folder. In this example, is called <strong>newdashboard.html</strong>. This file must have the same structure as other files in this folder, but you need to change the following lines to display your new dashboard.</p>
<div class="highlight-html"><div class="highlight"><pre><span></span>...
<span class="p">&lt;</span><span class="nt">iron-pages</span> <span class="na">attr-for-selected</span><span class="o">=</span><span class="s">&quot;data-route&quot;</span> <span class="na">selected</span><span class="o">=</span><span class="s">&quot;my_dashboard_route&quot;</span><span class="p">&gt;</span>
<span class="p">&lt;</span><span class="nt">section</span> <span class="na">data-route</span><span class="o">=</span><span class="s">&quot;my_dashboard_route&quot;</span><span class="p">&gt;</span>
<span class="p">&lt;</span><span class="nt">my-dashboard</span><span class="p">&gt;&lt;/</span><span class="nt">my-dashboard</span><span class="p">&gt;</span>
<span class="p">&lt;/</span><span class="nt">section</span><span class="p">&gt;</span>
<span class="p">&lt;/</span><span class="nt">iron-pages</span><span class="p">&gt;</span>
...
</pre></div>
</div>
<p>Finally, complete the <code class="docutils literal"><span class="pre">app.wsgi</span></code> and <code class="docutils literal"><span class="pre">elements.html</span></code> files located inside elements directory.</p>
<p><strong>app.wsgi</strong></p>
<div class="highlight-python"><div class="highlight"><pre><span></span><span class="o">...</span>
<span class="nd">@route</span><span class="p">(</span><span class="s1">&#39;/mydashboard&#39;</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">mydashboard</span><span class="p">():</span>
<span class="k">return</span> <span class="n">static_file</span><span class="p">(</span><span class="s1">&#39;/dashboards/newdashboard.html&#39;</span><span class="p">,</span> <span class="n">root</span><span class="o">=</span><span class="s1">&#39;&#39;</span><span class="p">)</span>
<span class="o">...</span>
</pre></div>
</div>
<p><strong>elements.html</strong></p>
<div class="highlight-html"><div class="highlight"><pre><span></span><span class="p">&lt;</span><span class="nt">link</span> <span class="na">rel</span><span class="o">=</span><span class="s">&quot;import&quot;</span> <span class="na">href</span><span class="o">=</span><span class="s">&quot;../bower_components/my_component/my_component.html&quot;</span><span class="p">&gt;</span>
<span class="p">&lt;</span><span class="nt">link</span> <span class="na">rel</span><span class="o">=</span><span class="s">&quot;import&quot;</span> <span class="na">href</span><span class="o">=</span><span class="s">&quot;my-dashboard/index.html&quot;</span><span class="p">&gt;</span>
</pre></div>
</div>
<p>Remember to add your Polymer Web Components to <code class="docutils literal"><span class="pre">bower_components</span></code> directory if not included yet. Edit css if necessary.</p>
<p>After following these steps, build up Sefarad environment and you should visualize your dashboard successfully.</p>
<p class="rubric">References</p>
<table class="docutils footnote" frame="void" id="f1" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label">[1]</td><td><em>(<a class="fn-backref" href="#id1">1</a>, <a class="fn-backref" href="#id2">2</a>, <a class="fn-backref" href="#id3">3</a>)</em> Enrique Conde Sánchez. (2016). Development of a Social Media Monitoring System based on Elasticsearch and Web Components Technologies.</td></tr>
</tbody>
</table>
<table class="docutils footnote" frame="void" id="f2" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label"><a class="fn-backref" href="#id4">[2]</a></td><td>Alberto Pascual Saavedra. (2016). Development of a Dashboard for Sentiment Analysis of Football in Twitter based on Web Components and D3.js.</td></tr>
</tbody>
</table>
<table class="docutils footnote" frame="void" id="f3" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label"><a class="fn-backref" href="#id5">[3]</a></td><td>Manuel García-Amado. (2016). Development of an Aspect-based Sentiment Analyzer for the Social Web and Application to Product Reviews.</td></tr>
</tbody>
</table>
<table class="docutils footnote" frame="void" id="f4" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label"><a class="fn-backref" href="#id6">[4]</a></td><td>José Emilio Carmona. (2016). Development of a Social Media Crawler for Sentiment Analysis.</td></tr>
</tbody>
</table>
<table class="docutils footnote" frame="void" id="f5" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label"><a class="fn-backref" href="#id7">[5]</a></td><td><ol class="first last upperalpha simple" start="10">
<li>Fernando Sánchez-Rada, Carlos A. Iglesias, Ignacio Corcuera-Platas &amp; Oscar Araque (2016). Senpy: A Pragmatic Linked Sentiment Analysis Framework. In Proceedings DSAA 2016 Special Track on Emotion and Sentiment in Intelligent Systems and Big Social Data Analysis (SentISData).</li>
</ol>
</td></tr>
</tbody>
</table>
</div>
</div>
</div>
</div>
</div>
<div class="sphinxsidebar" role="navigation" aria-label="main navigation">
<div class="sphinxsidebarwrapper">
<p class="logo">
<a href="index.html">
<img class="logo" src="_static/logo-gsi-crawler.png" alt="Logo"/>
</a>
</p>
<p>
<iframe src="https://ghbtns.com/github-btn.html?user=gsi-upm&repo=gsicrawler&type=watch&count=true&size=large&v=2"
allowtransparency="true" frameborder="0" scrolling="0" width="200px" height="35px"></iframe>
</p>
<h3>Navigation</h3>
<ul>
<li class="toctree-l1"><a class="reference internal" href="gsicrawler.html">What is GSI Crawler?</a></li>
<li class="toctree-l1"><a class="reference internal" href="architecture.html">Architecture</a></li>
<li class="toctree-l1"><a class="reference internal" href="tutorials.html">Getting started</a></li>
</ul>
<div id="searchbox" style="display: none" role="search">
<h3>Quick search</h3>
<form class="search" action="search.html" method="get">
<div><input type="text" name="q" /></div>
<div><input type="submit" value="Go" /></div>
<input type="hidden" name="check_keywords" value="yes" />
<input type="hidden" name="area" value="default" />
</form>
</div>
<script type="text/javascript">$('#searchbox').show(0);</script>
</div>
</div>
<div class="clearer"></div>
</div>
<div class="footer">
&copy;2017, Antonio F. Llamas.
|
Powered by <a href="http://sphinx-doc.org/">Sphinx 1.6.3</a>
&amp; <a href="https://github.com/bitprophet/alabaster">Alabaster 0.7.10</a>
|
<a href="_sources/dashboards.rst.txt"
rel="nofollow">Page source</a>
</div>
<a href="https://github.com/gsi-upm/gsicrawler" class="github">
<img style="position: absolute; top: 0; right: 0; border: 0;" src="https://s3.amazonaws.com/github/ribbons/forkme_right_darkblue_121621.png" alt="Fork me on GitHub" class="github"/>
</a>
</body>
</html>
\ No newline at end of file
......@@ -55,6 +55,7 @@
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="tutorials.html">Getting started</a><ul>
<li class="toctree-l2"><a class="reference internal" href="tutorials.html#first-glance-into-gsi-crawler">First glance into GSI Crawler</a></li>
<li class="toctree-l2"><a class="reference internal" href="tutorials.html#tutorial-i-install">Tutorial I: Install</a></li>
<li class="toctree-l2"><a class="reference internal" href="tutorials.html#tutorial-ii-crawling-news">Tutorial II: Crawling news</a></li>
<li class="toctree-l2"><a class="reference internal" href="tutorials.html#tutorial-iii-semantic-enrichment-and-data-storage">Tutorial III: Semantic enrichment and data storage</a></li>
......
......@@ -2,5 +2,4 @@
# Project: GSI Crawler
# Version:
# The remainder of this file is compressed using zlib.
xڍMN0>l*?6RQ]Աp`[\Mc7/zrZL(Ku !rm>5 4Z^P؎[жVFҽ ڝtM2`yNOcԚXVF Pޜ%IWl zd+Z34ghH0ʕ=۟Dzip
!UΖs:'2t8hn+hS8T`IȴNyc{P.9$#/D˛AaZՠa4%]|=eP"`Ml}0wEX_0h xsȔso։SLgCa b|Jd7_7oixHy
\ No newline at end of file
xmAN0E>pزAHU%ڵgc{z8nGexlTCS@WYۆ[ OZfeqPb}Zߑ3p)?Ig=V8iXzN&O)AȾ$P4 #kr6D0TGn7Ѣ5"kX£:]U-K&no6H}e`zb,˫uha\`1$2$ҳۘh=*1?\7Y}Nw:9$XVȜ__̫D- qFc}λOE$
\ No newline at end of file
......@@ -181,8 +181,8 @@ For using this cron pipeline is necessary to change docker-compose.yml file addi
<h3>Navigation</h3>
<ul>
<li class="toctree-l1"><a class="reference internal" href="gsicrawler.html">What is GSI Crawler?</a></li>
<li class="toctree-l1"><a class="reference internal" href="gsicrawler.html#architecture">Architecture</a></li>
<li class="toctree-l1"><a class="reference internal" href="gsicrawler.html#install">Install</a></li>
<li class="toctree-l1"><a class="reference internal" href="architecture.html">Architecture</a></li>
<li class="toctree-l1"><a class="reference internal" href="tutorials.html">Getting started</a></li>
</ul>
......
This diff is collapsed.
......@@ -41,6 +41,10 @@
<div class="section" id="getting-started">
<h1>Getting started<a class="headerlink" href="#getting-started" title="Permalink to this headline"></a></h1>
<div class="section" id="first-glance-into-gsi-crawler">
<h2>First glance into GSI Crawler<a class="headerlink" href="#first-glance-into-gsi-crawler" title="Permalink to this headline"></a></h2>
<p>The quickest way of exploring the possibilities offered by GSI Crawler is accessing this <a class="reference external" href="https://docs.docker.com/compose/install/">demo</a>. There we can find a dashboard to visualize data collected from different News sources and Twitter. Some examples of added value offered by this tool are topic and sentiment extraction, identification of people appearing on the scraped data and geolocation of sources.</p>
</div>
<div class="section" id="tutorial-i-install">
<h2>Tutorial I: Install<a class="headerlink" href="#tutorial-i-install" title="Permalink to this headline"></a></h2>
<p>GSI Crawler installation is based in docker containers, so it is required to have both docker and docker-compose installed.</p>
......@@ -257,6 +261,7 @@ $ cd gsicrawler
<li class="toctree-l1"><a class="reference internal" href="gsicrawler.html">What is GSI Crawler?</a></li>
<li class="toctree-l1"><a class="reference internal" href="architecture.html">Architecture</a></li>
<li class="toctree-l1 current"><a class="current reference internal" href="#">Getting started</a><ul>
<li class="toctree-l2"><a class="reference internal" href="#first-glance-into-gsi-crawler">First glance into GSI Crawler</a></li>
<li class="toctree-l2"><a class="reference internal" href="#tutorial-i-install">Tutorial I: Install</a></li>
<li class="toctree-l2"><a class="reference internal" href="#tutorial-ii-crawling-news">Tutorial II: Crawling news</a></li>
<li class="toctree-l2"><a class="reference internal" href="#tutorial-iii-semantic-enrichment-and-data-storage">Tutorial III: Semantic enrichment and data storage</a></li>
......
......@@ -33,18 +33,16 @@ This tasks server is activated periodically by an administrator of processes cal
All the pipelines have the same structure, as represented in the figure below.
.. image:: images/picLuigiNews.png
.. image:: images/picLuigi.png
:scale: 80%
:align: center
As is represented above, pipelines architecture is divided into three main steps, *Fetch*, *Analyze*, *Semantic* and *Save*:
As is represented above, pipelines architecture is divided into three main steps, *Fetch*, *Analyze*, *Store*:
* **Fetch** refers to the process of obtaining tweets, comments or any content which is desired to be analyzed, from the provided URL. Most of the times, this task involves webpage parsing, recognizing valuable information contained inside html tags and building a new JSON file with the selected data. This process is commonly known as *scraping* a website. In order to facilitate this filtering process,there exist multiple extensions or libraries that offer a well-formed structure to carry out this task in a more comfortable way. Inside the Tasks Server, we have imported the Scrapy library in order to agilize the data mining process. Scrapy is an open source and collaborative framework for extracting the data from websites, in a fast, simple, yet extensible way. It is based on sub classes named *spiders*, which contain the required methods to extract the information. Apart from the use of the Scrapy library, several APIs have also been used for retrieving data. The GSI Crawler application has three available scrapers, one for each Twitter and Reddit platform, and another one which includes spiders for different news sources. So to conclude, this task focuses on extracting the valuable data and generates a JSON which can be analyzed by the following task in the pipeline.
* **Analyze** task is responsible of taking the input JSON file generated by the previous task, parsing it and analyzing each text strign using Senpy remote server for it. Senpy service is based on HTTP calls, obtaining an analyzed result for the text attached in the request. Once the task has collected the analysis result, it generates another JSON containing the original sentence and its analysis result.
* **Semantic** task aims to structure data into triplets so as to be understood by the different ontologies supported. It takes as input the original JSON data and returns another JSON with the desired structure.
* **Store** process consists on storing the JSON generated previously which contains the analysis result inside elasticSearch instance or Fuseki. ElasticSearch is a distributed, RESTful search and analytics engine capable of solving a growing number of use cases. As the heart of the Elastic Stack, it centrally stores the data so it is possible to discover the expected and uncover the unexpected. To carry out the saving process, it's necessary to provide two arguments, the **index**, which represents the elastic index where the information will be saved, and the **doc type**, which allows to categorize information that belongs to the same index. It exists a third parameter which is the **id** of the query, but it is automatically generated by default.
To better understand these concepts, we are going to give a clear example that shows how the pipeline processes work internally. Imagine that the user requests a **sentiment** analysis for a certain **Tweet**. One elasticSearch parameters approach that would fit could be, **twitter** as the elasticSearch *index*, **sentiment** as the *doc type* because there could exist an emotion within the same platform, and lastly the *id* that could be the **datetime** when the task request was triggered.
......@@ -56,25 +54,4 @@ Once the Luigi orchestator has been explained, we will conclude this section det
Web App - Polymer Web Components
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
GSI Crawler framework uses a webpage based on Polymer web components to interact with all the functionalities offered by the tool. These Polymer Web Components are simply independent submodules that can be grouped each other to build the general dashboard interface. In this section we are going to present those components which actively participate in the main application workflow.
This example shows the representation of data obtained from the News scraper.
.. image:: images/news1.png
:align: center
|
The list of news obtained which fits the selected filters is shown as represented in the following image. The headline of each news item appears along with the logo of its source and the emotion analysis representation of its content, displayed as an emoji.
|
.. image:: images/news2.png
:align: center
|
Additionaly, it is possible to use the Sparql editor to execute semantic queries which work making use of several ontologies in order to enrich the gathered data.
.. image:: images/news3.png
:align: center
GSI Crawler framework uses a webpage based on Polymer web components to interact with all the functionalities offered by the tool. These Polymer Web Components are simply independent submodules that can be grouped each other to build the general dashboard interface.
Developing your own dashboard
-----------------------------
In this section we will explain how to create new dashboards in Sefarad, or import existing ones. First of all you can clone our dashboard development example from GitLab. Your dashboard should have the same files as this example.
.. sourcecode:: bash
$ git clone https://lab.cluster.gsi.dit.upm.es/sefarad/dashboard-tourpedia.git
$ cd dashboard-tourpedia
In addition, you have to define your dashboard structure as follows in **my-dashboard.html** file, this is the main file of the development. In our example this file is called `dashboard-tourpedia.html`
.. sourcecode:: html
<dom-module id="my-dashboard">
<template>
<!-- dashboard content -->
</template>
</dom-module>
Inside ``<dom-module>`` tag you have to define your new Polymer dashboard adding some JavaScript:
.. sourcecode:: javascript
Polymer({
is: 'my-dashboard',
properties: {
// dashboard properties
},
ready: function(){
do_some_function();
}
});
Is also necessary to specify dependencies (i.e your Widgets) for this dashboard using a bower.json file. The structure of this file is like this example:
.. sourcecode:: json
{
"name": "my-dashboard",
"homepage": "https://lab.cluster.gsi.dit.upm.es/sefarad/your-dashboard-url",
"authors": [
"GSI-UPM"
],
"description": "",
"main": "",
"license": "MIT",
"dependencies": {
"paper-card": "PolymerElements/paper-card#^1.1.4",
"polymer": "polymer#*",
"google-chart-elasticsearch": "google-chart-elasticsearch#*"
}
If you want to make your dashboard installable via bower you can register this package. This requires to have a git repository with all your dashboard code.
.. sourcecode:: bash
$ bower register <my-package-name> <git-endpoint>
Now is time to test your dashboard visualisation, create an `index.html` inside demo folder. You need to add your dashboard tags the same way as the dashboard-tourpedia example.
.. sourcecode:: html
<my-dashboard client="{{client}}"></my-dashboard>
After index.html is working, create a Dockerfile as in the example.
* In the Dockerfile, you need to edit the following line:
.. sourcecode:: bash
ENV NODE_PATH=/tmp/node_modules APP_NAME=<--- add your dashboard-name here --->
Now is time to run it using docker-compose.
.. sourcecode:: bash
$ sudo docker-compose up
If your dashboard requires elasticsearch, just upload your data using Luigi pipelines.
.. sourcecode:: bash
$ sudo docker-compose exec luigi python -m luigi --module add_tweets Elasticsearch --index tourpedia --doc-type places --filename add_demo.json --local-scheduler
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!