tutorials.rst 9.21 KB
Newer Older
Rodrigo Barbado Esteban's avatar
Rodrigo Barbado Esteban committed
1
2
Getting started
---------------
Rodrigo Barbado Esteban's avatar
Rodrigo Barbado Esteban committed
3
4
First glance into GSI Crawler 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Rodrigo Barbado Esteban's avatar
Rodrigo Barbado Esteban committed
5
6
7
8
9
10
11
The quickest way of exploring the possibilities offered by GSI Crawler is accessing this `demo <http://dashboard-gsicrawler.cluster.gsi.dit.upm.es//>`_. There you can find a dashboard to visualize data collected from different News sources and Twitter. Some examples of added value offered by this tool are topic and sentiment extraction, identification of people appearing on the scraped data and geolocation of sources.


.. image:: images/crawler2.png
  :align: center

|
Rodrigo Barbado Esteban's avatar
Rodrigo Barbado Esteban committed
12
13
14

.. image:: images/map.jpg
  :align: center
Rodrigo Barbado Esteban's avatar
Rodrigo Barbado Esteban committed
15

Rodrigo Barbado Esteban's avatar
Rodrigo Barbado Esteban committed
16
17
18
19
20
21
22
23
24
25
26


Tutorial I: Install
~~~~~~~~~~~~~~~~~~~~

GSI Crawler installation is based in docker containers, so it is required to have both docker and docker-compose installed.

For docker installation in Ubuntu, visit this `link <https://store.docker.com/editions/community/docker-ce-server-ubuntu?tab=description>`_.

Docker-compose installation detailed instructions are available `here <https://docs.docker.com/compose/install/>`_.

Rodrigo Barbado Esteban's avatar
Rodrigo Barbado Esteban committed
27
First of all, you need to clone the repositories:
Rodrigo Barbado Esteban's avatar
Rodrigo Barbado Esteban committed
28
29
30

.. code:: bash

31
   $ git clone http://lab.cluster.gsi.dit.upm.es/sefarad/gsicrawler.git
Rodrigo Barbado Esteban's avatar
Rodrigo Barbado Esteban committed
32

Rodrigo Barbado Esteban's avatar
Rodrigo Barbado Esteban committed
33
Then, it is needed to set up the environment variables. For this task, first create a file named ``.env`` in the root directory of each project (gsicrawler and dashboard-gsicrawler). As you can see, `Twitter <https://developer.twitter.com/en/docs/basics/authentication/guides/access-tokens>`_ and `Meaningcloud <https://www.meaningcloud.com/developer/apis>`_ credentials are needed if you wish to use those services.
Rodrigo Barbado Esteban's avatar
Rodrigo Barbado Esteban committed
34
35
36

.. code::

Rodrigo Barbado Esteban's avatar
Rodrigo Barbado Esteban committed
37
38
39
40
  TWITTER_CONSUMER_KEY={YourConsumerKey, get it on Twitter}
  TWITTER_CONSUMER_SECRET={YourConsumerSecret, get it on Twitter}
  TWITTER_ACCESS_TOKEN={YourAccessToken, get it on Twitter}
  TWITTER_ACCESS_TOKEN_SECRET={YourAccessTokenSecret, get it on Twitter}
Rodrigo Barbado Esteban's avatar
Rodrigo Barbado Esteban committed
41
42
  ES_ENDPOINT=elasticsearch
  ES_PORT=9200
Rodrigo Barbado Esteban's avatar
Rodrigo Barbado Esteban committed
43
44
  ES_ENDPOINT_EXTERNAL=localhost:19200
  FUSEKI_PASSWORD={YourFusekiPass}
45
  FUSEKI_ENDPOINT_EXTERNAL=localhost:13030
Rodrigo Barbado Esteban's avatar
Rodrigo Barbado Esteban committed
46
  FUSEKI_ENDPOINT={YourFusekiEndPoint}
Rodrigo Barbado Esteban's avatar
Rodrigo Barbado Esteban committed
47
48
  API_KEY_MEANING_CLOUD={YourMeaningCloudApiKey, get it on Meaningcloud}
  FUSEKI_ENDPOINT = fuseki
Rodrigo Barbado Esteban's avatar
Rodrigo Barbado Esteban committed
49
  FUSEKI_PORT = 3030
Rodrigo Barbado Esteban's avatar
Rodrigo Barbado Esteban committed
50
51
52



53
Finally, execute the following lines:
Rodrigo Barbado Esteban's avatar
Rodrigo Barbado Esteban committed
54
55
56

.. code:: bash

Rodrigo Barbado Esteban's avatar
Rodrigo Barbado Esteban committed
57
58
    $ cd gsicrawler
    $ sudo docker-compose up
Rodrigo Barbado Esteban's avatar
Rodrigo Barbado Esteban committed
59

Rodrigo Barbado Esteban's avatar
Rodrigo Barbado Esteban committed
60
61
62
The information related to the initialization can be found in the console. If you wish to see how tasks are being executed, apart from seeing the logs you can access the Luigi task visualizer in ``localhost:8082``. In the next steps you will discover more about Luigi.

When the process finishes it is possible to access the Demo dashboard by accesing ``localhost:8080`` from your web browser.
Rodrigo Barbado Esteban's avatar
Rodrigo Barbado Esteban committed
63
64
65
66
67
68
69
70
71
72
73
74
75

|

Tutorial II: Crawling news
~~~~~~~~~~~~~~~~~~~~~~~~~~

This second tutorial will show how to build a crawler to gather news from the CNN extracting data from the CNN News API, but in a general case we could use `Scrapy <https://docs.scrapy.org/en/latest/>`_ library, which allows to extract data from web pages.

We will only obtain the headline and url of each piece of news appearing on the CNN related to one topic, storing those fields into a JSON file. 

.. image:: images/cnnsearch.png
  :align: center

Rodrigo Barbado Esteban's avatar
Rodrigo Barbado Esteban committed
76

Rodrigo Barbado Esteban's avatar
Rodrigo Barbado Esteban committed
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131

The code of this example can be found in ``luigi/scrapers/tutorial2.py``:

.. code-block:: python

  import requests
  import json

  def retrieveCnnNews(search, num, filepath):
    r = requests.get("https://search.api.cnn.io/content?q=" + search + "&size=" + str(num) + "")

    response = r.json()["result"]

    with open(filepath, 'a') as outfile:
      print("CRAWLING RESULT")
      for newsitem in response:
        aux = dict()
        aux["url"] = newsitem["url"]
        aux["headline"] = newsitem["headline"]
        print(aux)
        json.dump(aux, outfile)
        outfile.write('\n')

Then, we have to program a Luigi task which orders to execute the code from above. For more information about Luigi pipelines of tasks, please visit this `documentation <https://luigi.readthedocs.io/en/stable/>`_. This task will appear in ``luigi/tutorialtask.py``.

.. code-block:: python

  class CrawlerTask(luigi.Task):
      """
      Generates a local file containing 5 elements of data in JSON format.
      """
      url = luigi.Parameter()
      id = luigi.Parameter()

      def run(self):
          """
          Writes data in JSON format into the task's output target.
          """
          filePath = '/tmp/_scrapy-%s.json' % self.id
          print(self.url, filePath)
          retrieveCnnNews(self.url, 10, filePath)

      def output(self):
          """
          Returns the target output for this task.
          In this case, a successful execution of this task will create a file on the local filesystem.
          """
          return luigi.LocalTarget(path='/tmp/_scrapy-%s.json' % self.id)



Finally, for running the tutorial execute the following line from your repository path. 

.. code:: bash

132
  $ sudo docker-compose run gsicrawler tutorial2
Rodrigo Barbado Esteban's avatar
Rodrigo Barbado Esteban committed
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179

|

The resulting JSON will appear on the console.

.. code:: json
  
  {"headline": "Iraqi forces say they've recaptured Hawija city center from ISIS", "url": "http://www.cnn.com/2017/10/05/middleeast/iraq-isis-hawija/index.html"}
  {"headline": "3 US troops killed in ambush in Niger", "url": "http://www.cnn.com/2017/10/04/politics/us-forces-hostile-fire-niger/index.html"}


Tutorial III: Semantic enrichment and data storage
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In this tutorial we are going to structure our data according to the `NewsArticle <http://schema.org/NewsArticle>`_ entity from Schema. The scraper code can be found in ``luigi/scrapers/tutorial3.py``.

.. code-block:: python

  import requests
  import json

  def retrieveCnnNews(search, num, filepath):
    r = requests.get("https://search.api.cnn.io/content?q=" + search + "&size=" + str(num) + "")
    response = r.json()["result"]

    with open(filepath, 'a') as outfile:
      for newsitem in response:
        aux = dict()
        aux["@type"] = "schema:NewsArticle"
        aux["@id"] = newsitem["url"]
        aux["_id"] = newsitem["url"]
        aux["schema:datePublished"] = newsitem["firstPublishDate"]
        aux["schema:dateModified"] = newsitem["lastModifiedDate"]
        aux["schema:articleBody"] = newsitem["body"]
        aux["schema:about"] = newsitem["topics"]
        aux["schema:author"] = newsitem["source"]
        aux["schema:headline"] = newsitem["headline"]
        aux["schema:search"] = search
        aux["schema:thumbnailUrl"] = newsitem["thumbnail"]
        json.dump(aux, outfile)
        outfile.write('\n')

The Luigi pipeline has more complexity as now data has to be stored in Elastic Search and Fuseki. The code of the pipeline can also be found in ``luigi/scrapers/tutorial3.py``, being the task execution workflow initiated by ``PipelineTask``, which is in charge of calling its dependent tasks.

For executing this tutorial you should execute the following line:

.. code:: bash

180
  $ sudo docker-compose run gsicrawler tutorial3
Rodrigo Barbado Esteban's avatar
Rodrigo Barbado Esteban committed
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228

In order to access the stored data in Elastic Search, access ``localhost:19200/tutorial/_search?pretty`` from your web browser. 

.. code:: json

  {
    "_index" : "tutorial",
    "_type" : "news",
    "_id" : "http://www.cnn.com/2017/10/04/politics/syria-russia-us-assad-at-tanf/index.html",
    "_score" : 1.0,
    "_source" : {
      "@type" : "schema:NewsArticle",
      "@id" : "http://www.cnn.com/2017/10/04/politics/syria-russia-us-assad-at-tanf/index.html",
      "schema:datePublished" : "2017-10-04T18:05:30Z",
      "schema:dateModified" : "2017-10-04T18:05:29Z",
      "schema:articleBody" : "Forces aligned with Syrian President Bashar al-Assad made an incursion Wednesday into the 55km \"de-confliction zone..." ",
      "schema:about" : [
        "Syria conflict",
        "Armed forces",
        "ISIS",
        "Military operations"
      ],
      "schema:author" : "cnn",
      "schema:headline" : "Syrian regime forces enter buffer zone surrounding US base",
      "schema:search" : "\"isis\"",
      "schema:thumbnailUrl" : "http://i2.cdn.turner.com/cnnnext/dam/assets/170616041647-baghdadi-file-story-body.jpg"
    }




In the case of seeing it on Fuseki, the address would be ``localhost:13030/tutorial/data``.

.. code:: turtle

  <http://www.cnn.com/2017/10/02/politics/las-vegas-domestic-terrorism/index.html>
          a                     schema:NewsArticle ;
          <http://latest.senpy.cluster.gsi.dit.upm.es/ns/_id>
                  "http://www.cnn.com/2017/10/02/politics/las-vegas-domestic-terrorism/index.html" ;
          schema:about          "Shootings" , "Mass murder" , "Las Vegas" , "2017 Las Vegas concert shooting" ;
          schema:articleBody    "President Donald Trump on Tuesday did not say ...\"" ;
          schema:author         "cnn" ;
          schema:dateModified   "2017-10-03T14:13:36Z" ;
          schema:datePublished  "2017-10-02T21:26:26Z" ;
          schema:headline       "Trump mum on whether Las Vegas shooting was domestic terrorism" ;
          schema:search         "\"isis\"" ;
          schema:thumbnailUrl   "http://i2.cdn.turner.com/cnnnext/dam/assets/171002123455-31-las-vegas-incident-1002-story-body.jpg" .

229

Alberto Pascual's avatar
Alberto Pascual committed
230
231
Developing Sefarad dashboards
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Rodrigo Barbado Esteban's avatar
Rodrigo Barbado Esteban committed
232

Alberto Pascual's avatar
Alberto Pascual committed
233
For more information about dashboard creation, please visit `Sefarad documentation <http://sefarad.readthedocs.io/en/latest/dashboards-dev.html>`_.