Commit 79ad1fc6 authored by J. Fernando Sánchez's avatar J. Fernando Sánchez
Browse files

Add script to dump backups

parent a40a0adf
......@@ -16,20 +16,31 @@ NOTE: This command may require sudo
Loading backup data
===================
Run this python script inside docker image to get backup data:
If you already have a list of analyzed posts you want to import into elasticsearch, you can do so with the loadBackup.py script.
For convenience, this script is included in the gsicrawler container.
You just need to copy your backup file (a jsonlines file named `lateralbackup.jsons`) in the gsicrawler folder, and run this command:
docker-compose exec gsicrawler python loadBackup.py
```
docker-compose exec gsicrawler python loadBackup.py
```
NOTE: This command may require sudo
Services available
==================
Services
========
Senpy: This analysis service demo is available on http://localhost:5000/
This demo includes four services:
Somedi dashboard: This visualization demo environment provides a dashboard for Somedi project and is available on http://localhost:8080.
* Luigi orchestrator (gsicrawler). This service provides both the task executer and a web interface to check your workflows status.
The tasks are executed periodically according to the period in `crontasks.py`.
By default, the period is 24h.
The web interface shows the status of the tasks, and it is available on http://localhost:8082
* Elasticsearch: the official elasticsearch image. It is available on localhost:19200
* Senpy, used for sentiment and semantic analysis. It is mapped to http://localhost:5000/
* Somedi dashboard (sefarad), a website developed with sefarad. It displays the data stored in elasticsearch.
It is available on http://localhost:8080.
Orchestrator: Luigi provides a web interface to check your workflows status on http://localhost:8082
The docker-compose definition adds all these services to the same network, so they can communicate with each other using the service name, without exposing external ports.
The endpoints used in each service (e.g. the elasticsearch endpoint in the gsicrawler service) are configurable through environment variables.
Troubleshooting
......@@ -41,3 +52,5 @@ This will solve it temporarily, until the next boot:
```
sudo sysctl -w vm.max_map_count=262144
```
If you want to make this permanent, set the value in your `/etc/sysctl.conf`.
import os
import json
from elasticsearch import Elasticsearch
from elasticsearch.helpers import scan
ENDPOINT = os.environ.get('ES_ENDPOINT', 'localhost:9200')
INDEX = os.environ.get('ES_INDEX', 'lateraldemo')
es = Elasticsearch(ENDPOINT)
res = scan(es, index=INDEX, scroll='1m')
#print("Got %d Hits:" % res['hits']['total'])
with open('lateralbackup.jsons'.format(ENDPOINT, INDEX), 'w') as f:
for hit in res:
print(json.dumps(hit['_source']), file=f)
This diff is collapsed.
......@@ -5,11 +5,11 @@ import os
host = os.environ.get("ES_ENDPOINT_EXTERNAL", "elasticsearch:9200")
es = Elasticsearch(hosts=[host, ])
with open('lateralbackup.json') as infile:
data = json.load(infile)
for line in data:
print(line['@id'])
es.index(index='lateraldemo', doc_type='tweet', id=line['@id'], body=line)
with open('lateralbackup.jsons') as infile:
for line in infile:
data = json.loads(line)
print(data['@id'])
es.index(index='lateraldemo', doc_type='tweet', id=data['@id'], body=data)
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment