Skip to main content Link Menu Expand (external link) Document Search Copy Copied

estela Entrypoint

The estela Entrypoint is a package that implements a wrapper layer to extract job data from the environment, prepare the job properly, and execute it using Scrapy.

It can be seen as the implementation of a contract to run spiders, namely, a set of requirements that any image has to comply with to run on estela.

Besides fulfilling the contract, the entry point takes care of:

  • Running the job with Scrapy.
  • Transparent integration with estela Storage
  • Keeping synchronization between the job and estela.

Contract statements

  1. The image should be able to start the job via the estela-crawl command without arguments.
    $ estela-crawl
    2022-03-19 14:00:05 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: books)
    ...
    
  2. The image should be able to return its metadata via the estela-describe-project command without arguments. The metadata must be a JSON object containing two fields:
    • project_type: The type of the project, e.g. ‘scrapy’.
    • spiders: A list of the spider names within the project.
    $ estela-describe-project
    {"project_type": "scrapy", "spiders": ["spider_1", "spider_2"]}
    
  3. The job should be able to get all needed information using environment variables.

Environment variables

JOB_INFO (Required)

Dictionary with all the job information in JSON format. The fields are:

Field Type Description Example Required
key string Job key in format job_ID/spider_ID/project_ID "1/2/3" Yes
spider string Spider name "spider_name" Yes
auth_token string estela user token authentication "token-A23@#21j" Yes
api_host string estela API host "https://api.host.com" Yes
collection string Collection name where items will be stored "collection-name" Yes
unique string Flag if the data will be stored in a unique collection "False" Only for cronjobs
args dict Job arguments {"arg1": "val1", "arg2": "val2"} No
env_vars dict Job environment variables {"env1": "val1", "env2": "val2"} No

QUEUE_PLATFORM (Required)

The queue platform used by estela, review the list of the current supported platforms.

QUEUE_PLATFORM_{PARAMETERS} (Required):

Please, refer to the estela Queue Adapter documentation to declare the needed variables.