Scrapy comes with a built-in web service for monitoring and controlling a running crawler. The service exposes most resources using the JSON-RPC 2.0 protocol, but there are also other (read-only) resources which just output JSON data.
Provides an extensible web service for managing a Scrapy process. It’s enabled by the WEBSERVICE_ENABLED setting. The web server will listen in the port specified in WEBSERVICE_PORT, and will log to the file specified in WEBSERVICE_LOGFILE.
The web service is a built-in Scrapy extension which comes enabled by default, but you can also disable it if you’re running tight on memory.
The web service contains several resources, defined in the WEBSERVICE_RESOURCES setting. Each resource provides a different functionality. See Available JSON-RPC resources for a list of resources available by default.
Althought you can implement your own resources using any protocol, there are two kinds of resources bundled with Scrapy:
These are the JSON-RPC resources available by default in Scrapy:
Provides access to the main Crawler object that controls the Scrapy process.
Available by default at: http://localhost:6080/crawler
Provides access to the Stats Collector used by the crawler.
Available by default at: http://localhost:6080/stats
You can access the spider manager JSON-RPC resource through the Crawler JSON-RPC resource at: http://localhost:6080/crawler/spiders
You can access the extension manager JSON-RPC resource through the Crawler JSON-RPC resource at: http://localhost:6080/crawler/spiders
These are the JSON resources available by default:
Provides access to engine status metrics.
Available by default at: http://localhost:6080/enginestatus
These are the settings that control the web service behaviour:
Default: True
A boolean which specifies if the web service will be enabled (provided its extension is also enabled).
Default: None
A file to use for logging HTTP requests made to the web service. If unset web the log is sent to standard scrapy log.
Default: [6080, 7030]
The port range to use for the web service. If set to None or 0, a dynamically assigned port is used.
Default: {}
The list of web service resources enabled for your project. See Web service resources. These are added to the ones available by default in Scrapy, defined in the WEBSERVICE_RESOURCES_BASE setting.
Default:
{
'scrapy.contrib.webservice.crawler.CrawlerResource': 1,
'scrapy.contrib.webservice.enginestatus.EngineStatusResource': 1,
'scrapy.contrib.webservice.stats.StatsResource': 1,
}
The list of web service resources available by default in Scrapy. You shouldn’t change this setting in your project, change WEBSERVICE_RESOURCES instead. If you want to disable some resource set its value to None in WEBSERVICE_RESOURCES.
Web service resources are implemented using the Twisted Web API. See this Twisted Web guide for more information on Twisted web and Twisted web resources.
To write a web service resource you should subclass the JsonResource or JsonRpcResource classes and implement the renderGET method.
A subclass of twisted.web.resource.Resource that implements a JSON web service resource. See
The name by which the Scrapy web service will known this resource, and also the path wehere this resource will listen. For example, assuming Scrapy web service is listening on http://localhost:6080/ and the ws_name is 'resource1' the URL for that resource will be:
This is a subclass of JsonResource for implementing JSON-RPC resources. JSON-RPC resources wrap Python (Scrapy) objects around a JSON-RPC API. The resource wrapped must be returned by the get_target() method, which returns the target passed in the constructor by default
Return the object wrapped by this JSON-RPC resource. By default, it returns the object passed on the constructor.
from scrapy.webservice import JsonRpcResource
from scrapy.stats import stats
class StatsResource(JsonRpcResource):
ws_name = 'stats'
def __init__(self, crawler):
JsonRpcResource.__init__(self, crawler, stats)
from scrapy.webservice import JsonResource
from scrapy.utils.engine import get_engine_status
class EngineStatusResource(JsonResource):
ws_name = 'enginestatus'
def __init__(self, crawler, spider_name=None):
JsonResource.__init__(self, crawler)
self._spider_name = spider_name
self.isLeaf = spider_name is not None
def render_GET(self, txrequest):
status = get_engine_status(self.crawler.engine)
if self._spider_name is None:
return status
for sp, st in status['spiders'].items():
if sp.name == self._spider_name:
return st
def getChild(self, name, txrequest):
return EngineStatusResource(name, self.crawler)
#!/usr/bin/env python
"""
Example script to control a Scrapy server using its JSON-RPC web service.
It only provides a reduced functionality as its main purpose is to illustrate
how to write a web service client. Feel free to improve or write you own.
Also, keep in mind that the JSON-RPC API is not stable. The recommended way for
controlling a Scrapy server is through the execution queue (see the "queue"
command).
"""
import sys, optparse, urllib
from urlparse import urljoin
from scrapy.utils.jsonrpc import jsonrpc_client_call, JsonRpcError
from scrapy.utils.py26 import json
def get_commands():
return {
'help': cmd_help,
'stop': cmd_stop,
'list-available': cmd_list_available,
'list-running': cmd_list_running,
'list-resources': cmd_list_resources,
'get-global-stats': cmd_get_global_stats,
'get-spider-stats': cmd_get_spider_stats,
}
def cmd_help(args, opts):
"""help - list available commands"""
print "Available commands:"
for _, func in sorted(get_commands().items()):
print " ", func.__doc__
def cmd_stop(args, opts):
"""stop <spider> - stop a running spider"""
jsonrpc_call(opts, 'crawler/engine', 'close_spider', args[0])
def cmd_list_running(args, opts):
"""list-running - list running spiders"""
for x in json_get(opts, 'crawler/engine/open_spiders'):
print x
def cmd_list_available(args, opts):
"""list-available - list name of available spiders"""
for x in jsonrpc_call(opts, 'crawler/spiders', 'list'):
print x
def cmd_list_resources(args, opts):
"""list-resources - list available web service resources"""
for x in json_get(opts, '')['resources']:
print x
def cmd_get_spider_stats(args, opts):
"""get-spider-stats <spider> - get stats of a running spider"""
stats = jsonrpc_call(opts, 'stats', 'get_stats', args[0])
for name, value in stats.items():
print "%-40s %s" % (name, value)
def cmd_get_global_stats(args, opts):
"""get-global-stats - get global stats"""
stats = jsonrpc_call(opts, 'stats', 'get_stats')
for name, value in stats.items():
print "%-40s %s" % (name, value)
def get_wsurl(opts, path):
return urljoin("http://%s:%s/"% (opts.host, opts.port), path)
def jsonrpc_call(opts, path, method, *args, **kwargs):
url = get_wsurl(opts, path)
return jsonrpc_client_call(url, method, *args, **kwargs)
def json_get(opts, path):
url = get_wsurl(opts, path)
return json.loads(urllib.urlopen(url).read())
def parse_opts():
usage = "%prog [options] <command> [arg] ..."
description = "Scrapy web service control script. Use '%prog help' " \
"to see the list of available commands."
op = optparse.OptionParser(usage=usage, description=description)
op.add_option("-H", dest="host", default="localhost", \
help="Scrapy host to connect to")
op.add_option("-P", dest="port", type="int", default=6080, \
help="Scrapy port to connect to")
opts, args = op.parse_args()
if not args:
op.print_help()
sys.exit(2)
cmdname, cmdargs, opts = args[0], args[1:], opts
commands = get_commands()
if cmdname not in commands:
sys.stderr.write("Unknown command: %s\n\n" % cmdname)
cmd_help(None, None)
sys.exit(1)
return commands[cmdname], cmdargs, opts
def main():
cmd, args, opts = parse_opts()
try:
cmd(args, opts)
except IndexError:
print cmd.__doc__
except JsonRpcError, e:
print str(e)
if e.data:
print "Server Traceback below:"
print e.data
if __name__ == '__main__':
main()