PyJSONProxy

simple proxy and scraper
git clone https://git.ce9e.org/PyJSONProxy.git

commit
a6af37d7d9895953177c5d3502b09b72f711ec11
parent
f20e16c35a8763f43e001e39c06f371010feee60
Author
Tobias Bengfort <tobias.bengfort@gmx.net>
Date
2015-02-05 16:41
write README

Diffstat

M README.rst 122 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

1 files changed, 122 insertions, 0 deletions


diff --git a/README.rst b/README.rst

@@ -0,0 +1,122 @@
   -1     1 PyJSONProxy - simple proxy and scraper
   -1     2 
   -1     3 
   -1     4 simple proxy
   -1     5 ============
   -1     6 
   -1     7 AJAX requests are restricted by the `same origin policy`_. This can be
   -1     8 bypassed by using either `JSONP`_, `CORS`_ or a local proxy. This
   -1     9 implements the third variant. So you can do something like this::
   -1    10 
   -1    11     $ curl http://localhost:5000/github/xi/
   -1    12     {
   -1    13       "login": "xi",
   -1    14       ...
   -1    15     }
   -1    16 
   -1    17 With a configuration like this::
   -1    18 
   -1    19     ENDPOINTS = {
   -1    20         'github': {
   -1    21             'host': 'https://api.github.com/users/'
   -1    22         }
   -1    23     }
   -1    24 
   -1    25 
   -1    26 scraping
   -1    27 ========
   -1    28 
   -1    29 Maybe the more interesting part is that this also contains a simple
   -1    30 scraping mechanism. So if a service does not offer an API but only plain
   -1    31 HTML pages, PyJSONProxy can extract information from there::
   -1    32 
   -1    33     $ curl http://localhost:5000/github/xi/
   -1    34     {
   -1    35       "url": "https://github.com/xi/",
   -1    36       "login": "xi",
   -1    37       ...
   -1    38     }
   -1    39     $ curl http://localhost:5000/repos/xi/
   -1    40     {
   -1    41       "url": "https://github.com/xi/",
   -1    42       "l": [
   -1    43         "/xi/pyjsonproxy",
   -1    44         ...
   -1    45       ]
   -1    46     }
   -1    47 
   -1    48 ::
   -1    49     ENDPOINTS = {
   -1    50         'github': {
   -1    51             'host': 'https://github.com/'
   -1    52             'type': 'scrape_item',
   -1    53             'fields': {
   -1    54               'login': '.vcard-username',
   -1    55               'fullname': '.vcard-fullname',
   -1    56               'email': '.vcard-details .email'
   -1    57               'join-date': '.vcard-details .join-date@datetime'
   -1    58             }
   -1    59         },
   -1    60         'repos': {
   -1    61             'host': 'https://github.com/'
   -1    62             'type': 'scrape_list',
   -1    63             'selector': '.repo-list-name a@href'
   -1    64         }
   -1    65     }
   -1    66 
   -1    67 There a two options here: ``scrape_item`` and ``scrape_list``. The first
   -1    68 one will take a list of fields and selectors and return only the first
   -1    69 match for each selector.The latter one will only take one selector and
   -1    70 return every match for this selector.
   -1    71 
   -1    72 Selectors are generally CSS-selectors with the additional option to
   -1    73 select an attribute by appending an ``@`` and the attribute name. If no
   -1    74 attribute is selected, the text content of the element will be used.
   -1    75 
   -1    76 
   -1    77 CORS header
   -1    78 ===========
   -1    79 
   -1    80 By setting ``ALLOW_CORS`` to ``True``, an
   -1    81 ``Access-Control-Allow-Origin``-header with value ``*`` will be set for
   -1    82 all responses.
   -1    83 
   -1    84 
   -1    85 Documentation
   -1    86 =============
   -1    87 
   -1    88 Some simple documentation is auomatically generated and available under
   -1    89 ``/`` (for all endpoints) or ``/{endpoint}/`` (for an individual
   -1    90 endpoint). To provide some input for this documentation, you can add a
   -1    91 description to both endpoints and fields::
   -1    92 
   -1    93     ENDPOINTS = {
   -1    94         'github': {
   -1    95             'host': 'https://github.com/'
   -1    96             'type': 'scrape_item',
   -1    97             'doc': 'Access data about GitHub users',
   -1    98             'fields': {
   -1    99               'login': '.vcard-username',
   -1   100               'fullname': '.vcard-fullname',
   -1   101               'email': '.vcard-details .email'
   -1   102               'join-date': '.vcard-details .join-date@datetime'
   -1   103             },
   -1   104             'fields_doc': {
   -1   105               'login': 'github username',
   -1   106               'fullname': 'the user's full name',
   -1   107               'join-date': 'date when the user joined github in ISO-xx format'
   -1   108             }
   -1   109         }
   -1   110     }
   -1   111 
   -1   112 
   -1   113 Note on security and performance
   -1   114 ================================
   -1   115 
   -1   116 Security and performance were not a priority in this project. So it
   -1   117 might be bad.
   -1   118 
   -1   119 
   -1   120 .. _same origin policy: https://developer.mozilla.org/en-US/docs/Web/Security/Same-origin_policy
   -1   121 .. _JSONP: https://en.wikipedia.org/wiki/JSONP
   -1   122 .. _CORS: https://developer.mozilla.org/en-US/docs/Web/HTTP/Access_control_CORS