- commit
- a6af37d7d9895953177c5d3502b09b72f711ec11
- parent
- f20e16c35a8763f43e001e39c06f371010feee60
- Author
- Tobias Bengfort <tobias.bengfort@gmx.net>
- Date
- 2015-02-05 16:41
write README
Diffstat
| M | README.rst | 122 | ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ |
1 files changed, 122 insertions, 0 deletions
diff --git a/README.rst b/README.rst
@@ -0,0 +1,122 @@
-1 1 PyJSONProxy - simple proxy and scraper
-1 2
-1 3
-1 4 simple proxy
-1 5 ============
-1 6
-1 7 AJAX requests are restricted by the `same origin policy`_. This can be
-1 8 bypassed by using either `JSONP`_, `CORS`_ or a local proxy. This
-1 9 implements the third variant. So you can do something like this::
-1 10
-1 11 $ curl http://localhost:5000/github/xi/
-1 12 {
-1 13 "login": "xi",
-1 14 ...
-1 15 }
-1 16
-1 17 With a configuration like this::
-1 18
-1 19 ENDPOINTS = {
-1 20 'github': {
-1 21 'host': 'https://api.github.com/users/'
-1 22 }
-1 23 }
-1 24
-1 25
-1 26 scraping
-1 27 ========
-1 28
-1 29 Maybe the more interesting part is that this also contains a simple
-1 30 scraping mechanism. So if a service does not offer an API but only plain
-1 31 HTML pages, PyJSONProxy can extract information from there::
-1 32
-1 33 $ curl http://localhost:5000/github/xi/
-1 34 {
-1 35 "url": "https://github.com/xi/",
-1 36 "login": "xi",
-1 37 ...
-1 38 }
-1 39 $ curl http://localhost:5000/repos/xi/
-1 40 {
-1 41 "url": "https://github.com/xi/",
-1 42 "l": [
-1 43 "/xi/pyjsonproxy",
-1 44 ...
-1 45 ]
-1 46 }
-1 47
-1 48 ::
-1 49 ENDPOINTS = {
-1 50 'github': {
-1 51 'host': 'https://github.com/'
-1 52 'type': 'scrape_item',
-1 53 'fields': {
-1 54 'login': '.vcard-username',
-1 55 'fullname': '.vcard-fullname',
-1 56 'email': '.vcard-details .email'
-1 57 'join-date': '.vcard-details .join-date@datetime'
-1 58 }
-1 59 },
-1 60 'repos': {
-1 61 'host': 'https://github.com/'
-1 62 'type': 'scrape_list',
-1 63 'selector': '.repo-list-name a@href'
-1 64 }
-1 65 }
-1 66
-1 67 There a two options here: ``scrape_item`` and ``scrape_list``. The first
-1 68 one will take a list of fields and selectors and return only the first
-1 69 match for each selector.The latter one will only take one selector and
-1 70 return every match for this selector.
-1 71
-1 72 Selectors are generally CSS-selectors with the additional option to
-1 73 select an attribute by appending an ``@`` and the attribute name. If no
-1 74 attribute is selected, the text content of the element will be used.
-1 75
-1 76
-1 77 CORS header
-1 78 ===========
-1 79
-1 80 By setting ``ALLOW_CORS`` to ``True``, an
-1 81 ``Access-Control-Allow-Origin``-header with value ``*`` will be set for
-1 82 all responses.
-1 83
-1 84
-1 85 Documentation
-1 86 =============
-1 87
-1 88 Some simple documentation is auomatically generated and available under
-1 89 ``/`` (for all endpoints) or ``/{endpoint}/`` (for an individual
-1 90 endpoint). To provide some input for this documentation, you can add a
-1 91 description to both endpoints and fields::
-1 92
-1 93 ENDPOINTS = {
-1 94 'github': {
-1 95 'host': 'https://github.com/'
-1 96 'type': 'scrape_item',
-1 97 'doc': 'Access data about GitHub users',
-1 98 'fields': {
-1 99 'login': '.vcard-username',
-1 100 'fullname': '.vcard-fullname',
-1 101 'email': '.vcard-details .email'
-1 102 'join-date': '.vcard-details .join-date@datetime'
-1 103 },
-1 104 'fields_doc': {
-1 105 'login': 'github username',
-1 106 'fullname': 'the user's full name',
-1 107 'join-date': 'date when the user joined github in ISO-xx format'
-1 108 }
-1 109 }
-1 110 }
-1 111
-1 112
-1 113 Note on security and performance
-1 114 ================================
-1 115
-1 116 Security and performance were not a priority in this project. So it
-1 117 might be bad.
-1 118
-1 119
-1 120 .. _same origin policy: https://developer.mozilla.org/en-US/docs/Web/Security/Same-origin_policy
-1 121 .. _JSONP: https://en.wikipedia.org/wiki/JSONP
-1 122 .. _CORS: https://developer.mozilla.org/en-US/docs/Web/HTTP/Access_control_CORS