Web Scraping

With Everbase, you can parse websites using a very intuitive GraphQL API. Here is a simple example of how to get the title of a web page:

{
  url(url: "http://example.com") {
    htmlDocument {
      title
    }
  }
}
{
  "data": {
    "url": {
      "htmlDocument": {
        "title": "Example Domain"
      }
    }
  }
}

With web scraping, we always start with a url and access its HTMLDocument property, which will fetch the document behind the scenes.

In HTML, the main content is inside the body element, and this is also the field on HTMLDocument to access it, which returns a HTMLNode. From there, we can use all or first with a selector to extract the attributes and text nodes that contain the content. This is how we can parse the front page of Hacker News into a machine-readable format:

{
  url(url: "https://news.ycombinator.com") {
    htmlDocument {
      title
      body {
        submissions: all(selector: "tr.athing") {
          rank: text(selector: "span.rank")
          text(selector: "a.storylink")
          url: attribute(selector: "a.storylink", name: "href")
          attrs: next {
            score: text(selector: "span.score")
            user: text(selector: "a.hnuser")
            comments: text(selector: "a:nth-of-type(3)")
          }
        }
      }
    }
  }
}

The submissions: all(selector: "tr.athing") syntax is called an alias in GraphQL, it makes the query return the value under submissions and not all. With a query like this, it's highly recommended to use this feature because it makes the result much more readable:

{
  "data": {
    "url": {
      "htmlDocument": {
        "title": "Hacker News",
        "body": {
          "submissions": [
            {
              "rank": "1.",
              "text": "Collected Notes a note-taking blogging app I made",
              "url": "https://collectednotes.com/",
              "attrs": {
                "score": "148 points",
                "user": "alecrosa",
                "comments": "68 comments"
              }
            },
            {
              "rank": "2.",
              "text": "High-Resolution 3D Human Digitization",
              "url": "https://shunsukesaito.github.io/PIFu/",
              "attrs": {
                "score": "54 points",
                "user": "hliyan",
                "comments": "5 comments"
              }
            }
            // ...
          ]
        }
      }
    }
  }
}

For more information, check out HTMLDocument and HTMLNode. To try your own queries, use the editor.