With Everbase, you can parse websites using a very intuitive GraphQL API. Here is a simple example of how to get the title of a web page:
{
url(url: "http://example.com") {
htmlDocument {
title
}
}
}
{
"data": {
"url": {
"htmlDocument": {
"title": "Example Domain"
}
}
}
}
With web scraping, we always start with a url
and access its HTMLDocument
property, which will
fetch the document behind the scenes.
In HTML, the main content is inside the body
element, and this is also the field on
HTMLDocument to access it, which returns a
HTMLNode. From there, we can use all
or first
with a selector to extract
the attributes and text nodes that contain the content. This is how we can parse the front page of
Hacker News into a machine-readable format:
{
url(url: "https://news.ycombinator.com") {
htmlDocument {
title
body {
submissions: all(selector: "tr.athing") {
rank: text(selector: "span.rank")
text(selector: "a.storylink")
url: attribute(selector: "a.storylink", name: "href")
attrs: next {
score: text(selector: "span.score")
user: text(selector: "a.hnuser")
comments: text(selector: "a:nth-of-type(3)")
}
}
}
}
}
}
The submissions: all(selector: "tr.athing")
syntax is called an alias in GraphQL, it makes the
query return the value under submissions
and not all
. With a query like this, it's highly
recommended to use this feature because it makes the result much more readable:
{
"data": {
"url": {
"htmlDocument": {
"title": "Hacker News",
"body": {
"submissions": [
{
"rank": "1.",
"text": "Collected Notes a note-taking blogging app I made",
"url": "https://collectednotes.com/",
"attrs": {
"score": "148 points",
"user": "alecrosa",
"comments": "68 comments"
}
},
{
"rank": "2.",
"text": "High-Resolution 3D Human Digitization",
"url": "https://shunsukesaito.github.io/PIFu/",
"attrs": {
"score": "54 points",
"user": "hliyan",
"comments": "5 comments"
}
}
// ...
]
}
}
}
}
}
For more information, check out HTMLDocument and HTMLNode. To try your own queries, use the editor.