Scraping for transparency

The States of Guernsey’s Website (www.gov.gg) has poor search facilities. Whether this is by incompetence or intent is unclear.

Openness, accountability and good corporate governance demand easy and fast access to information; this is particularly so with States Business – the meetings, Hansards, proposals etc.

A solution to this problem would be to scrape the content carefully and index it properly with a full-text search.

Accordingly, I give you a tool to do this. I made a tool to scrape thousands of public documents from the government server, restructure them, extract metadata, extract text, and index them properly with a full-text search tool. You may find the source code on github: https://github.com/JBDAC/govgg-scraper

Almost all search engines crawl web servers to extract text to index.

Some sites might manage web crawler and scraper activities, but typically they do so at the expense of search placement. The most prevalent way is the robots.txt file, placed in the root directory, which guides crawlers about which pages to avoid. Additionally, the meta robots tag in the HTML head section of individual pages can specify indexing and following preferences, like noindex or nofollow. For non-HTML content, the X-Robots-Tag HTTP header performs a similar role. Another technique involves using the rel='nofollow' attribute in hyperlinks to prevent crawlers from following specific links. More restrictive methods include password-protecting certain areas of the site and disabling directory listings to hide files from crawlers. These methods offer varying degrees of control but they do not require that the crawler obey them: they are guidelines only.

In all events, Gov.gg does not use any – and nor should it – it’s a public service website, where all content is in the public domain. Its problem is the abysmal searching. You can check the downloaded files with:

grep -RilE "nofollow|noindex|robots.txt" ./

So we’re free to crawl! Having grabbed the files (which are public domain, anyway) we can use Recoll. This is a powerful open-source full-text search tool that efficiently indexes and retrieves information from various document formats, allowing users to quickly search and locate specific content within their collections of documents. Now we need to allow people to use this data: so we use Recoll with a web front end.

What follows is to download States Business from the gov.gg web server so that it can be properly indexed with a full-text search. We convert the flat file structure on the States webserver to a date hierarchy. Steps:

Create a folder such as: /run/media/jbdac/nvme/gov/all/
This is our root.

Copy the Python scripts there and run them from a terminal opened at this location. The scrips will:

Download the specific States Business HTML file in our root folder by using the Python scripts supplied to extract the 300+ links (as of 21/06/2023) to relevant ‘Meeting files’ & store these links in a links.txt file.
Parse the links.txt file to download each ‘Meeting file’ from the States webserver, saving it in a created yyyy/mm/dd folder.
Recursively process this new hierarchy of files to extract the relevant proper document links to relevant documents such as Hansards etc for that meeting from each meeting file, and download the files into the relevant dated directories.

We start at this URL:

https://www.gov.gg/article/163276/States-Meeting-information-index#selectednavItem191711

and access linked files.

To accomplish all of this, run, (from Terminal opened in the same directory):

python step1_generatelinks.py
(edit the links.txt file in a text editor to limit the dates for download, you may optionally specify a year or range of years as parameters)
python step2_download.py
python step3_getkids.py
python step4_removeJBDAC.py

The Python files are commented.


I have my own index of the States’ Business section of their server. It means that you can search things like Hansards properly. I have a tunnel into it so that it can be accessed from the outside internet.

This runs on a tiny server just for this in my office. It uses ~5.5w of power. When you make a query, power consumption doubles, and one of the 8 cores services your request. The data is tunnelled to it via Pagekite.

Yep, that’s it – $200 for better searching than the whole government website.

You may try it out here: h t t p :// govgg . pagekite . me/recoll/

(remove the spaces)

‘govgg’ identifies the service via subdomain on the pagekite.me servers. They then re-route the traffic to me. I handle it, sending you back the search results as a webpage. You click on a file to view, and I re-direct you to the original (if possible) on the real gov.gg server.

If you do, there are some settings you really should make in your browser:

Because the data is stored locally to me, and your browser expects the files to be local to you, you have to set the correct location and the folder depth. You only have to do it once for each device that you want to use the search tool on,

You must do this, otherwise it won’t work properly.

Enter: h t t p :// govgg . pagekite . me/recoll/settings

Which will give you the following dialog box:

1) change the default folder depth of 2 to 5.
2) change the ‘Locations’ to h t t p s :// govgg . pagekite .me

Press <Save>

From then on, use: h t t p :// govgg . pagekite . me/recoll/

to access the search tool. You need to use the trailing backslash, or the page will be ill formed.

This is what it looks like:

You can enter complex searches, use date ranges, and drill into the document tree with the folder option (but that’s typically date-oriented, too)

It should be more or less up to date. If not, drop me a message via the contacts page.