The problem with browsing user sites on tilde.institute using the stats page is that most of the sites are empty. The following is a catalog of non-empty sites.
The sites were selected from over 900 listed on the stats page on 2024-07-16. This is the second catalog of this sort that I have made. The first was the “random SDF homepages” feature on my own SDF homepage. Here, I decided to use a local LLM instead of a set of heuristics to determine what sites had content. I wrote a Python script that used several common Python libraries to retrieve and parse the sites’ index pages and used ollama-python with the Llama 3 8B model to analyze them. Built with Meta Llama 3, the script is.
The prompt was:
Is a webpage with the following text likely empty, a placeholder, a test page, or otherwise not worth listing in a catalog because of sparse content? Truncated text is okay. Answer "YES" or "NO".
```
(Page text goes here.)
```
The script kept sites whose front pages either had the network say “YES” or contained an HTML element <audio>
, <iframe>
, <img>
, <script>
, or <video>
.
Processing over 900 pages took an hour on a Ryzen 5 3500 CPU with Ollama configured to only use three threads (effectively three cores). The prompt ended up excluding sites that were directory listings and including some sites with only a single line of text. Because of Llama 3’s limited context length of 8192 tokens, my script did not feed the network raw HTML. Text content was extracted from the pages, shortened to 6000 characters, and inserted in the prompt. This no doubt distorted the results.
Initially, I made a typo in the prompt: “Trucated text is okay.” The number of sites increased from 134 to 141 when I fixed the typo and reran the job.
The generated list is presented below without edits. It is also available as JSON.