This isn’t related to parallel computing but figured I’d share this here for now until I create another blog for more miscellaneous topics.


Let’s say hypothetically you wanted to download all of the high quality Super Nintendo ROMs from a website. The site simply has a list of links that take you to the file itself. Since this is a flat tree, you could run a basic wget command with the URL ala:

wget -m -np -c -w 3 -R "index.html*" "https://rom-site.blah/path/to/roms/"


However, this would give you every game, regardless of quality. Fortunately, ROM enthusiasts use suffixes to denote the status of roms:

[a] Alternate
[p] Pirate
[b] Bad Dump     (avoid these, they may not work!)
[t] Trained
[f] Fixed
[T-] OldTranslation
[T+] NewerTranslation
[h] Hack
(-) Unknown Year
[o] Overdump
[!] Verified Good Dump
(M#) Multilanguage (# of Languages)
(###) Checksum
(??k) ROM Size
ZZZ_ Unclassified
(Unl) Unlicensed

So we just want the ones with the [!] suffix. **You may also want to specify ‘[U]’ for just the US releases as well.

There is certainly some way of specifying this to wget with a regular expression, but I am definitely no wget or regex pro, so after a few minutes of unsuccessful attempts, I gave up and wrote a short Python script to get me what I wanted using Beautiful Soup.

Before writing any code I analyzed the source of the target URL, and sure enough, the page was pretty much just a list of anchor tags, with a direct link to the ROM file. Perfect.

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8" />
    <title>Rom site</title>
    <!-- ... -->
</head>
<body>
    <!-- ... -->
    <h1>Index of /public/rom/SNES/</h1><hr><pre><a href="../">../</a>
        <a href="10%20Yard%20Fight%20%28A%26S%20NES%20Hack%29.zip">10 Yard Fight (A&amp;S NES Hack).zip</a>                   29-May-2019 21:02     56K
        <a href="1997%20New%20Year%20FD%20%28PD%29.zip">1997 New Year FD (PD).zip</a>                          29-May-2019 21:02    672K
        <a href="2.68%20MHz%20Demo%20%28PD%29%20%5Bo1%5D.zip">2.68 MHz Demo (PD) [o1].zip</a>                        29-May-2019 21:02    407K
        <a href="2.68%20MHz%20Demo%20%28PD%29.zip">2.68 MHz Demo (PD).zip</a>                             29-May-2019 21:02    407K
        <a href="2.noheader.zip">2.noheader.zip</a>                                     29-May-2019 21:02      1M
        <a href="2020%20Super%20Baseball%20%28J%29%20%5Ba1%5D%5BhI%5D.zip">2020 Super Baseball (J) [a1][hI].zip</a>               29-May-2019 21:02    637K
        <a href="2020%20Super%20Baseball%20%28J%29%20%5Bh1C%5D.zip">2020 Super Baseball (J) [h1C].zip</a>                  29-May-2019 21:02    634K
        <a href="2020%20Super%20Baseball%20%28J%29%20%5BhI%5D.zip">2020 Super Baseball (J) [hI].zip</a>                   29-May-2019 21:02    637K
        <a href="2020%20Super%20Baseball%20%28J%29.zip">2020 Super Baseball (J).zip</a>                        29-May-2019 21:02    634K
        <a href="2020%20Super%20Baseball%20%28U%29%20%5Bb1%5D.zip">2020 Super Baseball (U) [b1].zip</a>                   29-May-2019 21:02    612K
        <a href="2020%20Super%20Baseball%20%28U%29.zip">2020 Super Baseball (U).zip</a>                        29-May-2019 21:02    612K
        <a href="3%20Ninjas%20Kick%20Back%20%28U%29%20%5BT%2BFre1.00_GenerationIX%5D.zip">3 Ninjas Kick Back (U) [T+Fre1.00_GenerationIX]..&gt;</a> 29-May-2019 21:02      1M
        <a href="3%20Ninjas%20Kick%20Back%20%28U%29%20%5BT%2BGer1.00_Star%5D.zip">3 Ninjas Kick Back (U) [T+Ger1.00_Star].zip</a>        29-May-2019 21:02      1M
        <a href="3%20Ninjas%20Kick%20Back%20%28U%29.zip">3 Ninjas Kick Back (U).zip</a>                         29-May-2019 21:02      1M
        <a href="32768%20Color%20Demo%20by%20Joshua%20Cain%20%28PD%29.zip">32768 Color Demo by Joshua Cain (PD).zip</a>           29-May-2019 21:02     951
        <a href="3D%20Stereo%20World%20-%20Find%20the%20Hidden%20Images%20%28PD%29.zip">3D Stereo World - Find the Hidden Images (PD).zip</a>  29-May-2019 21:02    645K
        <a href="3x3%20Eyes%20-%20Juuma%20Houkan%20%28J%29%20%5Bb1%5D.zip">3x3 Eyes - Juuma Houkan (J) [b1].zip</a>               29-May-2019 21:02      1M
        <a href="3x3%20Eyes%20-%20Juuma%20Houkan%20%28J%29%20%5Bf1%5D.zip">3x3 Eyes - Juuma Houkan (J) [f1].zip</a>               29-May-2019 21:02      1M
        <a href="3x3%20Eyes%20-%20Juuma%20Houkan%20%28J%29%20%5Bh1C%5D.zip">3x3 Eyes - Juuma Houkan (J) [h1C].zip</a>              29-May-2019 21:02      1M
        <a href="3x3%20Eyes%20-%20Juuma%20Houkan%20%28J%29%20%5Bh2C%5D.zip">3x3 Eyes - Juuma Houkan (J) [h2C].zip</a>              29-May-2019 21:02      1M
        <a href="3x3%20Eyes%20-%20Juuma%20Houkan%20%28J%29.zip">3x3 Eyes - Juuma Houkan (J).zip</a>                    29-May-2019 21:02      1M
        <a href="3x3%20Eyes%20-%20Seima%20Kourinden%20%28J%29%20%5Bb1%5D.zip">3x3 Eyes - Seima Kourinden (J) [b1].zip</a>            29-May-2019 21:02    514K
        <a href="3x3%20Eyes%20-%20Seima%20Kourinden%20%28J%29%20%5Bo1%5D.zip">3x3 Eyes - Seima Kourinden (J) [o1].zip</a>            29-May-2019 21:02    852K
        <a href="3x3%20Eyes%20-%20Seima%20Kourinden%20%28J%29.zip">3x3 Eyes - Seima Kourinden (J).zip</a>                 29-May-2019 21:02    514K
        <a href="4%20Nin%20Shougi%20%28J%29%20%5Bh1C%5D.zip">4 Nin Shougi (J) [h1C].zip</a>                         29-May-2019 21:02    267K
        <a href="4%20Nin%20Shougi%20%28J%29.zip">4 Nin Shougi (J).zip</a>                               29-May-2019 21:02    267K
        <a href="4%20Puzzle%20%28PD%29.zip">4 Puzzle (PD).zip</a>                                  29-May-2019 21:02     12K
        <a href="46%20Okunen%20Monogatari%20-%20Harukanaru%20Eden%20he%20%28J%29%20%5Bh1C%5D.zip">46 Okunen Monogatari - Harukanaru Eden he (J) [..&gt;</a> 29-May-2019 21:02      1M
        <a href="46%20Okunen%20Monogatari%20-%20Harukanaru%20Eden%20he%20%28J%29%20%5Bo1%5D.zip">46 Okunen Monogatari - Harukanaru Eden he (J) [..&gt;</a> 29-May-2019 21:02      1M
        <a href="46%20Okunen%20Monogatari%20-%20Harukanaru%20Eden%20he%20%28J%29.zip">46 Okunen Monogatari - Harukanaru Eden he (J).zip</a>  29-May-2019 21:02      1M
        <a href="64%20Bit%20First%20Heart%20Club%20by%20Szuzy%20%28PD%29.zip">64 Bit First Heart Club by Szuzy (PD).zip</a>          29-May-2019 21:02    693K
        <a href="64MBIT%20SWCDX2%20Memory%20Explorer%20by%20neviksti%20%28PD%29.zip">64MBIT SWCDX2 Memory Explorer by neviksti (PD).zip</a> 29-May-2019 21:02     96K
    <!-- tons 
    more 
    roms -->
</body>
</html>


After peeking at the html, I know I just need to extract the links from all the anchors, but only collect the ones containing the [!] suffix. This can be done in less than 15 lines of Python:


First, install beautifulsoup4:
$ pip3 install beautfilsoup4

# good_roms.py
import requests
from bs4 import BeautifulSoup

weburl = 'https://site.site/path/to/roms/'
data = requests.get(weburl)
soup = BeautifulSoup(data.text, features='html.parser')

links = []
for anch in soup.find_all('a'):
    if '[!]' in str(anch):
        links.append(weburl + anch.get('href'))

for link in links:
    print(link)

Now I can just run the program and redirect the output to a text file.

python3 good_roms.py > good_roms.txt


Now that I have a text file with the URLs of all the good ROMs, I can give that file directly to wget and it will download just the good ones using the -i input file switch:

wget -c -i good_roms.txt


That’s it! Make sure you have enough space for all the roms and watch them pile up one at a time:

--2019-01-25 21:27:02--  https://rom-site.blah/path/to/roms/YourFavoriteRom[!].bin
Reusing existing connection to [rom-site.blah]:443.
HTTP request sent, awaiting response... 200 OK
Length: 2097152 (2.0M) [application/octet-stream]
Saving to: ‘YourFavoriteRom[!].bin’

YourFavoriteRom[!].bin 100%[========================>]   2.00M   513KB/s    in 3.9s    

2019-01-25 21:27:09 (513 KB/s) - ‘YourFavoriteRom[!].bin’ saved [2097152/2097152]

FINISHED --2019-01-25 21:29:41--
Total wall clock time: 38m 47s
Downloaded: 693 files, 888M in 30m 38s (495 KB/s)