Downloading Links from a Website

Let’s say you found a treasure trove of information, pictures, documentation, or whatever else on the web and you just want to download all of it at one time.  Yes, you can just use software that someone has already written to accomplish this task, but you’re here because you’re going to script it yourself!  It’s pretty easy once you see it done.  Let me show you.

First, let’s import or libraries that we will use.  We need ‘BeautifulSoup‘ and ‘urllib‘, along with ‘os

from bs4 import BeautifulSoup
import os
from urllib.request import urlretrieve, urlopen

Creating Variables

Now we need to set the variable for our script.

url – The page that you will scrape your links from.  I want to download a bunch of old tube radio manuals from a page.
base_url – In my example the links that we download from are based on the main domain.  I needed to define that separately.
ext – The extension of files you want.  You do not want to download every little thing on the page.  In my example I just want ‘.pdf’ files from the page.
dir_dl – This is the location you want your downloaded files to go.
log_file – We need to log what we download so if you want to run the script again you don’t have to download all the same stuff multiple times.
downloaded and lst_link – We want to create blank lists to work with later.




url = 'http://www.americanradiohistory.com/Service_Magazine.htm'
base_url = 'http://www.americanradiohistory.com/'
ext = '.pdf'



dir_dl = 'c://python_dl//'
log_file = dir_dl+'log_file.dat'



downloaded = []
lst_link = []

mk_lst()

We need to create a function to make a list object inside python to stash every link on the site.  We need to create an object for the page.  That is what the html_page variable is.  Next we use BeautifulSoup to parse through the page and grab all of the content.

Now we need to find every link.  We do this with ‘soup.findAll(‘a’)‘.  Then we need to use a for loop to append those links to the list ‘lst_link‘.

 

def mk_lst():
    global lst_link
    html_page = urlopen(url)
    soup = BeautifulSoup(html_page, 'lxml')
    for link in soup.findAll('a'):
        lst_link.append(link.get('href'))

mk_dir()

This is just a simple utility function to make a directory as needed. I use this function in most of my scripts. It comes in handy.

def mk_dir(dir_name):
    if not os.path.exists(dir_name):
        print('Making - '+dir_name)
        os.makedirs(dir_name)

dl_list()

This is the main engine that downloads the links.  First we need to create the log file if it does not exist already.  Then we need to parse through the list if it already exists and place every link in the list into the ‘downloaded’ list.  This is needed so that we don’t download the entire page over and over every time we run the script.  This will allow the script to skip links that are already downloaded.

Then, we for loop through the ‘lst_link’ list object.  If the link is not in the list ‘downloaded’ and then if the link ends with our preferred extension (‘.pdf’) we use the ‘urlretrieve’ function to download the link.

After that we add that link to the downloaded list and then log the link into the log file we defined at ‘log_file’.

def dl_list():
    if not os.path.isfile(log_file):
        downloaded = []
        f = open(log_file, 'w')
        f.close()



    else:
        f = open(log_file, 'r')
        downloaded = f.read()
        downloaded = downloaded.split('\n')
        downloaded = list(filter(None, downloaded))
        f.close()



    for link in lst_link:
        if link not in downloaded:
            print (link)
            if str(link).endswith(ext):
                urlretrieve((base_url+link),
                            (dir_dl+link.split('/')[-1]))



            downloaded.append(link)
            with open(log_file,'a') as log:
                log.write(str(link)+'\n')

Using the Script

From inside your python interpretter run the ‘mk_lst()’ function to parse through the site and create the list of links.

Then run the ‘dl_list()’ function to start the downloads.  The interpretter will list the downloads as they are downloaded.  If for whatever reason the script crashes, you can just run it again and because we logged every link as it goes through the script it will not download it for a second time.

The downloads will show up in the folder defined at ‘dir_dl’.  In this example, it is ‘c:\python_dl\’.  Have fun.  Please send any feedback or comments to Contact@DREAM-Enterprise.com.  Thank you.