+1 vote
in Programming Languages by (26.1k points)

I am trying to find all '<a>' tags with the "href" attribute on a webpage using the following beautifulsoup code, but it returns many incorrect values. What am I missing in the code?

from bs4 import BeautifulSoup

import urllib.request as ur

req = ur.Request(url, None, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36' })

rs = ur.urlopen(req)

soup = BeautifulSoup(rs, 'html.parser')

for sp in soup.find_all('a'):

        print(sp.text)

1 Answer

+3 votes
by (121k points)
selected by
 
Best answer

You need to use parameter "href=True" in the find_all() function. Without the parameter, it tries to find all <a> tags with/without href and hence you are getting some incorrect values.

Here is the modified code:

from bs4 import BeautifulSoup
import urllib.request as ur

req = ur.Request(url, None, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36' })
rs = ur.urlopen(req)

soup = BeautifulSoup(rs, 'html.parser')
for sp in soup.find_all('a', href=True):
        print(sp.text)


...