+4 votes
in Programming Languages by (14.3k points)

I am trying to fetch video data (e.g. title, description, etc.) from YouTube and am using BeautifulSoup for that. If the title and description of the video are in English, the code works fine. But non-English titles and descriptions are not printed correctly. It just prints some unknown characters.

E.g. The URL https://www.youtube.com/watch?v=_XQaK3BfLxY has title in Hindi "एक सच्ची कहानी - कैसे लक्ष्मी जी बनी एक गरीब के घर में नौकरानी! | Lord Vishnu and Lakshmi Story" . The following codes prints "рдПрдХ рд╕рдЪреНрдЪреА рдХрд╣рд╛рдиреА - рдХреИрд╕реЗ рд▓рдХреНрд╖реНрдореА рдЬреА рдмрдиреА рдПрдХ рдЧрд░реАрдм рдХреЗ рдШрд░ рдореЗрдВ рдиреМрдХрд░рд╛рдиреА! | Lord Vishnu and Lakshmi Story" as title.

url = "https://www.youtube.com/watch?v=_XQaK3BfLxY"

request = urllib.request.Request(url, None, headers)

response = urllib.request.urlopen(request)

soup = BeautifulSoup(response, 'html5lib')

print(soup)

How can I fix this code so that it prints non-English characters correctly? 

1 Answer

0 votes
by (24.7k points)

Non-English characters are not printed correctly if BeautifulSoup cannot autodetect the encoding. Usually, BeautifulSoup is very good at the autodetecting document’s encoding. BeautifulSoup uses a sub-library called "Unicode, Dammit" to detect a document’s encoding and convert it to Unicode. However, if you know the document's encoding, you should pass it to the BeautifulSoup constructor as from_encoding.

Since you are trying to access YouTube, you can try "UTF-8" as encoding.

Try the following code and it should work.

url = "https://www.youtube.com/watch?v=_XQaK3BfLxY"
request = urllib.request.Request(url, None, headers)
response = urllib.request.urlopen(request)
soup = BeautifulSoup(response, 'html5lib', from_encoding="utf-8")

...