Python: BeautifulSoup not printing non-English characters correctly

Question

Python: BeautifulSoup not printing non-English characters correctly

asked Aug 30, 2020 in Programming Languages by pythonuser (73.8k points)

I am trying to fetch video data (e.g. title, description, etc.) from YouTube and am using BeautifulSoup for that. If the title and description of the video are in English, the code works fine. But non-English titles and descriptions are not printed correctly. It just prints some unknown characters.

E.g. The URL https://www.youtube.com/watch?v=_XQaK3BfLxY has title in Hindi "एक सच्ची कहानी - कैसे लक्ष्मी जी बनी एक गरीब के घर में नौकरानी! | Lord Vishnu and Lakshmi Story" . The following codes prints "рдПрдХ рд╕рдЪреНрдЪреА рдХрд╣рд╛рдиреА - рдХреИрд╕реЗ рд▓рдХреНрд╖реНрдореА рдЬреА рдмрдиреА рдПрдХ рдЧрд░реАрдм рдХреЗ рдШрд░ рдореЗрдВ рдиреМрдХрд░рд╛рдиреА! | Lord Vishnu and Lakshmi Story" as title.

url = "https://www.youtube.com/watch?v=_XQaK3BfLxY"
request = urllib.request.Request(url, None, headers)
response = urllib.request.urlopen(request)
soup = BeautifulSoup(response, 'html5lib')
print(soup)

How can I fix this code so that it prints non-English characters correctly?

1 Answer

answered Aug 30, 2020 by pkumar81 (349k points)
selected Mar 8 by pythonuser

Best answer

Non-English characters are not printed correctly if BeautifulSoup cannot autodetect the encoding. Usually, BeautifulSoup is very good at the autodetecting document’s encoding. BeautifulSoup uses a sub-library called "Unicode, Dammit" to detect a document’s encoding and convert it to Unicode. However, if you know the document's encoding, you should pass it to the BeautifulSoup constructor as from_encoding.

Since you are trying to access YouTube, you can try "UTF-8" as encoding.

Try the following code and it should work.

url = "https://www.youtube.com/watch?v=_XQaK3BfLxY"
request = urllib.request.Request(url, None, headers)
response = urllib.request.urlopen(request)
soup = BeautifulSoup(response, 'html5lib', from_encoding="utf-8")

Python: BeautifulSoup not printing non-English characters correctly

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Related questions

Categories