How to download intext images with beautiful soup

0 votes

I'm trying to use beautiful soups and requests to program a website scraper in Python. I can easily collect all of the text I want but some of the text I'm trying to download has inline images that are important. I want to replace the image with it's title, and add that to a string I can parse later, but I'm not sure how to do this.

This is an example of the kind of HTML I'm trying to parse:

    <td colspan="3"><b>"Assemble under Siegfried!"</b> 
        <a href="/wiki/index.php/File:Continuous.png" class="image" title="CONT"><img alt="CONT" src="/wiki/images/thumb/7/78/Continuous.png/14px-Continuous.png" width="14" height="17" srcset="/wiki/images/thumb/7/78/Continuous.png/21px-Continuous.png 1.5x, /wiki/images/7/78/Continuous.png 2x">
        </a> This unit gains +10 attack for each 
        <a href="/wiki/index.php/File:Black.png" class="image" title="Black"><img alt="Black" src="/wiki/images/thumb/7/71/Black.png/15px-Black.png" width="15" height="15" srcset="/wiki/images/thumb/7/71/Black.png/23px-Black.png 1.5x, /wiki/images/thumb/7/71/Black.png/30px-Black.png 2x">
        </a> and 
        <a href="/wiki/index.php/File:White.png" class="image" title="White"><img alt="White" src="/wiki/images/thumb/8/80/White.png/15px-White.png" width="15" height="15" srcset="/wiki/images/thumb/8/80/White.png/23px-White.png 1.5x, /wiki/images/thumb/8/80/White.png/30px-White.png 2x">
        </a> ally besides this unit.
    </td>

From this HTML I want to pull:

"Assemble under Siegfried! CONT This unit gains +10 attack for each Black and White ally besides this unit."

Using the normal get_text() method does not include the titles of the images, which is the problem.

Sep 10, 2018 in Python by bug_seeker
• 15,510 points
5,452 views

1 answer to this question.

0 votes

Try this:

html_data = """ <td colspan="3"><b>"Assemble under Siegfried!"</b> 
    <a href="/wiki/index.php/File:Continuous.png" class="image" title="CONT"><img alt="CONT" src="/wiki/images/thumb/7/78/Continuous.png/14px-Continuous.png" width="14" height="17" srcset="/wiki/images/thumb/7/78/Continuous.png/21px-Continuous.png 1.5x, /wiki/images/7/78/Continuous.png 2x">
    </a> This unit gains +10 attack for each 
    <a href="/wiki/index.php/File:Black.png" class="image" title="Black"><img alt="Black" src="/wiki/images/thumb/7/71/Black.png/15px-Black.png" width="15" height="15" srcset="/wiki/images/thumb/7/71/Black.png/23px-Black.png 1.5x, /wiki/images/thumb/7/71/Black.png/30px-Black.png 2x">
    </a> and 
    <a href="/wiki/index.php/File:White.png" class="image" title="White"><img alt="White" src="/wiki/images/thumb/8/80/White.png/15px-White.png" width="15" height="15" srcset="/wiki/images/thumb/8/80/White.png/23px-White.png 1.5x, /wiki/images/thumb/8/80/White.png/30px-White.png 2x">
    </a> ally besides this unit.
</td>"""
from bs4 import BeautifulSoup
html = BeautifulSoup(html_data, "html.parser")

texts = [html.find("b").get_text()]
for a in html.find_all("a"):
    texts.append(a.attrs.get("title"))
    texts.append(a.next_element.next_element.next_element.strip())
print(" ".join(texts))

I don't sure that you realy want. But i purpose need attrs of Tag.

Example: from bs4 import BeautifulSoup

html = BeautifulSoup(html_data)
for a in html.find_all("a"):
    print(a.attrs.get("title"))

Output:

CONT
Black
White

If you want download images: from urllib.parse import urljoin import requests from bs4 import BeautifulSoup

cdn_url = "http://some.com/" # root url of site with static content
html = BeautifulSoup(html_data)
for img in html.find_all("img"):
    img_response = requests.get(urljoin(cdn_url, img.attrs.get("src"))) #img content should save in file
answered Sep 10, 2018 by Priyaj
• 58,020 points

Related Questions In Python

0 votes
1 answer

I want to download a file from the website by web scraping. Can anyone explain how to do this in jupyter lab (python) with an example?

Hey, Web scraping is a technique to automatically ...READ MORE

answered Apr 7, 2020 in Python by Gitika
• 65,770 points
2,388 views
0 votes
1 answer

How can I generating file to download with Django?

Hello @kartik, To trigger a download you need ...READ MORE

answered Aug 7, 2020 in Python by Niroj
• 82,840 points
5,444 views
0 votes
1 answer

How to perform web scraping with python?

Hey, there are various libraries used in ...READ MORE

answered Apr 20, 2018 in Python by aayushi
• 750 points
1,795 views
0 votes
1 answer

how to download and install Django rest framework?

To install Django, you can simply open ...READ MORE

answered Apr 24, 2018 in Python by Christine
• 15,790 points
1,969 views
0 votes
1 answer

How to use BeautifulSoup for Webscraping

Your code is good until you get ...READ MORE

answered Sep 6, 2018 in Python by Priyaj
• 58,020 points
2,174 views
0 votes
1 answer

Get all the read more links of amazon.jobs with Python

As you've noticed your request returns only ...READ MORE

answered Sep 28, 2018 in AWS by Priyaj
• 58,020 points
1,477 views
0 votes
1 answer

How to web scrape using python without using a browser?

Yes, you can use the headless mode. ...READ MORE

answered Apr 2, 2019 in Python by Yogi

edited Oct 7, 2021 by Sarfaraz 13,066 views
0 votes
1 answer

How to parse html file to BeautifulSoup?

Hey. Refer to the following code: driver.get("link") html = ...READ MORE

answered Apr 2, 2019 in Python by Kirti
1,936 views
0 votes
1 answer

How to download intext images with beautiful soup

Ohh... I got what you need. Try this: html_data ...READ MORE

answered Sep 20, 2018 in Python by Priyaj
• 58,020 points
5,565 views
+1 vote
1 answer

How to replace id with attribute corresponding to id of another table?

Use the following query statement and let ...READ MORE

answered Aug 8, 2018 in Python by Priyaj
• 58,020 points
2,476 views
webinar REGISTER FOR FREE WEBINAR X
REGISTER NOW
webinar_success Thank you for registering Join Edureka Meetup community for 100+ Free Webinars each month JOIN MEETUP GROUP