Jan 22, 2015

Don't put html, head and body tags automatically into beautifulsoup content

Was making a parser recently with BeautifulSoup. Came to the final with rendering contents of the edited text. Like so:
text = "<h1>Test text tag</h1>"
soup = BeautifulSoup(text, "html5")

text = soup.renderContents()
print text
It renders those contents with a result wrapped into the <html>, <head> and <body> tags. So print output looks like so:
'<html><head></head><body><h1>Test text tag</h1></body></html>'
That's a feature of the html5lib library, it fixes HTML that is lacking, such as adding back in missing required elements.

The workaround is simple enough:
text = soup.body.renderContents()
This solution will return an inside of the <html> tag <body>.
Result is:
text = '<h1>Test text tag</h1>'

No comments:

Post a Comment