lxml
Introduction
lxml
is a package used for processing XML in Python. To get started, simply import the package:
from lxml import etree
If you want to load XML from a string, do the following:
my_string = f"""<html>
<head>
<title>My Title</title>
</head>
<body>
<div>
<div class="abc123 sktoe-sldjkt dkjfg3-dlgsk">
<div class="glkjr-slkd dkgj-0 dklfgj-00">
<a class="slkdg43lk dlks" href="https://example.com/123456">
</a>
</div>
</div>
<div>
<div class="ldskfg4">
<span class="slktjoe" aria-label="123 comments, 43 Retweets, 4000 likes">Love it.</span>
</div>
</div>
<div data-amount="12">13</div>
</div>
<div>
<div class="abc123 sktoe-sls dkjfg-dlgsk">
<div class="glkj-slkd dkgj-0 dklfj-00">
<a class="slkd3lk dls" href="https://example.com/123456">
</a>
</div>
</div>
<div>
<div class="ldg4">
<span class="sktjoe" aria-label="1000 comments, 455 Retweets, 40000 likes">Love it.</span>
</div>
</div>
<div data-amount="122">133</div>
</div>
</body>
</html>"""
tree = etree.fromstring(my_string)
If you don’t want all of this code in your file, you can put this string into a .xml file and parse it as follows:
tree = etree.parse("example.xml")
Most of the time, you’ll be working with .xml files that already exist and this is the common format you’ll see. From there, you can use XPath Expressions to parse the dataset.
|
Examples
How do I load a webpage I scraped using requests
into an lxml
tree?
Click to see solution
import requests
import lxml.html
# note that without this header, a website may give you a puzzle to solve
my_headers = {'User-Agent': 'Mozilla/5.0'}
# scrape the webpage
response = requests.get("https://www.reddit.com/r/puppies/", headers=my_headers)
# load the webpage into an lxml tree
tree = lxml.html.fromstring(response.text)
How do I get the name of the root node from my lxml
tree called tree
?
Click to see solution
# remember "/" gets the node starting at the root node and "*" is a
# wildcard that means "anything"
tree.xpath("/*")[0].tag
'html'
If the root node is named "html", how do I get the name of all nested tags?
Click to see solution
list_of_tags = [x.tag for x in tree.xpath("/html/*")]
print(list_of_tags)
# remember, this syntax is list comprehension.
# It is essentially a nice short-hand way of writing a loop in Python.
['head', 'body']
How do I get the attributes of an element?
Click to see solution
import pandas as pd
# as you can see, this prints the attributes in a dict-like object for each div element
# in the node.
for element in tree.xpath("//div"):
print(element.attrib)
{} {'class': 'abc123 sktoe-sldjkt dkjfg3-dlgsk'} {'class': 'glkjr-slkd dkgj-0 dklfgj-00'} {} {'class': 'ldskfg4'} {'data-amount': '12'} {} {'class': 'abc123 sktoe-sls dkjfg-dlgsk'} {'class': 'glkj-slkd dkgj-0 dklfj-00'} {} {'class': 'ldg4'} {'data-amount': '122'}
The output looks much like a dictionary. We can turn the attributes of an element into a Pandas DataFrame if that’s easier for our analysis.
list_of_dicts = []
# adding `dict` before element.attrib is important here.
# Failing to add it results in an incorrect DataFrame
for element in tree.xpath("//div"):
list_of_dicts.append(dict(element.attrib))
myDF = pd.DataFrame(list_of_dicts)
myDF.head(10)
class data-amount 0 NaN NaN 1 abc123 sktoe-sldjkt dkjfg3-dlgsk NaN 2 glkjr-slkd dkgj-0 dklfgj-00 NaN 3 NaN NaN 4 ldskfg4 NaN 5 NaN 12 6 NaN NaN 7 abc123 sktoe-sls dkjfg-dlgsk NaN 8 glkj-slkd dkgj-0 dklfj-00 NaN 9 NaN NaN
How do I get the div elements with attribute "data-amount"?
Click to see solution
for element in tree.xpath("//div[@data-amount]"):
print(element.attrib)
{'data-amount': '12'} {'data-amount': '122'}