Awesome-omni-skill beautifulsoup4

Parse, search, and modify HTML/XML documents by building a navigable tree of tags and text.

install

source · Clone the upstream repo

git clone https://github.com/diegosouzapw/awesome-omni-skill

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/diegosouzapw/awesome-omni-skill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/development/beautifulsoup4" ~/.claude/skills/diegosouzapw-awesome-omni-skill-beautifulsoup4 && rm -rf "$T"

manifest: skills/development/beautifulsoup4/SKILL.md

source content

Imports

import bs4
from bs4 import BeautifulSoup, Tag, Comment
from bs4.exceptions import FeatureNotFound, ParserRejectedMarkup
from bs4.dammit import UnicodeDammit

Core Patterns

Parse markup with an explicit parser ✅ Current

from __future__ import annotations

from bs4 import BeautifulSoup

html_doc = "<html><body><p class='body strikeout'>Hello</p></body></html>"

# Always choose the parser explicitly for consistent behavior across environments.
soup = BeautifulSoup(html_doc, "html.parser")

p = soup.find("p")
assert p is not None
print(p.name)          # "p"
print(p.get_text())    # "Hello"

Prefer
```
BeautifulSoup(markup, "html.parser")
```
,
```
"lxml"
```
,
```
"html5lib"
```
, or
```
"xml"/"lxml-xml"
```
depending on your needs; different parsers can produce different trees for invalid documents.

Parse from a file handle (context manager) ✅ Current

from __future__ import annotations

from pathlib import Path
from bs4 import BeautifulSoup

path = Path("example.html")
path.write_text("<html><body><a href='/x'>Link</a></body></html>", encoding="utf-8")

with path.open("r", encoding="utf-8") as fp:
    soup = BeautifulSoup(fp, "html.parser")

a = soup.find("a")
assert a is not None
print(a.get("href"))  # "/x"

Pass an open file handle directly to
```
BeautifulSoup
```
to let the builder stream/handle encodings appropriately.

Find elements and navigate relatives ✅ Current

from __future__ import annotations

from typing import Optional
from bs4 import BeautifulSoup, Tag

html_doc = """
<div id="root">
  <h1>Title</h1>
  <p>First</p>
  <p>Second <span>inner</span></p>
</div>
"""

soup = BeautifulSoup(html_doc, "html.parser")

root: Optional[Tag] = soup.find(id="root")
assert root is not None

h1: Optional[Tag] = root.find("h1")
assert h1 is not None

# Navigate
second_p: Optional[Tag] = h1.find_next("p")
assert second_p is not None
print(second_p.get_text(strip=True))  # "First"

all_ps = root.find_all("p")
print([p.get_text(" ", strip=True) for p in all_ps])  # ["First", "Second inner"]

Use
```
find
```
,
```
find_all
```
, and the
```
find_next*
```
/
```
find_previous*
```
/ sibling / parent variants for tree navigation.

Work with tag attributes (including multi-valued

class

) ✅ Current

from __future__ import annotations

from bs4 import BeautifulSoup, Tag

soup = BeautifulSoup("<p id='x' class='body strikeout'></p>", "html.parser")
p = soup.find("p")
assert isinstance(p, Tag)

# Dict-like access
print(p["id"])         # "x"
print(p.get("id"))     # "x"

# Multi-valued HTML attributes like class are lists by default.
print(p["class"])      # ["body", "strikeout"]

# If you always want a list (even for non-multivalued attrs), use get_attribute_list.
print(p.get_attribute_list("id"))     # ["x"]
print(p.get_attribute_list("class"))  # ["body", "strikeout"]

# Mutation
p["data-role"] = "demo"
del p["id"]
print(p.attrs)  # {'class': ['body', 'strikeout'], 'data-role': 'demo'}

In HTML mode,
```
class
```
,
```
rel
```
, etc. are typically stored as
```
list[str]
```
. Use
```
Tag.get_attribute_list(name)
```
to normalize to a list.

Handle text nodes and comments safely ✅ Current

from __future__ import annotations

from bs4 import BeautifulSoup, Comment
from bs4.element import NavigableString

soup = BeautifulSoup("<p>Hello<!--secret--></p>", "html.parser")
p = soup.find("p")
assert p is not None

# Comments are special text nodes.
comment = p.find(string=lambda s: isinstance(s, Comment))
assert isinstance(comment, Comment)
print(comment)  # "secret"

# NavigableString is immutable; replace the node instead of editing in place.
text = p.find(string=lambda s: isinstance(s, NavigableString) and not isinstance(s, Comment))
assert isinstance(text, NavigableString)
text.replace_with("Hi")

print(p.get_text())  # "Hi"

Treat
```
NavigableString
```
as immutable; use
```
replace_with(...)
```
to change text.

Configuration

Parser selection (
features
):
- ```
"html.parser"
```
  : built-in, decent baseline.
- ```
"lxml"
```
  : fast (requires
```
lxml
```
  ).
- ```
"html5lib"
```
  : most lenient (slow; requires
```
html5lib
```
  ).
- ```
"xml"
```
  /
```
"lxml-xml"
```
  : XML parsing mode (attribute handling differs from HTML).
parse_only
: pass a
```
SoupStrainer
```
(not covered here) to parse only parts of a document for speed/memory.
from_encoding
/
exclude_encodings
: hint or restrict encoding detection when input is bytes.
Large text nodes with lxml: when using an lxml builder and documents may contain a single text node > 10,000,000 bytes, pass
```
huge_tree=True
```
to
```
BeautifulSoup(...)
```
to avoid lxml security limits truncating the parse.
Multi-valued attributes:
- Default (HTML):
```
class
```
  /
```
rel
```
  etc. become lists.
- To disable list conversion:
```
BeautifulSoup(markup, "html.parser", multi_valued_attributes=None)
```
- In XML mode, multi-valued attributes are not enabled by default; you can opt in via
```
multi_valued_attributes={'*': 'class'}
```
  .

Pitfalls

Wrong: Not specifying a parser (inconsistent trees)

from bs4 import BeautifulSoup

html_doc = "<p><b>badly nested</p></b>"
soup = BeautifulSoup(html_doc)  # parser not specified
print(soup.find("b"))

Right: Choose a parser explicitly

from bs4 import BeautifulSoup

html_doc = "<p><b>badly nested</p></b>"
soup = BeautifulSoup(html_doc, "html.parser")
print(soup.find("b"))

Wrong: Treating

class

as a string in HTML mode

from bs4 import BeautifulSoup

soup = BeautifulSoup("<p class='body strikeout'></p>", "html.parser")
# In HTML mode, soup.p["class"] is a list, so this fails.
classes = soup.p["class"].split()  # type: ignore[attr-defined]
print(classes)

Right: Use the list directly (or normalize with

get_attribute_list

)

from bs4 import BeautifulSoup

soup = BeautifulSoup("<p class='body strikeout'></p>", "html.parser")
classes = soup.p["class"]
print(classes)  # ["body", "strikeout"]

ids = soup.p.get_attribute_list("id")
print(ids)  # []

Wrong: Assuming multi-valued attributes exist in XML mode

from bs4 import BeautifulSoup

soup = BeautifulSoup("<p class='body strikeout'></p>", "xml")
# In XML mode, "class" is a string by default; indexing returns a character.
first = soup.p["class"][0]
print(first)  # "b" (not "body")

Right: Opt in to multi-valued attributes when parsing XML

from bs4 import BeautifulSoup

class_is_multi = {"*": "class"}
soup = BeautifulSoup("<p class='body strikeout'></p>", "xml", multi_valued_attributes=class_is_multi)
first = soup.p["class"][0]
print(first)  # "body"

Wrong: Editing a

NavigableString

“in place”

from bs4 import BeautifulSoup
from bs4.element import NavigableString

soup = BeautifulSoup("<p>Hello</p>", "html.parser")
text = soup.p.string
assert isinstance(text, NavigableString)

# Strings are immutable; this does not update the parse tree.
text = NavigableString("Hi")
print(soup.p.get_text())  # still "Hello"

Right: Replace the existing node with

replace_with

from bs4 import BeautifulSoup
from bs4.element import NavigableString

soup = BeautifulSoup("<p>Hello</p>", "html.parser")
text = soup.p.string
assert isinstance(text, NavigableString)

text.replace_with("Hi")
print(soup.p.get_text())  # "Hi"

Wrong: lxml builder truncation with huge text nodes (missing

huge_tree=True

)

from bs4 import BeautifulSoup

# If this markup contains a single >10,000,000 byte text node, lxml may stop early.
markup_with_huge_text = "<root>" + ("x" * 11_000_000) + "</root>"
soup = BeautifulSoup(markup_with_huge_text, "lxml")
print(soup.find("root") is not None)

Right: Enable huge tree support when needed

from bs4 import BeautifulSoup

markup_with_huge_text = "<root>" + ("x" * 11_000_000) + "</root>"
soup = BeautifulSoup(markup_with_huge_text, "lxml", huge_tree=True)
print(soup.find("root") is not None)

References

Migration from v4.13.x

Typing changes (4.14.0+):
```
find_*
```
methods gained overloads to improve type safety.
- Prefer annotating results as
```
Optional[Tag]
```
  ,
```
Optional[NavigableString]
```
  ,
```
Sequence[Tag]
```
  , etc.
- Known edge case:
```
find_all("a", string="...")
```
  may still confuse type checkers; refactor or use
```
typing.cast
```
  .

from __future__ import annotations

from typing import Optional, Sequence, cast
from bs4 import BeautifulSoup, Tag

soup = BeautifulSoup("<a>b</a>", "html.parser")

# Preferred: reflect optionality
a: Optional[Tag] = soup.find("a")

# Edge case: mixed filters may require a cast for static type checkers
tags = cast(Sequence[Tag], soup.find_all("a", string="b"))
print([t.get_text() for t in tags])

```
ResultSet
```
typing churn across 4.14.x: inheritance changed in 4.14.0/4.14.1/4.14.2; avoid depending on specific ABC inheritance.
- If you need a stable container type at boundaries:
```
results = list(soup.find_all(...))
```
  .
lxml huge text nodes (4.14.3 note): if using an lxml builder and expecting extremely large text nodes, pass
```
huge_tree=True
```
.

API Reference

BeautifulSoup(markup, features=..., parse_only=..., from_encoding=..., exclude_encodings=..., element_classes=..., **kwargs) - parse markup into a tree; specify
```
features
```
(parser) explicitly.
BeautifulSoup.find(...) - return the first matching element (often
```
Tag | None
```
); supports tag name, attrs, and other filters.
BeautifulSoup.find_all(...) - return all matching elements (list-like result set); convert to
```
list(...)
```
if you need a stable container type.
BeautifulSoup.find_next(...) / find_all_next(...) - search forward in document order from a starting node.
BeautifulSoup.find_previous(...) / find_all_previous(...) - search backward in document order from a starting node.
BeautifulSoup.find_next_sibling(...) / find_next_siblings(...) - search among following siblings.
BeautifulSoup.find_previous_sibling(...) / find_previous_siblings(...) - search among preceding siblings.
BeautifulSoup.find_parent(...) / find_parents(...) - search upward to parents/ancestors.
BeautifulSoup.get_text(separator="...", strip=False) - extract combined text content from a subtree.
BeautifulSoup.prettify() - render formatted markup for debugging/inspection.
BeautifulSoup.contains_replacement_characters - flag indicating replacement characters were introduced during entity/encoding handling (builder-dependent).
Tag.name - the tag’s name (e.g.,
```
"a"
```
,
```
"p"
```
).
Tag.attrs - dict of attributes; multi-valued HTML attributes may be lists.
Tag.get(key, default=None) - safe attribute lookup.
Tag.get_attribute_list(name) - normalize an attribute to a list regardless of internal storage.
UnicodeDammit(...) - helper for detecting/decoding unknown encodings before parsing.
Comment - class for HTML/XML comment nodes (a specialized string-like node).
ParserRejectedMarkup - exception raised when the underlying parser rejects markup.
FeatureNotFound - exception raised when the requested parser feature/builder is unavailable.

Awesome-omni-skill beautifulsoup4

Imports

Core Patterns

Parse markup with an explicit parser ✅ Current

Parse from a file handle (context manager) ✅ Current

Find elements and navigate relatives ✅ Current

Work with tag attributes (including multi-valued
`class`
) ✅ Current

Handle text nodes and comments safely ✅ Current

Configuration

Pitfalls

Wrong: Not specifying a parser (inconsistent trees)

Right: Choose a parser explicitly

Wrong: Treating
`class`
as a string in HTML mode

Right: Use the list directly (or normalize with
`get_attribute_list`
)

Wrong: Assuming multi-valued attributes exist in XML mode

Right: Opt in to multi-valued attributes when parsing XML

Wrong: Editing a
`NavigableString`
“in place”

Right: Replace the existing node with
`replace_with`

Wrong: lxml builder truncation with huge text nodes (missing
`huge_tree=True`
)

Right: Enable huge tree support when needed

References

Migration from v4.13.x

API Reference

Awesome-omni-skill beautifulsoup4

Imports

Core Patterns

Parse markup with an explicit parser ✅ Current

Parse from a file handle (context manager) ✅ Current

Find elements and navigate relatives ✅ Current

Work with tag attributes (including multi-valued class) ✅ Current

Handle text nodes and comments safely ✅ Current

Configuration

Pitfalls

Wrong: Not specifying a parser (inconsistent trees)

Right: Choose a parser explicitly

Wrong: Treating class as a string in HTML mode

Right: Use the list directly (or normalize with get_attribute_list)

Wrong: Assuming multi-valued attributes exist in XML mode

Right: Opt in to multi-valued attributes when parsing XML

Wrong: Editing a NavigableString “in place”

Right: Replace the existing node with replace_with

Wrong: lxml builder truncation with huge text nodes (missing huge_tree=True)

Right: Enable huge tree support when needed

References

Migration from v4.13.x

API Reference

Work with tag attributes (including multi-valued
`class`
) ✅ Current

Wrong: Treating
`class`
as a string in HTML mode

Right: Use the list directly (or normalize with
`get_attribute_list`
)

Wrong: Editing a
`NavigableString`
“in place”

Right: Replace the existing node with
`replace_with`

Wrong: lxml builder truncation with huge text nodes (missing
`huge_tree=True`
)