HTML to BlockNote Converter¶
The html_to_blocks() function parses HTML content and converts it into BlockNote blocks.
Function Signature¶
Parameters¶
- html (
str): HTML string to parse and convert
Returns¶
- List[Block]: List of BlockNote Block objects
Basic Usage¶
from blocknote.converter import html_to_blocks
html = "<h1>Title</h1><p>This is a paragraph.</p>"
blocks = html_to_blocks(html)
print(len(blocks)) # 2
print(blocks[0].type) # heading
print(blocks[0].props["level"]) # 1
print(blocks[1].type) # paragraph
Supported HTML Elements¶
Headings¶
html = "<h1>H1</h1><h2>H2</h2><h6>H6</h6>"
blocks = html_to_blocks(html)
# Results in 3 heading blocks with levels 1, 2, and 6
Paragraphs¶
Lists¶
Unordered Lists¶
html = """
<ul>
<li>First item</li>
<li>Second item</li>
</ul>
"""
blocks = html_to_blocks(html)
# Results in 2 bulletListItem blocks
Ordered Lists¶
html = """
<ol>
<li>First step</li>
<li>Second step</li>
</ol>
"""
blocks = html_to_blocks(html)
# Results in 2 numberedListItem blocks
Blockquotes¶
html = "<blockquote>Important quote</blockquote>"
blocks = html_to_blocks(html)
# Results in 1 quote block
Checkboxes¶
html = """
<div><input type="checkbox" checked> Completed task</div>
<div><input type="checkbox"> Pending task</div>
"""
blocks = html_to_blocks(html)
# Results in 2 checkListItem blocks
# First with checked=True, second with checked=False
Text Styling¶
Basic Styles¶
html = "<p><strong>Bold</strong> and <em>italic</em> text</p>"
blocks = html_to_blocks(html)
block = blocks[0]
print(len(block.content)) # 3 inline content items
print(block.content[0].styles) # {"bold": True}
print(block.content[2].styles) # {"italic": True}
Supported Style Tags¶
| HTML Tag | BlockNote Style | Example |
|---|---|---|
<strong>, <b> |
{"bold": True} |
bold |
<em>, <i> |
{"italic": True} |
italic |
<u> |
{"underline": True} |
underline |
<s> |
{"strike": True} |
~~strikethrough~~ |
<code> |
{"code": True} |
code |
CSS Styles¶
html = '<p><span style="color: red; background-color: yellow;">Colored text</span></p>'
blocks = html_to_blocks(html)
content = blocks[0].content[0]
print(content.styles) # {"textColor": "red", "backgroundColor": "yellow"}
Nested Styles¶
html = "<p><strong><em>Bold and italic</em></strong></p>"
blocks = html_to_blocks(html)
content = blocks[0].content[0]
print(content.styles) # {"bold": True, "italic": True}
Complex HTML Examples¶
Mixed Content Document¶
html = """
<h1>Document Title</h1>
<p>This is a <strong>paragraph</strong> with <em>mixed</em> formatting.</p>
<ul>
<li>First bullet point</li>
<li>Second bullet point</li>
</ul>
<blockquote>An important quote</blockquote>
<div><input type="checkbox" checked> Completed task</div>
"""
blocks = html_to_blocks(html)
print(f"Parsed {len(blocks)} blocks")
for i, block in enumerate(blocks):
print(f"Block {i}: {block.type}")
if block.content:
text = "".join(c.text for c in block.content)
print(f" Content: {text}")
Table Handling¶
html = '<div class="blocknote-table">Table content</div>'
blocks = html_to_blocks(html)
print(blocks[0].type) # table
Error Handling¶
from blocknote.converter import html_to_blocks
# Invalid input type
try:
blocks = html_to_blocks(123)
except TypeError as e:
print(f"Error: {e}") # Input must be a string
# Empty input
blocks = html_to_blocks("")
print(len(blocks)) # 0
# Malformed HTML (handled gracefully)
blocks = html_to_blocks("<p>Unclosed paragraph")
print(len(blocks)) # Still parses what it can
HTML Sanitization¶
The parser automatically handles potentially unsafe HTML:
html = '<p><script>alert("xss")</script>Safe content</p>'
blocks = html_to_blocks(html)
# Script tags are ignored, only safe content is preserved
print(blocks[0].content[0].text) # "Safe content"
Advanced Usage¶
Custom Block Detection¶
The parser can detect custom BlockNote block types:
html = '<div class="blocknote-customType">Custom content</div>'
blocks = html_to_blocks(html)
# If customType is a valid BlockType, it will be preserved
# Otherwise, defaults to paragraph
Whitespace Handling¶
html = "<p>Text with multiple spaces</p>"
blocks = html_to_blocks(html)
# Whitespace is preserved as-is
print(blocks[0].content[0].text) # "Text with multiple spaces"
Empty Elements¶
html = "<p></p><h1></h1>"
blocks = html_to_blocks(html)
print(len(blocks)) # 2
print(len(blocks[0].content)) # 0 (empty content)
print(len(blocks[1].content)) # 0 (empty content)
Performance Considerations¶
Large Documents¶
For large HTML documents:
def parse_large_html(html_content):
# Consider chunking very large documents
if len(html_content) > 1_000_000: # 1MB
print("Warning: Large document, parsing may be slow")
return html_to_blocks(html_content)
Memory Usage¶
The parser creates Block objects in memory. For very large documents, consider:
- Processing in chunks
- Streaming processing
- Limiting the depth of nested elements
Limitations¶
- Complex Tables: Only basic table support
- Media Elements: Images, videos not supported
- Custom Elements: Unknown HTML elements are ignored
- CSS Styles: Only basic inline styles are supported
Best Practices¶
Input Validation¶
def safe_html_parse(html_input):
if not isinstance(html_input, str):
raise TypeError("HTML input must be a string")
if not html_input.strip():
return []
return html_to_blocks(html_input)
Error Recovery¶
def robust_html_parse(html_input):
try:
return html_to_blocks(html_input)
except Exception as e:
print(f"HTML parsing failed: {e}")
# Return a safe fallback
return [Block(
id="error",
type="paragraph",
content=[InlineContent(type="text", text="Content could not be parsed")]
)]