c# - Truncating HTML content at the end of text blocks (block elements) -
mainly when shorten/truncate textual content truncate @ specific character index. that's complicated in html anyway, want truncate html content (generated using content-editable div
) using different measures:
- i define character index
n
serve truncation startpoint limit - algorithm check whether content @ least
n
characters long (text only; not counting tags); if it's not, return whole content - it check
n-x
n+x
character position (text only) , search ends of block nodes;x
predefined offset value ,n/5
n/4
; - if several block nodes end within range, algorithm select 1 ends closest limit index
n
- if no block node ends within range find closest word boundary within same range , select index closest
n
, truncate @ position. - return truncated content valid html (all tags closed @ end)
my content-editable generated content may consist of paragraphs (with line breaks), preformatted code blocks, block quotes, ordered , unordered lists, headers, bolds , italics (which inline nodes , shouldn't count in truncation process) etc. final implementation of course define elements possible truncation candidates. headers though block html elements not count truncation points don't want widowed headers. paragraphs, list individual items, whole ordered , unordered lists, block quotes, preformatted blocks, void elements etc. ones. headers , inline block elements aren't.
example
let's take stackoverflow question example of html content truncate. let's set truncation limit 1000 offset of 250 characters (1/4).
this dotnetfiddle shows text of question while adding limit markers inside of (|min|
represents character 750, |limit|
representing character 1000 , |max|
represents character 1250).
as can seen example the closest truncation boundary between 2 block nodes character 1000 between </ol>
, p
(my content-editable generated...). means html should truncated right between these 2 tags result in little bit less 1000 characters long content text wise, kept truncated content meaningful because wouldn't truncate somewhere in middle of text passage.
i hope explains how things should working related algorithm.
the problem
the first problem i'm seeing here i'm dealing nested structure html. have detect different elements (only block elements , no inline ones). , last not least have count characters in string , ignore belong tags.
possible solutions
- i parse content manually creating object tree representing content nodes , hierarchy
- i convert html easier manage markdown , search closest new line provided index
n
, convert html - use html agility pack , replace #1 parsing , somehow use xpath extract block nodes , truncate content
second thoughts
- i'm sure make doing #1 feels i'm reinventing wheel.
- i don't think there's c# library #2 should doing html markdown manually or run i.e. pandoc external process.
- i use hap it's great @ manipulating html, i'm not sure whether truncation simple enough using it. i'm afraid bulk of processing still outside hap in custom code
how should 1 approach such truncation algorithm? head seems tired come consensus (or solution).
here sample code can truncate inner text. uses recursive capability of innertext
property , clonenode
method.
public static htmlnode truncateinnertext(htmlnode node, int length) { if (node == null) throw new argumentnullexception("node"); // nothing do? if (node.innertext.length < length) return node; htmlnode clone = node.clonenode(false); truncateinnertext(node, clone, clone, length); return clone; } private static void truncateinnertext(htmlnode source, htmlnode root, htmlnode current, int length) { htmlnode childclone; foreach (htmlnode child in source.childnodes) { // expected size ok? int expectedsize = child.innertext.length + root.innertext.length; if (expectedsize <= length) { // yes, clone whole hierarchy childclone = child.clonenode(true); current.childnodes.add(childclone); continue; } // text node? crop htmltextnode text = child htmltextnode; if (text != null) { int remove = expectedsize - length; childclone = root.ownerdocument.createtextnode(text.innertext.substring(0, text.innertext.length - remove)); current.childnodes.add(childclone); return; } // it's not text node, shallow clone , dive in childclone = child.clonenode(false); current.childnodes.add(childclone); truncateinnertext(child, root, childclone, length); } }
and sample c# console app scrap question example, , truncate 500 characters.
class program { static void main(string[] args) { var web = new htmlweb(); var doc = web.load("http://stackoverflow.com/questions/30926684/truncating-html-content-at-the-end-of-text-blocks-block-elements"); var post = doc.documentnode.selectsinglenode("//td[@class='postcell']//div[@class='post-text']"); var truncated = truncateinnertext(post, 500); console.writeline(truncated.outerhtml); console.writeline("size: " + truncated.innertext.length); } }
when ran it, should display this:
<div class="post-text" itemprop="text"> <p>mainly when shorten/truncate textual content truncate @ specific character index. that's complicated in html anyway, want truncate html content (generated using content-editable <code>div</code>) using different measures:</p> <ol> <li>i define character index <code>n</code> serve truncating startpoint <em>limit</em></li> <li>algorithm check whether content @ least <code>n</code> characters long (text only; not counting tags); if it's not return whole content</li> <li>it then</li></ol></div> size: 500
note: have not truncated @ word boundary, @ character boundary, , no, it's not @ following suggestions in comment :-)
Comments
Post a Comment