c# - Truncating HTML content at the end of text blocks (block elements) -

mainly when shorten/truncate textual content truncate @ specific character index. that's complicated in html anyway, want truncate html content (generated using content-editable div) using different measures:

i define character index n serve truncation startpoint limit
algorithm check whether content @ least n characters long (text only; not counting tags); if it's not, return whole content
it check n-x n+x character position (text only) , search ends of block nodes; x predefined offset value , n/5 n/4;
if several block nodes end within range, algorithm select 1 ends closest limit index n
if no block node ends within range find closest word boundary within same range , select index closest n , truncate @ position.
return truncated content valid html (all tags closed @ end)

my content-editable generated content may consist of paragraphs (with line breaks), preformatted code blocks, block quotes, ordered , unordered lists, headers, bolds , italics (which inline nodes , shouldn't count in truncation process) etc. final implementation of course define elements possible truncation candidates. headers though block html elements not count truncation points don't want widowed headers. paragraphs, list individual items, whole ordered , unordered lists, block quotes, preformatted blocks, void elements etc. ones. headers , inline block elements aren't.

example

let's take stackoverflow question example of html content truncate. let's set truncation limit 1000 offset of 250 characters (1/4).

as can seen example the closest truncation boundary between 2 block nodes character 1000 between </ol> , p (my content-editable generated...). means html should truncated right between these 2 tags result in little bit less 1000 characters long content text wise, kept truncated content meaningful because wouldn't truncate somewhere in middle of text passage.

i hope explains how things should working related algorithm.

the problem

the first problem i'm seeing here i'm dealing nested structure html. have detect different elements (only block elements , no inline ones). , last not least have count characters in string , ignore belong tags.

possible solutions

i parse content manually creating object tree representing content nodes , hierarchy
i convert html easier manage markdown , search closest new line provided index n , convert html
use html agility pack , replace #1 parsing , somehow use xpath extract block nodes , truncate content

second thoughts

i'm sure make doing #1 feels i'm reinventing wheel.
i don't think there's c# library #2 should doing html markdown manually or run i.e. pandoc external process.
i use hap it's great @ manipulating html, i'm not sure whether truncation simple enough using it. i'm afraid bulk of processing still outside hap in custom code

how should 1 approach such truncation algorithm? head seems tired come consensus (or solution).

here sample code can truncate inner text. uses recursive capability of innertext property , clonenode method.

    public static htmlnode truncateinnertext(htmlnode node, int length)     {         if (node == null)             throw new argumentnullexception("node");          // nothing do?         if (node.innertext.length < length)             return node;          htmlnode clone = node.clonenode(false);         truncateinnertext(node, clone, clone, length);         return clone;     }      private static void truncateinnertext(htmlnode source, htmlnode root, htmlnode current, int length)     {         htmlnode childclone;         foreach (htmlnode child in source.childnodes)         {             // expected size ok?             int expectedsize = child.innertext.length + root.innertext.length;             if (expectedsize <= length)             {                 // yes, clone whole hierarchy                 childclone = child.clonenode(true);                 current.childnodes.add(childclone);                 continue;             }              // text node? crop             htmltextnode text = child htmltextnode;             if (text != null)             {                 int remove = expectedsize - length;                 childclone = root.ownerdocument.createtextnode(text.innertext.substring(0, text.innertext.length - remove));                 current.childnodes.add(childclone);                 return;             }              // it's not text node, shallow clone , dive in             childclone = child.clonenode(false);             current.childnodes.add(childclone);             truncateinnertext(child, root, childclone, length);         }     }

and sample c# console app scrap question example, , truncate 500 characters.

  class program   {       static void main(string[] args)       {           var web = new htmlweb();           var doc = web.load("http://stackoverflow.com/questions/30926684/truncating-html-content-at-the-end-of-text-blocks-block-elements");           var post = doc.documentnode.selectsinglenode("//td[@class='postcell']//div[@class='post-text']");           var truncated = truncateinnertext(post, 500);           console.writeline(truncated.outerhtml);           console.writeline("size: " + truncated.innertext.length);       }   }

when ran it, should display this:

<div class="post-text" itemprop="text">  <p>mainly when shorten/truncate textual content truncate @ specific character index. that's complicated in html anyway, want truncate html content (generated using content-editable <code>div</code>) using different measures:</p>  <ol> <li>i define character index <code>n</code> serve truncating startpoint <em>limit</em></li> <li>algorithm check whether content @ least <code>n</code> characters long (text only; not counting tags); if it's not return whole content</li> <li>it then</li></ol></div> size: 500

note: have not truncated @ word boundary, @ character boundary, , no, it's not @ following suggestions in comment :-)

Search This Blog

Yet