Monday, April 03, 2006

Generating DOCTYPE and ENTITY DECLs in XSLT

I ran into a very interesting challenge lately, and since RoboHelp has been EOL'd, I thought I'd try to benefit the rest of the community that might need to migrate content out of RoboHelp. There is an export handler in RoboHTML that will convert content to DocBook, but unfortunately, it doesn't provide very good or accurate markup, IMO.

Since the default RoboHTML export handler for DocBook wouldn't work for us, we created a new export handler, based on the original. What I wanted, was a book file that contained entity declarations for each topic. The trick was trying to generate the Doctype declaration, with the entity declarations. XSLT provides a way to generate a PUBLIC and SYSTEM identifier in the DOCTYPE declaration for an output document, but does not provide a way to write out the entity declarations as an internal DTD subset in that doctype decl.

Part of the solution is to not use the XML output method, but to use the TEXT method. This was working splendidly, until it came time to write the file entity reference in the content. From what I could tell, the problem is with the msxml parser that RoboHelp uses when exporting content. No matter how I tried to escape the "&", I would get & in the entity reference, which of course would not resolve.

Instead of building it into the export handler, I came up with a stylesheet to process the *_toc.xml file that resulted from the DocBook export.

Here are some snippets from the resulting code. The critical components are generating the DOCTYPE decl, generating the ENTITY decl, and then creating the entity reference.

<xsl:output method="text"  indent="yes"/>

<xsl:template match="/">
  <xsl:text disable-output-escaping="yes"><!DOCTYPE book PUBLIC "-//COMPANY//DTD DocBook-Based Extension v1.0//EN" "extended-docbook.dtd" [
    <!ENTITY glossary SYSTEM "glossary.xml">
  </xsl:text>
  <xsl:for-each select="//tocentry/ulink[normalize-space(@url)!= '']">
    <xsl:call-template name="generate_entity_decl">
      <xsl:with-param name="url" select="@url" />
      <xsl:with-param name="title" select="@title" />
    </xsl:call-template>
  </xsl:for-each>
  <xsl:text disable-output-escaping="yes">]></xsl:text>
  <book>
    <title><xsl:value-of select="title"/></title>
    <bookinfo>
      <xsl:call-template name="generate_publisher_info">
        <xsl:with-param name="rootnode" select="." />
      </xsl:call-template>
    </bookinfo>
<xsl:apply-templates/> 
    <xsl:text disable-output-escaping="yes">&</xsl:text>glossary;
  </book>
</xsl:template>

The above creates the DocType declaration, but relies on generating a valid entity name for the entity declarations:

  <!-- =========== generate_file_url template ============= -->
  <xsl:template name="entityName">  
    <xsl:param name="url"  />
    
    <xsl:choose>
      <xsl:when test="contains($url,'/')">        
        <xsl:value-of select="substring-before($url,'/')"/>
        <xsl:call-template name="entityName">          
          <xsl:with-param name="url" select="substring-after($url, '/')"/>         
        </xsl:call-template>       
      </xsl:when>     
      <xsl:when test="contains($url,'\')">        
        <xsl:value-of select="substring-before($url,'\')"/>
        <xsl:call-template name="entityName">          
          <xsl:with-param name="url" select="substring-after($url, '\')"/>         
        </xsl:call-template>       
      </xsl:when> 
      <xsl:otherwise>       
        <xsl:value-of select="$url"/>       
      </xsl:otherwise>     
    </xsl:choose>
    
  </xsl:template>
  
  <!-- ======== outputs entity references ======== -->
  <xsl:template name="generate_file_url">
    <xsl:param name="url" select="''" />
    <xsl:param name="title" select="''" />
    
    <xsl:variable name="transformedURL">
      <xsl:value-of select="translate(string($url), ' ()','_' )" />
    </xsl:variable>
    
    <xsl:variable name="entity.file.name">
      <xsl:call-template name="entityName">
        <xsl:with-param name="url" select="$transformedURL" />
      </xsl:call-template>
    </xsl:variable>
    
    <xsl:variable name="entity.name">
      <xsl:choose>
        <xsl:when test="contains($entity.file.name, 'htm')">
          <xsl:value-of select="substring-before($entity.file.name, '.htm')" />         
        </xsl:when>
        <xsl:when test="contains($entity.file.name, 'xml')">
          <xsl:value-of select="substring-before($entity.file.name, '.xml')" />         
        </xsl:when>
        <xsl:otherwise>
          <xsl:value-of select="$entity.file.name" />
        </xsl:otherwise>
      </xsl:choose> 
    </xsl:variable>
    
    <!-- outputs entity reference -->
    <xsl:text disable-output-escaping="yes">&</xsl:text>
    <xsl:value-of select="$entity.name" />
    <xsl:text disable-output-escaping="yes">; 
    </xsl:text>
    
  </xsl:template>

The above templates create a valid entity name, based on the path to the file. I tried to adjust for both XML, HTML and unknown file names. Following is the template that creates the ENTITY declaration that needs to be a part of the internal subset in the DOCTYPE declaration.

  <!-- This generates an entity declaration: <!ENTITY foo SYSTEM "foo.xml"> -->
  <xsl:template name="generate_entity_decl">
    <xsl:param name="url" select="''" />
    <xsl:param name="title" select="''" />
    
    <xsl:variable name="transformedURL">
      <xsl:value-of select="translate(string($url), ' ()','_' )" />
    </xsl:variable>
    
    <xsl:variable name="entity.file.name">
      <xsl:call-template name="entityName">
        <xsl:with-param name="url" select="$transformedURL" />
      </xsl:call-template>
    </xsl:variable>
    
    <xsl:variable name="entity.name">
      <xsl:choose>
        <xsl:when test="contains($entity.file.name, 'htm')">
          <xsl:value-of select="substring-before($entity.file.name, '.htm')" />         
        </xsl:when>
        <xsl:when test="contains($entity.file.name, 'xml')">
          <xsl:value-of select="substring-before($entity.file.name, '.xml')" />         
        </xsl:when>
        <xsl:otherwise>
          <xsl:value-of select="$entity.file.name" />
        </xsl:otherwise>
      </xsl:choose> 
    </xsl:variable>
    
    <!-- outputs entity declaration -->
    <xsl:text disable-output-escaping='yes'><!ENTITY </xsl:text><xsl:value-of select='$entity.name'/><xsl:text> SYSTEM
"</xsl:text><xsl:value-of select='$transformedURL'/><xsl:text>"</xsl:text><xsl:text disable-output-escaping='yes'>>
    </xsl:text> 
  </xsl:template>

As you can see, there are several variables that have been reused, and could probably be declared globally or more efficiently. The remaining task is to create the entity reference in the content. Since the RoboHelp export handler dumped everything as a tocentry/ulink, it's pretty easy to extract and re-write the entity reference we really want:

 
  <xsl:template match="tocpart">
    <chapter>
      <title>
        <xsl:value-of select="tocentry"/>
      </title>
      <xsl:apply-templates />
    </chapter>
  </xsl:template>
  
  <xsl:template match="tocchap">
    <xsl:choose>
      <xsl:when test="tocentry/ulink">
    <xsl:call-template name="generate_file_url">
      <xsl:with-param name="url">
        <xsl:value-of select="tocentry/ulink/@url"/>
      </xsl:with-param>
    </xsl:call-template>
      </xsl:when>
      <xsl:otherwise>
        <section>
          <title>
            <xsl:value-of select="tocentry" />
          </title>
          <xsl:apply-templates select="toclevel1"/>
        </section>
      </xsl:otherwise>
    </xsl:choose>
  </xsl:template>

The finer details will be left as an exercise to the reader, but this was a particularly interesting problem and solution. Hope this helps!

Categories: ,

No comments: