Saturday, 7 May 2011

Creating word documents in rails

This week, I needed to create a word document from data in a rails app. Needless to say, there is not a windows machine in sight. After a bit of googling around and thinking about maybe trying to use OpenOffice to do some of the heavy lifting I came across a couple of posts that suggested it might be possible to do what I needed by creating a docx file that could be used as a template, and then editing it. After all, a docx file is just a zip file with a bunch of xml files inside ...

Levente Bagi has a nice solution but it didn't really meet my needs, and seemed overly complicated in places. There was also this blog article which outlined the technique but didn't have a lot of detail. My problem was that I had to extract a bunch of stuff from an active record object (an Event) and then iterate through several associated objects (Event has_many Days, has_many Providers). So I ended up rolling my own - hopefully these notes will help anyone following on behind.

First lesson - don't try and use rubyzip or zipruby to compress the files when creating the docx file. For reasons I didn't really investigate, they don't work. I'm guessing the default compression is wrong for docx files, but don't have the stamina to wade through the documentation. Use system zip instead.

The approach I took was this:
  1. Create a template. Using MS Word, make a document that is the sort of thing you want to create programatically. I originally wanted to add images but this complicates things unnecessarily.
  2. Save this as a docx file.
  3. Unzip the docx file. You get a folder containing several subfolders. One of these is called word, and inside that is a file called document.xml. Open it up with something that will format xml nicely - I used netbeans. First I found the data that needs to be extracted from the Event object. I replaced that with a new xml node containing the name of the method I wanted to call on the Event object as text so in place of
  4. <w:t>My event</w:t>

    I had

    <w:t><insert>fd_event_name</insert></w:t>


    Continue with the same node name for all the methods to be called on this object
  5. Next find the chunk of html that represents the associated object. We are going to need to cut this out and put it in a new xml document so that we can iterate over it. So we create a new empty document with the same namespace definitions as in document.xml, add a new node called <fragment/> and then paste the text you cut from the template document inside. In place of the cut text in the master template, add a new node - in my case since the cut text will display information about the each day of the event, I called the node <days/> Now work through the fragment and add a new xml node containing the name of the method I wanted to call on the Day object as text so in place of

    <w:t>Sat May 7th 2011</w:t>

    I had

    <w:t><insert>date</insert></w:t>

    One refinement I needed to make was to pass an index and count for each associated object so that I could have headings like "Day 1 of 5" - just as before, I added nodes to the template where I needed these to appear.
  6. Repeat for other associated objects
  7. Now we need to create a new word document using these pieces. I created a method on the Event object
     def create_docx
    f=File.read("lib/docx_sections/template.xml")
    #substitute fields in main template
    doc = substitute(f,self)
    f=File.read("lib/docx_sections/day.xml")

    self.days.each_with_index do |day, i|
    doc.xpath("//days").before(substitute(f,day, i, self.days.size).xpath("//fragment").children)
    end
    f=File.read("lib/docx_sections/provider.xml")
    self.providers.each_with_index do |provider, i|
    doc.xpath("//providers").before(substitute(f,provider, i, self.providers.size).xpath("//fragment").children)
    end
    doc.xpath("//days").remove
    doc.xpath("//providers").remove
    doc = doc.to_s.gsub(/(\n|\t|\r)/, ' ').gsub(/>\s*<').squeeze(' ') build_docx(doc)
    end

    Let's go through this line by line. We read in the template.xml file, and call substitute with the file and self as parameters - we'll look at that method later. Then we do the same with the associations - read the template, iterate over the associated objects, call substitute. Then we remove the marker tags, compress the xml file to remove any whitespace we don't need, and build the docx file. Easy.

    So what about the substitute method. It could hardly be simpler. Nokogiri makes it easy to replace the marker nodes we added with the content we want. Find the "insert" node, get the text it contains, call the method of that name on the object and replace the node with the result. Similarly, replace the index and count nodes with the parameters we passed in.
    def substitute(xmlstring,obj, i = 0, count = 1)
    doc= Nokogiri::XML(xmlstring.clone)
    doc.xpath("//insert").each do |n|
    n.parent.content= obj.send(n.text.to_sym)
    end
    doc.xpath("//index").each do |n|
    n.parent.content= i + 1
    end
    doc.xpath("//count").each do |n|
    n.parent.content= count
    end

    doc
    end
    Finally, the build_docx method is essentially stolen from Levente Bagi.
    def build_docx(content)
    filename="#{self.event_organiser.fd_name}_#{self.fd_event_name}".gsub(/\s*/, '')
    in_temp_dir do |temp_dir|
    system("cp -r lib/word_template_files #{temp_dir}/plan_report")
    open("#{temp_dir}/plan_report/word/document.xml", "w") do |file|
    file.write(content)
    end
    system("cd #{temp_dir}/plan_report; zip -r ../#{filename}.docx *")
    system("cp #{temp_dir}/#{filename}.docx /home/chaser/downloads")
    end
    end

    def in_temp_dir
    temp_dir = "/tmp/docx_#{Time.now.to_f.to_s}"
    Dir.mkdir(temp_dir)
    yield(temp_dir)
    system("rm -Rf #{temp_dir}")
    end


As mentioned at the start of this post - I originally hoped to be able to add images to this document - but that would require understanding enough about the way docx files handle assets and frankly the users will probably want to change the images and layout to suit their needs so it's almost certainly not worth it. It would be nice to try though ...