Mahdi Jaberzadeh Ansari recently finished his work on a Master of Computer Science degree at the University of Bonn. As part of this degree, he has been working on an automated way to convert documents between Fidus Writer to and Microsoft Word. Jaberzadeh Ansari’s work is part of the ongoing Opening Scholarly Communication in Social Sciences (OSCOSS) collaborative project between the University of Bonn and GESIS – Leibniz Institute for the Social Sciences in Cologne. The project is financed by the Deutsche Forschungsgemeinschaft. In this interview, Jaberzadeh Ansari reveals some of the issues he has come across and he critically assesses Fidus Writer and reveals where he believes it will go next.
When you began working on the Fidus Writer to Word and Word to Fidus Writer filters half a year ago, what did you hope to achieve?
First of all, I have to correct your question. I didn’t try to write a filter for Fidus Writer. You already have some filter to import data from other sources to Fidus Writer. What I tried to create was a converter. My converter is a separate software that can be used on any OS (as it has been made with JAVA) while it is possible to integrate it with your web-based application.
We initially wanted to create tools like SourceTree and Git for collaboration on DOCX file. Following this line of thinking, we needed to understand the DOCX format completely, as well as test different ideas for merging and comparing DOCX files. We decided to create this converter as part of OSCOSS project to help that project which will also help us achieve some other goals.
One idea we had was to create a grammar for DOCX files with Xtext and convert it to Fidus-files with Xtend. But that idea was very ambitious and failed right from the beginning. Thus we tested some other ideas.
How did that work out? Did you run into technical issues with either Fidus Writer or Word?
Yes, we had lots of problems with both. Related to Fidus Writer, there was a shortage of documentation and using the Fidus file type as a standard was problematic. Related to DOCX file format, the issue was the huge amount of documentation and the many exceptions in it. It is possible to show an object in DOCX files in different ways. So, dealing with all these different ways was time-consuming. Also, incompatibility between the two file types was another problem. There are lots of items in DOCX files that are not supported in Fidus files and thus we ignored them.
Was it overall harder or easier than you had expected to create these filters? What part was harder/easier than you had thought?
No parts were specifically easy or hard. It was just time-consuming. It took more than 8 months to implement this software. The difficulty was dealing with each of the different features that DOCX files have. It is easier to create a new DOCX file, and harder to interpret what is inside an existing DOCX file. When creating a DOCX file, you can, for example, simply select a few possible ways of showing an image but if you need to read a DOCX file, you have to be careful about a lot of exceptions that exist in the OOXML standard that the DOCX format is based upon.
Word is hated among many of those who are working with journal formatting because there is much cleanup that needs to happen. Projects like Fidus Writer are trying to replace Word with an editor that requires much less cleanup. Yet most people writing scientific articles continue to use Word. Why do you think that is?
Many people also hate Windows. But if you look closely, it is Windows that is the most popular operating system among users. I know there are some specialist users who prefer open source operating systems, but I speak about general people.
I also hate using LaTeX. It is a software with a design from the 1970s. But it is the most powerful editing system for scientific writing. I liked the idea behind the Fidus Writer, but it seems that you are swimming against the river. I don’t mean to say that you should give up the idea of Fidus Writer. No, it is a good idea. I am speaking more about technology, the file structure and features. You sit with your programmers and decide to do something, then after some months you decide to change the design and it happened several times during the months that I monitored the product. You kept the old structure and added some new options to it, for example for equations. But I think it is better to sit together and reach a stable design for the file and then start the programming. I think it is more matched the software engineering paradigms. [Red: We are not sure what Jaberzadeh is referring to here. With the exception of footnotes, the Fidus file format shouldn’t have changed since he started working on it.]. You cannot create something that is both super complicated and super intuitive.
For example, we did an online survey and surprisingly the number of people who used Authorea for scientific purposes was almost zero out of 213 interviewees, even though it is claimed Authorea is a Google Docs for scientific purposes. Therefore, I think creating a tool for a special use must consider many different categories of end users. While I bet Fidus Writer can be popular among social scientists, I don’t think it can fulfill the needs of all types of scientific communities.
You created a little graphical user interface to be used with your filters. If these filters were to be used in a production environment, in what kinds of workflows would you think it would make sense to use your filters, and how would they best be integrated?
I made it at first as a command line product. Then I added a package to give it a graphic user interface. It is just 3% to 5% of the code. I think it is possible to create a jar file out of it or a CGI and use it inside of other environments. I think the integration of different programs in the back-end can be solved relatively easily.
I think the best is to provide both tools:Develop a graphic user interface for users to install on their systems and provide a web-based tool that receives files from the user, passes the files to the CGI and returns back the result to the user.
Fidus Writer is written in Python and JavaScript, yet you chose to write your filter software in Java, What was the reasoning behind that, and was the difference in programming language ever an issue for you?
After failing to create some general solutions with Xtext, Xtend and XSLT, we decided to use libraries. We started implementations with some of them. Each of them has some limitations. We selected four different programming languages and looked for some libraries that had been written in those languages. Here is the list of the selected languages and the reasons for selecting each of them:
1. Python: The Fidus Writer backend has been developed in the Python language. Thus, this language was the first choice for developing a backend tool to support Fidus Writer.
2. JavaScript: A general solution using JavaScript could be used in almost all browsers and was the solution other web-based authoring tools used as well.
3. Node.js: The most popular system on the server for developing web-based tools that support real-time connections and streams, Node.js JavaScript is an interesting language.
4. Java: The most popular and powerful object oriented language which is able to run it on almost any platforms, Java was one of the best languages that can be used in both desktop-based and web-based tools (server side).
Finally, we selected the docx4j Java library because of its powerful support of reading and writing DOCX files. If you want to know more details about the libraries that we considered look at the following table. The most important point related to docx4j is its helper addin that generates the code for creating some parts of DOCX files.
Also, it is possible to develop the functionality of the library without coding. For example, when the docx4j does not support some part of DOCX programmatically we can inject the final OOXML markup for that part in the file that is created with docx4j. This ability is the result of using JAXB inside of docx4j for modeling and parsing the DOCX file.
At the end of this answer, I have to conclude that it is preferable to use libraries in the same language for a project. But what can be done when there is no library or the libraries that do exist have lots of deficiencies?
If I say I hate this programming language is like someone who says “I hate English, I want to speak French and people who want to speak with me must go and learn French.” The languages should be for communication and nothing more. There is no benefit of one language over others. Programming languages are used to develop software. Maybe someone doesn’t like a language but when he needs to do so, he has to use it.
python-docx 0.8.6 | docx4js | html-docx-js | mammoth.js | Apache POI | docx4j | |
Language | Python | JavaScript (Node.js) | JavaScript | JavaScript | Java | Java |
Read, Write | RW | R | W | R | RW | RW |
Title | × | × | × | × | × | ✓ |
Subtitle | × | × | × | × | × | ✓ |
Author | × | × | × | × | × | ✓ |
Institute | × | × | × | × | × | ✓ |
Abstract | × | × | × | × | × | ✓ |
Keyword | × | × | × | × | × | ✓ |
Date | × | × | × | × | × | ✓ |
Header, Footer | ✓ | ✓ | × | × | ✓ | ✓ |
Footnote, Endnote | × | × | × | ✓ | ✓ | ✓ |
Bookmark | × | ✓ | × | × | ✓ | ✓ |
Cross-Reference | × | × | × | × | × | ✓ |
Page number | × | × | × | × | ✓ | ✓ |
Page break | ✓ | ✓ | ✓ | × | ✓ | ✓ |
Hyperlink | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
Table | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
Figure | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
Equation | × | ✓ | × | × | × | ✓ |
Caption | × | ✓ | × | ✓ | × | ✓ |
Citation | × | × | × | × | × | ✓ |
Bibliography | × | × | × | × | × | ✓ |
Styling | ✓ | × | × | ✓ | ✓ | ✓ |
Bold, Italic, Underlining | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
Superscript, Subscript | × | × | ✓ | ✓ | ✓ | ✓ |
Bullet list | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
Numbered list | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
Heading | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
Comment | × | × | × | × | ✓ | ✓ |
Extensibility without programming | × | × | × | × | × | ✓ |
Total | 9 | 12 | 9 | 11 | 15 | 28 |
(Table may not be completely accurate/up-to-date as it was somewhat difficult to find information about some of the libraries.)
If you could travel back in time to the start of 2016 and start the programming of the filters from scratch, would you do something differently? If yes, what and why?
No, never. The reason is clear. I think we selected the best way and the fastest. Of course, if we were a group of 6 to 10 developers, then we could create a library that supports the needed features in JavaScript or Python and then use it. But even with 10 programmers, it would more than 1 year to do something like that. At the moment the best choice is using docx4j for working with DOCX files.
Will your converter be made publicly available? If yes, under what license and where can people get it?
Yes it has been published publicly in Github in the following address:
https://github.com/mjza/MSThesis_Fidus_Docx_Converter
It is completely free, licensed under the MIT-license.