SystemWeaver provides a simple Word Importer program example in our GitHub repository which reads a Word document and creates a hierarchy of sections in SystemWeaver that corresponds to the headings of the Word document. This article covers steps to consider when using a Word importer.
Adjusting Paragraphs in Word
The MS Word importer assumes that a document is organized as a sequence of Heading and Normal paragraphs. For example:
Scope (Style: Heading 1)
Scope text (Style: Normal)
Introduction (Style: Heading 1)
Introduction text (Style: Normal)
Definitions (Style: Heading 1)
Abbreviations (Style: Heading 2)
Upon import, each Heading paragraph will be translated into a Section item, with the content of the Paragraph used as the item name. The following Normal paragraph will be used as the Description of the Section item.
If a Heading paragraph has a lower level (=higher number) than the preceding, the corresponding Section will be put into a Subsection of the preceding Section, otherwise it will be included as Subsection of the previous Section with a one level lower number. (There are Heading sequences that should be avoided since no clear section structure can be defined, like Heading 1, Heading 3.)
In the event that you converted a PDF to a Word file, Word does not handle all PDF formatting cases perfectly, so some adjustments may be required (see example below for some strange formatting in PDF).
- Break into proper headings (typically in bold text like the examples below) in case the heading continues into the following body text.
- Make sure the heading text is on the same paragraph as the paragraph number.
In the PDF example below, paragraph 3.2 will be possible for Word to convert OK, while 3.2.1 will require subsequent adjustments in Word.
- Create proper paragraph breaks before each paragraph and paragraph number. If, for example, there is a table before a new text section, there may not be a proper paragraph break.
Tip: Turn on display of Paragraph marks in Word.
- Adjust Heading level. For example, sometimes Word considers ”1.2.3” to be Heading 2.
- Some Paragraphs may not be numbered originally, but setting these to a numbered paragraph like Heading 2 could make the import easier, for example a paragraph called ”Annex A” could be set to Heading 1. (There are also configuration options during the import that could handle such cases, but as said, it could be easier to fix in Word.)
- Make sure no sections become too big, or contain too many pictures, because that may violate the data size limit of SystemWeaver. One way to deal with this is to break the section up into several sections (several Heading/Normal sequences).
In practice, you may have to review a document manually, which can take a couple of hours for a 100-page document.
Performing the Import
It is recommended that your first import attempt be made into a library where it’s easy to scrap the results, e.g., a dedicated library for each import. This way, it may be easier to go back and fix things in Word and then repeat the import. If you have missed a paragraph break in Word, it will be apparent in SystemWeaver from the Section numbering in the Document view, since the numbering included in the Section names will differ from the automatic numbering in SystemWeaver. (assuming that there were actual paragraph numbers in the original Word/PDF file). You will of course have to remove such paragraph numbers at a later stage in SystemWeaver.
Manual Adjustments After Import
In some cases, the indentation of the text, like a specific table, can be negative, which means that the Description will appear to be truncated. A quick way to fix this is to:
- Turn on the paragraph and formatting markers so you can clearly see that you have selected a complete paragraph.
- Select the whole description or section, then use the Paragraph/Paragraph.. Indents and spacing option, set Indentation, pixels/ Left=0
- In some cases, tables are not “anchored” to the right paragraph when converted to Word, and may even become split into separate paragraphs. This has to be managed separately.