Structure in an unstructured world.

Reading unstructured data from Microsoft Word

Many businesses have data which is held in an unstructured format within Microsoft Word, product marketing data, such as descriptions, key features, etc.. which is used on retailer websites. This ‘information’ is generic and not linked to a specific, tangible structured data in any way..

As an example, if we wanted to link the marketing description, which is held in a separate word document for each of our products, to a specific product SKU to use in reporting or to load in to a new system, this would have to be done manually, which is massively inefficient and time consuming.

A solution

Using Alteryx, I addressed the challenge through utalising the Python tool and a little Python scripting to read a list of Word documents from a directory. This was built out in to a custom macro for reusability. Then for each file, read the content, identify each individual line of text and build a ‘list’. This list is then transformed in to a single record, consisting of multiple fields of data for each file.

For fields which have multiple values, such as ‘Key Features’ these are then extracted to create a field per value.

The result, a structured record for each document which can then be utilised as part of other data processes.

The how

The core to this process is the Python tool and the script which is contained within it.

from ayx import Alteryx
import pandas
import docx2txt
fp = Alteryx.read("#1")
print (fp)
filePath = fp.iloc[0]['FullPath']
print(filePath)
text = docx2txt.process(filePath)
content = []
for line in text.splitlines():
    if line != '':
        content.append(line)
print(content)

#Turn the variabe with html page into Pandas' Data frame
df = pandas.DataFrame({"content":[content]})

#Write the data frame to Alteryx workflow for downstream processing
df1 = pandas.DataFrame(df[df.columns[0]].values.tolist(), index=df.index)
Alteryx.write(df1,1)

From the Python tool, the data is then transposed, creating a single row of data for each processed Word document. The resultant data set is then passed as the output to the Macro to be transposed and processed by the remainder of a Alteryx workflow.

In this case, the structure of the word document is replicated with a with a multi row formula for each section with in the document. These formula, cycle through the data to extract the relevant portion of text for each section of text.

Using the directory tool, we can provide a wildcard input for the workflow to dynamically loop through any Word document in the directory location. At this point, it's important to highlight that this workflow is designed to process Word comments of the same format, i.e. the constant layout within the Word documents is uniform and consistent for all documents to be processed.

To extract the first section of text, the Product Title

IF Contains([Value],'Product Title')THEN Substring([Value],14) ELSE 0 ENDIF

That one was nice and easy, it's the first line after all. Finding the product description (3rd multi line formula), takes a little more thought.

IF StartsWith([Value], 'Product Description') THEN 
	[Row+1:Value] + '; ' +
	IF [Row+2:Value] != 'Product Description' THEN 
		[Row+2:Value] + '; ' +
 		IF [Row+3:Value] != 'Product Description' THEN 
			[Row+3:Value] + '; ' +
				IF [Row+4:Value] != 'Product Description' THEN 
					[Row+4:Value] + '; ' +
					IF [Row+5:Value] != 'Product Description' THEN 
						[Row+5:Value] + '; '
					ELSE '' 
					ENDIF
				ELSE '' 
				ENDIF
			ELSE '' 
		ENDIF
		ELSE ' '
	ENDIF 
ELSE 0 
ENDIF

With this one, we know the text section begins with 'Product Description' and the description is always 5 lines or less. So, we find the line that starts with 'Product Description' and then check the lines following, concatenating the values together to create a single text block.

It gets really complex when you have 10 lines and multiple variables to check, as with the 'Key Features'in this example.

IF StartsWith([Value], 'Key Features') THEN 
[Row+1:Value] + '; ' +
IF StartsWith([Row+1:Value], 'Block') OR StartsWith([Row+1:Value], 'Key Features')

THEN '' ELSE

[Row+2:Value] + '; ' +
IF StartsWith([Row+2:Value], 'Block') THEN '' ELSE
	[Row+3:Value] + '; ' +
	IF StartsWith([Row+3:Value], 'Block') THEN '' ELSE
	    [Row+4:Value] + '; ' + 
	    IF StartsWith([Row+4:Value], 'Block') THEN '' ELSE
		[Row+5:Value] + '; '+ 
		IF StartsWith([Row+6:Value], 'Block') THEN '' ELSE

                    [Row+6:Value] + ';' +

        		IF StartsWith([Row+7:Value], 'Block') THEN
							''
			    ELSE [Row+7:Value] + ';' +
				IF StartsWith([Row+8:Value], 'Block') OR

                                    StartsWith([Row+7:Value], 'Block') OR

                                      StartsWith([Row+6:Value], 'Block') THEN
							''
				    ELSE [Row+8:Value] + ';' +
					IF StartsWith([Row+9:Value], 'Block') OR

                                         StartsWith([Row+8:Value], 'Block') OR

                                          StartsWith([Row+7:Value], 'Block') OR

                                          StartsWith([Row+6:Value], 'Block')THEN
									''
	        			    ELSE [Row+9:Value] + ';' +
						IF StartsWith([Row+10:Value], 'Block') OR

                                                    StartsWith([Row+9:Value], 'Block') OR

                                                    StartsWith([Row+8:Value], 'Block') OR

                                                    StartsWith([Row+7:Value], 'Block') OR

                                                    StartsWith([Row+6:Value], 'Block')

                                                    THEN
										''
							ELSE [Row+10:Value]
						ENDIF 
					    ENDIF 
					ENDIF 
				ENDIF 
			ENDIF 
		ENDIF
	ENDIF
			 
ENDIF

ENDIF 
ELSE 0 
ENDIF

All the parsed text blocks are then union joined together and out put, in this case to an Excel

spreadsheet.

The future

In terms of Alteryx, the hard work of having to think through these such challenges and build your own solutions may very well be a thing of the past. With the introduction of the Alteryx Intelligence Suit in June 202, optical character recognition, text mining and natural language processing tools, gives the none data scientists among us (and I am definitely one of those) a fighting chance. Maybe more on the Intelligence Suit in a future post..

Search This Blog

Analytics and Things