Structure in an unstructured world.
Reading unstructured data from Microsoft Word
Many businesses have data which is held in an unstructured format within Microsoft Word, product marketing data, such as descriptions, key features, etc.. which is used on retailer websites. This ‘information’ is generic and not linked to a specific, tangible structured data in any way..
As an example, if we wanted to link the marketing description, which is held in a separate word document for each of our products, to a specific product SKU to use in reporting or to load in to a new system, this would have to be done manually, which is massively inefficient and time consuming.
A solution
Using Alteryx, I addressed the challenge through utalising the Python tool and a little Python scripting to read a list of Word documents from a directory. This was built out in to a custom macro for reusability. Then for each file, read the content, identify each individual line of text and build a ‘list’. This list is then transformed in to a single record, consisting of multiple fields of data for each file.
For fields which have multiple values, such as ‘Key Features’ these are then extracted to create a field per value.
The result, a structured record for each document which can then be utilised as part of other data processes.
The how
The core to this process is the Python tool and the script which is contained within it.
from ayx import Alteryximport pandasimport docx2txtfp = Alteryx.read("#1")print (fp)filePath = fp.iloc[0]['FullPath']print(filePath)text = docx2txt.process(filePath)content = []for line in text.splitlines():if line != '':content.append(line)print(content)#Turn the variabe with html page into Pandas' Data framedf = pandas.DataFrame({"content":[content]})#Write the data frame to Alteryx workflow for downstream processingdf1 = pandas.DataFrame(df[df.columns[0]].values.tolist(), index=df.index)Alteryx.write(df1,1)
From the Python tool, the data is then transposed, creating a single row of data for each processed Word document. The resultant data set is then passed as the output to the Macro to be transposed and processed by the remainder of a Alteryx workflow.
In this case, the structure of the word document is replicated with a with a multi row formula for each section with in the document. These formula, cycle through the data to extract the relevant portion of text for each section of text.
Using the directory tool, we can provide a wildcard input for the workflow to dynamically loop through any Word document in the directory location. At this point, it's important to highlight that this workflow is designed to process Word comments of the same format, i.e. the constant layout within the Word documents is uniform and consistent for all documents to be processed.
To extract the first section of text, the Product Title
IF Contains([Value],'Product Title')THEN Substring([Value],14) ELSE 0 ENDIF
That one was nice and easy, it's the first line after all. Finding the product description (3rd multi line formula), takes a little more thought.
IF StartsWith([Value], 'Product Description') THEN
[Row+1:Value] + '; ' +
IF [Row+2:Value] != 'Product Description' THEN
[Row+2:Value] + '; ' +
IF [Row+3:Value] != 'Product Description' THEN
[Row+3:Value] + '; ' +
IF [Row+4:Value] != 'Product Description' THEN
[Row+4:Value] + '; ' +
IF [Row+5:Value] != 'Product Description' THEN
[Row+5:Value] + '; '
ELSE ''
ENDIF
ELSE ''
ENDIF
ELSE ''
ENDIF
ELSE ' '
ENDIF
ELSE 0
ENDIF
With this one, we know the text section begins with 'Product Description' and the description is always 5 lines or less. So, we find the line that starts with 'Product Description' and then check the lines following, concatenating the values together to create a single text block.
It gets really complex when you have 10 lines and multiple variables to check, as with the 'Key Features'in this example.
IF StartsWith([Value], 'Key Features') THEN
[Row+1:Value] + '; ' +
IF StartsWith([Row+1:Value], 'Block') OR StartsWith([Row+1:Value], 'Key Features')
THEN '' ELSE
[Row+2:Value] + '; ' +
IF StartsWith([Row+2:Value], 'Block') THEN '' ELSE
[Row+3:Value] + '; ' +
IF StartsWith([Row+3:Value], 'Block') THEN '' ELSE
[Row+4:Value] + '; ' +
IF StartsWith([Row+4:Value], 'Block') THEN '' ELSE
[Row+5:Value] + '; '+
IF StartsWith([Row+6:Value], 'Block') THEN '' ELSE
[Row+6:Value] + ';' +
IF StartsWith([Row+7:Value], 'Block') THEN
''
ELSE [Row+7:Value] + ';' +
IF StartsWith([Row+8:Value], 'Block') OR
StartsWith([Row+7:Value], 'Block') OR
StartsWith([Row+6:Value], 'Block') THEN
''
ELSE [Row+8:Value] + ';' +
IF StartsWith([Row+9:Value], 'Block') OR
StartsWith([Row+8:Value], 'Block') OR
StartsWith([Row+7:Value], 'Block') OR
StartsWith([Row+6:Value], 'Block')THEN
''
ELSE [Row+9:Value] + ';' +
IF StartsWith([Row+10:Value], 'Block') OR
StartsWith([Row+9:Value], 'Block') OR
StartsWith([Row+8:Value], 'Block') OR
StartsWith([Row+7:Value], 'Block') OR
StartsWith([Row+6:Value], 'Block')
THEN
''
ELSE [Row+10:Value]
ENDIF
ENDIF
ENDIF
ENDIF
ENDIF
ENDIF
ENDIF
ENDIF
ENDIF
ELSE 0
ENDIF
All the parsed text blocks are then union joined together and out put, in this case to an Excel
spreadsheet.
The future
In terms of Alteryx, the hard work of having to think through these such challenges and build your own solutions may very well be a thing of the past. With the introduction of the Alteryx Intelligence Suit in June 202, optical character recognition, text mining and natural language processing tools, gives the none data scientists among us (and I am definitely one of those) a fighting chance. Maybe more on the Intelligence Suit in a future post..
Comments
Post a Comment