Extract data from a document - HxGN SDx - Update 64

Extract data from a document - HxGN SDx - Update 64 - Help

HxGN SDx Help

Language

English

Product

HxGN SDx

Search by Category

Help

SmartPlant Foundation / SDx Version

This functionality was modified in an update. For more information, see Extract data from a document (modified in an update).

You can extract data from a document thereby updating all relationships between tags and the document.

How can I configure DCOM to allow content extraction from Microsoft Office files?

You must enable DCOM permissions before HxGN SDx can access Microsoft Office applications.

This is mandatory if you want to extract content from:

a Microsoft Excel file using the Data Capture Datasheet Reader Pre-Processor.
any Microsoft Office file (97-2003) using the Data Capture Office Reader Pre-Processor.

To set the DCOM configuration for the respective file type application, complete the following steps:

Click Start > Administrative Tools > Component Services.
In the tree view, expand Component Services > Computers > My Computer > DCOM Config.
Based on the Microsoft Office file type, locate and right-click the respective DCOM Config component service:
- Microsoft Excel Application
- Microsoft Word 97-2003
- Microsoft PowerPoint Slide (97-2003)
On the shortcut menu, click Properties.
In the General tab, set the Authentication Level to None.
In the Identity tab, select The Launching User option.
In the Security tab, set the Launch and Activation Permissions to Customize, and click Edit.
1. Add the Administrators created by Server Manager.
2. Select the Allow check box for the following items:
  - Local Launch
  - Remote Launch
  - Local Activation
  - Remote Activation
  - Read
  - Special permissions

What is the purpose of a default template group?

When you use Data Capture Content Discovery Task in the Desktop Client or Extract Content in the Web Client to extract content from multiple documents, the software automatically considers the templates and rules defined for the default template group. The default template group is considered only when a PDF file or a drawing file is attached to the document. To successfully extract content, ensure that the templates and rules are configured for the template group. However, if you have not chosen any template group as default, the software automatically considers a template group DefaultDrawingTemplateGroup for extracting the content. This default template group is provided with the software.

In the Web Client, to extract content from a single document, the software automatically considers the templates and rules defined for the auto selected default template group. The default template group is considered only when a PDF file or a drawing file is attached to the document. However, you have an option to select and apply any other template group instead of the default template group. For more information, see Extract data from a document.

For the auto selected default template group, the Match Tag Patterns option is pre-selected.

How is the content extracted from a document with more than one file attached?

When processing a document with more than one file attached, the following scenarios are considered:

If different file types are attached, the software first checks for a file with the ISPFNMasterFile interface, and content is extracted from that file.
If a file with the ISPFNMasterFile interface is not found, content is extracted from the file with the highest priority. By default, files with the .dwg file extension are set as highest priority. However, the priority of the file can be changed in the Data Capture Central settings module in the Desktop Client. For more information, see Manage file types and prioritize them for content extraction.
If more than one file of highest priority is attached, then you must choose the file to extract the content from the Extract Content window.

How can I extract double-byte characters from the content?

By default, the Data Capture supports a few double-byte characters when extracting content from a document. To allow this, you must ask your administrator to enable the Extract matching double-byte characters as alias tag option in the Desktop Client's Data Capture Administration module Master Tag Patterns page. For more information, see Create a new master tag pattern.

Before you start content extraction, we recommend you check if the double-byte character you want to extract, is supported by the Data Capture, by completing the following steps:

Go to https://www.dotnetfiddle.net in a browser.
Copy and paste the following Data Capture double-byte character conversion code snippet in the Code Editor box:

using System;

using System.Text;

public class Program

{

public static void Main()

{

Console.WriteLine(Encoding.GetEncoding(1252).GetChars(Encoding.Convert(Encoding.Unicode, Encoding.GetEncoding(1252), Encoding.Unicode.GetBytes("<double-byte character to be converted>"))));

}

}
Click Run to execute the code and verify the following:
- If the output is a single-byte character, the double-byte character is converted successfully.
- If the output is a ?, the double-byte character conversion failed.

To configure Data Capture for allowing extraction of unsupported double-byte characters, do the following:

Before doing this, set the query scope of your site to the (Scope Not Set) Configuration Top and load the Data Capture sample data. For more information on how to load the sample data, see Load the sample data for your site.

In the Desktop Client, under the Document classifications tree, expand the Data Capture Administration node.
Under the Data Capture Administration node, right-click the Data Capture Miscellaneous node, and click Show Template Documents.
Right-click the Double Byte Substitution List template in the tree view, and click Edit > Check Out to check out the template and edit the DoubleByteSubstitutionList.xml.
In the file, based on the type of double-byte character to be converted, copy and paste the following XML code and replace the child node value with the double-byte character's Unicode value and the parent node value with the corresponding single-byte character's Unicode value.

<SubstituteChar UnicodeValue="Single-byte character's unicode value"> <!Parent node>

<SubstituteeChar>Double-byte character's unicode value</SubstituteeChar> <!Child node>

</SubstituteChar>

For example,

<SubstituteChar UnicodeValue="007E"> <!Unicode value for single-byte character tilda ~>

<SubstituteeChar>ff5e</SubstituteeChar> <! Unicode value for double-byte character tilda 〜>

</SubstituteChar>

This sample code converts a double-byte character 〜 to single-byte character ~.
Check in the template with the updated XML file.

Data Capture applies the conditions in the template file when processing the content with double-byte characters.

To extract content from a document, use Web Client to perform the following steps:

To extract content from 3D models .zvf and .mdb2 files, we recommend you to install Microsoft Access database engine 2010 (64-bit) on the HxGN SDx application server.

Click Documents > All Documents.
To extract data, select a document from All Documents list, and click Actions > Extract Content.
In the Extract Content window, do the following:
1. Select any file attached to the document from the Select File list.
2. Select a template group from the TemplateGroup/Template list to apply the processing rules using the corresponding Preprocessor Reader.
From Update 14 onwards, for PDF and drawing files, a template group which is set as default in the Data Capture Drawing Reader Pre-Processor and PDF Reader Pre-Processor is automatically selected and applied for extracting content. However, you can select and apply any template group from the TemplateGroup/Template list to process the content. For more information, see Manage drawing reader pre-processor templates and template groups and Manage PDF reader pre-processor templates.
Click OK.

SHARED Tip To use preprocessed content files for processing the file, click Show more and apply more options as follows:

Click this		To do this
Use Existing PreProcessed Content Files		Process the file using the preprocessed content XML file available. To extract content using the preprocessed content files, we recommend you attach the ContentFile.xml along with the corresponding file to the document. You must also attach GraphicsMapFile.xml if the file type supports graphical navigation.
Reader Pre-Processor		Select appropriate Preprocessor Reader for processing the datasheet file.
For Hexagon 3D model	OleDB Provider box and type the connection string.	Connect to the Microsoft Access database.
For Hexagon 3D model	Match Tag Patterns check box.	Extract the tags based on the tag patterns defined in the Tag Discovery Patterns module.

Before preparing to work with large amount of data, based on the size of the data it is recommended to configure the LicenseTimeoutSeconds property under Site Settings node in HxGN SDx Server Manager. This setting will prevent the license token to timeout, thereby allowing the session to retain.
To view the status of content extraction from a selected document:
- Select Actions menu, and click Show the detail form > Extract Content.
  
  For more information about the status of a document processed using the Data Capture, see Data Capture Document Status.
FDW tags are created without applying the ENS definition.
By default, the property Is Data Capture Rel is set to True on document to tag relationships SPFNDocRevMasterTag, SPFNDocRevAliasTag, FDWDocRevTag and SPFNFDWDocRevChildTag for Data Capture tags.
The master tag and the FDW tag are identified with the same icon . The alias tag is identified with the icon.
In the Desktop Client, the FDW tag is identified with the icon which is same for the master tag extracted using the Data Capture Content Discovery Task module.
You can select Match Tag Patterns to extract the tags based on the tag patterns defined in the Tag Discovery Patterns module. For few file types, Match Tag Patterns is pre-selected.
Except for the datasheet file, the Reader Pre-Processor is automatically selected based on the attached file type. For the datasheet file, you can select one of the following options as the base reader:
- Datasheet Reader
- PDF Reader
You can view the base reader set for different file types in the Data Capture Central Settings module in the Desktop Client. For more information, see Manage file types and prioritize them for content extraction.
For PDF files and Microsoft Office files, by default PDF reader is selected as the base reader in the Data Capture Central Settings module in the Desktop Client. For any file types other than the PDF files if the base reader is set as the PDF reader, when extracting content from such file types the PDF reader generates Markup renditions which are used by the software to retrieve the tags details. For more information, see Manage file types and prioritize them for content extraction.
After extracting data from the document, you can navigate to the document and tags in Web Client. For more information, see View and manage Data Capture data using the Web Client.