This functionality was modified in an update. For more information, see Extract data from a document (modified in an update).
You can extract data from a document thereby updating all relationships between tags
and the document.
How can I configure DCOM to allow content extraction from Microsoft Office files?
You must enable DCOM permissions before HxGN SDx can access Microsoft Office applications.
This is mandatory if you want to extract content from:
To set the DCOM configuration for the respective file type application, complete the
following steps:
-
Click Start > Administrative Tools > Component Services.
-
In the tree view, expand Component Services > Computers > My Computer > DCOM Config.
-
Based on the Microsoft Office file type, locate and right-click the respective DCOM
Config component service:
-
On the shortcut menu, click Properties.
-
In the General tab, set the Authentication Level to None.
-
In the Identity tab, select The Launching User option.
-
In the Security tab, set the Launch and Activation Permissions to Customize, and click Edit.
-
Add the Administrators created by Server Manager.
-
Select the Allow check box for the following items:
-
Local Launch
-
Remote Launch
-
Local Activation
-
Remote Activation
-
Read
-
Special permissions
What is the purpose of a default template group?
When you use Data Capture Content Discovery Task in the Desktop Client or Extract Content in the Web Client to extract content from multiple documents, the software automatically
considers the templates and rules defined for the default template group. The default
template group is considered only when a PDF file or a drawing file is attached to
the document. To successfully extract content, ensure that the templates and rules
are configured for the template group. However, if you have not chosen any template
group as default, the software automatically considers a template group DefaultDrawingTemplateGroup for extracting the content. This default template group is provided with the software.
In the Web Client, to extract content from a single document, the software automatically
considers the templates and rules defined for the auto selected default template group.
The default template group is considered only when a PDF file or a drawing file is
attached to the document. However, you have an option to select and apply any other
template group instead of the default template group. For more information, see Extract data from a document.
For the auto selected default template group, the Match Tag Patterns option is pre-selected.
How is the content extracted from a document with more than one file attached?
When processing a document with more than one file attached, the following scenarios
are considered:
-
If different file types are attached, the software first checks for a file with the
ISPFNMasterFile interface, and content is extracted from that file.
-
If a file with the ISPFNMasterFile interface is not found, content is extracted from
the file with the highest priority. By default, files with the .dwg file extension
are set as highest priority. However, the priority of the file can be changed in the
Data Capture Central settings module in the Desktop Client. For more information, see Manage file types and prioritize them for content extraction.
-
If more than one file of highest priority is attached, then you must choose the file
to extract the content from the Extract Content window.
How can I extract double-byte characters from the content?
By default, the Data Capture supports a few double-byte characters when extracting
content from a document. To allow this, you must ask your administrator to enable
the Extract matching double-byte characters as alias tag option in the Desktop Client's Data Capture Administration module Master Tag Patterns page. For more information, see Create a new master tag pattern.
Before you start content extraction, we recommend you check if the double-byte character
you want to extract, is supported by the Data Capture, by completing the following
steps:
-
Go to https://www.dotnetfiddle.net in a browser.
-
Copy and paste the following Data Capture double-byte character conversion code snippet
in the Code Editor box:
using System;
using System.Text;
public class Program
{
public static void Main()
{
Console.WriteLine(Encoding.GetEncoding(1252).GetChars(Encoding.Convert(Encoding.Unicode,
Encoding.GetEncoding(1252), Encoding.Unicode.GetBytes("<double-byte character to be converted>"))));
}
}
-
Click Run to execute the code and verify the following:
-
If the output is a single-byte character, the double-byte character is converted successfully.
-
If the output is a ?, the double-byte character conversion failed.
To configure Data Capture for allowing extraction of unsupported double-byte characters,
do the following:
Before doing this, set the query scope of your site to the (Scope Not Set) Configuration Top and load the Data Capture sample data. For more information on how to load the sample
data, see Load the sample data for your site.
-
In the Desktop Client, under the Document classifications tree, expand the Data Capture Administration node.
-
Under the Data Capture Administration node, right-click the Data Capture Miscellaneous node, and click Show Template Documents.
-
Right-click the Double Byte Substitution List template in the tree view, and click Edit > Check Out to check out the template and edit the DoubleByteSubstitutionList.xml.
-
In the file, based on the type of double-byte character to be converted, copy and
paste the following XML code and replace the child node value with the double-byte
character's Unicode value and the parent node value with the corresponding single-byte
character's Unicode value.
<SubstituteChar UnicodeValue="Single-byte character's unicode value"> <!Parent node>
<SubstituteeChar>Double-byte character's unicode value</SubstituteeChar> <!Child node>
</SubstituteChar>
For example,
<SubstituteChar UnicodeValue="007E"> <!Unicode value for single-byte character tilda ~>
<SubstituteeChar>ff5e</SubstituteeChar> <! Unicode value for double-byte character tilda 〜>
</SubstituteChar>
This sample code converts a double-byte character 〜 to single-byte character ~.
-
Check in the template with the updated XML file.
Data Capture applies the conditions in the template file when processing the content
with double-byte characters.
To extract content from a document, use Web Client to perform the following steps:
To extract content from 3D models .zvf and .mdb2 files, we recommend you to install
Microsoft Access database engine 2010 (64-bit) on the HxGN SDx application server.
-
Click Documents > All Documents.
-
To extract data, select a document from All Documents list, and click Actions > Extract Content.
-
In the Extract Content window, do the following:
-
Select any file attached to the document from the Select File list.
-
Select a template group from the TemplateGroup/Template list to apply the processing rules using the corresponding Preprocessor Reader.
From Update 14 onwards, for PDF and drawing files, a template group which is set
as default in the Data Capture Drawing Reader Pre-Processor and PDF Reader Pre-Processor is automatically selected
and applied for extracting content. However, you can select and apply any template
group from the TemplateGroup/Template list to process the content. For more information, see Manage drawing reader pre-processor templates and template groups and Manage PDF reader pre-processor templates.
-
Click OK.
To use preprocessed content files for processing the file, click Show more and apply more options as follows:
Click this
|
To do this
|
Use Existing PreProcessed Content Files
|
Process the file using the preprocessed content XML file available.
To extract content using the preprocessed content files, we recommend you attach
the ContentFile.xml along with the corresponding file to the document. You must also
attach GraphicsMapFile.xml if the file type supports graphical navigation.
|
Reader Pre-Processor
|
Select appropriate Preprocessor Reader for processing the datasheet file.
|
For Hexagon 3D model
|
OleDB Provider box and type the connection string.
|
Connect to the Microsoft Access database.
|
Match Tag Patterns check box.
|
Extract the tags based on the tag patterns defined in the Tag Discovery Patterns module.
|
-
Before preparing to work with large amount of data, based on the size of the data
it is recommended to configure the LicenseTimeoutSeconds property under Site Settings node in HxGN SDx Server Manager. This setting will prevent the license token to timeout, thereby allowing
the session to retain.
-
To view the status of content extraction from a selected document:
-
Select Actions menu, and click Show the detail form > Extract Content.
For more information about the status of a document processed using the Data Capture, see Data Capture Document Status.
-
FDW tags are created without applying the ENS definition.
-
By default, the property Is Data Capture Rel is set to True on document to tag relationships SPFNDocRevMasterTag, SPFNDocRevAliasTag, FDWDocRevTag
and SPFNFDWDocRevChildTag for Data Capture tags.
-
The master tag and the FDW tag are identified with the same icon . The alias tag is identified with the icon.
-
In the Desktop Client, the FDW tag is identified with the icon which is same for the master tag extracted using the Data Capture Content Discovery Task module.
-
You can select Match Tag Patterns to extract the tags based on the tag patterns defined in the Tag Discovery Patterns module. For few
file types, Match Tag Patterns is pre-selected.
-
Except for the datasheet file, the Reader Pre-Processor is automatically selected based on the attached file type. For the datasheet file,
you can select one of the following options as the base reader:
-
Datasheet Reader
-
PDF Reader
-
You can view the base reader set for different file types in the Data Capture Central Settings module in the Desktop Client. For more information, see Manage file types and prioritize them for content extraction.
-
For PDF files and Microsoft Office files, by default PDF reader is selected as the
base reader in the Data Capture Central Settings module in the Desktop Client. For any file types other than the PDF files if the
base reader is set as the PDF reader, when extracting content from such file types
the PDF reader generates Markup renditions which are used by the software to retrieve
the tags details. For more information, see Manage file types and prioritize them for content extraction.
-
After extracting data from the document, you can navigate to the document and tags
in Web Client. For more information, see View and manage Data Capture data using the Web Client.