# Area 1: Content Analysis and Auto Template Creation

This includes detecting and interpreting complex data structures such as checkboxes, tables, blank lines, and signatures. The goal in extracting essential information from these documents is to evaluate its validity within the scope of the company and industry it applies to. The existing version of the prototype requires manual creation of templates for each form type. We are currently developing a program to automatically create templates by detecting essential keywords in table and blank box/line data types.

# 9/29/2020 update

Content analysis: successfully calculated Term Frequency / Inverted Document Frequency. We are analyzing results and building suitable classifiers.

Automatic template creation: successfully identified the non-table input area. We are working on scalable table data extraction using computer vision techniques.

Mainly based on Ryan Software’s data (tax certification).

# 8/19/2020 update

We are currently developing a program to automatically create templates by detecting essential keywords in table and blank box/line data types. Access to the github repository for this project can be given by request.

← Introduction Area 2: Text Processing →