# Area 2: Text Processing

We use NLP and other techniques to go through text and other material in corporate reports and similar documents to get "key items." This data can be linked with other computer-based information to take necessary follow-up actions.

# Current Status

Preprocessing: Convert PDFs to text; separate them into sections for analysis.

NLP: Using Natural Language Toolkit to preprocess text data with these techniques:

Latent Dirichlet Allocation

Generate dynamic number of topics for each section. Use these topics to find candidate sentences which best summarize the given topic.
Term Frequency Inverse Document Frequency

Summarize content from each section. Additionally weighting terms based on task related keywords to generate topic specific summaries.
Pre-Trained Transformer Networks

Leverage question answering capabilities to identify the issues raised and actions required in the text. Investigating priming and other techniques to enable few shot learning on the specific task.

Next: We plan on trying new models and continuing to train on more documents to gain further insights into what information can be gained from each section. LDA can be used for further applications to connect documents.

Area 2 is mainly based on data from a foreign bank.

# 9/29/2020 update

NLP: we have implemented methods for preprocessing documents in an NLP context. We use NLTK to perform tasks such as frequency dictionaries and named entity recognition. We plan on trying new models and continuing to read documents to gain further insights into what information can be gained from each section.

Preprocessing: we convert PDFs (machine readable or scanned) to text and separate them into sections for analysis.

# Inputs/Outputs

# LDA

Input: Machine-readable PDF, desired summary length

Output: Overall summary of document. For each section, topics and key sentences.

# Transformers

Input: Machine-readable PDF

Output: Spreadsheet of issues, actions identified in each section of the text

← Area 1: Content Analysis and Auto Template Creation Area 3: Creating a robust pre-processing pipeline for different types of documents →