Leveraging Language Technologies to extract Information and Metadata from GA Resolutions

X Close

Prev | Next

Leveraging Language Technologies to extract Information and Metadata from GA Resolutions

by Hussein Ghaly 02/08/2019 05:04 PM GMT

{{:upVoteCount}}

Move idea from "Expert Review" stage to:

Collapse

Do you want to send this idea to AdaptiveWork?

Collapse

Do you want to send this idea to Portfolios?

Parent structure code

Collapse

Which workspace template do you wish to use?

Collapse

I accept the terms and conditions (see side bar). I understand all content I am submitting must be licensed under an open-source software or Creative Commons license as described in the Terms and Conditions:

Description

United Nations documents contain data and information that pertain to their procedural and substantive function in the organization. Much of this data is intended for human consumption, such as how the documents are written, stored, hosted, described, annotated, and formatted. Therefore, the challenge is to allow machine readability of such documents, by both using the regularities of the patterns of preparing and publishing these documents, such as the standard formats, and the metadata stored both within the document and where it is hosted. It will also require the use of Natural Language Processing (NLP) techniques to identify more structured information from the text.

This combined approach involves a pipeline for processing documents, with a view to adding more structure and improve their machine readability. We would start crawling into the repository of the documents to retrieve the documents and their descriptive information available online. This can be achieved with basic libraries in programming languages, such as the Python requests library. Then the retrieved word files are processed into more machine readable formats, such as HTML or XML, using software packages such as Tika or Antiword. The structure of these processed documents would make it possible to tag specific elements in the text (e.g. titles, paragraphs, tables, etc.), and then parse its content. Further processing can be achieved using NLP packages, such as NLTK, coreNLP, spaCy, with built-in functionalities for parts of speech tagging, parsing, named entity recognition, among others.

Using this data structure, it would be possible to set up search/filtering and visualization interfaces that would allow users to view linkages between resolutions, compare similar resolutions or display other information related to specific queries based on the available data types.

Co-authors to your solution

Jose Garcia-Verdugo

Link to your concept design and documentation (Required by the final day of the Submission & Collaboration phase)

Link to an online working solution or prototype (Required by the final day of the Submission & Collaboration phase):

Link to a video or screencast of your solution or prototype (Required by the final day of the Submission & Collaboration phase):

Link to source code of your solution or prototype above. (If you submitted a link to an online solution or prototype, or to a video of your solution of prototype, you must provide a link to the source code. This item is required by the final day of the submission phase):

Tags: naturalLanguageProcessing

Move this Idea

Close this idea

When closing an idea, you must determine whether the idea has exited successfully or unsuccessfully.

Was the idea selected?

What is the Primary annual Impact?*

Quantify based on your selection*

What is the annual Secondary Impact?

Quantify based on your selection

What will the next steps be?*

Cancel Submit

Add Team Members

*Required

Cancel Add Now

Done

Help to Improve This Idea.

life cycle stages

100%

User Tasks

Required for graduation.
Task	Assigned to	Due Date	Status
Judge review	Francesco Sansoni	05/24/2019	Completed on 05/28/2019

Terms & Conditions

Help to Improve This Idea.

legal.notice.title

View Idea