DescriptionUnited Nations documents contain data and information that pertain to their procedural and substantive function in the organization. Much of this data is intended for human consumption, such as how the documents are written, stored, hosted, described, annotated, and formatted. Therefore, the challenge is to allow machine readability of such documents, by both using the regularities of the patterns of preparing and publishing these documents, such as the standard formats, and the metadata stored both within the document and where it is hosted. It will also require the use of Natural Language Processing (NLP) techniques to identify more structured information from the text.
This combined approach involves a pipeline for processing documents, with a view to adding more structure and improve their machine readability. We would start crawling into the repository of the documents to retrieve the documents and their descriptive information available online. This can be achieved with basic libraries in programming languages, such as the Python requests library. Then the retrieved word files are processed into more machine readable formats, such as HTML or XML, using software packages such as Tika or Antiword. The structure of these processed documents would make it possible to tag specific elements in the text (e.g. titles, paragraphs, tables, etc.), and then parse its content. Further processing can be achieved using NLP packages, such as NLTK, coreNLP, spaCy, with built-in functionalities for parts of speech tagging, parsing, named entity recognition, among others.
Using this data structure, it would be possible to set up search/filtering and visualization interfaces that would allow users to view linkages between resolutions, compare similar resolutions or display other information related to specific queries based on the available data types.
Help to Improve This Idea.