Your browser does not support JavaScript. Please to enable it.

Terms & Conditions

The idea you wish to view belongs to a community that requires acceptance of terms and conditions.

RejectAccept

    Help to Improve This Idea.

    Search

     
    Prev | Next

    Assisted Data Extraction

    by Mahdi Moqri 02/24/2018 08:01 PM GMT

    • {{:upVoteCount}}
    Username * ()

        Move idea from "Winners / Selected for Development" stage to:

          Collapse

          Which workspace template do you wish to use?

          Collapse
          I accept the terms and conditions (see side bar). I understand all content I am submitting must be licensed under an open-source software or Creative Commons license as described in the Terms and Conditions:

          on

          Description

          The task is divided into 4 steps. Each step is done either automatically or with the help of an agent (assisted).

          1- Finding pages containing tables - assisted

          Input: yearbook PDF file - https://unstats.un.org/unsd/publications/statistical-yearbook/files/syb56/syb56.pdf 

          Output: a csv file containing page numbers of each table - https://github.com/moqri/UN_YearBooks2OpenData/blob/master/2_table_page_numbers/tables.csv 

          Method: Recording the page numbers for each table in the yearbook (manually)

          2- Extracting the pages containing tables - automated

          Input: a text file containing page numbers of each table - https://github.com/moqri/UN_YearBooks2OpenData/blob/master/2_table_page_numbers/tables.csv 

          Output: a set of PDF files each containing a table - https://github.com/moqri/UN_YearBooks2OpenData/tree/master/3_table_pdfs 

          Method: PyPDF2 - https://pythonhosted.org/PyPDF2/

          3- Extracting the data form each table -automated

          Input: a PDF file containing a table - https://github.com/moqri/UN_YearBooks2OpenData/blob/master/3_table_pdfs/2.pdf 

          Output: a CSV file containing table data - https://github.com/moqri/UN_YearBooks2OpenData/blob/master/4_table_csvs/2.csv 

          Method: Tabula - http://tabula.technology/ 

          4- Cleaning the data and creating tables - assisted

          Input: a CSV file containing table data - https://github.com/moqri/UN_YearBooks2OpenData/blob/master/4_table_csvs/2.csv 

          Output: a CSV table with correct labels and rows - https://github.com/moqri/UN_YearBooks2OpenData/blob/master/5_tables_cleaned/2.csv 

          Method: comparing the original table (from PDF) and the result (in CSV) and correcting the labels, alignments, etc. (manually)

          Co-authors to your solution

          Parham Amiri

          Link to your concept design and documentation (Required by the final day of the Submission & Collaboration phase)

          https://github.com/moqri/UN_YearBooks2OpenData

          Link to an online working solution or prototype (Required by the final day of the Submission & Collaboration phase):

          https://github.com/moqri/UN_YearBooks2OpenData

          Link to a video or screencast of your solution or prototype (Required by the final day of the Submission & Collaboration phase):

          Link to source code of your solution or prototype above. (If you submitted a link to an online solution or prototype, or to a video of your solution of prototype, you must provide a link to the source code. This item is required by the final day of the submission phase):

          https://github.com/moqri/UN_YearBooks2OpenData

          Move this Idea

          Select a Category

          Close this idea

          When closing an idea, you must determine whether the idea has exited successfully or unsuccessfully.

          Copy idea to another community

          Add Team Members

            Maximum number of team members allowed: 5
            *Required

            Help to Improve This Idea.

            0%
            0%
            100%
            No ideas found!
            No activities yet.