Your browser does not support JavaScript. Please to enable it.

Terms & Conditions

The idea you wish to view belongs to a community that requires acceptance of terms and conditions.

RejectAccept

    Help to Improve This Idea.

    Search

     
    Prev | Next

    Assisted Data Extraction

    by Mahdi Moqri 02/24/2018 08:01 PM GMT

    • {{:upVoteCount}}
    Username * ()

        Move idea from "Winners / Selected for Development" stage to:

          Collapse
          I accept the terms and conditions (see side bar). I understand all content I am submitting must be licensed under an open-source software or Creative Commons license as described in the Terms and Conditions:

          on

          Description

          The task is divided into 4 steps. Each step is done either automatically or with the help of an agent (assisted).

          1- Finding pages containing tables - assisted

          Input: yearbook PDF file - https://unstats.un.org/unsd/publications/statistical-yearbook/files/syb56/syb56.pdf 

          Output: a csv file containing page numbers of each table - https://github.com/moqri/UN_YearBooks2OpenData/blob/master/2_table_page_numbers/tables.csv 

          Method: Recording the page numbers for each table in the yearbook (manually)

          2- Extracting the pages containing tables - automated

          Input: a text file containing page numbers of each table - https://github.com/moqri/UN_YearBooks2OpenData/blob/master/2_table_page_numbers/tables.csv 

          Output: a set of PDF files each containing a table - https://github.com/moqri/UN_YearBooks2OpenData/tree/master/3_table_pdfs 

          Method: PyPDF2 - https://pythonhosted.org/PyPDF2/

          3- Extracting the data form each table -automated

          Input: a PDF file containing a table - https://github.com/moqri/UN_YearBooks2OpenData/blob/master/3_table_pdfs/2.pdf 

          Output: a CSV file containing table data - https://github.com/moqri/UN_YearBooks2OpenData/blob/master/4_table_csvs/2.csv 

          Method: Tabula - http://tabula.technology/ 

          4- Cleaning the data and creating tables - assisted

          Input: a CSV file containing table data - https://github.com/moqri/UN_YearBooks2OpenData/blob/master/4_table_csvs/2.csv 

          Output: a CSV table with correct labels and rows - https://github.com/moqri/UN_YearBooks2OpenData/blob/master/5_tables_cleaned/2.csv 

          Method: comparing the original table (from PDF) and the result (in CSV) and correcting the labels, alignments, etc. (manually)

          Co-authors to your solution

          Parham Amiri

          Link to your concept design and documentation (Required by the final day of the Submission & Collaboration phase)

          https://github.com/moqri/UN_YearBooks2OpenData

          Link to an online working solution or prototype (Required by the final day of the Submission & Collaboration phase):

          https://github.com/moqri/UN_YearBooks2OpenData

          Link to a video or screencast of your solution or prototype (Required by the final day of the Submission & Collaboration phase):

          Link to source code of your solution or prototype above. (If you submitted a link to an online solution or prototype, or to a video of your solution of prototype, you must provide a link to the source code. This item is required by the final day of the submission phase):

          https://github.com/moqri/UN_YearBooks2OpenData

          Move this Idea

          Select a Category

          Close this idea

          When closing an idea, you must determine whether the idea has exited successfully or unsuccessfully.

          Add Team Members

            Maximum number of team members allowed: 5
            *Required

            Help to Improve This Idea.

            0%
            0%
            100%
            No ideas found!
            No activities yet.