Your browser does not support JavaScript. Please to enable it.

Terms & Conditions

The idea you wish to view belongs to a community that requires acceptance of terms and conditions.


    Help to Improve This Idea.


    Prev | Next

    Cybersecurity Classification by Machine Learning

    by Hao Sun 02/27/2018 06:49 AM GMT

    • {{:upVoteCount}}
    Username * ()

        Move idea from "Winners / Selected for Development" stage to:


          Which workspace template do you wish to use?

          I accept the terms and conditions (see side bar). I understand all content I am submitting must be licensed under an open-source software or Creative Commons license as described in the Terms and Conditions:



          Concept Design

          Data Preprocess

          Data collection

          BeautifulSoup was used to automatically download policy files.

          76* unlabeled Cybersecurity policy files were crawled from:

          193 human labeled Cybersecurity policy files were crawled from:

          Extract text from pdf files

          Before extracting text from pdf files, we checked all files and rotated them into a forward angle.  

          Readable pdf files

          If the text in the policy files are readable, PDF miner was used to extract text from the files.

          Slides and Images

          There exist 10 outliers, 9 of them (Belgium, Brunei, Korea, Latvia, Malawi, Mauritius, Panama, South Africa and Uruguay) are of image format, and the other one (Italy) is locked. So we deployed pytesseract to recognize text after transferring those files into JPEG format.

          Initial cleaning

          Unreadable ASCII characters are removed from the data set.

          The Unicoder was set to ‘UTF-8’

          Special page splitter

          When extracting text data from pdf files, special page splitters were added to the end to each page, which will be a signal to recognize different pages.


          Google translator * package was used to translate non-English language files into English.

          Sentence Extraction

          Filter sentences

          The title-liked sentences and titles were eliminated first. We used only the normal length sentences.

          Predict period

          The sentence tokenization was a huge issue, so we used LSTM model implemented on Theano with GPU boosting to predict the periods. The methodology was to use pre-trained model and apply to our dataset. For example, we used roughly first 80% of lines from the Europarl v7 monolingual English corpus as training data, next 10% as development data and last 10% as test data (preprocessing script here). The training set size was about 40 million words. The corpus was obtained from the IWSLT 2012 TED task web page. The accuracy of our model can reach to 87%.

          Split sentences

          Using the predicted periods, we could split the sentences by these periods. Also, we cleaned the page breaks to extract the right sentences.

          Data cleaning

          First, the unnecessary symbols including punctuations, numbers and garbled text in front of each sentence were cleaned using the regular expression.

          Second, the Porter stemming algorithm was used to remove the commoner morphological and inflexional endings from words in English. And the stop words were removed to filter out the most common words such as the, is, at, which, and on.

          Topic Generator

          More data cleaning

          Country names were removed.

          Words whose term frequency of top 5 and frequency smaller than 10 were also removed.

          Essential words selection

            Essential words were selected by TF-IDF. Words whose term frequency were top three of a sentence were selected to build topics.

            LDA modeling

            10 topics were generated by LDA according to selected words.

            Here are the ten meaningful topics:

            Topic tagging

            Sentences were tagged by the topic of the most similarity. The similarity of a sentence and a topic were measured by LDA results.

            Similarity matrix

            Similarity matrix of all sentences was generated according to their degree of relevance to the ten topics. The matrix can be used as a reference for category labeling as similar sentences tend to be similar (or high similarity scores.)

            Category Modeling

            Category Definition

            Before we could train our model, we should make sure the categories can be defined correctly, since our task is to predict the categories (labels). The exact definition of each categories (including the sub-categories) can be found from Global Cybersecurity Index & Cyberwellness Profiles :


            The category definitions of all 192 countries can be found in this book. Also, we collected more definitions from some papers and reports.

            Labeled category

            Each category including sub-category was labeled with some specific definition. Then, the labeled file was saved to do the vectorization.


            The machine learning model could only deal with numeric data, so we used Word2vec to produce word embeddings, which is the vector representations of words.

            Sentence Classification

            Sentence modeling

            Doc2vec was deployed to transform our sentences into numeric vectors.

            Classify by categories

            Firstly, sentences were classified by calculating the similarity to each category. Category labels will be tagged when its statistical significance is more than 95%.

            Classify by sub-categories

            Once finish classifying by categories, sentences were tagged by sub-categories whose similarity  are of more than 95% significance.


            • Anaconda(Jupyter)

            Related Python Packages

            • pyPDF, pdfminer
            • googletrans
            • Beautiful soup
            • requests
            • urllib
            • re
            • numpy
            • pandas
            • gensim
            • tensorflow
            • nltk
            • scikit-learn
            • pillow
            • collections
            • argparse
            • pytesseract
            Co-authors to your solution

            Leo Lu/Hanson Dong/Erica Xu/Clark Chen

            Link to your concept design and documentation (Required by the final day of the Submission & Collaboration phase)


            Link to an online working solution or prototype (Required by the final day of the Submission & Collaboration phase):


            Link to a video or screencast of your solution or prototype (Required by the final day of the Submission & Collaboration phase):


            Link to source code of your solution or prototype above. (If you submitted a link to an online solution or prototype, or to a video of your solution of prototype, you must provide a link to the source code. This item is required by the final day of the submission phase):


            Move this Idea

            Select a Category

            Close this idea

            When closing an idea, you must determine whether the idea has exited successfully or unsuccessfully.

            Copy idea to another community

            Add Team Members

              Maximum number of team members allowed: 5

              Help to Improve This Idea.

              No ideas found!
              No activities yet.