욱이의 냉철한 공부

[NLP 개념정리] 이해하기 쉬운 분류체계 본문

데이터과학/개념 : NLP

[NLP 개념정리] 이해하기 쉬운 분류체계

냉철한 욱 2020. 4. 16. 19:23

* Word Representation 관점 (Word Embedding) 

1. Discrete Representation : Local Representation

    1) One - hot Vector

        - One - hot Vector

    2) Count Based

       - Bag of Words (BoW)

       - Document-Term Matrix (DTM)

       - (TDM)

       - Term Frequency-Inverse Document Frequency (TF - IDF)

       - N-gram Language Model (N-gram)

2. Continuous Representation

    1) Prediction Based (Distributed Representation)

        - Neural Network Language Model (NNLM) or Neural Probabilistic Language Model (NPLM)

        - Word2Vec

        - FastText

        - Embedding from Language Model (ELMo) (Bidirecional Language Model (biLM) 활용)

    2) CountBased (Full Document)

        - Latent Semantic Analysis (LSA)) <-DTM

    3) Prediction Based and CountBased (Windows)

        - GloVe

*

Discrete Representation은 값 그 자체를 표현, 정수로 표현된 이산표현

Continuous Representation은 관계, 속성의미를 내포하여 표현, 실수로 표현된 연속표현

 


* Language Model 관점

1. Statistical Language Model

    1) Prediction Based

      - N-gram Language Model (N-gram)

      - Naive Bayes Classifier

   2) Topic Modeling

     - Latent Semantic Analysis (LSA)

     - Latent Dirichlet Allocation (LDA)

2. Neural Network Based Language Model

   1) Prediction Based

      - MultiLayer Perceptron (MLP)

     - Neural Network Language Model (NNLM) or Neural Probabilistic Language Model (NPLM)

     - Recurrent Neural Network Language Model (RNNLM)

     - Char Recurrent Neural Network Language Model (Char RNNLM)

     - Bidirecional Language Model (biLM)

   => 평가방법 : Perplexity (PPL)

 


* TASK 관점

1. Text Classification

    - Naive Bayes Classifier

    - Recurrent Neural Network (RNN) : many to one

    - Long Short-Term Memory (LSTM) : many to one

=> 평가방법 : F1 score 고려

 

2. Part-of-speech Tagging(POS Tagging), Named Entity Recognition

    - Bidirecional Long Short-Term Memory (Bi-LSTM) : many to many

    - Bidirecional Long Short-Term Memory(Bi-LSTM) + Conditional Random Field (CRF) : many to many

=> 평가방법 : F1 score 고려

3. Machine Translation, Chatbot, Test Summarization, Speech to Text

   - Long Short-Term Memory (LSTM) : sequence to sequence (seq2seq)

   - Bidirecional Long Short-Term Memory (Bi-LSTM) +

     Attention Mechanism : sequence to sequence   (seq2seq)

   - Transformer

=> 평가방법 : Bilingual Evaluation Understudy Score (BLEU Score)

4. Image Captioning