[Tensorflow] 04. 자연어 처리

Machine Learning 2024. 5. 16. 13:25

자연어 처리(NLP)는 사람의 언어를 이해하기 위한 인공지능 기술입니다.

언어를 숫자로 인코딩하기 (Tokenizer)

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer

sentences = [
    'Today is a sunny day',
    'Today is a rainy day',
    'Is it sunny today?'
]

tokenizer = Tokenizer(num_words = 100)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print(word_index)

{'today': 1, 'is': 2, 'a': 3, 'sunny': 4, 'day': 5, 'rainy': 6, 'it': 7}

Tokenizer 객체를 이용하여 문장의 각 단어를 숫자로 인코딩하는 과정이다.

num_words: 추출할 수 있는 최대 토큰 개수
모든 단어는 소문자로 치환된다.
자주 나오는 단어일수록 앞 번호가 주어진다.
구두점(? ' " ! ...)이 자동으로 제거된다.

sequences = tokenizer.texts_to_sequences(sentences)
print(sentences)
print(sequences)

['Today is a sunny day', 'Today is a rainy day', 'Is it sunny today?']
[[1, 2, 3, 4, 5], [1, 2, 3, 6, 5], [2, 7, 4, 1]]

Tokenizer 객체의 texts_to_sequences() 메소드를 이용하면 인코딩된 시퀀스를 반환한다.

모델의 학습을 위해 미리 토큰화하여 숫자로 인코딩하는 과정이다. 만약, 테스트 과정에서 기존에 인코딩 되지 않은 새로운 텍스트를 만나면 어떻게 될까?

이 경우를 대비하여 OOV토큰을 이용한다.

OOV 토큰 사용하기

test_data = [
    'Today is a snowy day',
    'Will it be rainy tomorrow?'
]

test_sequences = tokenizer.texts_to_sequences(test_data)
print(word_index)
print(test_sequences)

{'today': 1, 'is': 2, 'a': 3, 'sunny': 4, 'day': 5, 'rainy': 6, 'it': 7}
[[1, 2, 3, 5], [7, 6]]

테스트 데이터로 기존에 인코딩 되지 않았던 텍스트를 만났을 때에는 그냥 건너뛰는 모습을 볼 수 있다. 위의 경우에는 'today is a day', 'it rainy'로 인코딩 되었는데, 기존 문맥과 의미를 잃어버려 곤란한 상황이다.

tokenizer_oov = Tokenizer(num_words=100, oov_token="<OOV>")
tokenizer_oov.fit_on_texts(sentences)
word_index_oov = tokenizer_oov.word_index

test_sequences_oov = tokenizer_oov.texts_to_sequences(test_data)
print(word_index_oov)
print(test_sequences_oov)

{'<OOV>': 1, 'today': 2, 'is': 3, 'a': 4, 'sunny': 5, 'day': 6, 'rainy': 7, 'it': 8}
[[2, 3, 4, 1, 6], [1, 8, 1, 7, 1]]

oov_token: 인코딩 되지 않은 텍스트를 만났을 때 변환될 토큰 값

oov 토큰은 1번이 되었고, sequences에서도 모르는 텍스트를 만났을 때 1번으로 치환된다.
원래의 문장의 의미를 잃어버린 것은 여전하지만 여기에 어떤 단어가 있었다라는 것을 알려주어 더 나은 방식이다.

패딩 사용하기

모델을 훈련할 때 입출력 데이터의 크기는 형식화되어있다. 따라서 sequence의 길이가 모두 일정하도록 변경해보자.

from tensorflow.keras.preprocessing.sequence import pad_sequences

sentences2 = [
    'Today is a sunny day',
    'Today is a rainy day',
    'Is it sunny today?',
    'I really enjoyed walking in the snow today'
]

tokenizer2 = Tokenizer(num_words=100, oov_token="<OOV>")
tokenizer2.fit_on_texts(sentences2)
sequences2 = tokenizer2.texts_to_sequences(sentences2)
padded = pad_sequences(sequences2)
print(padded)

[[ 0  0  0  2  3  4  5  6]
 [ 0  0  0  2  3  4  7  6]
 [ 0  0  0  0  3  8  5  2]
 [ 9 10 11 12 13 14 15  2]]

sequences 중 가장 긴 문장을 기준으로 남은 공간은 0으로 패딩되며 기본적으로 오른쪽 정렬이 된다.

오른쪽 정렬을 하며 시작부분을 0으로 패딩하는 방법을 Prepadding이라 한다. padding 파라미터를 이용하면 Postpadding 또한 이용할 수 있다.

padded2 = pad_sequences(sequences2, padding='post')

[[ 2  3  4  5  6  0  0  0]
 [ 2  3  4  7  6  0  0  0]
 [ 3  8  5  2  0  0  0  0]
 [ 9 10 11 12 13 14 15  2]]

너무 긴 문장을 방지하고 싶을 때 maxlen 파라미터를 이용하면 최대 길이를 지정할 수 있다.

padded3 = pad_sequences(sequences2, padding='post', maxlen=6)

[[ 2  3  4  5  6  0]
 [ 2  3  4  7  6  0]
 [ 3  8  5  2  0  0]
 [11 12 13 14 15  2]]

범위를 넘어가는 문장은 시작부분이 잘렸다는 것을 알 수 있다. 만약 뒷부분을 자르고 싶다면 truncating 파라미터를 이용한다.

padded4 = pad_sequences(sequences2, padding='post', maxlen=6, truncating='post')

[[ 2  3  4  5  6  0]
 [ 2  3  4  7  6  0]
 [ 3  8  5  2  0  0]
 [ 9 10 11 12 13 14]]

불용어 제거 및 텍스트 정제

불용어는 특별한 의미가 없는 텍스트를 뜻한다.

'the', 'and', 'but'과 같이 너무 자주 등장하는 단어
HTML 태그
비속어
구두점
고유명사

import tensorflow_datasets as tfds

imdb_sentences = []
train_data = tfds.as_numpy(tfds.load('imdb_reviews', split="train"))
for item in train_data:
  imdb_sentences.append(str(item['text']))

tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(imdb_sentences)
sequences = tokenizer.texts_to_sequences(imdb_sentences)
print(tokenizer.word_index)

{'the': 1, 'and': 2, 'a': 3, 'of': 4, 'to': 5, 'is': 6, 'br': 7, 'in': 8, 'it': 9, 'i': 10, 'this': 11, 'that': 12, 'was': 13, 'as': 14, 'for': 15, ...

상위 단어들을 보았을 때 'the', 'and', 'a', 'br'등의 불용어들이 보인다. 이를 BeautifulSoup으로 제거해보자.

from bs4 import BeautifulSoup
import string

stopwords = ["a", "about", "above", "after", "again", "against", "all", "am", "an", "and", "any", "are", "as", "at",
             "be", "because", "been", "before", "being", "below", "between", "both", "but", "by", "could", "did", "do",
             "does", "doing", "down", "during", "each", "few", "for", "from", "further", "had", "has", "have", "having",
             "he", "hed", "hes", "her", "here", "heres", "hers", "herself", "him", "himself", "his", "how",
             "hows", "i", "id", "ill", "im", "ive", "if", "in", "into", "is", "it", "its", "itself",
             "lets", "me", "more", "most", "my", "myself", "nor", "of", "on", "once", "only", "or", "other", "ought",
             "our", "ours", "ourselves", "out", "over", "own", "same", "she", "shed", "shell", "shes", "should",
             "so", "some", "such", "than", "that", "thats", "the", "their", "theirs", "them", "themselves", "then",
             "there", "theres", "these", "they", "theyd", "theyll", "theyre", "theyve", "this", "those", "through",
             "to", "too", "under", "until", "up", "very", "was", "we", "wed", "well", "were", "weve", "were",
             "what", "whats", "when", "whens", "where", "wheres", "which", "while", "who", "whos", "whom", "why",
             "whys", "with", "would", "you", "youd", "youll", "youre", "youve", "your", "yours", "yourself",
             "yourselves"]

table = str.maketrans('', '', string.punctuation)

imdb_sentences = []
train_data = tfds.as_numpy(tfds.load('imdb_reviews', split="train"))
for item in train_data:
  sentence = str(item['text'].decode('UTF-8').lower())
  soup = BeautifulSoup(sentence)
  sentence = soup.get_text()
  words = sentence.split()
  filtered_sentence = ""
  for word in words:
    word = word.translate(table)
    if word not in stopwords:
      filtered_sentence = filtered_sentence + word + " "
  imdb_sentences.append(filtered_sentence)

tokenizer = Tokenizer(num_words=25000)
tokenizer.fit_on_texts(imdb_sentences)
sequences = tokenizer.texts_to_sequences(imdb_sentences)
print(tokenizer.word_index)

{'movie': 1, 'film': 2, 'not': 3, 'one': 4, 'like': 5, 'just': 6, 'good': 7, 'even': 8, 'no': 9, 'time': 10, 'really': 11, 'story': 12, 'see': 13, 'can': 14, 'much': 15, ...}

이전보다 결과가 정제되었지만, 여전히 이상한 단어가 보인다. (위 예시는 결과가 너무 길어 축약표시하였다.) 'annoying-conclusion', 'him/her'과 같이 대시(-), 슬래시(/)로 연결된 단어들이 있다. 이런 단어들은 걸러지지 않은 구두점이기 때문에 코드를 추가하여 해결해보자.

imdb_sentences = []
for item in train_data:
  sentence = str(item['text'].decode('UTF-8').lower())
  sentence = sentence.replace(",", " , ")
  sentence = sentence.replace(".", " . ")
  sentence = sentence.replace("-", " - ")
  sentence = sentence.replace("/", " / ")
  soup = BeautifulSoup(sentence)
  sentence = soup.get_text()
  words = sentence.split()
  filtered_sentence = ""
  for word in words:
    word = word.translate(table)
    if word not in stopwords:
      filtered_sentence = filtered_sentence + word + " "
  imdb_sentences.append(filtered_sentence)

tokenizer = Tokenizer(num_words=25000)
tokenizer.fit_on_texts(imdb_sentences)
sequences = tokenizer.texts_to_sequences(imdb_sentences)
print(tokenizer.word_index)

인코딩된 결과를 이용하여 간단한 문장을 변환해보자

sentences = [
    'Today is a sunny day',
    'Today is a rainy day',
    'Is it sunny today?'
]
sequences = tokenizer.texts_to_sequences(sentences)
print(sequences)

[[516, 5229, 147], [516, 6489, 147], [5229, 516]]

이를 다시 decode 해보면,

reverse_word_index = dict([(value, key) for (key, value) in tokenizer.word_index.items()])

for s in range(len(sentences)):
  decoded_review = ' '.join([reverse_word_index.get(i, '?') for i in sequences[s]])
  print(decoded_review)

today sunny day  
today rainy day  
sunny today

'Machine Learning' 카테고리의 다른 글

[Tensorflow] 06. 순환 신경망 & LSTM (2)	2024.07.21
[Tensorflow] 05. 자연어 처리 - Embedding (2)	2024.07.10
[TensorFlow] 03-1. 합성곱 신경망으로 이미지 분류하기 (0)	2024.05.09
[TensorFlow] 02. 이미지 분류하기 (0)	2024.04.27
[TensorFlow] 01. 1차 방정식 학습하기 (0)	2024.04.10

ABOUT ME

hooony hooony

언어를 숫자로 인코딩하기 (Tokenizer)

OOV 토큰 사용하기

패딩 사용하기

불용어 제거 및 텍스트 정제

'Machine Learning' 카테고리의 다른 글

티스토리툴바

ABOUT ME

언어를 숫자로 인코딩하기 (Tokenizer)

OOV 토큰 사용하기

패딩 사용하기

불용어 제거 및 텍스트 정제

'Machine Learning' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바