Workshop 01 利用 NLTK 进行预处理

首先，如果你从来没有使用过 iPython notebooks，为了能够在此工作册上运行代码，你可以选中一个 code cell 之后，点击 Cell 菜单里的运行命令，或者按一下键盘上的 shift + enter。通常，为了使代码能够正常工作，你需要按照程序的顺序运行 cells。一个给定的 cell 的输出（包括图表在内的任何结果）会在代码运行完毕之后显示在代码下方。为了确保一切顺利，请尝试运行下面的代码：

print("hello world")

输出：

hello world

好的，现在让我们对一段来自班级课程网站的 html 片段进行一些简单的预处理：

text = '''
<body>
    <!-- JavaScript plugins (requires jQuery) -->
    <script src="http://code.jquery.com/jquery.js"></script>
    <!-- Include all compiled plugins (below), or include individual files as needed -->
    <script src="js/bootstrap.min.js"></script>

    <div class="container">
      <div class="page-header">
  <h3>COMP90042 Natural Language Processing</h3>
</div>

The aims for this subject is for students to develop an understanding of the main algorithms used in natural 
language processing, for use in a diverse range of applications including text classification, machine 
translation, and question answering. Topics to be covered include part-of-speech tagging, n-gram language 
modelling, syntactic parsing and deep learning. The programming language used is Python, see 
<a href="python.html">the detailed configuration instructions</a> for more information on its use in the 
workshops, assignments and installation at home.
</body>
'''

首先，让我们利用正则表达式移除 html 标记：

import re

text = re.sub("<[^>]+>", "", text).strip()
print(text)

输出：

COMP90042 Natural Language Processing


The aims for this subject is for students to develop an understanding of the main algorithms used in natural 
language processing, for use in a diverse range of applications including text classification, machine 
translation, and question answering. Topics to be covered include part-of-speech tagging, n-gram language 
modelling, syntactic parsing and deep learning. The programming language used is Python, see 
the detailed configuration instructions for more information on its use in the 
workshops, assignments and installation at home.

现在，我们可以更清楚地看到：在标题和正文之间有三个换行符，并且文本内也有一些换行符。我们的句子分割器（sentence tokenizer）将无法正确地处理标题，因此我们将其删除，然后将其他换行符更改为空格：

text = text.split("\n\n\n")[1].replace("\n", " ")
print(text)

输出：

The aims for this subject is for students to develop an understanding of the main algorithms used in natural  language processing, for use in a diverse range of applications including text classification, machine  translation, and question answering. Topics to be covered include part-of-speech tagging, n-gram language  modelling, syntactic parsing and deep learning. The programming language used is Python, see  the detailed configuration instructions for more information on its use in the  workshops, assignments and installation at home.

接下来，让我们将文本分割成句子。虽然在这个例子中，像 “按照句号进行分割 split(".")” 这种简单的方法已经完全够用了，但是，让我们尝试一下 NLTK 中 punkt 下的英文分句模块 pickle，它可以处理文本中可能出现的更加复杂的情况：

import nltk
nltk.download('punkt')
sent_segmenter = nltk.data.load('tokenizers/punkt/english.pickle')

sentences = sent_segmenter.tokenize(text)
print(sentences)

输出：

['The aims for this subject is for students to develop an understanding of the main algorithms used in natural  language processing, for use in a diverse range of applications including text classification, machine  translation, and question answering.', 'Topics to be covered include part-of-speech tagging, n-gram language  modelling, syntactic parsing and deep learning.', 'The programming language used is Python, see  the detailed configuration instructions for more information on its use in the  workshops, assignments and installation at home.']


[nltk_data] Downloading package punkt to /Users/laujh/nltk_data...
[nltk_data]   Package punkt is already up-to-date!

NLTK 也提供了分词器（word tokenizer）。对于第一个句子，让我们对比一下简单的空格 split() 方法和 NLTK 中的正则分词器（regex tokenizer）模块 WordPunctTokenizer()：

word_tokenizer = nltk.tokenize.regexp.WordPunctTokenizer()

tokenized_sentence = word_tokenizer.tokenize(sentences[1])
print(tokenized_sentence)
print(sentences[1].split(" "))

输出：

['Topics', 'to', 'be', 'covered', 'include', 'part', '-', 'of', '-', 'speech', 'tagging', ',', 'n', '-', 'gram', 'language', 'modelling', ',', 'syntactic', 'parsing', 'and', 'deep', 'learning', '.']
['Topics', 'to', 'be', 'covered', 'include', 'part-of-speech', 'tagging,', 'n-gram', 'language', '', 'modelling,', 'syntactic', 'parsing', 'and', 'deep', 'learning.']

可以看到，NLTK 的分词器（tokenizer）正确地将词尾的逗号 , 和句号 . 分离出去了。另外，它还对带连字符的词 “part-of-speech” 也进行了分割，这种做法在某些应用中可能是正确的，而在另外一些应用中可能是错误的。

让我们尝试一下词形还原（lemmatization）。NLTK 的 wordnet 模块中也提供了一个自带的词形还原器（lemmatizer）方法 WordNetLemmatizer()，然而想要使用它，我们需要知道单词的词性部分。在这个例子中，我们将仅尝试动词（verb）词形还原，然后如果单词没有发生变化，我们会继续尝试名词（noun）：

nltk.download('wordnet')
lemmatizer = nltk.stem.wordnet.WordNetLemmatizer()

def lemmatize(word):
    lemma = lemmatizer.lemmatize(word,'v')
    if lemma == word:
        lemma = lemmatizer.lemmatize(word,'n')
    return lemma

print([lemmatize(token) for token in tokenized_sentence])

输出：

[nltk_data] Downloading package wordnet to /Users/laujh/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


['Topics', 'to', 'be', 'cover', 'include', 'part', '-', 'of', '-', 'speech', 'tag', ',', 'n', '-', 'gram', 'language', 'model', ',', 'syntactic', 'parse', 'and', 'deep', 'learn', '.']

将上面的输出结果与基于 Porter Stemmer 的词干提取（stemming）结果进行对比：

stemmer = nltk.stem.porter.PorterStemmer()
print([stemmer.stem(token) for token in tokenized_sentence])

输出：

['topic', 'to', 'be', 'cover', 'includ', 'vector', 'space', 'model', ',', 'part', '-', 'of', '-', 'speech', 'tag', ',', 'n', '-', 'gram', 'languag', 'model', ',', 'syntact', 'pars', 'and', 'neural', 'sequenc', 'model', '.']

下节内容：BPE 算法

本作品采用知识共享署名-非商业性使用-相同方式共享 4.0 国际许可协议进行许可。欢迎转载，并请注明来自：YEY 的博客同时保持文章内容的完整和以上声明信息！

自然语言处理 Workshop 01：利用 NLTK 进行预处理

墨尔本大学 COMP90042 Workshop

Workshop 01 利用 NLTK 进行预处理

CATALOG

FEATURED TAGS

FRIENDS