预训练向量,自然语言处理,word2vec,特定主题的词嵌入?

huangapple 未分类评论49阅读模式
英文:

Pre trained vectors, nlp, word2vec, word embedding for particular topic?

问题

有没有针对特定主题的预训练向量?比如说,“java”,所以我想要与java相关的向量在文件中。也就是说,如果我输入“inheritance”,那么余弦相似度会显示与多态性等相关的内容!我正在使用GoogleNews-vectors-negative300.bin和Glove向量作为语料库,但仍然没有得到相关的词汇。

英文:

is there any pretrained vector for particular topic only? for example "java", so i want vectors related java in file. mean if i give input inheritance then cosine similarity show me polymorphism and other related stuff only!
i am using corpus as GoogleNews-vectors-negative300.bin and Glove vectors. still not getting related words.

答案1

得分: 0

不确定我是否理解您的问题/问题陈述,但如果您想处理一组Java源代码,您可以使用code2vec,它提供了预训练的词嵌入模型。请查看:https://code2vec.org/

英文:

Not sure if I understand your question/problem statement, but if you want to work with a corpus of java source code you can use code2vec which provides pre-trained word-embeddings models. Check it out: https://code2vec.org/

答案2

得分: 0

是的,您偶尔可以找到其他组的预训练向量供下载,这些向量可能更好地覆盖了它们所训练的问题领域:既包括更专业的词汇,也包括与该领域中词义匹配的词向量。

例如,GoogleNews 词向量是在大约2012年左右的新闻文章上进行训练的,因此它对于 'Java' 这个词的向量可能会被印尼爪哇岛的故事所主导,就像编程语言一样。而且许多其他的向量集是在维基百科文本上训练的,其中的用法将主要受到特定参考文体的影响。但可能有其他的数据集更能强调您所需的词义。

然而,最好的方法通常是从与您关注的主题/文档密切匹配的训练语料库中训练您自己的词向量。然后,这些词向量将更好地调整到您关注的领域。只要您有足够多的单词在上下文中的不同例子,生成的向量可能会比他人语料库中的通用向量更好。("足够多" 没有明确定义,但通常至少为5个,最好是几十到几百个具有代表性和多样性的用例。)

让我们考虑您的示例目标 - 在 "多态性" 和 "输入继承" 的概念之间显示一些相似性。为此,您需要一个讨论这些概念的训练语料库,最好是多次,由多位作者,在许多问题背景下进行讨论的语料库。(教科书、在线文章和 Stack Overflow 页面可能是可能的信息来源。)

您还需要一种分词策略,能够为两个词的概念 'input_inheritance' 创建一个单一的词令牌 - 这是一个单独的挑战,可以通过以下方法解决:(1) 手工制作一个多词短语的词汇表,应该将它们合并; (2) 统计分析经常在一起出现的词对,它们应该被合并; (3) 更复杂的语法感知短语和实体检测预处理。

GoogleNews 数据集中的多词短语是通过一种统计算法创建的,该算法也可在 gensim Python 库中作为 Phrases 类使用。但据我所知,Google 使用的确切参数尚未公开。而且,这个算法的良好结果可能需要大量的数据和调整,仍然可能产生一些人们认为是无意义的组合,而遗漏一些人们认为是自然的组合。)

英文:

Yes, you can occasionally find other groups' pre-trained vectors for download, which may have better coverage of whatever problem domains they've been trained on: both more specialized words, and word-vectors matching the word sense in that domain.

For example, the GoogleNews word-vectors were trained on news articles circa 2012, so its vector for 'Java' may be dominated by stories of the Java island of Indosnesia as much as the programming language. And many other vector-sets are trained on Wikipedia text, which will be dominated by usages in that particular reference-style of writing. But there could be other sets that better emphasize the word-senses you need.

However, the best approach is often to train your own word-vectors, from a training corpus that closely matches the topics/documents you are concerned about. Then, the word-vectors are well-tuned to your domain-of-concern. As long as you have "enough" varied examples of a word used in context, the resulting vector will likely be better than generic vectors from someone else's corpus. ("Enough" has no firm definition, but is usually at least 5, and ideally dozens to hundreds, of representative, diverse uses.)

Let's consider your example goal – showing some similarity between the ideas of 'polymorphism' and 'input inheritance'. For that, you'd need a training corpus that discusses those concepts, ideally many times, from many authors, in many problem-contexts. (Textbooks, online articles, and Stack Overflow pages might be possible sources.)

You'd further need a tokenization strategy that manages to create a single word-token for the two-word concept 'input_inheritance' - which is a separate challenge, and might be tackled via (1) a hand-crafted glossary of multi-word-phrases that should be combined; (2) statistical analysis of word-pairs that seem to occur so often together, they should be combined; (3) more sophisticated grammar-aware phrase- and entity-detection preprocessing.

(The multiword phrases in the GoogleNews set were created via a statistical algorithm which is also available in the gensim Python library as the Phrases class. But, the exact parameters Google used have not, as far as I know, been revealed.And, good results from this algorithm can require a lot of data and tuning, and still result in some combinations that a person would consider nonsense, and missing others that a person would consider natural.)

huangapple
  • 本文由 发表于 2020年4月6日 03:09:27
  • 转载请务必保留本文链接:https://java.coder-hub.com/61048035.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定