pytorch学习笔记-高阶篇（时间序列表示）

前面已经学习了对于2d数据(图片等)的处理，卷积神经网络的相关知识，本篇主要介绍对于语言，连续文字等时间序列的表示方法

一、representation

对于文字信息，string类型， pytorch没有直接的数据类型处理，这里需要编码转换。常用的方法就是：Sequence representation。用一个向量来代表单词。

[seq_len,feature_len]
向量第一维表示序列长度，第二维表示表示方法
```  
![图片描述](https://m2.im5i.com/2022/01/01/UZ0vJW.png)  
&ensp;&ensp;对于图片中的文本表示，编码方式是**one-hot**那长度就取决于单词的数量，第一位为1表示是Rome,依次类推。如果有3500个单词，这句话的单词数是5，那就可以表示为[5, 3500]  
![图片描述](https://m2.im5i.com/2022/01/01/UZ01Ax.png)   
&ensp;&ensp;但很明显，**one-hot**编码的缺点（数据稀疏，高维）很突出，通常大数据也不会采用，在编码的时候要考虑语义相关性  
### 二、embedding  Glove  
``` python 
# 1. embedding
word_to_ix = {"hello": 0, "world": 1}
lookup_tensor = torch.tensor([word_to_ix["hello"]], dtype=torch.long)
# 2 words in vocab, 5 dimensional embeddings
embeds = nn.Embedding(2, 5)
hello_embed = embeds(lookup_tensor)
print(hello_embed)
'''
tensor([[ 0.2541, -1.3322,  0.9541,  0.6223,  0.5823]],
   grad_fn=<EmbeddingBackward0>)
'''
# 2. GloVe
from torchnlp.word_to_vector import GloVe
vectors = GloVe()
vectors['hello']