### Coursework coding instructions (please also see full coursework spec)

Version: Ready to submit

Please choose if you want to do either Task 1 or Task 2. You should write your report about one task only.

For the task you choose you will need to do two approaches:
  - Approach 1, which can use use pre-trained embeddings / models
  - Approach 2, which should not use any pre-trained embeddings or models
We should be able to run both approaches from the same colab file

#### Running your code:
  - Your models should run automatically when running your colab file without further intervention
  - For each task you should automatically output the performance of both models
  - Your code should automatically download any libraries required

#### Structure of your code:
  - You are expected to use the 'train', 'eval' and 'model_performance' functions, although you may edit these as required
  - Otherwise there are no restrictions on what you can do in your code

#### Documentation:
  - You are expected to produce a .README file summarising how you have approached both tasks

#### Reproducibility:
  - Your .README file should explain how to replicate the different experiments mentioned in your report

Good luck! We are really looking forward to seeing your reports and your model code!

#### README
You can run Approach1 and 2 directly separately, Please place the data set in '/content/drive/MyDrive/data/task-1/'.



*   The .README FILE: [README](https://1drv.ms/u/s!Aqc8bRqOh1johe0FeGHe4AA6WHkaGg?e=GLvaaU) 
*   The Dataset: [Dataset](https://cs.rochester.edu/u/nhossain/humicroedit.html)






#### Approach 1: Pre-trained representations

##### Method 1: The BiLSTM model 

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# You will need to download any word embeddings required for your code, e.g.:

# %cd /content/drive/My\ Drive/nlp-task1
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip glove.6B.zip

# For any packages that Colab does not provide auotmatically you will also need to install these below, e.g.:

# ! pip install torch
# ! pip install torchtext

/content/drive/My Drive/nlp-task1


In [None]:
# Imports

import re
import torch
import nltk
import torch.nn as nn
import pandas as pd
import numpy as np
from torch.utils.data import Dataset, random_split
from torchtext.data.utils import get_tokenizer
from nltk.stem import WordNetLemmatizer 
from nltk.corpus import stopwords
import codecs

nltk.download("stopwords")
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [None]:
# Setting random seed and device
SEED = 1

torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

use_cuda = torch.cuda.is_available()
device = torch.device("cuda:0" if use_cuda else "cpu")

In [None]:
# Load data
train_humicroedit_df = pd.read_csv('/content/drive/MyDrive/data/task-1/train.csv')
train_funlines_df = pd.read_csv('/content/drive/MyDrive/data/task-1/train_funlines.csv')

train_df = train_humicroedit_df.append(train_funlines_df, ignore_index=True)
train_df = pd.read_csv('/content/drive/MyDrive/data/task-1/train.csv')
test_df = pd.read_csv('/content/drive/MyDrive/data/task-1/dev.csv')

In [None]:
# Number of epochs
epochs = 20

# Proportion of training data for train compared to dev
train_proportion = 0.8

In [None]:
# We define our training loop
def train(train_iter, dev_iter, model, number_epoch):
    """
    Training loop for the model, which calls on eval to evaluate after each epoch
    """

    print("Training model.")

    for epoch in range(1, number_epoch+1):

        model.train()
        epoch_loss = 0
        epoch_sse = 0
        no_observations = 0  # Observations used for training so far

        for batch in train_iter:

            feature, target = batch

            feature, target = feature.to(device), target.to(device)

            # for RNN:
            model.batch_size = target.shape[0]
            no_observations = no_observations + target.shape[0]
            # model.hidden = model.init_hidden()

            predictions = model(feature).squeeze(1)

            optimizer.zero_grad()

            loss = loss_fn(predictions, target)

            sse, __ = model_performance(predictions.detach().cpu().numpy(), target.detach().cpu().numpy())

            loss.backward()
            optimizer.step()

            epoch_loss += loss.item()*target.shape[0]
            epoch_sse += sse

        valid_loss, valid_mse, __, __ = eval(dev_iter, model)

        epoch_loss, epoch_mse = epoch_loss / no_observations, epoch_sse / no_observations
        print(f'| Epoch: {epoch:02} | Train Loss: {epoch_loss:.2f} | Train MSE: {epoch_mse:.2f} | Train RMSE: {epoch_mse**0.5:.2f} | \
        Val. Loss: {valid_loss:.2f} | Val. MSE: {valid_mse:.2f} |  Val. RMSE: {valid_mse**0.5:.2f} |')

In [None]:
# We evaluate performance on our dev set
def eval(data_iter, model):
    """
    Evaluating model performance on the dev set
    """
    model.eval()
    epoch_loss = 0
    epoch_sse = 0
    pred_all = []
    trg_all = []
    no_observations = 0

    with torch.no_grad():
        for batch in data_iter:
            feature, target = batch

            feature, target = feature.to(device), target.to(device)

            # for RNN:
            model.batch_size = target.shape[0]
            no_observations = no_observations + target.shape[0]
            # model.hidden = model.init_hidden()

            predictions = model(feature).squeeze(1)
            loss = loss_fn(predictions, target)

            # We get the mse
            pred, trg = predictions.detach().cpu().numpy(), target.detach().cpu().numpy()
            sse, __ = model_performance(pred, trg)

            epoch_loss += loss.item()*target.shape[0]
            epoch_sse += sse
            pred_all.extend(pred)
            trg_all.extend(trg)

    return epoch_loss/no_observations, epoch_sse/no_observations, np.array(pred_all), np.array(trg_all)

In [None]:
# How we print the model performance
def model_performance(output, target, print_output=False):
    """
    Returns SSE and MSE per batch (printing the MSE and the RMSE)
    """

    sq_error = (output - target)**2

    sse = np.sum(sq_error)
    mse = np.mean(sq_error)
    rmse = np.sqrt(mse)

    if print_output:
        print(f'| MSE: {mse:.2f} | RMSE: {rmse:.2f} |')

    return sse, mse

In [None]:
def create_vocab(data):
    """
    Creating a corpus of all the tokens used
    """
    tokenized_corpus = [] # Let us put the tokenized corpus in a list

    # define a tokenizer
    tokenizer = get_tokenizer("spacy")

    # define stopwords
    stopwordList = set(stopwords.words('english'))

    # define a lemmatizer
    lemmatizer = WordNetLemmatizer()

    for sentence in data:

        tokenized_sentence = tokenizer(sentence)
        valid_tokenized_sentence = []
        
        for word in tokenized_sentence:

            if re.match("^[A-Za-z]+$", word) and word not in stopwordList:

                valid_tokenized_sentence.append(lemmatizer.lemmatize(word))

        # for token in sentence.split(' '): # simplest split is
        #     tokenized_sentence.append(token)

        tokenized_corpus.append(valid_tokenized_sentence)

    # Create single list of all vocabulary
    vocabulary = []  # Let us put all the tokens (mostly words) appearing in the vocabulary in a list

    for sentence in tokenized_corpus:

        for token in sentence:

            if token not in vocabulary:

                if True:
                    vocabulary.append(token)

    return vocabulary, tokenized_corpus

In [None]:
def collate_fn_padd(batch):
    '''
    We add padding to our minibatches and create tensors for our model
    '''

    batch_labels = [l for f, l in batch]
    batch_features = [f for f, l in batch]

    batch_features_len = [len(f) for f, l in batch]

    seq_tensor = torch.zeros((len(batch), max(batch_features_len))).long()

    for idx, (seq, seqlen) in enumerate(zip(batch_features, batch_features_len)):
        seq_tensor[idx, :seqlen] = torch.LongTensor(seq)

    batch_labels = torch.FloatTensor(batch_labels)

    return seq_tensor, batch_labels

class Task1Dataset(Dataset):

    def __init__(self, train_data, labels):
        self.x_train = train_data
        self.y_train = labels

    def __len__(self):
        return len(self.y_train)

    def __getitem__(self, item):
        return self.x_train[item], self.y_train[item]

In [None]:
class BiLSTM(nn.Module):

    def __init__(self, embedding_dim, hidden_dim, vocab_size, batch_size, device):
        super(BiLSTM, self).__init__()
        self.hidden_dim = hidden_dim
        self.embedding_dim = embedding_dim
        self.device = device
        self.batch_size = batch_size
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)
        
        # The LSTM takes word embeddings as inputs, and outputs hidden states
        # with dimensionality hidden_dim.
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, bidirectional=True, dropout=0.2)      

        # The batch normalization layer that standardizes the inputs to a layer
        # for each mini-batch
        self.bn = nn.BatchNorm1d(hidden_dim * 2)

        # The linear layer that maps from hidden state space to tag space
        self.hidden2label = nn.Linear(hidden_dim * 2, 1)
        self.hidden = torch.zeros(2, self.batch_size, self.hidden_dim).to(self.device), \
               torch.zeros(2, self.batch_size, self.hidden_dim).to(self.device)    

        # The activation layer that produces non-negative real outputs
        self.activation = nn.ReLU(True)

    def forward(self, sentence):

        # Before we've done anything, we dont have any hidden state.
        # Refer to the Pytorch documentation to see exactly why they have this dimensionality.
        # The axes semantics are (num_layers * num_directions, minibatch_size, hidden_dim)
        self.hidden = torch.zeros(2, self.batch_size, self.hidden_dim).to(self.device), \
               torch.zeros(2, self.batch_size, self.hidden_dim).to(self.device)         

        embedded = self.embedding(sentence)
        embedded = embedded.permute(1, 0, 2)

        lstm_out, self.hidden = self.lstm(
            embedded.view(len(embedded), self.batch_size, self.embedding_dim), self.hidden)

        out = self.bn(lstm_out[-1])
        out = self.hidden2label(out)
        out = self.activation(out)

        # print(self.embedding_dim, self.hidden_dim, lstm_out.shape)

        return out



Proposed model
1. Original sentences were changed by using edit words.
2. On the edited sentences, pre-trained word embedding model was implemented to obtain the high-quality distributed vector representation for our datasets.
3. Next, we apply the Bidirectional LSTMs (BiLSTMs) models to extract the higher level feature sequences with sequential information from the edited news headlines embedding.
4. We employ an encoder pre-trained by Google News word2vec model to encode each word into 300-dimensional feature vector. These features are then sent to a Bidirectional LSTM module.
5. Finally, the generated output feature sequences from Bidirectional LSTMs fed into the fully-connected prediction module to determine the prediction.



In [None]:
## Data preprocessing

# original sentences were changed by using edit words
train_df["edited"] = train_df.apply(lambda x: re.sub("<.*/>", f"{x.edit}", x.original), axis=1) 
test_df["edited"] = test_df.apply(lambda x: re.sub("<.*/>", f"{x.edit}", x.original), axis=1)

# make all edited lower case
train_df["edited"] = train_df["edited"].str.lower()
test_df["edited"] = test_df["edited"].str.lower()

In [None]:
# We set our training data and test data
training_data = train_df['edited']
test_data = test_df['edited']

# Creating word vectors
training_vocab, training_tokenized_corpus = create_vocab(training_data)
test_vocab, test_tokenized_corpus = create_vocab(test_data)

# Creating joint vocab from test and train:
joint_vocab, joint_tokenized_corpus = create_vocab(pd.concat([training_data, test_data]))

print("Vocab created.")

Vocab created.


In [None]:
# We create representations for our tokens
wvecs = [[0 for _ in range(100)]] # word vectors
word2idx = [("<pad>", 0)] # word2index
idx2word = [(0, "<pad>")]

# This is a large file, it will take a while to load in the memory!
with codecs.open('glove.6B.100d.txt', 'r','utf-8') as f:
  index = 1
  for line in f.readlines():
    # Ignore the first line - first line typically contains vocab, dimensionality
    if len(line.strip().split()) > 3:
      word = line.strip().split()[0]
      if word in joint_vocab:
          (word, vec) = (word,
                     list(map(float,line.strip().split()[1:])))
          wvecs.append(vec)
          word2idx.append((word, index))
          idx2word.append((index, word))
          index += 1

wvecs = np.array(wvecs)
word2idx = dict(word2idx)
idx2word = dict(idx2word)

vectorized_seqs = [[word2idx[tok] for tok in seq if tok in word2idx] for seq in training_tokenized_corpus]

# To avoid any sentences being empty (if no words match to our word embeddings)
vectorized_seqs = [x if len(x) > 0 else [0] for x in vectorized_seqs]

print('Token representations created')

Token representations created


In [None]:
# check the coverage of word2idx w.r.t. vocab
counter = 0
wordList = word2idx.keys()
for word in joint_vocab:
    if word in wordList:
        counter += 1
    # else:
    #     print(word, len(word))

print(f"{ counter } / { len(joint_vocab) } of joint_vocab is in the word2idx")

13853 / 14480 of joint_vocab is in the word2idx


In [None]:
INPUT_DIM = len(word2idx)
EMBEDDING_DIM = 100
HIDDEN_DIM = 50
BATCH_SIZE = 128
lr = 5e-4

model = BiLSTM(EMBEDDING_DIM, HIDDEN_DIM, INPUT_DIM, BATCH_SIZE, device)
print("Model initialised.")

model.to(device)
# We provide the model with our embeddings
model.embedding.weight.data.copy_(torch.from_numpy(wvecs))

feature = vectorized_seqs

# 'feature' is a list of lists, each containing embedding IDs for word tokens
# print(len(feature), len(train_df['meanGrade']), train_df['meanGrade'][0])
train_and_dev = Task1Dataset(feature, train_df['meanGrade'])

train_examples = round(len(train_and_dev)*train_proportion)
dev_examples = len(train_and_dev) - train_examples

train_dataset, dev_dataset = random_split(train_and_dev,
                                           (train_examples,
                                            dev_examples))

train_loader = torch.utils.data.DataLoader(train_dataset, shuffle=True, batch_size=BATCH_SIZE, collate_fn=collate_fn_padd)
dev_loader = torch.utils.data.DataLoader(dev_dataset, batch_size=BATCH_SIZE, collate_fn=collate_fn_padd)

print("Dataloaders created.")

loss_fn = nn.MSELoss()
loss_fn = loss_fn.to(device)

optimizer = torch.optim.Adam(model.parameters(), lr=lr)

train(train_loader, dev_loader, model, epochs)

  "num_layers={}".format(dropout, num_layers))


Model initialised.
Dataloaders created.
Training model.
| Epoch: 01 | Train Loss: 0.74 | Train MSE: 0.74 | Train RMSE: 0.86 |         Val. Loss: 0.50 | Val. MSE: 0.50 |  Val. RMSE: 0.71 |
| Epoch: 02 | Train Loss: 0.49 | Train MSE: 0.49 | Train RMSE: 0.70 |         Val. Loss: 0.49 | Val. MSE: 0.49 |  Val. RMSE: 0.70 |
| Epoch: 03 | Train Loss: 0.43 | Train MSE: 0.43 | Train RMSE: 0.66 |         Val. Loss: 0.42 | Val. MSE: 0.42 |  Val. RMSE: 0.65 |
| Epoch: 04 | Train Loss: 0.39 | Train MSE: 0.39 | Train RMSE: 0.63 |         Val. Loss: 0.39 | Val. MSE: 0.39 |  Val. RMSE: 0.63 |
| Epoch: 05 | Train Loss: 0.35 | Train MSE: 0.35 | Train RMSE: 0.60 |         Val. Loss: 0.37 | Val. MSE: 0.37 |  Val. RMSE: 0.61 |
| Epoch: 06 | Train Loss: 0.31 | Train MSE: 0.31 | Train RMSE: 0.56 |         Val. Loss: 0.37 | Val. MSE: 0.37 |  Val. RMSE: 0.61 |
| Epoch: 07 | Train Loss: 0.27 | Train MSE: 0.27 | Train RMSE: 0.52 |         Val. Loss: 0.37 | Val. MSE: 0.37 |  Val. RMSE: 0.60 |
| Epoch: 08 | Train 

##### Method 2: The BERT model

In [None]:
# You will need to download any word embeddings required for your code, e.g.:

# !wget http://nlp.stanford.edu/data/glove.6B.zip
# !unzip glove.6B.zip

# For any packages that Colab does not provide auotmatically you will also need to install these below, e.g.:

!pip install torch transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/f9/54/5ca07ec9569d2f232f3166de5457b63943882f7950ddfcc887732fc7fb23/transformers-4.3.3-py3-none-any.whl (1.9MB)
[K     |████████████████████████████████| 1.9MB 11.9MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 55.1MB/s 
Collecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/71/23/2ddc317b2121117bf34dd00f5b0de194158f2a44ee2bf5e47c7166878a97/tokenizers-0.10.1-cp37-cp37m-manylinux2010_x86_64.whl (3.2MB)
[K     |████████████████████████████████| 3.2MB 43.0MB/s 
Building wheels for collected packages: sacremoses
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
  Created wheel for sacremoses: filename=sacremoses-0.0.43-cp37-none-any.whl size=893262 sha256=4e50a3290d

In [None]:
# Imports

import torch
import torch.nn as nn
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from torch.utils.data import Dataset, random_split, DataLoader
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import codecs
import re
from transformers import BertTokenizer
from transformers import BertForSequenceClassification

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Setting random seed and device
SEED = 1

torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

use_cuda = torch.cuda.is_available()
device = torch.device("cuda:0" if use_cuda else "cpu")

In [None]:
# Load data
train_df = pd.read_csv('/content/drive/MyDrive/data/task-1/train.csv')
valid_df = pd.read_csv('/content/drive/MyDrive/data/task-1/dev.csv')
test_df = pd.read_csv('/content/drive/MyDrive/data/task-1/test.csv')
train_extra = pd.read_csv('/content/drive/MyDrive/data/task-1/train_funlines.csv')
train_df = pd.concat([train_df,train_extra], ignore_index=True)

In [None]:
# We define our training loop
def train(train_iter, dev_iter, model, number_epoch):
    """
    Training loop for the model, which calls on eval to evaluate after each epoch
    """

    
    print("Training model.")

    for epoch in range(1, number_epoch+1):

        model.train()
        epoch_loss = 0
        epoch_sse = 0
        no_observations = 0  # Observations used for training so far

        for batch in train_iter:

            input, attention, token, target = batch

            input, attention, token, target = input.to(device), attention.to(device), token.to(device), target.to(device)

            no_observations = no_observations + target.shape[0]

            predictions = model(input, attention, token)[0].squeeze(1)

            optimizer.zero_grad()

            loss = loss_fn(predictions, target)

            sse, __ = model_performance(predictions.detach().cpu().numpy(), target.detach().cpu().numpy())

            loss.backward()
            optimizer.step()
            scheduler.step()

            epoch_loss += loss.item()*target.shape[0]
            epoch_sse += sse

        valid_loss, valid_mse, __, __ = eval(dev_iter, model)

        epoch_loss, epoch_mse = epoch_loss / no_observations, epoch_sse / no_observations
        print(f'| Epoch: {epoch:02} | Train Loss: {epoch_loss:.2f} | Train MSE: {epoch_mse:.2f} | Train RMSE: {epoch_mse**0.5:.4f} | \
        Val. Loss: {valid_loss:.2f} | Val. MSE: {valid_mse:.2f} |  Val. RMSE: {valid_mse**0.5:.4f} |')

In [None]:
# We evaluate performance on our dev set
def eval(data_iter, model):
    """
    Evaluating model performance on the dev set
    """
    model.eval()
    epoch_loss = 0
    epoch_sse = 0
    pred_all = []
    trg_all = []
    no_observations = 0

    with torch.no_grad():
        for batch in data_iter:
            input, attention, token, target = batch

            input, attention, token, target = input.to(device), attention.to(device), token.to(device), target.to(device)

            no_observations = no_observations + target.shape[0]

            predictions = model(input, attention, token)[0].squeeze(1)
            loss = loss_fn(predictions, target)

            # We get the mse
            pred, trg = predictions.detach().cpu().numpy(), target.detach().cpu().numpy()
            sse, __ = model_performance(pred, trg)

            epoch_loss += loss.item()*target.shape[0]
            epoch_sse += sse
            pred_all.extend(pred)
            trg_all.extend(trg)

    return epoch_loss/no_observations, epoch_sse/no_observations, np.array(pred_all), np.array(trg_all)

In [None]:
# How we print the model performance
def model_performance(output, target, print_output=False):
    """
    Returns SSE and MSE per batch (printing the MSE and the RMSE)
    """

    sq_error = (output - target)**2

    sse = np.sum(sq_error)
    mse = np.mean(sq_error)
    rmse = np.sqrt(mse)

    if print_output:
        print(f'| MSE: {mse:.2f} | RMSE: {rmse:.2f} |')

    return sse, mse

In [None]:
class Task1Dataset(Dataset):

    def __init__(self, input, attention, token, labels):
        self.len = input.shape[0]
        self.x1_train = input.to(device)
        self.x2_train = attention.to(device)
        self.x3_train = token.to(device)
        self.y_train = labels.to(device)

    def __len__(self):
        return self.len

    def __getitem__(self, item):
        return self.x1_train[item], self.x2_train[item], self.x3_train[item], self.y_train[item]

In [None]:
from tqdm.auto import tqdm
tqdm.pandas()

# preprocess headline, delete "</>" and extra space

def preprocess(text):
    text = text.strip()
    text = text.replace("<", "").replace("/>", "")
    text = " ".join(text.split())
    return text

train_df["preprocess_headline"] = train_df["original"].progress_apply(preprocess)
valid_df["preprocess_headline"] = valid_df["original"].progress_apply(preprocess)
test_df["preprocess_headline"] = test_df["original"].progress_apply(preprocess)

  from pandas import Panel


HBox(children=(FloatProgress(value=0.0, max=17900.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=2419.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=3024.0), HTML(value='')))




In [None]:
# preprocess new headline, delete "</>" and extra space
def preprocess_newhead(text, new_word):
    text = text.strip()
    p = re.compile(r'\<(.*?)\/\>')
    text = p.sub(new_word, text)
    text = " ".join(text.split())
    return text
train_df["preprocess_new_headline"] = train_df.progress_apply(lambda row:preprocess_newhead(row['original'],row['edit']), axis=1)
valid_df["preprocess_new_headline"] = valid_df.progress_apply(lambda row:preprocess_newhead(row['original'],row['edit']), axis=1)
test_df["preprocess_new_headline"] = test_df.progress_apply(lambda row:preprocess_newhead(row['original'],row['edit']), axis=1)

HBox(children=(FloatProgress(value=0.0, max=17900.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=2419.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=3024.0), HTML(value='')))




In [None]:
# preprocess edited word, delete extra space
def preprocess_new_word(text):
    # text = text.lower()
    text = text.strip()
    text = " ".join(text.split())
    return text

train_df["preprocess_edit"] = train_df["edit"].progress_apply(preprocess_new_word)
valid_df["preprocess_edit"] = valid_df["edit"].progress_apply(preprocess_new_word)
test_df["preprocess_edit"] = test_df["edit"].progress_apply(preprocess_new_word)

HBox(children=(FloatProgress(value=0.0, max=17900.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=2419.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=3024.0), HTML(value='')))




In [None]:
# get original headline, new headline, new word and original word
train_o_headlines = train_df["preprocess_headline"].tolist()
train_n_headlines = train_df["preprocess_new_headline"].tolist()
train_n_word = train_df["preprocess_edit"].tolist()
train_labels_l = train_df["meanGrade"].tolist()

valid_o_headlines = valid_df["preprocess_headline"].tolist()
valid_n_headlines = valid_df["preprocess_new_headline"].tolist()
valid_n_word = valid_df["preprocess_edit"].tolist()
valid_labels_l = valid_df["meanGrade"].tolist()

test_o_headlines = test_df["preprocess_headline"].tolist()
test_n_headlines = test_df["preprocess_new_headline"].tolist()
test_n_word = test_df["preprocess_edit"].tolist()
test_labels_l = test_df["meanGrade"].tolist()

In [None]:
# Initialize tokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

In [None]:
# get encoded inputs, and we choose concatenating original headlines and new headlines
train_encoded_inputs = tokenizer(train_o_headlines, train_n_headlines, padding='max_length', max_length=70, truncation=True, return_tensors="pt")
valid_encoded_inputs = tokenizer(valid_o_headlines, valid_n_headlines, padding='max_length', max_length=70, truncation=True, return_tensors="pt")
test_encoded_inputs = tokenizer(test_o_headlines, test_n_headlines, padding='max_length', max_length=70, truncation=True, return_tensors="pt")

In [None]:
# get input_id, attention_mask, token_type_id and labels(MeanGrades)
train_input_ids = train_encoded_inputs['input_ids']
train_attention_mask = train_encoded_inputs['attention_mask']
train_token_type_ids = train_encoded_inputs['token_type_ids']
train_labels = torch.Tensor(train_labels_l)

valid_input_ids = valid_encoded_inputs['input_ids']
valid_attention_mask = valid_encoded_inputs['attention_mask']
valid_token_type_ids = valid_encoded_inputs['token_type_ids']
valid_labels = torch.Tensor(valid_labels_l)

test_input_ids = test_encoded_inputs['input_ids']
test_attention_mask = test_encoded_inputs['attention_mask']
test_token_type_ids = test_encoded_inputs['token_type_ids']
test_labels = torch.Tensor(test_labels_l)

print(tokenizer.decode(train_input_ids.tolist()[0]))

[CLS] france is ‘ hunting down its citizens who joined isis ’ without trial in iraq [SEP] france is ‘ hunting down its citizens who joined twins ’ without trial in iraq [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]


In [None]:
# Get Dataset
batch = 16
train_dataset = Task1Dataset(train_input_ids, train_attention_mask, train_token_type_ids, train_labels)
valid_dataset = Task1Dataset(valid_input_ids, valid_attention_mask, valid_token_type_ids, valid_labels)
test_dataset = Task1Dataset(test_input_ids, test_attention_mask, test_token_type_ids, test_labels)

train_dataloader = DataLoader(train_dataset, batch_size=batch, shuffle=True)
valid_dataloader = DataLoader(valid_dataset, batch_size=batch, shuffle=True)
test_dataloader = DataLoader(test_dataset, batch_size=batch, shuffle=True)

In [None]:
# Number of epochs
epochs = 1

In [None]:
# Training
model = BertForSequenceClassification.from_pretrained("bert-base-uncased",num_labels = 1,output_attentions = False,output_hidden_states = False)
model.to(device)

loss_fn = nn.MSELoss()
loss_fn = loss_fn.to(device)

optimizer = torch.optim.AdamW(model.parameters(), lr = 3e-5, eps = 1e-8)

from transformers import get_linear_schedule_with_warmup

scheduler = get_linear_schedule_with_warmup(optimizer,num_warmup_steps = 0.1, num_training_steps = len(train_dataloader) * 8)

train(train_dataloader, valid_dataloader, model, epochs)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

Training model.
| Epoch: 01 | Train Loss: 0.35 | Train MSE: 0.35 | Train RMSE: 0.5912 |         Val. Loss: 0.28 | Val. MSE: 0.28 |  Val. RMSE: 0.5256 |


In [None]:
# Test RMSE LOSS
model.eval()

test_input_ids = test_input_ids.to(device)
test_attention_mask = test_attention_mask.to(device)
test_token_type_ids = test_token_type_ids.to(device)
test_labels = test_labels.to(device)

with torch.no_grad():
  test_predictions = model(test_input_ids,
                           test_attention_mask,
                           test_token_type_ids)[0].squeeze(1)
  test_loss = torch.sqrt(((test_predictions - test_labels)**2).mean()).item()

print(f'| Test Loss: {test_loss:.5f} |')

| Test Loss: 0.53059 |


#### Approach 2: No pre-trained representations

##### Method1: Word2Vec+SpaCy+Regression

In [None]:
!pip install torch spacy



In [None]:
import numpy as np
import pandas as pd
import nltk
import spacy
import gensim.models.keyedvectors as word2vec
import re

from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import PredefinedSplit
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.neural_network import MLPRegressor
from sklearn.metrics import mean_squared_error

import warnings
warnings.filterwarnings("ignore")

import gensim
import logging
import multiprocessing

from tqdm.auto import tqdm
tqdm.pandas()

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Load data
train_df = pd.read_csv('/content/drive/MyDrive/data/task-1/train.csv')
valid_df = pd.read_csv('/content/drive/MyDrive/data/task-1/dev.csv')
test_df = pd.read_csv('/content/drive/MyDrive/data/task-1/test.csv')
# train_extra = pd.read_csv('/content/drive/MyDrive/data/task-1/train_funlines.csv')
# train_df = pd.concat([train_df,train_extra], ignore_index=True)

In [None]:
# get stopwords
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
## preprocess original headlines
# we remove all the punctuation in the headlines, replace abbreviations, 
# remove all the nonalphabetic symbols using regular expression 
# and also use lowercase except for the initials and proprietary nouns.

STOP_WORDS = set(nltk.corpus.stopwords.words('english'))

def preprocess(text):
    # text = text.lower()
    text = text.strip()
    text = text.replace("<", "").replace("/>", "")
    text = text.replace("’", "'")
    text = text.replace("'s", "is").replace("'ve", "have").replace("'m", "am").replace("'re", "are").replace("n't","not")
    # tokens = word_tokenize(text)
    # tagged_sent = pos_tag(tokens)
    # wnl = WordNetLemmatizer()
    # lemmas_sent = []
    # for tag in tagged_sent:
    #     wordnet_pos = get_wordnet_pos(tag[1]) or wordnet.NOUN
    #     lemmas_sent.append(wnl.lemmatize(tag[0], pos=wordnet_pos))
    # text = " ".join(lemmas_sent)
    for w in text.split(" "):
      if not w.isalpha():
        text = text.replace(w, "")
    text = " ".join(text.split())
    if all([w[0].isupper() for w in text.split(" ") if w not in STOP_WORDS]):
      text = text.lower()
      text = text[0].upper() + text[1:]
    
    return text

train_df["preprocess_headline"] = train_df["original"].progress_apply(preprocess)
valid_df["preprocess_headline"] = valid_df["original"].progress_apply(preprocess)
test_df["preprocess_headline"] = test_df["original"].progress_apply(preprocess)

HBox(children=(FloatProgress(value=0.0, max=9652.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=2419.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=3024.0), HTML(value='')))




In [None]:
# preprocess new headlines
import re
def new_head(text, new_word):
    p = re.compile(r'\<(.*?)\/\>')
    text = p.sub(new_word, text)
    text = text.replace("’", "'")
    text = text.replace("'s", "is").replace("'ve", "have").replace("'m", "am").replace("'re", "are").replace("n't","not")
    for w in text.split(" "):
      if not w.isalpha():
        text = text.replace(w, "")
    text = " ".join(text.split())
    if all([w[0].isupper() for w in text.split(" ") if w not in STOP_WORDS]):
      text = text.lower()
      text = text[0].upper() + text[1:]
    return text
train_df["new_headline"] = train_df.progress_apply(lambda row:new_head(row['original'],row['edit']), axis=1)
valid_df["new_headline"] = valid_df.progress_apply(lambda row:new_head(row['original'],row['edit']), axis=1)
test_df["new_headline"] = test_df.progress_apply(lambda row:new_head(row['original'],row['edit']), axis=1)

HBox(children=(FloatProgress(value=0.0, max=9652.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=2419.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=3024.0), HTML(value='')))




In [None]:
# preprocess new word
def preprocess_new_word(text):
    text = text.lower()
    text = text.strip()
    text = " ".join(text.split())
    return text

train_df["preprocess_edit"] = train_df["edit"].progress_apply(preprocess_new_word)
valid_df["preprocess_edit"] = valid_df["edit"].progress_apply(preprocess_new_word)
test_df["preprocess_edit"] = test_df["edit"].progress_apply(preprocess_new_word)

HBox(children=(FloatProgress(value=0.0, max=9652.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=2419.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=3024.0), HTML(value='')))




In [None]:
# Get original word
def get_original_word(headline):
    start = "<"
    end = "/>"
    original_word = headline[(headline.index(start)+len(start)):headline.index(end)].strip().lower()
    return original_word

train_df["original_word"] = train_df["original"].progress_apply(get_original_word)
valid_df["original_word"] = valid_df["original"].progress_apply(get_original_word)
test_df["original_word"] = test_df["original"].progress_apply(get_original_word)

HBox(children=(FloatProgress(value=0.0, max=9652.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=2419.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=3024.0), HTML(value='')))




In [None]:
!python -m spacy download en_core_web_md

Collecting en_core_web_md==2.2.5
[?25l  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-2.2.5/en_core_web_md-2.2.5.tar.gz (96.4MB)
[K     |████████████████████████████████| 96.4MB 1.2MB/s 
Building wheels for collected packages: en-core-web-md
  Building wheel for en-core-web-md (setup.py) ... [?25l[?25hdone
  Created wheel for en-core-web-md: filename=en_core_web_md-2.2.5-cp37-none-any.whl size=98051305 sha256=228817a03969c987c4c964b0c3a59fb32140e855b111b8435dd2b2c63750559c
  Stored in directory: /tmp/pip-ephem-wheel-cache-beb_704_/wheels/df/94/ad/f5cf59224cea6b5686ac4fd1ad19c8a07bc026e13c36502d81
Successfully built en-core-web-md
Installing collected packages: en-core-web-md
Successfully installed en-core-web-md-2.2.5
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_md')


In [None]:
# Using Spacy
import en_core_web_md
class EntityRetokenizeComponent:
    def __init__(self, pipeline):
        pass
    
    def __call__(self, doc):
        with doc.retokenize() as retokenizer:
            for ent in doc.ents:
                retokenizer.merge(doc[ent.start:ent.end], attrs={"LEMMA": str(doc[ent.start:ent.end])})
        return doc

spacy_pipeline = en_core_web_md.load()
retokenizer = EntityRetokenizeComponent(spacy_pipeline) 
spacy_pipeline.add_pipe(retokenizer, name='merge_enitities', last=True)

train_df["headline_spacy"] = train_df["preprocess_headline"].progress_apply(spacy_pipeline)
valid_df["headline_spacy"] = valid_df["preprocess_headline"].progress_apply(spacy_pipeline)
test_df["headline_spacy"] = test_df["preprocess_headline"].progress_apply(spacy_pipeline)

HBox(children=(FloatProgress(value=0.0, max=9652.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=2419.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=3024.0), HTML(value='')))




In [None]:
# Change sentences to lists
def get_text(list):
  s = []
  for i in list:
    i = re.sub(r"\s*[^A-Za-z]+\s*", " ", i)
    t = i.split( )
    s.append(t)
  return s
  

In [None]:
# train word2vec
train_headlines = train_df["preprocess_headline"].tolist() + train_df["new_headline"].tolist() + valid_df["preprocess_headline"].tolist() + valid_df["new_headline"].tolist() + test_df["preprocess_headline"].tolist() + test_df["new_headline"].tolist()
text = get_text(train_headlines)
model = gensim.models.Word2Vec(text, size=20, window=10, min_count=4, workers=multiprocessing.cpu_count())

In [None]:
# save model
model.wv.save_word2vec_format("/content/drive/MyDrive/data/task-1/word2vec_gensim_bin3",binary = True)

In [None]:
# get vocab
w2v_path = f"/content/drive/MyDrive/data/task-1/word2vec_gensim_bin3"
w2v = word2vec.KeyedVectors.load_word2vec_format(w2v_path, binary=True)
vocab = set(w2v.vocab)

In [None]:
# get tokens
def tokenize(spacy):
    tokens = []
    for word in spacy.doc:
        w = str(word)
        if spacy_pipeline.vocab[word.text.lower()].is_stop: continue
        if w in vocab:
            tokens.append(w)
        else:
            capitalized_word = " ".join([x.capitalize() for x in w.split(" ")])
            if capitalized_word in vocab:
                tokens.append(capitalized_word)
            else:
                w = w.lower()
                if w in vocab:
                    tokens.append(w)
            
    return tokens

train_df["headline_tokens"] = train_df["headline_spacy"].progress_apply(tokenize)
valid_df["headline_tokens"] = valid_df["headline_spacy"].progress_apply(tokenize)
test_df["headline_tokens"] = test_df["headline_spacy"].progress_apply(tokenize)

HBox(children=(FloatProgress(value=0.0, max=9652.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=2419.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=3024.0), HTML(value='')))




In [None]:
# get labels
y_train = train_df["meanGrade"]
y_valid = valid_df["meanGrade"]
y_test = test_df["meanGrade"]

In [None]:
# get average vector
vecs = []
for i, tokens in enumerate(train_df["headline_tokens"]):
    for token in tokens:
        if token in w2v.vocab:
            vec = w2v[token]
            vecs.append(vec)
avg_vec = np.nanmean(vecs, axis=0)

In [None]:
# concatenate the average vectors of the headlines, 
# the vector of the edited word, 
# and the original word as the feature vectors.

def get_concat(df):
    feature1 = np.zeros((len(df), 20))
    feature2 = np.zeros((len(df), 20))
    feature3 = np.zeros((len(df), 20))
    
    for i, tokens in enumerate(df["headline_tokens"]):
        vecs = []
        for token in tokens:
            if token in w2v.vocab:
                vec = w2v[token]
                vecs.append(vec)
        if len(vecs) == 0:
            vecs.append(np.zeros(20))
        feature1[i,:] = np.mean(vecs, axis=0)
    
    for i, token in enumerate(df["original_word"]):
        if token in w2v.vocab:
            feature2[i,:] = w2v[token]
        else:
            feature2[i,:] = avg_vec
        
    for i, token in enumerate(df["preprocess_edit"]):
        if token in w2v.vocab:
            feature3[i,:] = w2v[token]
        else:
            feature3[i,:] = avg_vec
            
    return np.concatenate((feature1, feature2, feature3), axis=1)

In [None]:
#get vector
x_train = get_concat(train_df)
x_valid = get_concat(valid_df)
x_test = get_concat(test_df)

In [None]:
# Linear regression
regressor = LinearRegression(normalize=True)
regressor.fit(x_train, y_train)
y_pred_v = regressor.predict(x_valid)
print(f"|Valid. RMSE error: {np.sqrt(mean_squared_error(y_valid, y_pred_v))}|")
y_pred_t = regressor.predict(x_test)
print(f"|Test RMSE error: {np.sqrt(mean_squared_error(y_test, y_pred_t))}|")

|Valid. RMSE error: 0.5715549311058523|
|Test RMSE error: 0.566636362185281|


In [None]:
# Ridge
regressor = Ridge(alpha=0.1, normalize = True, tol=1)
regressor.fit(x_train, y_train)
y_pred_v = regressor.predict(x_valid)
print(f"|Valid. RMSE error: {np.sqrt(mean_squared_error(y_valid, y_pred_v))}|")
y_pred_t = regressor.predict(x_test)
print(f"|Test RMSE error: {np.sqrt(mean_squared_error(y_test, y_pred_t))}|")


df_pred = pd.DataFrame({
    "id": test_df["id"],
    "pred": y_pred_t
})
df_pred.to_csv(f"task-1-output.csv", index=False)

|Valid. RMSE error: 0.5698993044642675|
|Test RMSE error: 0.5651532891228751|


##### Method2: Use UnionFeature to do feature extraction and Ridge Regression

In [None]:
# How we print the model performance
def model_performance(output, target, print_output=False):
    """
    Returns SSE and MSE per batch (printing the MSE and the RMSE)
    """

    sq_error = (output - target)**2

    sse = np.sum(sq_error)
    mse = np.mean(sq_error)
    rmse = np.sqrt(mse)

    if print_output:
        print(f'| MSE: {mse:.2f} | RMSE: {rmse:.2f} |')

    return sse, mse

In [None]:
# Proportion of training data for train compared to dev
train_proportion = 0.8

In [None]:
import spacy
from sklearn.decomposition import PCA, TruncatedSVD, SparsePCA
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
tokenizer = spacy.load("en_core_web_sm")

def spacy_tokenize(text):
    return [x.text for x in tokenizer(text)]

train_and_dev = train_df['edit']

training_data, dev_data, training_y, dev_y = train_test_split(train_df['edit'], train_df['meanGrade'],
                                                                        test_size=(1-train_proportion),
                                                                        random_state=42)

count_vect = CountVectorizer(stop_words='english', tokenizer=spacy_tokenize)
train_counts = count_vect.fit_transform(training_data)

pca = SparsePCA(n_components=2)
svd = TruncatedSVD(n_components=2)
# svd = TruncatedSVD()
tfidf = TfidfTransformer()
# pca.fit(training_data)

combined_features = FeatureUnion([('svd', svd), ('tfidf', tfidf)])

x_features = combined_features.fit(train_counts, training_y).transform(train_counts)

regression_model = Ridge().fit(x_features, training_y)

# Train predictions
predicted_train = regression_model.predict(x_features)


# Calculate Tf-idf using train and dev, and validate model on dev:
test_and_test_counts = count_vect.transform(train_and_dev)
transformer = TfidfTransformer().fit(test_and_test_counts)

test_counts = count_vect.transform(dev_data)
test_features = combined_features.fit(test_counts, dev_y).transform(test_counts)

# Dev predictions
predicted = regression_model.predict(test_features)

# We run the evaluation:
print("\nTrain performance:")
sse, mse = model_performance(predicted_train, training_y, True)

print("\nDev performance:")
sse, mse = model_performance(predicted, dev_y, True)


Train performance:
| MSE: 0.17 | RMSE: 0.41 |

Dev performance:
| MSE: 0.32 | RMSE: 0.57 |
