Data augmentation for computer vision vs NLP

pcbinary June 27, 2021 0 Comments



Data augmentation for vision vs NLP


In computer vision applications data augmentations are done almost everywhereto get larger training data and make the model generalize better.The main methods used involve: * cropping, * flipping, * zooming, * rotation, * noise injection, * and many others. In computer vision, these transformations are done on the go using datagenerators. As a batch of data is fed to your neural network it is randomlytransformed (augmented). You don’t need to prepare anything before training.This isn’t the case with NLP, where data augmentation should be done carefullydue to the grammatical structure of the text. The methods discussed here areused before training. A new augmented dataset is generated beforehand andlater fed into data loaders to train the model.

Data Augmentation methods


In this article, I will mainly focus on NLP data augmentation methods providedin the following projects:So, let’s dive into each of them.

Easy Data Augmentation


Easy data augmentation uses traditional and very simple data augmentationmethods. EDA consists of four simple operations that do a surprisingly goodjob of preventing overfitting and helping train more robust models.Randomly choose n words from the sentence that are not stop words. Replaceeach of these words with one of its synonyms chosen at random.For example, given the sentence:This article will focus on summarizing data augmentation techniques in NLP.The method randomly selects n words (say two), the words article andtechniques, and replaces them with write-up and methods respectively.This write-up will focus on summarizing data augmentation methods in NLP.Find a random synonym of a random word in the sentence that is not a stopword. Insert that synonym into a random position in the sentence. Do this ntimes.For example, given the sentence:This article will focus on summarizing data augmentation techniques in NLP.The method randomly selects n words (say two), the words article andtechniques find the synonyms as write-up and methods respectively. Then thesesynonyms are inserted at a random position in the sentence.This article will focus on write-up summarizing data augmentation techniquesin NLP methods.Randomly choose two words in the sentence and swap their positions. Do this ntimes.For example, given the sentenceThis article will focus on summarizing data augmentation techniques in NLP.The method randomly selects n words (say two), the words article andtechniques and swaps them to create a new sentence.This techniques will focus on summarizing data augmentation article in NLP.Randomly remove each word in the sentence with probability p.For example, given the sentenceThis article will focus on summarizing data augmentation techniques in NLP.The method selects n words (say two), the words will and techniques, andremoves them from the sentence.This article focus on summarizing data augmentation in NLP.You can go to this repository if you want to apply these techniques to yourprojects.

Things to keep in mind while doing NLP Data Augmentation


As I said in the introduction, there are certain things that we need to becareful of while doing augmentation in NLP.The main issue faced when training on augmented data is that algorithms, whendone incorrectly, is that you heavily overfit the augmented training data.Some things to keep in mind: * Do not validate using the augmented data. * If you’re doing K-fold cross-validation, always keep the original sample and augmented sample in the same fold to avoid overfitting. * Always try different augmentation approaches and check which works better. * A mix of different augmentation methods is also appreciated but don’t overdo it. * Experiment to determine the optimal number of samples to be augmented to get the best results. * Keep in mind that data augmentation in NLP does not always help to improve model performance.

Data Augmentation workflow


In this section, we will try data augmentation on Real or Not? NLP withDisaster Tweets competition hosted on Kaggle.In one of my previous posts, I used the data from this competition to trydifferent non-contextual embedding methods. Here, I will use the very sameclassification pipeline I used there but I will add data augmentation to seeif it improves the model performance.First, let’s load the training dataset and check the target classdistribution. … x=tweet.target.value_counts() sns.barplot(x.index,x) plt.gca().set_ylabel(‘samples’)We can see that there is a small class imbalance here.Let’s generate some positive samples using the synonym replacement method.Before data augmentation, we split the data into the train and validation setso that no samples in the validation set have been used for data augmentation. train,valid=train_test_split(tweet,test_size=0.15)Now, we can do data augmentation of the training dataset. I have chosen togenerate 300 samples from the positive class. def augment_text(df,samples=300,pr=0.2): aug_w2v.aug_p=pr new_text=[] df_n=df[df.target==1].reset_index(drop=True) for i in tqdm(np.random.randint(0,len(df_n),samples)): text = df_n.iloc[i][‘text’] augmented_text = aug_w2v.augment(text) new_text.append(augmented_text) new=pd.DataFrame({‘text’:new_text,’target’:1}) df=shuffle(df.append(new).reset_index(drop=True)) return df train = augment_text(train)We can now use this augmented text data to train the model.If you are interested in learning how to build the entire pipeline from datapreparation for NLP, training a classifier, and running inference you cancheck my other article.So did data augmentation with synonym replacement work?| Without Data Augmentation | With Data Augmentation —|—|— ROC AUC score | 0.775 | 0.785 With data augmentation, we got a good boost in the model performance (AUC).Playing with different techniques and tuning hyperparameters of the dataaugmentation methods can improve results even further but I will leave it fornow.If you’d like to do that I prepared a notebook where you can play with things.

The Essential Guide to Data Augmentation in NLP


Patrycja | Marketing Assistant at https://neptune.aiThere are many tasks in NLP from text classification to question answering,but whatever you do the amount of data you have to train your model impactsthe model performance heavily.What can you do to make your dataset larger?Simple option -> Get more data :)But acquiring and labeling additional observations can be an expensive andtime-consuming process.What you can do instead?Apply data augmentation to your text data.Data augmentation techniques are used to generate additional, synthetic datausing the data you have. Augmentation methods are super popular in computervision applications but they are just as powerful for NLP.In this article, we’ll go through all the major data augmentation methods forNLP that you can use to increase the size of your textual dataset and improveyour model performance.

Data augmentation for computer vision vs NLP


In computer vision applications, data augmentations are done almost everywhereto get larger training data and make the model generalize better.The main methods used involve: * cropping * flipping * zooming * rotation * noise injectionIn computer vision, these transformations are done on the go using datagenerators. As a batch of data is fed to your neural network it is randomlytransformed (augmented). You don’t need to prepare anything before training.This isn’t the case with NLP, where data augmentation should be done carefullydue to the grammatical structure of the text. The methods discussed here areused before training. A new augmented dataset is generated beforehand andlater fed into data loaders to train the model.

Data Augmentation Methods


In this article, I will mainly focus on NLP data augmentation methods providedin the following projects:So, let’s dive into each of them.

Easy Data Augmentation


Easy data augmentation uses traditional and very simple data augmentationmethods. EDA consists of four simple operations that do a surprisingly goodjob of preventing overfitting and helping train more robust models.Randomly choose n words from the sentence that are not stop words. Replaceeach of these words with one of its synonyms chosen at random.For example, given the sentence:This article will focus on summarizing data augmentation techniques in NLP.The method randomly selects n words (say two), the words article andtechniques, and replaces them with write-up and methods respectively.This write-up will focus on summarizing data augmentation methods in NLP.Find a random synonym of a random word in the sentence that is not a stopword. Insert that synonym into a random position in the sentence. Do this ntimes.For example, given the sentence:This article will focus on summarizing data augmentation techniques in NLP.The method randomly selects n words (say two), the words article andtechniques find the synonyms as write-up and methods respectively. Then thesesynonyms are inserted at a random position in the sentence.This article will focus on write-up summarizing data augmentation techniquesin NLP methods.Randomly choose two words in the sentence and swap their positions. Do this ntimes.For example, given the sentenceThis article will focus on summarizing data augmentation techniques in NLP.The method randomly selects n words (say two), the words article andtechniques and swaps them to create a new sentence.This techniques will focus on summarizing data augmentation article in NLP.Randomly remove each word in the sentence with probability p.For example, given the sentenceThis article will focus on summarizing data augmentation techniques in NLP.The method selects n words (say two), the words will and techniques, andremoves them from the sentence.This article focus on summarizing data augmentation in NLP.You can go to this repository if you want to apply these techniques to yourprojects.

Things to keep in mind while doing NLP Data Augmentation


As I said in the introduction, there are certain things that we need to becareful of while doing augmentation in NLP.The main issue faced when training on augmented data is that algorithms, whendone incorrectly, is that you heavily overfit the augmented training data.Some things to keep in mind: * Do not validate using the augmented data. * If you’re doing K-fold cross-validation, always keep the original sample and augmented sample in the same fold to avoid overfitting. * Always try different augmentation approaches and check which works better. * A mix of different augmentation methods is also appreciated but don’t overdo it. * Experiment to determine the optimal number of samples to be augmented to get the best results. * Keep in mind that data augmentation in NLP does not always help to improve model performance.

Data Augmentation workflow


In this section, we will try data augmentation on Real or Not? NLP withDisaster Tweets competition hosted on Kaggle.In one of my previous posts, I used the data from this competition to trydifferent non-contextual embedding methods. Here, I will use the very sameclassification pipeline I used there but I will add data augmentation to seeif it improves the model performance.First, let’s load the training dataset and check the target classdistribution. … x=tweet.target.value_counts() sns.barplot(x.index,x) plt.gca().set_ylabel(‘samples’)We can see that there is a small class imbalance here.Let’s generate some positive samples using the synonym replacement method.Before data augmentation, we split the data into the train and validation setso that no samples in the validation set have been used for data augmentation. train,valid=train_test_split(tweet,test_size=0.15) Now, we can do data augmentation of the training dataset. I have chosen togenerate 300 samples from the positive class. def augment_text(df,samples=300,pr=0.2): aug_w2v.aug_p=pr new_text=[] selecting the minority class samples df_n=df[df.target==1].reset_index(drop=True) data augmentation loop for i in tqdm(np.random.randint(0,len(df_n),samples)): text = df_n.iloc[i][‘text’] augmented_text = aug_w2v.augment(text) new_text.append(augmented_text) dataframe new=pd.DataFrame({‘text’:new_text,’target’:1}) df=shuffle(df.append(new).reset_index(drop=True)) return df train = augment_text(train)We can now use this augmented text data to train the model.So did data augmentation with synonym replacement work?With data augmentation, we got a good boost in the model performance (AUC).Playing with different techniques and tuning hyperparameters of the dataaugmentation methods can improve results even further but I will leave it fornow.If you’d like to do that I prepared a notebook where you can play with things.

3. Image Segmentation


Computer Vision Project Idea – Image segmentation is the process of dividingan image into multiple segments.It is very useful in finding meanings from the image. They are used in objectdetection of self-driving cars.

Leave a Reply

Your email address will not be published. Required fields are marked *