Trending December 2023 # Kaggle Solution: What’s Cooking ? (Text Mining Competition) # Suggested January 2024 # Top 19 Popular

You are reading the article Kaggle Solution: What’s Cooking ? (Text Mining Competition) updated in December 2023 on the website We hope that the information we have shared is helpful to you. If you find the content interesting and meaningful, please share it with your friends and continue to follow and support us for the latest updates. Suggested January 2024 Kaggle Solution: What’s Cooking ? (Text Mining Competition)


Tutorial on Text Mining, XGBoost and Ensemble Modeling in R

I came across What’s Cooking competition on Kaggle last week. At first, I was intrigued by its name. I checked it and realized that this competition is about to finish. My bad! It was a text mining competition.  This competition went live for 103 days and ended on 20th December 2023.

Still, I decided to test my skills. I downloaded the data set, built a model and managed to get a score of 0.79817 in the end. Even though, my submission wasn’t accepted after the competition got over, but I could check my score. This got me in top 20 percentile.

I used Text Mining, XGBoost and Ensemble Modeling to get this score. And, I used R. It took me less than 6 hours to achieve this milestone. I teamed up with Rohit Hinduja, who is currently interning at Analytics Vidhya.

To help beginners in R, here is my solution in a tutorial format. In the article below, I’ve adapted a step by step methodology to explain the solution. This tutorial requires prior knowledge of R and Machine Learning.

I am confident that this tutorial can improve your R coding skills and approaches.

Let’s get started.

Before you start…

Here’s a quick approach to (for beginners) give a tough fight in any kaggle competition:

What’s Cooking ?

Yeah! I could smell, it was a text mining competition. The data set had a list of id, ingredients and cuisine. There were 20 types of cuisine in the data set. The participants were asked to predict a cuisine based on available ingredients.

The ingredients were available in the form of a text list. That’s where text mining was used. Before reaching to the modeling stage, I cleaned the text using pre-processing methods. And, finally with available set of variables, I used an ensemble of XGBoost Models.

Note: My system configuration is core i5 processor, 8GB RAM and 1TB Hard Disk. 


Below is the my solution of this competition:

Step 1. Hypothesis Generation

Though many people don’t believe in this, but this step do wonders when done intuitively. Hypothesis Generation can help you to think ‘out of data’. It also helps you understand the data and relationship between the variables. It should ideally be done after you’ve looked at problem statement (but not the data).

Before exploring data, you must think smartly on the problem statement. What could be the features which can influence your outcome variable? Think on these terms and write down your findings. I did the same. Below is my list of findings which I thought could help me in determining a cuisine:

Taste: Different cuisines are cooked to taste different. If you know the taste of the food, you can estimate the type of cuisine.

Smell: With smell also, we can determine a cuisine type

Serving Type: We can identify the cuisine by looking at the way it is being served. What are the dips it is served with?

Hot or Cold: Some cuisines are served hot while some cold.

Group of ingredients and spices: Precisely, after one has tasted, we can figure out the cuisine by the mix of ingredients used. For example, you are unlikely to find pasta as an ingredient in any Indian cooking.

Liquid Served: Some cuisines are represented by the type of drinks served with food.

Location: The location of eating can be a factor in determining cuisine.

Duration of cooking: Some cuisines tend to have longer cooking cycles. Others might have more fast food style cooking.

Order of pouring ingredients: At times, the same set of ingredients are poured in a different order in different cuisines.

Percentage of ingredients which are staple crops / animals in the country of the cuisine:  A lot of cooking historically has been developed based on the availability of the ingredients in the country. A high percentage here could be a good indicator.

Step 2. Download and Understand the Data Set

The data set shows a list of id, cuisine and ingredients . The data set is available in json format. The dependent variable is cuisine. The independent variable is ingredients. Train data set is used for creating model. Test data is used to checking the accuracy of the model. If you are still confused between the two, remember, test data set do not have dependent variable.

Since the data is available in text format, I was determined to quickly build a corpus of ingredients (next step). Here is a snapshot of data set for your perusal in json format:

Step 3. Basics of Text Mining

For this solution, I’ve used R (precisely R Studio 0.99.484) in Windows environment.

Text Mining / Natural Language Processing helps computers to understand text and derive useful information from it. Several brands use this technique to analyse customer sentiments on social media. It consists of pre-defined set of commands used to clean the data. Since, text mining is mainly used to verify sentiments, the incoming data can be loosely structured, multilingual, textual or might have poor spellings.

Some of the commonly used techniques in text mining are:

Bag of Words : This techniques creates a ‘bag’ or group of words by counting the number of times each word has appear and use these counts as independent variables.

Deal with Punctuation : This can be tricky at times. Your tool(R or Python) would read ‘data mining’ & ‘data-mining’ as two different words. But they are same. Hence, we should remove the punctuation elements also.

Remove Stopwords : Stopwords are nothing but the words which add no value to text. They don’t describe any sentiment. Examples are ‘i’,’me’,’myself’,’they’,’them’ and many more. Hence, we should remove such words too. In addition to stopwords, you may find other words which are repeated  but add no value. Remove them as well.

Stemming or Lemmatization : This suggests bringing a word back to its root. It is generally used of words which are similar but only differ by tenses. For example: ‘play’, ‘playing’ and ‘played’ can be stemmed into one word ‘play’, since all three connotes the same action.

I’ve used these techniques in my solution too.

Step 4. Importing and Combining Data Set

Since the data set is in json format, I require different set of libraries to perform this step. jsonlite offers an easy way to import data in R. This is how I’ve done:

1. Import Train and Test Data Set

setwd('D:/Kaggle/Cooking') install.packages('jsonlite') library(jsonlite) train <- fromJSON("train.json") test <- fromJSON("test.json")

2. Combine both train and test data set. This will make our text cleaning process less painful. If I do not combine, I’ll have to clean train and test data set separately. And, this would take a lot of time.

But I need to add the dependent variable in test data set. Data can be combine using rbind (row-bind) function.

#add dependent variable test$cuisine <- NA #combine data set combi <- rbind(train, test) Step 5. Pre-Processing using tm package ( Text Mining)

As explained above, here are the steps used to clean the list of ingredients. I’ve used tm package for text mining.

1. Create a Corpus of Ingredients (Text)

#install package library(tm) #create corpus corpus <- Corpus(VectorSource(combi$ingredients))

2.  Convert text to lowercase

corpus <- tm_map(corpus, tolower) corpus[[1]]

3. Remove Punctuation

corpus <- tm_map(corpus, removePunctuation) corpus[[1]]

4.  Remove Stopwords

corpus <- tm_map(corpus, removeWords, c(stopwords('english'))) corpus[[1]]

5. Remove Whitespaces

corpus <- tm_map(corpus, stripWhitespace) corpus[[1]]

6. Perform Stemming

corpus <- tm_map(corpus, stemDocument) corpus[[1]]

6. After we are done with pre-processing, it is necessary to convert the text into plain text document. This helps in pre-processing documents as text documents.

corpus <- tm_map(corpus, PlainTextDocument)

7. For further processing, we’ll create a document matrix where the text will categorized in columns

#document matrix frequencies <- DocumentTermMatrix(corpus) frequencies Step 6. Data Exploration

1. Computing frequency column wise to get the ingredient with highest frequency

#organizing frequency of terms freq <- colSums(as.matrix(frequencies)) length(freq) ord <- order(freq) ord #if you wish to export the matrix (to see how it looks) to an excel file m <- as.matrix(frequencies) dim(m) write.csv(m, file = 'matrix.csv') #check most and least frequent words freq[head(ord)] freq[tail(ord)] #check our table of 20 frequencies head(table(freq),20) tail(table(freq),20)

We see that, there are may terms (ingredients) which occurs once, twice or thrice. Such ingredients won’t add any value to the model. However, we need to be sure about removing these ingredients as it might cause loss in data. Hence, I’ll remove only the terms having frequency less than 3

#remove sparse terms sparse <- removeSparseTerms(frequencies, 1 - 3/nrow(frequencies)) dim(sparse)

2. Let’s visualize the data now. But first, we’ll create a data frame.

#create a data frame for visualization wf <- data.frame(word = names(freq), freq = freq) head(wf) #plot terms which appear atleast 10,000 times library(ggplot2)

chart <- chart + geom_bar(stat = ‘identity’, color = ‘black’, fill = ‘white’) chart <- chart + theme(axis.text.x=element_text(angle=45, hjust=1)) chart

Here we see that salt, oil, pepper are among the highest occurring ingredients. You can change the freq values (in graph above) to visualize the frequency of ingredients.

3. We can also find the level of correlation between two ingredients. For example, if you have any ingredient in mind which can be highly correlated with others, we can find it. Here I am checking the correlation of salt and oil with other variables. I’ve assigned the correlation limit as 0.30. It means, I’ll only get the value which have correlation higher than 0.30.

#find associated terms findAssocs(frequencies, c('salt','oil'), corlimit=0.30)

4. We can also create a word cloud to check the most frequent terms. It is easy to build and gives an enhanced understanding of ingredients in this data. For this, I’ve used the package ‘wordcloud’.

#create wordcloud library(wordcloud) set.seed(142) #plot word cloud wordcloud(names(freq), freq, chúng tôi = 2500, scale = c(6, .1), colors = brewer.pal(4, "BuPu"))

#plot 5000 most used words wordcloud(names(freq), freq, max.words = 5000, scale = c(6, .1), colors = brewer.pal(6, 'Dark2'))

5. Now I’ll make final structural changes in the data.

#create sparse as data frame newsparse <- dim(newsparse) #check if all words are appropriate colnames(newsparse) <- make.names(colnames(newsparse)) #check for the dominant dependent variable table(train$cuisine)

Here I find that, ‘italian’ is the most popular in all the cuisine available. Using this information, I’ve added the dependent variable ‘cuisine’ in the data frame newsparse as ‘italian’.

#add cuisine newsparse$cuisine <- as.factor(c(train$cuisine, rep('italian', nrow(test)))) #split data mytrain <- newsparse[1:nrow(train),] mytest <- newsparse[-(1:nrow(train)),] Step 7. Model Building

As my first attempt, I couldn’t think of any algorithm better than naive bayes. Since I have a multi class categorical variable, I expected naive bayes to do wonders. But, to my surprise, the naive bayes model went in perpetuity. Perhaps, my machine specifications aren’t powerful enough.

Next, I tried Boosting. Thankfully, the model computed without any trouble. Boosting is a technique which convert weak learners into strong learners. In simple terms, I built three XGBoost model. All there were weak, means their accuracy weren’t good. I combined (ensemble) the predictions of three model to produce a strong model.  To know more about boosting, you can refer to this introduction.

The reason I used boosting is because, it works great on sparse matrices. Since, I’ve a sparse matrix here, I expected it to give good results. Sparse Matrix is a matrix which has large number of zeroes in it. It’s opposite is dense matrix. In a dense matrix, we have very few zeroes. XGBoost, precisely, deliver exceptional results on sparse matrices.

I did parameter tuning on XGBoost model to ensure that every model behaves in a different way. To read more on XGBoost, here’s a comprehensive documentation: XGBoost

Below is my complete code. I’ve used the packages xgboost and matrix. The package ‘matrix’ is used to create sparse matrix quickly.

library(xgboost) library(Matrix)

Now, I’ve created a sparse matrix using xgb.DMatrix of train data set. I’ve kept the set of independent variables and removed the dependent variable.

# creating the matrix for training the model ctrain <- xgb.DMatrix(Matrix(data.matrix(mytrain[,!colnames(mytrain) %in% c('cuisine')])), label = as.numeric(mytrain$cuisine)-1)

I’ve created a sparse matrix for test data set too. This is done to create a watchlist. Watchlist is a list of sparse form of train and test data set. It is served as an parameter in xgboost model to provide train and test error as the model runs.

dtest <- xgb.DMatrix(Matrix(data.matrix(mytest[,!colnames(mytest) %in% c(‘cuisine’)])))  watchlist <- list(train = ctrain, test = dtest)

To understand the modeling part, I suggest you to read this document. I’ve built 3 just models with different parameters . You can even create 40 – 50 models for ensembling. In the code below, I’ve used ‘Objective = multi:softmax’. Because, this is a case of multi classification.

Among other parameters, eta, min_child_weight, max.depth and gamma directly controls the model complexity. These parameters prevents the model to overfit. The model will be more conservative, if these values are chosen larger.

#train multiclass model using softmax #first model xgbmodel <- xgboost(data = ctrain, max.depth = 25, eta = 0.3, nround = 200, objective = "multi:softmax", num_class = 20, verbose = 1, watchlist = watchlist) #second model xgbmodel2 <- xgboost(data = ctrain, max.depth = 20, eta = 0.2, nrounds = 250, objective = "multi:softmax", num_class = 20, watchlist = watchlist) #third model xgbmodel3 <- xgboost(data = ctrain, max.depth = 25, gamma = 2, min_child_weight = 2, eta = 0.1, nround = 250, objective = "multi:softmax", num_class = 20, verbose = 2,watchlist = watchlist) #predict 1 xgbmodel.predict <- predict(xgbmodel, newdata = data.matrix(mytest[, !colnames(mytest) %in% c('cuisine')])) xgbmodel.predict.text <- levels(mytrain$cuisine)[xgbmodel.predict + 1] #predict 2 xgbmodel.predict2 <- predict(xgbmodel2, newdata = data.matrix(mytest[, !colnames(mytest) %in% c('cuisine')])) xgbmodel.predict2.text <- levels(mytrain$cuisine)[xgbmodel.predict2 + 1] #predict 3 xgbmodel.predict3 <- predict(xgbmodel3, newdata = data.matrix(mytest[, !colnames(mytest) %in% c('cuisine')])) xgbmodel.predict3.text <- levels(mytrain$cuisine)[xgbmodel.predict3 + 1] #data frame for predict 1 submit_match1 <- cbind($id), colnames(submit_match1) <- c('id','cuisine') submit_match1 <- data.table(submit_match1, key = 'id') #data frame for predict 2 submit_match2 <- cbind($id), colnames(submit_match2) <- c('id','cuisine') submit_match2 <- data.table(submit_match2, key = 'id') #data frame for predict 3 submit_match3 <- cbind($id), colnames(submit_match3) <- c('id','cuisine') submit_match3 <- data.table(submit_match3, key = 'id')

Now I have three weak learners. You can check their accuracy using:

sum(diag(table(mytest$cuisine, xgbmodel.predict)))/nrow(mytest)  sum(diag(table(mytest$cuisine, xgbmodel.predict2)))/nrow(mytest) sum(diag(table(mytest$cuisine, xgbmodel.predict3)))/nrow(mytest)

The simple key is ensemble. Now, I have three data frame for model predict, predict2 and predict 3. I’ve now extracted the ‘cuisine’ column from predict and predict 2 into predict 3. With this step, I get all values of ‘cuisines’ in one data frame. Now I can easily ensemble their predictions

#ensembling submit_match3$cuisine2 <- submit_match2$cuisine submit_match3$cuisine1 <- submit_match1$cuisine

I’ve used the MODE function to extract the predicted value with highest frequency per id.

#function to find the maximum value row wise Mode <- function(x) { u <- unique(x) u[which.max(tabulate(match(x, u)))] } x <- Mode(submit_match3[,c("cuisine","cuisine2","cuisine1")]) y <- apply(submit_match3,1,Mode) final_submit <- data.frame(id= submit_match3$id, cuisine = y) #view submission file data.table(final_submit) #final submission write.csv(final_submit, 'ensemble.csv', row.names = FALSE)

After following the step mentioned above, you can easily get the same score as mine (0.798). You would have seen, I haven’t used any brainy method to improve this model. i just applied my basics. Since I’ve just started, I would like to see if I can push this further the highest level now.

End Notes

With this, I finish this tutorial for now! There are many things in this data set which you can try at your end. Due to time constraints, I couldn’t spent much time on it during the competition. But, it’s time you put on your thinking boots. I failed at Naive Bayes. So, why don’t you create an ensemble of naive bayes models? or may be, create a cluster of ingredients and build a model over it ?

I’m sure this strategy might give you a better score. Perhaps, more knowledge. In this tutorial, I’ve built a predictive model on What’s Cooking ? data set hosted by Kaggle. I took a step wise approach to cover various stages of model building. I used text mining and ensemble of 3 XGBoost models. XGBoost in itself is a deep topic. I plan to cover it deeply in my forthcoming articles. I’d suggest you to practice and learn.

Did you find the article useful? Share with us if you have done similar kind of analysis before. Do let us know your thoughts about this article in the box below.

If you like what you just read & want to continue your analytics learning, subscribe to our emails, follow us on twitter or like our facebook page.


You're reading Kaggle Solution: What’s Cooking ? (Text Mining Competition)

How Does Text Mining Tell A Lot About Brand Image?

This article was published as a part of the Data Science Blogathon.


A vast amount of textual data is generated daily through posts, likes, and tweets by social networking sites such as Facebook, Instagram, and Twitter. This data contains a lot of information we can harness to generate insights. Still, most of this data is unstructured and is not ready for statistical analysis. Managing unstructured data for business benefit can be understood since around 80% of the business data is unstructured. With the exponential growth of social media, its share will keep increasing with time.

This vast amount of information can help us create valuable insights, but this information is highly unstructured and needs to be processed for analysis. This article looks at the results of a use case created using data extracted from Twitter to generate insights about a brand’s image after a scandal about the brand became public.

Analyzing this unstructured data in a text can help marketers in customer experience management, brand monitoring, etc., by transforming a massive amount of unstructured customer feedback into actionable insights. One common issue while analyzing this enormous amount of free-form text is that no human can read it in a reasonable amount of time. In this case, text mining is the answer to dealing with unstructured data and unlocking the value of customer feedback. This article investigates how Tweets made on a topic unlocked valuable insights as a use case.

Text Mining Application


One of the most significant controversies that recently rocked the whole car industry occurred when Volkswagen (VW) cheated on pollution emissions tests in the US. The VW scandal has raised the eyebrows of customers worldwide. Dubbed the “diesel dupe”, the German car giant has admitted to cheating emissions tests in the US. According to the Environmental Protection Agency (EPA), some cars sold in America had devices in diesel engines that could detect when they were tested, changing the performance accordingly to improve results.

The EPA’s findings cover 482,000 cars in the US only. But VW has admitted that about 11 million vehicles worldwide are fitted with the so-called “defeat device”. Under such circumstances, it is interesting to analyze customers’ Tweets to see what they are talking about the company. To create this use case, tweets were extracted using the search criterion “Volkswagen” just after the Volkswagen(VW) emission scandal became public. The aim of analyzing the tweets related to VW was to understand the current perception of the consumer about VW and its cars in light of the scandal.



The approach taken is broadly classified under three steps as depicted in the figure,

                                             Figure 1. Steps in Text Analytics

In the first step, the data is extracted from Twitter using the search criterion “Volkswagen”. It involves creating a Twitter application using the Developer section of Twitter and writing code in Python or R to use a credential object to establish a secure connection and extract Tweets on the desired topic. For example, R libraries, “twitteR” and “ROAuth”, can be used to extract and store the raw data in a Comma Separated Values (CSV) file. The following code in R shows how the tweets were extracted.

Next, we need to remove line breaks and join and collapse all the lines into one long string using the “paste” function. The string stored in a vector object is converted to lowercase. We must also remove blank spaces, usernames, and punctuation, as well as words from cleaning the text. Finally, we split the string and the regular expression “W” to detect word boundaries that result in a list of words from the Tweets. After obtaining the list of words, we are ready to analyze the data. To begin the analysis, we calculate the number of unique words. Then we build a table of word types and their corresponding frequencies. The following code is used to perform these steps.

Finally, we created a corpus of frequent words and generated the following word cloud.

This “Word Cloud”, generated from the Tweets, help us to visually represent the words that appear more frequently and aids in understanding their prominence in the text analyzed.


Some challenges that we face in this exercise are information extracted from Twitter is highly unstructured form, hence the data needs to be preprocessed and cleaned to apply techniques of statistical analysis. Finally, the amount of data that can be extracted from Twitter and processed for further analysis is limited due to restrictions.

Text Mining Analysis

In the use case, our journey started with two thousand Tweets, accounting for 27,157 words. But, after preprocessing and data cleaning, we arrived at 2,919 unique words. We created a frequency table for these unique words along with the count of their occurrence, and finally, we generated a word cloud from it. When we looked at the frequently occurring words in the word cloud, we found:

cheated, deception, annoying, emissions, deaths”

The prevalence of words indicating negative sentiment among the customers has been evident since we analyzed it. Tweets were made after one of the biggest scandals in the auto industry was exposed. At that point, the brand’s image, and the industry, in general, are expected to be negative. But, if we look closely, we get some more exciting words that lead to valuable insight into the topic. One such example of words from the word cloud is:

elon, musk”

The auto industry, especially the diesel car makers, lost a considerable amount of credibility and power to influence customers and the government, as illustrated in the rise of companies like electric car makers such as Tesla. While the Volkswagen emissions scandal has prompted investigations into other car brands in both Europe and the US, Tesla Motors CEO Elon Musk said customers might seriously consider the time to give up on fossil fuels and embrace new technology. Since 2023 when this scandal happened till date, we have indeed witnessed a phenomenal rise in Tesla if not the electric car industry. Another such set of words is:

cars, deception, cheated”

Germans are generally considered the best in engineering, and their cars are associated with performance, quality, and reliability. Still, as evident from our analysis of data from Twitter, the VW scandal dented the reputation of not only VW but also the German car manufacturers.


As depicted this article through the use case, a text mining application using straightforward techniques can indicate a fundamental shift in the direction and pace at which the industry will move. It shows how using text mining; we can measure customers’ perceptions about a company or its brand and why such a perception occurred. This method can be beneficial for marketers in analyzing the gap between brand identity and brand image. In this article, we descriptively used text analytics to understand the perception of customers about a particular company and its brands after an event has occurred, which can be repeated for other brands and other events. Similarly, we can also use text analytics in a predictive manner to understand the future outcome of events.

Thus, we can safely conclude that the application of text analytics on social media content is the way for different business industries to understand consumer sentiment about a brand and decide their future course of action. And for this, the reader can start analyzing with simple techniques like generating a word cloud before leaping into the ocean of text mining, waiting to be explored further.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.


Difference Between Data Mining And Web Mining?

Data mining is the procedure of exploration and analysis of huge quantities of data to find meaningful patterns and rules. On the other hand, web mining defines the process of using data mining techniques to extract useful data patterns and trends from web-based records and services, server logs, and hyperlinks.

Read this article to learn more about Data Mining and Web Mining and how they are different from each other.

What is Data Mining?

Data mining is the process of discovering meaningful new correlations, patterns, and trends by shifting through a large amount of data stored in repositories, using pattern recognition technologies as well as statistical and mathematical techniques. It is the analysis of observational datasets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and beneficial to the data owner.

Data mining is the process of selection, exploration, and modeling of large quantities of data to discover regularities or relations that are at first unknown to obtain clear and useful results for the owner of the database. Data mining is the procedure of exploration and analysis by automatic or semi-automatic defines of huge quantities of data to find meaningful patterns and rules.

Data Mining is similar to Data Science. It is carried out by a person, in a specific situation, on a particular data set, with an objective. This process includes various types of services such as text mining, web mining, audio and video mining, pictorial data mining, and social media mining. It is completed through software that is simple or hugely specific.

By outsourcing data mining, all the work can be completed quicker with low operation costs. Specialized firms can also use new technologies to set data that is impossible to situate manually. There are tons of information available on various platforms, but very little knowledge is accessible.

The biggest challenge is to analyze the data to extract important information that can be used to solve a problem or for company development. There are many powerful instruments and techniques available to mine data and find better insight from it.

What is Web Mining?

Web mining defines the process of using data mining techniques to extract beneficial patterns trends and data generally with the help of the web by dealing with it from web-based records and services, server logs, and hyperlinks. The main goal of web mining is to find the designs in web data by collecting and analyzing data to get important insights.

Web mining can broadly be viewed as the application of adapted data mining techniques to the internet, whereas data mining is represented as the application of the algorithm to find patterns on mostly structured data fixed into a knowledge discovery process.

Web mining has distinctive features to offer a set of multiple data types. The web has several aspects that yield multiple approaches for the mining process, including web pages including text, web pages are connected via hyperlinks, and user activity can be monitored via web server logs.

Difference between Data Mining and Web Mining

The following table highlights all the major differences between data mining and web mining −


Data Mining

Web Mining


Data mining is the process of discovering meaningful new correlations, patterns, and trends by shifting through a large amount of data stored in repositories, using pattern recognition technologies as well as statistical and mathematical techniques.

Web mining defines the process of using data mining techniques to extract beneficial patterns trends and data generally with the help of the web by dealing with it from web-based records and services, server logs, and hyperlinks.


The main goal of data mining is the analysis of observational datasets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and beneficial to the data owner.

The main goal of web mining is to find the designs in web data by collecting and analyzing data to get important insights.


It is the process of selection, exploration, and modeling of large quantities of data to discover regularities or relations that are at first unknown to obtain clear and useful results for the owner of the database.

Web mining can broadly be viewed as the application of adapted data mining techniques to the internet.


This process includes various types of services such as text mining, web mining, audio and video mining, pictorial data mining, and social media mining.

The web has several aspects that yield multiple approaches for the mining process, including web pages including text, web pages are connected via hyperlinks, and user activity can be monitored via web server logs.


Data mining uses business intelligent applications that includes information to improve business activities.

Web mining uses data analytics for medication of raw data into a meaningful format.


The target users of data mining include data engineers and data scientists.

The target users of web mining include data analysts or web analysts.


Data mining uses tools like machine learning.

Web mining uses tools like PageRank, Apache logs, Scrapy, etc.


For decision-making process, many organizations use data mining results.

Web mining is significant for pulling the existing data mining process.


To conclude, the main objective of data mining is to analyze observational datasets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and beneficial to the data owner. In contrast, the primary objective of web mining is to find the designs in web data by collecting and analyzing data to get important insights.

Monopoly And A Monopolistic Competition


Companies usually become monopolies by having access to the entire supply chain, from production to sales which is known as vertical integration. In another way, a monopoly business can buy companies to become the only significant player in the market. This is called horizontal integration. Let us see in detail about monopoly and monopolistic competition.

What is a Monopoly?

A monopoly is a type of market structure where one company has dominance in a sector or an industry. Monopolies are devoid of competition and hence they are not encouraged in a free market economy. Many governments enact laws to counter monopoly so that no single organization can control the market and exploit the customers using its dominant position in the market.

In a monopoly, competition among firms is absent as only one player is dominant. Moreover, there is no availability of substitute products in the market. The company that is a sole player in a monopoly, therefore, has a free hand to determine the prices, and demand and supply do not determine the prices.

In a monopoly, the organization in the dominant position can go against the market forces to control the market. It can become a price maker and ignore the conditions of the market while determining the prices.

Types of Monopolies

Monopolies are divided into the following types −

Pure Monopoly Natural Monopoly

A natural monopoly is derived from the unique abilities or attributes of the company. Companies that have unique research and development facilities, such as pharma companies fall under this category.

Public Monopolies

Public monopolies provide essential services to the public and are controlled by the government. Energy companies and water supply companies are good examples of public monopolies.

What is a Monopolistic Competition?

The idea of monopolistic competition falls in between pure monopoly and perfect competition. However, it has elements of both monopolies and perfect competition in general. The players in the case of monopolistic competition all have some market power to be the price setters.

Competing Companies in Monopolistic Competition

In the case of monopolistic competition, there are usually a few companies that vie for market share.

Examples include restaurants, such as McDonald’s and Burger King that offer similar products and compete for a similar market share

Product Differentiation Comparative actions

The action of one company in the case of monopolistic competition does not lead to a change in the strategy of other players. For example, a price change in McDonald’s burgers won’t necessarily affect Burger King’s burger price. This is in contrast to oligopoly where a price change by one player may start a price war.

Price setting

As mentioned above, the players in monopolistic competition are price setters and not price takers. However, product differentiation is necessary if a company wants to increase the price of its products.

Demand elasticity

The demand is highly elastic in the case of monopolistic competition. If the price of a product is increased, the consumers would often shift to a competing product because there are options in the hands of the consumers to select a similar product at a lower price.

Economic profit

A company in monopolistic competition can make excess economic profit in the short run but when we consider the long-term aspects, the economic profit comes to zero.

Advertising Differences between Monopoly and Monopolistic Competition

The following differences exist between monopoly and monopolistic competition −

Monopoly Monopolistic Competition In monopolies, there is only one dominant player and it controls the market. In the case of monopolistic competition, there are a few players who fight for the market share. The seller in a monopoly hardly faces any competition. In a monopolistic market, there is palpable competition. The demand and supply in a monopoly are controlled by the dominant player. In the case of monopolistic competition, the demand and supply are measured depending on the competition between the players. Entry and exit in a monopoly are difficult. While in the case of monopolistic competition, these barriers are low. The existence of a sole player forbids the consumers from exerting influence in product pricing in the case of monopolies. However, consumers may exert influence in product pricing in the case of monopolistic competition. The product variety in the monopoly is limited and dependent on the sole player’s will. In the case of monopolistic competition, consumer pressure may let the players create new product variants. Product predictability in monopolies is higher. In monopolistic competition as the latter has many players.


It can be seen that monopolies and monopolistic competition have many differences. However, monopolistic competition can be considered a form of monopoly too. Depending on the nature of the monopoly, there can be different aspects of it but the main aspect that only one player dominates the market remains unchanged in all conditions. That is why monopolistic competition is considered a better option than pure monopoly.


Qns 1. Why is monopoly considered unfavorable for consumers?

Ans. Monopolies can use their sole existence in a market as a source to exploit the consumers by maximizing profit and offering low-quality products. That is why a monopoly is considered unfavorable for consumers.

Qns 2. What is price discrimination?

Ans. In a monopoly, the dominant player has the discretion to determine the price of a product. This is known as price discrimination.

Qns 3. Between monopoly and monopolistic competition, which is more common?

Ans. Monopolistic competition is more common than monopolies in the business world.

Multicollinearity: Problem, Detection And Solution

This article was published as a part of the Data Science Blogathon.

Multicollinearity causes the following 2 primary issues –

1. Multicollinearity generates high variance of the estimated coefficients and hence, the coefficient estimates corresponding to those interrelated explanatory variables will not be accurate in giving us the actual picture. They can become very sensitive to small changes in the model.

2. Consecutively the t-ratios for each of the individual slopes might get impacted leading to insignificant coefficients. It is also possible that the adjusted R squared for a model is pretty good and even the overall F-test statistic is also significant but some of the individual coefficients are statistically insignificant. This scenario can be a possible indication of the presence of multicollinearity as multicollinearity affects the coefficients and corresponding p-values, but it does not affect the goodness-of-fit statistics or the overall model significance.

How do we measure Multicollinearity?

A very simple test known as the VIF test is used to assess multicollinearity in our regression model. The variance inflation factor (VIF) identifies the strength of correlation among the predictors.

Now we may think about why we need to use ‘VIF’s and why we are simply not using the Pairwise Correlations.

Since multicollinearity is the correlation amongst the explanatory variables it seems quite logical to use the pairwise correlation between all predictors in the model to assess the degree of correlation. However, we may observe a scenario when we have five predictors and the pairwise correlations between each pair are not exceptionally high and it is still possible that three predictors together could explain a very high proportion of the variance in the fourth predictor.

I know this sounds like a multiple regression model itself and this is exactly what VIFs do. Of course, the original model has a dependent variable (Y), but we don’t need to worry about it while calculating multicollinearity. The formula of VIF is 

Here the Rj2 is the R squared of the model of one individual predictor against all the other predictors. The subscript j indicates the predictors and each predictor has one VIF. So more precisely, VIFs use a multiple regression model to calculate the degree of multicollinearity. Suppose we have four predictors – X1, X2, X3, and X4. So, to calculate VIF, all the independent variables will become dependent variables one by one. Each model will produce an R-squared value indicating the percentage of the variance in the individual predictor that the set of other predictors explain.

The name “variance inflation factor” was coined because VIF tells us the factor by which the correlations amongst the predictors inflate the variance. For example, a VIF of 10 indicates that the existing multicollinearity is inflating the variance of the coefficients 10 times compared to a no multicollinearity model. The variances that we are talking about here are the standard errors of the coefficient estimates which indicates the precision of these estimates. These standard errors are used to calculate the confidence interval of the coefficient estimates.

Larger standard errors will produce wider confident intervals leading to less precise coefficient estimates. Additionally, wide confidence intervals may sometimes flip the coefficient signs as well.

VIFs do not have any upper limit. The lower the value the better. VIFs between 1 and 5 suggest that the correlation is not severe enough to warrant corrective measures. VIFs greater than 5 represent critical levels of multicollinearity where the coefficient estimates may not be trusted and the statistical significance is questionable. Well, the need to reduce multicollinearity depends on its severity.

How can we fix Multi-Collinearity in our model?

The potential solutions include the following:

1. Simply drop some of the correlated predictors. From a practical point of view, there is no point in keeping 2 very similar predictors in our model. Hence, VIF is widely used as variable selection criteria as well when we have a lot of predictors to choose from.

3. Do some linear transformation e.g., add/subtract 2 predictors to create a new bespoke predictor.

4. As an extension of the previous 2 points, another very popular technique is to perform Principal components analysis (PCA). PCA is used when we want to reduce the number of variables in our data but we are not sure which variable to drop. It is a type of transformation where it combines the existing predictors in a way only to keep the most informative part.

It then creates new variables known as Principal components that are uncorrelated. So, if we have 10-dimensional data then a PCA transformation will give us 10 principal components and will squeeze maximum possible information in the first component and then the maximum remaining information in the second component and so on. The primary limitation of this method is the interpretability of the results as the original predictors lose their identity and there is a chance of information loss. At the end of the day, it is a trade-off between accuracy and interpretability.

How to calculate VIF (R and Python Code):

I am using a subset of the house price data from Kaggle. The dependent/target variable in this dataset is “SalePrice”. There are around 80 predictors (both quantitative and qualitative) in the actual dataset. For Simplicity’s purpose, I have selected 10 predictors based on my intuition that I feel will be suitable predictors for the Sale price of the houses. Please note that I did not do any treatment e.g., creating dummies for the qualitative variables. This example is just for representation purposes.

The following table describes the predictors I chose and their description.

The below code shows how to calculate VIF in R. For this we need to install the ‘car’ package. There are other packages available in R as well.

The output is shown below. As we can see most of the predictors have VIF <= 5

Now if we want to do the same thing in python then please see the code and output below

Please note that in the python code I have added a column of intercept/constant to my data set before calculating the VIFs. This is because the variance_inflation_factor function in python does not assume the intercept by default while calculating the VIFs. Hence, often we may come across very different results in R and Python output. For details, please see this discussion here.


Tofu: It’s What’s For Dinner

K. C. Mackey (CAS’13) distributes Humane League brochures to passing students on Marsh Plaza. Photos by Vernon Doucette

K. C. Mackey was standing on the sidewalk near Marsh Plaza, offering passers-by a warm smile and a Humane League brochure, whose cover featured two spotted piglets and a fuzzy chick. Some smiled back and grabbed a brochure. Others waved a dismissive hand or ignored her completely. In the course of a half hour, she had two conversations about animal rights and reducing meat consumption, both with friends. One changed the topic, the other said she is slowly cutting meat from her diet and switching from milk to soy products.

Mackey was among a handful of students, all of them members of the BU Vegetarian Society, passing out leaflets this particular day. Founded four years ago, the club aims to end animal cruelty, save the environment, and halt world hunger by promoting a vegetarian or vegan lifestyle. Members meet every other week, organize potlucks, and host events like last fall’s 10 Billion Lives Tour, where participants received $1 apiece from the Farm Animal Rights Movement to view a four-minute video about factory farm abuses. The society also protests organizations like the Ringling Bros. and Barnum & Bailey circus for its alleged abuse of elephants. And they invite speakers such as Gene Bauer, cofounder and president of Farm Sanctuary, and Zoe Weil, an educator at the Institute for Humane Education, to campus to discuss such issues as animal rights and ways to live a kinder and more compassionate life.

Last fall, the club hosted a talk by Breeze Harper, author of Sistah Vegan. Harper discussed the way a vegan diet can overcome sexism and racism.

Opting for a vegetarian or vegan lifestyle remains relatively rare in the United States. Only 5 percent of American adults consider themselves vegetarians and 2 percent vegans, according to a July 2012 Gallup poll. The survey also reported that single adults are more than twice as likely as married adults to be vegetarian and that women are more likely than men to opt for the lifestyle.

Every vegetarian or vegan has a conversion tale. Rachel Atcheson (CAS’13), 2012–2013 Vegetarian Society president, was a high school senior when she learned about factory farms’ pollution, their alleged abuse of animals, and the health risks of eating meat that could be contaminated with salmonella or E. coli as a result of unclean slaughterhouse practices. “I had been a very long-haired hippie before that,” she says, but this newfound knowledge pushed her toward eliminating meat completely from her diet. Two months later, she went vegan, cutting out eggs and dairy products.

For club treasurer Victoria Brown (SAR’15), the choice to become vegan was the result of her growing concern for animal rights and her awareness of the health benefits of veganism—among them, lower cholesterol and blood pressure levels and a decreased risk for heart disease and cancer. She announced her decision to her mother three days before her 15th birthday. Instead of cake, she got a cantaloupe slice adorned with candles.

Another member, Robert Borgoño (CGS’13), says his decision is anchored in a concern for the environment, animal rights, and world hunger. Raising animals for meat requires significantly more land and water than does raising fruits and vegetables, he says. In fact, last year the Journal of Animal Science reported that it takes nearly 53 gallons of water, about 75 square feet of land for grazing and growing feed, and 1,036 Btus of fossil fuel energy—enough to power a microwave for 18 minutes—to produce a quarter pound hamburger.

Borgoño says some of his friends have forsworn meat because of factory farm worker abuse. Whether for the sake of humans or animals, he says, “I’m trying to reduce the amount of suffering that my life is causing.”

Baby steps are key to successfully converting to a vegetarian or vegan lifestyle, club members say. They suggest starting with a Meatless Monday, reading about dietary options online, visiting a nutritionist to find out which foods provide necessary vitamins and minerals otherwise found in meat, dairy, and eggs, or attending one of their club meetings.

“Our goal is not to make everyone vegan or vegetarian,” says Borgoño, the club’s vice president, “but to reduce meat consumption.”

The BU Vegetarian Society meets every other Wednesday during the school year from 6:30 to 8 p.m. in the CAS Environmental Lounge, Room 442.

Explore Related Topics:

Update the detailed information about Kaggle Solution: What’s Cooking ? (Text Mining Competition) on the website. We hope the article's content will meet your needs, and we will regularly update the information to provide you with the fastest and most accurate information. Have a great day!