Skip to main content

Text classification using CNN written in tensorflow.


Problem statement :

You are supposed to build a model which automatically classifies an article under Finance, Law, Fashion and Lifestyle. Use the data from leading magazines for training the model.


Solution:
 
Github Repo : link

In past, I had used NLTK and python to solve the above problem, but neural networks have proven to be more accurate when it comes to NLP. I had researched on text classification libraries and different approaches to solve this problem and decided to use CNN.

I have used Denny Britz code for implementing the CNN( convolutional neural network). Here is the  link for his blog post.

I would describe the files and the procedure I followed to get the data, train the model, test the model and the results.

First, I went to the leading newspaper TheGuardian and looked for the labels i.e Finance, Law, Fashion, Lifestyle. Scraping the data from the same source would be help in keeping the homogeneity in the articles.

I have used Goose and BeautifulSoup to scrape the articles. Code for the same is uploaded in the Github. The folder structure and the data files description is as follows:

raw_data/                                Contains files related to train and test
├── collect_url_data.py        Python script that scrapes articles       
├── data                                Training data folder
│   ├── fashion_7000.txt      7000 training data for class fashion
│   ├── finance_7000.txt      7000 training data for class finance
│   ├── law_7000.txt            7000 training data for class law
│   └── lifestyle_7000.txt     7000 training data for class lifestyle
├── fashion                            From original scraped data and cleaned one
│   ├── fashion_7000.txt      7000 training data for class fashion
│   ├── fashion_original.txt    Original scraped data
│   ├── log                                Log output of python
│   ├── test_fashion.txt           test data for python 1001 samples
│   ├── urls.txt                         urls which were scraped
│   └── urltext.txt                    raw text from urls
├── finance                              From original scraped data and cleaned one
│   ├── finance.txt                   raw text from urls
│   ├── finance_7000.txt         7000 training data for class finance
│   ├── finance_urls.txt           urls scraped for finance
│   ├── log_finance                  log output
│   ├── original_finance.txt    Original scraped file
│   └── test_finance.txt           test sample for finance
├── law                                    Data folder for law
│   ├── law.txt                         scraped data for law
│   ├── law_7000.txt              7000 training samples for law
│   ├── law_urls.txt                urls scraped for law
│   ├── log_law                       log output
│   ├── original_law.txt         original scraped data for law
│   └── test_law.txt                test data for law
└── lifestyle                            Data folder for lifestyle
    ├── lifestyle.txt                  cleaned data for lifestyle
    ├── lifestyle_7000.txt        7000 training samples for lifestyle
    ├── lifestyle_urls.txt         urls collected for scraping
    ├── log_lifestyle                 log output of the script
    ├── original_lifestyle.txt   original scraped data
    └── test_lifestyle.txt          test data for lifestyle


Using the python script I have scraped the above categories. Each folder has the respective raw data and the cleaned data. I have cleaned the unnecessary lines using sed.

Once the data was ready, I went through the basics of neural networks and made appropriate changes in the tensorflow code to solve the problem. Changes include changing the source files and increasing the array size on  lines 16-20 and 52 in this script

Once the script was ready and the required python libraries were installed, I was able to successfully run the code and tensorflow created a new folder called runs, which holds the final results.

Here is a screenshot of the results.




Picture link

Results:

I was able to get 94 % accuracy using the second checkpoint in the runs folder.

Here is the google sheet link

Unfortunately there were few empty lines since I had split the article in the training data, CNN predicted the label for that anyway. I could have avoided this mistake.

If you have more questions, feel free to reach out to me at shanker.mani0@gmail.com.

Happy hacking !

Comments

Popular posts from this blog

Deploying Devstack with Ironic

Devstack with Ironic Openstack  Openstack is a cloud operating system which manages pools of resources like storage, newtorking, computing and provisions cloud solutions which is massively scalable. Openstack is an free and open source project. Openstack's mission is to provide ubiquitous cloud computing platform which serves as Infrastructure As A Service (IAAS) to meet the needs of public, private cloud and hybrid cloud. Devstack Devstack is an all in one installer for Openstack. Openstack is getting bigger and bigger day by day. Many new projects have been added to it and it is getting more complicated. To make things easy, Devstack offers a way to install all components of Openstack with out much fuss. These are the key components of Openstack : Compute ( Code name 'Nova') : Nova is responsible for creation and management of virtual machines. It has support for different types of Hypervisors like KVM, Xen, QEMU, Vmware-vsphere, Hyper-V, Baremetal. Object Storage (Code…

How to contribute to Openstack

How to contribute to OpenstackMy contribution : link

First of all lets answer the question who can contribute to Openstack. Anyone. Yes, anyone can contribute to Openstack. Whether you are interested in developing new feature in Openstack  or in Documentation or in fixing Bugs , you are welcome. That's how Open source projects work.
Lets answer another question . Why should anyone contribute to Openstack. The answer would be :  To learn more about the project. By contributing you learn a lot of things. You are making the system better and helping others all over the world who use Openstack.
Let's begin This  is where you should start. The link has all the information on How to contribute. All the commands used here are from that link. In case if you want more info please use the link provided in the beginning. My mentor suggested me to fix a Bug in Openstack. Bug can be a very small one like fixing a typo in the code message or it can be a critical  one. Both are considered as co…

My contribution to Openstack

Contribution to Openstack Stackalytics: Stackalytics is one stop for looking at some one's contribution to Openstack. It has great user interface where we can find the modules user has contributed to, timeline of his contributions. Here is my Stackalytics page link. I have mainly contributed for Openstack-manuals, documentation for Openstack projects about installation, user guides, admin guides and Openstack-API which hosts the documentation related to the API of several Openstack components.