Skip to main content

Text classification using CNN written in tensorflow.


Problem statement :

You are supposed to build a model which automatically classifies an article under Finance, Law, Fashion and Lifestyle. Use the data from leading magazines for training the model.


Solution:
 
Github Repo : link

In past, I had used NLTK and python to solve the above problem, but neural networks have proven to be more accurate when it comes to NLP. I had researched on text classification libraries and different approaches to solve this problem and decided to use CNN.

I have used Denny Britz code for implementing the CNN( convolutional neural network). Here is the  link for his blog post.

I would describe the files and the procedure I followed to get the data, train the model, test the model and the results.

First, I went to the leading newspaper TheGuardian and looked for the labels i.e Finance, Law, Fashion, Lifestyle. Scraping the data from the same source would be help in keeping the homogeneity in the articles.

I have used Goose and BeautifulSoup to scrape the articles. Code for the same is uploaded in the Github. The folder structure and the data files description is as follows:

raw_data/                                Contains files related to train and test
├── collect_url_data.py        Python script that scrapes articles       
├── data                                Training data folder
│   ├── fashion_7000.txt      7000 training data for class fashion
│   ├── finance_7000.txt      7000 training data for class finance
│   ├── law_7000.txt            7000 training data for class law
│   └── lifestyle_7000.txt     7000 training data for class lifestyle
├── fashion                            From original scraped data and cleaned one
│   ├── fashion_7000.txt      7000 training data for class fashion
│   ├── fashion_original.txt    Original scraped data
│   ├── log                                Log output of python
│   ├── test_fashion.txt           test data for python 1001 samples
│   ├── urls.txt                         urls which were scraped
│   └── urltext.txt                    raw text from urls
├── finance                              From original scraped data and cleaned one
│   ├── finance.txt                   raw text from urls
│   ├── finance_7000.txt         7000 training data for class finance
│   ├── finance_urls.txt           urls scraped for finance
│   ├── log_finance                  log output
│   ├── original_finance.txt    Original scraped file
│   └── test_finance.txt           test sample for finance
├── law                                    Data folder for law
│   ├── law.txt                         scraped data for law
│   ├── law_7000.txt              7000 training samples for law
│   ├── law_urls.txt                urls scraped for law
│   ├── log_law                       log output
│   ├── original_law.txt         original scraped data for law
│   └── test_law.txt                test data for law
└── lifestyle                            Data folder for lifestyle
    ├── lifestyle.txt                  cleaned data for lifestyle
    ├── lifestyle_7000.txt        7000 training samples for lifestyle
    ├── lifestyle_urls.txt         urls collected for scraping
    ├── log_lifestyle                 log output of the script
    ├── original_lifestyle.txt   original scraped data
    └── test_lifestyle.txt          test data for lifestyle


Using the python script I have scraped the above categories. Each folder has the respective raw data and the cleaned data. I have cleaned the unnecessary lines using sed.

Once the data was ready, I went through the basics of neural networks and made appropriate changes in the tensorflow code to solve the problem. Changes include changing the source files and increasing the array size on  lines 16-20 and 52 in this script

Once the script was ready and the required python libraries were installed, I was able to successfully run the code and tensorflow created a new folder called runs, which holds the final results.

Here is a screenshot of the results.




Picture link

Results:

I was able to get 94 % accuracy using the second checkpoint in the runs folder.

Here is the google sheet link

Unfortunately there were few empty lines since I had split the article in the training data, CNN predicted the label for that anyway. I could have avoided this mistake.

If you have more questions, feel free to reach out to me at shanker.mani0@gmail.com.

Happy hacking !

Comments

  1. This comment has been removed by the author.

    ReplyDelete
  2. Hey, greetings and applause for your excellent post. You never fail to provide quality information in an appropriate quantity. I must say you must be consistent in this activity of yours. Hey, if you are a QuickBooks user, then pay attention to my advice. Consult the best tech support service providers for 24*7 services at QuickBooks Helpline Number +1 833-228-2822.
    QuickBooks Error 6000

    ReplyDelete
  3. Hi, I was searching for a post that can help me in completing my task. Thank you very much for sharing such a beautiful post! You can manage your business accounting and fiscal tasks accurately with just a few clicks, with QuickBooks Accounting Solution. For more help, you can contact QuickBooks Support Phone Number +1-800-329-0391. We are available 24 x7 hours and 365 days a year.

    ReplyDelete
  4. Great! Continue posting these stuffs. It's fun. I l would also like sharing a thought on QuickBooks–a user-friendly accounting program used by practitioners around the world. While you may have small glitches, you shouldn't panic, since QuickBooks Support Phone Number +1 833-441-8848 is accessible24x7 for your assistance. Ring instantly! And connect with the best executive to fix your issue.

    ReplyDelete
  5. Great! I must say this blog post is awesome. Besides, I really like the use of QuickBooks, an exemplary accounting program that functions in depth and simplicity for the administration of accounting. Moreover, you can dial QuickBooks Desktop Support Phone Number +1 833-441-8848 if you encounter any problem. Call us anytime as we operate round the clock.

    ReplyDelete
  6. Reach us via QuickBooks Desktop Support Phone Number +1(833)780-0086 to get rid of the QuickBooks issues. The professional team is always there to assist you with the finest solutions. For More Visit: http://www.santrasolutions.com/quickbooks-desktop-support/

    ReplyDelete
  7. Just saying thanks will not just be sufficient, for the fantasti c lucidity in your writing. I will instantly grab your rss feed to stay informed of any updates.
    360DigiTMG data science course

    ReplyDelete
  8. I really enjoy simply reading all of your weblogs. Simply wanted to inform you that you have people like me who appreciate your work. Definitely a great post. Hats off to you! The information that you have provided is very helpful.
    360digiTMG.com artificial intelligence online course

    ReplyDelete
  9. It is really a very informative post for all those budding entreprenuers planning to take advantage of post for business expansions. You always share such a wonderful articlewhich helps us to gain knowledge .Thanks for sharing such a wonderful article, It will be deinitely helpful and fruitful article.
    Cyber Security Training Course in Chennai | Certification | Cyber Security Online Training Course | Ethical Hacking Training Course in Chennai | Certification | Ethical Hacking Online Training Course | CCNA Training Course in Chennai | Certification | CCNA Online Training Course | RPA Robotic Process Automation Training Course in Chennai | Certification | RPA Training Course Chennai | SEO Training in Chennai | Certification | SEO Online Training Course

    ReplyDelete
  10. thanks for the information seeks such more blogs with complete knowledge.
    360DigiTMG data analytics course

    ReplyDelete
  11. Nice blog, it's so knowledgeable, informative, and good looking site. I appreciate your hard work. Good job. Thank you for this wonderful sharing with us.data science course in Hyderabad

    ReplyDelete
  12. Very awesome!!! When I searched for this I found this website at the top of all blogs in search engines
    business analytics course

    ReplyDelete
  13. I think I actually have never seen such blogs ever before that has complete things with all details that i would like. therefore kindly update this ever for us.
    data scientist certification

    ReplyDelete
  14. I wanted to thank you for this great read!! I definitely enjoy every little bit of it. I have you bookmarked to check out new stuff you post.
    data scientist course in hyderabad

    ReplyDelete
  15. I am sure that this is going to help a lot of individuals. Keep up the good work. It is highly convincing and I enjoyed going through the entire blog.
    data science course

    ReplyDelete
  16. Thank you for excellent article.You made an article that is interesting.
    ai courses chennai

    ReplyDelete
  17. Wonderful illustrated information. I thank you for that. No doubt it will be very useful for my future projects. Would like to see some other posts on the same subject!
    data science course fee in hyderabad

    ReplyDelete
  18. Really fantastic blog. Informative and knowledgeable content. This blog is useful to many people. Keep sharing more stuff like this. Thank you.
    Online Data Science Training in Hyderabad

    ReplyDelete
  19. Really an awesome blog and informative content. Thanks for sharing with us. If you want to become a data scientist, then check out the following link.
    AI Patasala Data Science Training in Hyderabad

    ReplyDelete
  20. Nice Content ! If you are looking for contact information for QuickBooks, or just want to learn more about how QuickBooks works, dial QuickBooks Customer Support Phone Number +1 855-941-1563 for help along the way.

    ReplyDelete

Post a Comment

Popular posts from this blog

How to contribute to Openstack

How to contribute to Openstack My contribution : link First of all lets answer the question who can contribute to Openstack. Anyone . Yes, anyone can contribute to Openstack. Whether you are interested in developing new feature in Openstack  or in Documentation or in fixing Bugs , you are welcome. That's how Open source projects work. Lets answer another question . Why should anyone contribute to Openstack. The answer would be :  To learn more about the project. By contributing you learn a lot of things. You are making the system better and helping others all over the world who use Openstack. Let's begin This   is where you should start. The link has all the information on How to contribute. All the commands used here are from that link. In case if you want more info please use the link provided in the beginning. My mentor suggested me to fix a Bug in Openstack. Bug can be a very small one like fixing a typo in the code message or it can be a critical   one. Bo

Deploying Devstack with Ironic

Devstack with Ironic Openstack  Openstack is a cloud operating system which manages pools of resources like storage, newtorking, computing and provisions cloud solutions which is massively scalable. Openstack is an free and open source project. Openstack's mission is to provide ubiquitous cloud computing platform which serves as Infrastructure As A Service (IAAS) to meet the needs of public, private cloud and hybrid cloud. Devstack Devstack is an all in one installer for Openstack. Openstack is getting bigger and bigger day by day. Many new projects have been added to it and it is getting more complicated. To make things easy, Devstack offers a way to install all components of Openstack with out much fuss. These are the key components of Openstack : Compute ( Code name 'Nova') : Nova is responsible for creation and management of virtual machines. It has support for different types of Hypervisor s like KVM , Xen , QEMU , Vmware-vsphere , Hyper-V , Baremetal