Skip to main content

Web scraping using Python package Goose

Web scraping is one of the powerful technique used to collect large amounts of data from internet. Companies with quality data strive in today's world when it comes to Machine learning.


Let's take a scenario. You set out to build worlds best restaurant review classification system. You collect all the reviews from several restaurants and use a fancy deep learning algorithm to do the classification.Turns out your classification algorithm is not doing well out in public. What went wrong ?

Well, machine learning is all about capturing the pattern and generalizing it so well that unseen data will also work well. Given the situation you are in, you have these options. Try GPU, incorporate latest ML techniques, build an ensemble of many models, revisit feature engineering... or

Get more data. As trivial as it might sound, fetching more data would enable any ML algorithm to capture more pattern with in the data and perform well on unseen data.

I am going to talk about not so famous python package which works really well and is quick to get it and start scraping.

[Goose3](https://pypi.org/project/goose3/)


Lets see the steps to install and scrape:


pip3 install goose3
from goose3 import Goose
url = 'http://edition.cnn.com/2012/02/22/world/europe/uk-occupy-london/index.html?hpt=ieu_c2'
g = Goose()
article = g.extract(url=url)
article.title
u'Occupy London loses eviction fight'
article.cleaned_text[:150]
(CNN) - Occupy London protesters who have been camped outside the landmark St. Paul's Cathedral for the past four months lost their court bid to avoi



Comments

Popular posts from this blog

How to contribute to Openstack

How to contribute to Openstack My contribution : link First of all lets answer the question who can contribute to Openstack. Anyone . Yes, anyone can contribute to Openstack. Whether you are interested in developing new feature in Openstack  or in Documentation or in fixing Bugs , you are welcome. That's how Open source projects work. Lets answer another question . Why should anyone contribute to Openstack. The answer would be :  To learn more about the project. By contributing you learn a lot of things. You are making the system better and helping others all over the world who use Openstack. Let's begin This   is where you should start. The link has all the information on How to contribute. All the commands used here are from that link. In case if you want more info please use the link provided in the beginning. My mentor suggested me to fix a Bug in Openstack. Bug can be a very small one like fixing a typo in the code message or it can be a critical   one. Bo

Text classification using CNN written in tensorflow.

Problem statement : You are supposed to build a model which automatically classifies an article under Finance, Law, Fashion and Lifestyle. Use the data from leading magazines for training the model. Solution :   Github Repo :  link In past, I had used NLTK and python to solve the above problem, but neural networks have proven to be more accurate when it comes to NLP. I had researched on text classification libraries and different approaches to solve this problem and decided to use CNN. I have used Denny Britz code for implementing the CNN( convolutional neural network ). Here is the  link for his blog post. I would describe the files and the procedure I followed to get the data, train the model, test the model and the results. First, I went to the leading newspaper TheGuardian and looked for the labels i.e Finance, Law, Fashion, Lifestyle. Scraping the data from the same source would be help in keeping the homogeneity in the articles. I have used Goose and Bea

Deploying Devstack with Ironic

Devstack with Ironic Openstack  Openstack is a cloud operating system which manages pools of resources like storage, newtorking, computing and provisions cloud solutions which is massively scalable. Openstack is an free and open source project. Openstack's mission is to provide ubiquitous cloud computing platform which serves as Infrastructure As A Service (IAAS) to meet the needs of public, private cloud and hybrid cloud. Devstack Devstack is an all in one installer for Openstack. Openstack is getting bigger and bigger day by day. Many new projects have been added to it and it is getting more complicated. To make things easy, Devstack offers a way to install all components of Openstack with out much fuss. These are the key components of Openstack : Compute ( Code name 'Nova') : Nova is responsible for creation and management of virtual machines. It has support for different types of Hypervisor s like KVM , Xen , QEMU , Vmware-vsphere , Hyper-V , Baremetal