python,class,beautifulsoup
I'm scraping the data off this website to create a table. I plan on creating a function to iterate through every subject but testing on just Accounting & Finance first. So far I have the following code: import os import requests from bs4 import BeautifulSoup import pandas as pd main_url...

python,beautifulsoup,urllib
I am trying to extract some information from this website i.e. the line which says: Scale(Virgo + GA + Shapley): 29 pc/arcsec = 0.029 kpc/arcsec = 1.72 kpc/arcmin = 0.10 Mpc/degree but everything after the : is variable depending on galtype. I have written a code which used beautifulsoup and...

python,html,beautifulsoup,list-comprehension
I've narrowed my HTML down and I want to pull the hrefs from each line IF the content following the a tag is past 2010. What's the best way to do this? I'll post my code first, and then the HTML. Code: links = [STEM_URL + row.a["href"] for row in...

python-2.7,selenium,beautifulsoup
I'm trying to grab data from: http://www.boerse-frankfurt.de/de/etfs/ishares+msci+world+momentum+factor+ucits+etf+DE000A12BHF2 The types of data I'm looking for are located in the classes named singlebox list_component. Let's say I want to extract the Total Expense Ratio (0.30%). It is located in a td class called: right column-datavalue lastColOfRow. But if I do: dues =...

python,python-3.x,beautifulsoup
i'm making a web spider to automate some of my work. I have a table with lots of drivers and different version for different operating systems. So far everything works fine but i'm having a hard time separating the links for each operating system. I'll post part of the html...

python,beautifulsoup,encode
OK, so as a beginner webscrapper I feel as though I've seen both used, seemingly interchangeably when converting the default unicode of text in HTML. I know contents() is a list object but other than that, what the heck is the difference? I've noticed that .encode("utf-8") seems to work more...

python,beautifulsoup
I am trying to get it so I can have it print the title of the book and the chapters but only each book and title. So basically "The First Book of Jacob" Chapters 1-7 instead of it iterating over all the books. Here is the page layout (url included...

python,python-3.x,beautifulsoup
i'm trying to write a web spider to gather me some links and text. I have a table i'm working with and the second cell of each row has a number in it, all i want to do is get that number, if it's the one i need then grab...

beautifulsoup
I am trying to learn beautifulsoup to scrap HTML and have a difficult challenge. HTML I am trying to scrap is not well formatted and with lack of knowledge with beautifulsoup I am kind of stuck.. The HTML I am trying to scrap is as below <table> <tr> <td><b>Value 1<b/>HiddenValue1</td>...

python,html,regex,wordpress,beautifulsoup
So I need to scrape some a site using Python but the problem is that the markup is random, unstructured, and proving hard to work with. For example <p style='font-size: 24px;'> <strong>Title A</strong> </p> <p> <strong> First Subtitle of Title A </strong> "Text for first subtitle" </p> Then it will...

python,html,beautifulsoup
I'm currently trying to extract the html elements which have a text on their own and wrap them with a special tag. For example, my HTML looks like this: <ul class="myBodyText"> <li class="fields"> This text still has children <b> Simple Text </b> <div class="s"> <ul class="section"> <li style="padding-left: 10px;"> Hello...

python,beautifulsoup
I'm trying to access the data in the Table in this URL. I am using the code below but I'm coming across the Error AttributeError: 'NoneType' object has no attribute 'find' in the line data = iter(soup.find("table", {"class": "xtTblCon"}).find("div", {"id": "MATURITYY%"}).find_all_next("li")). The code is as follows: from bs4 import BeautifulSoup...

python,beautifulsoup,python-requests
Why is it that the penultimate line of this snippet completes successfully, but the last one gives the error: TypeError: 'NoneType' object is not callable? What is different inside the scope of the function, and how can it be fixed? import requests from bs4 import BeautifulSoup def findDiv(soup): print soup.body.FindAll("div")...

python,beautifulsoup,html-parsing
I am trying to remove the header cells from a html table using BeautifulSoup. I have something like; <tr> <th> head1 </th> <th> head2 </th> </tr> I am using the following code to remove all the header cells; soup = BeautifulSoup(url) for headless in soup.find_all('th'): headless.decompose() This works great, except...

python,graph,beautifulsoup,label,networkx
I'm trying to extract one set of values from a URL. This set has a unique list of numbers. The number of elements in this list should be equal to the number of nodes. So the label that these nodes get should come from the list extracted. How can this...

python,regex,beautifulsoup
I've isolated a line of HTML procured from BeautifulSoup that i want to run regex on, but I keep getting AttributeError: 'NoneType' object has no attribute 'groups' I read another stackoverflow question (using regex on beautiful soup tags) but I can't see what I need to do to fix my...

python,web-scraping,beautifulsoup,screen-scraping
i'm having a little bit of an issue: I would like to take this data, for item in g_data: print item.contents[1].find_all("a", {"class":"a-link-normal s-access-detail-page a-text-normal"})[0]["href"] print item.contents[1].find_all("a", {"class":"a-link-normal s-access-detail-page a-text-normal"})[1]["href"] print item.contents[1].find_all("a", {"class":"a-link-normal s-access-detail-page a-text-normal"})[2]["href"] print item.contents[1].find_all("a", {"class":"a-link-normal s-access-detail-page a-text-normal"})[3]["href"] and use the...

python,beautifulsoup,html-parsing
I have been trying to scrape the data from a website which is using a good amount of tables. I have been researching on the beautifulsoup documentation as well as here on stackoverflow but am still lost. Here is the said table: <form action="/rr/" class="form"> <table border="0" width="100%" cellpadding="2" cellspacing="0"...

python,python-3.x,beautifulsoup,python-3.4
I am trying to install Beautiful Soup 4 in Python 3.4. I installed it from the command line, (got the invalid syntax error because I had not converted it), ran the 2to3.py conversion script to bs4 and now I get a new invalid syntax error. >>> from bs4 import BeautifulSoup...

python,python-2.7,beautifulsoup,urllib2
I want my output to be like: count:0 - Bournemouth and Watford to go head-to-head for Abdisalam Ibrahim Olympiacos midfielder Abdisalam Ibrahim is a target for Premier League new-boys Bournemouth and Watford.The former Manchester City man is keen to leave Greece this summer, and his potential availability has alerted Eddie...

html,parsing,beautifulsoup,html-parsing
print [(element['name'], element['value']) for element in soup.find_all('input')] I copied this code to get the value of an input and it throws this error: File "messager.py", line 116, in main print [(element['name'], element['value']) for element in soup.find_all('input')] File "C:\PYTHON27\lib\site-packages\bs4\element.py", line 905, in __getitem__ return self.attrs[key] KeyError: 'value' If I only provide...

python,table,web-scraping,beautifulsoup,html-table
This webpage... http://www.nfl.com/player/tombrady/2504211/gamelogs has multiple tables on it. Within the HTML all of the tables are labeled the exact same: <table class="data-table1" width="100%" border="0" summary="Game Logs For Tom Brady In 2014"> I can scrape data from only the first table (Preseason table) but I do not know how to skip...

python,beautifulsoup
When using BeautifulSoup4, I can run this code to get one "Shout" without problems. When I use the for loop, I get the error AttributeError: 'NavigableString' object has no attribute 'children' class Shout: def __init__(self, user, msg, date): self.user = user self.msg = msg self.date = date def getShouts(): #s...

python,loops,beautifulsoup,mechanize,bs4
I'm attempting to scrape several pages of results from the county search tool here: http://www2.tceq.texas.gov/oce/waci/index.cfm?fuseaction=home.main But I can't seem to figure out how to iterate over more than just the first page. import csv from mechanize import Browser from bs4 import BeautifulSoup url = 'http://www2.tceq.texas.gov/oce/waci/index.cfm?fuseaction=home.main' br = Browser() br.set_handle_robots(False) br.open(url)...

html,web,web-scraping,beautifulsoup
I've been playing around with scraping webpages using BeautifulSoup for a few weeks now. An issue I recently ran into, and hadn't seen before is where the content of the webpage is different from what's show as the page's source code and what's given in the url request response. For...

python,python-2.7,web,web-scraping,beautifulsoup
I am trying an example from the BeautifulSoupDocs and found it acting weird. When I try to access the next_sibling value, instead of the "body" a '\n' is coming in to picture. html_doc = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were...

python,beautifulsoup,mechanize
I am trying to log into my instagram via a python script using argparse. It seems to connect but it prints out "This page could not be loaded. If you have cookies disabled in your browser, oryou are browsing in Private Mode, please try enabling cookies or turning off Private...

javascript,python,html,json,beautifulsoup
I am using BeautifulSoup to get the HTML of a webpage. That works fine so far. But what I really want are the contents of this javascript chunk inside the HTML, which is encapsulated with <script type="text/javascript"> and then inside that tag, eventually there is a giant array thing that...

python,beautifulsoup
I'm trying to access the string Out of Stock using BeautifulSoup but cannot find the way to it: <span style="color: #727272; font-size: 14px; font-weight: normal;"> <strong>Price: $790</strong> (Out of stock) </span> Can anybody give hints how can I do this?...

python,web-scraping,beautifulsoup
I am first going to point out that I am new to all of this but struggling with trying to get to a nested tables cells. Here is the square footage field I am trying to get to down around line 282: view-source:http://services.wakegov.com/realestate/Account.asp?id=0355891 'square_feet': soup.findAll('table')[10].findAll('tr')[15].get_text().strip(), The error I receive is:...

python,python-2.7,beautifulsoup
I need to retrieve an image from a website using Python. However, the image is not in the form of a linked file, but as a GIF Data URI. How do I download this and store it in a .gif file?

python,html,regex,beautifulsoup,html-parsing
I am using Beautifulsoup for easy scraping. I have figured out there are more than 5 div in webpage which I want to scrap. Their names are different but has pattern. These divs are: divnewthing divnew divnewstring etc So the pattern is divnew* kind of regular expression. And I am...

python,table,statistics,beautifulsoup,screen-scraping
On a player stat page. How can I make my anchor point the year "2014" and grab specific numbers in the 2014 column (scrape numbers to the right of 2014) The code below is skipping the "Passing" table (with all of the career passing stats) and trying to grab stats...

python,xml,beautifulsoup,enthought
I'm trying to read a bunch of xml files and do stuff to them. The first thing I want to do is rename them based on a number that's inside the file. You can see a sample of the data hereWarning this will initiate a download of a 108MB zip...

python,string,beautifulsoup
My problem is that I want to print only this results with '1', not '-1', but when I use find() I just get '1' or '-1'. I know that is working but is there any function to print only this with '1', not number but whole line? import requests import...

python,printing,count,beautifulsoup
Is there any function in beautiful soup to count the number of lines retrieved? Or is there any other way this can be done? from bs4 import BeautifulSoup import string content = open("webpage.html","r") soup = BeautifulSoup(content) divTag = soup.find_all("div", {"class":"classname"}) for tag in divTag: ulTags = tag.find_all("ul", {"class":"classname"}) for tag...

python,table,data,beautifulsoup,scrape
In python using beautiful soup I want to be able to grab specific text/numbers from a sortable table online. http://www.nfl.com/stats/categorystats?archive=false&conference=null&role=OPP&offensiveStatisticCategory=null&defensiveStatisticCategory=INTERCEPTIONS&season=2014&seasonType=REG&tabSeq=2&qualified=false&Submit=Go I have attempted this about a million times and can't figure it out. This is the best i could do: from bs4 import BeautifulSoup import urllib2 import requests import pymongo...

python,beautifulsoup
I am running python 3.5 with BeautifulSoup4 and getting an error when I attempt to pass the plain text of a webpage to the constructor. The source code I am trying to run is import requests from bs4 import BeautifulSoup tcg = 'http://magic.tcgplayer.com/db/deck_search_result.asp?Format=Commander' sourcecode = requests.get(tcg) plaintext = sourcecode.text soup...

python,pandas,beautifulsoup
I am using the code below to try an extract the data from the table in this URL. I asked the same question here and got an Answer for it. However, despite the code from the Answer working at that time I've now come to realize that data in the...

python,html,beautifulsoup,html-parsing
I've written a script using beautifulsoup4 that works in one machine but not another. The reason is that on that other machine, BeautifulSoup() constructor auto-convert <br> to <br/> whereas it's not the behaviour on my machine. Believe it or not, it matters to my script. I figured that the two...

xml,python-2.7,beautifulsoup
I'm having an issue where my code is returning the information I want from XML with the tags where I only want the information between the tags. My output looks like [<weekendingdate>2015-05-02</weekendingdate>] but it should be 2015-05-02. Thanks for the help! Below is my attempt and the XML code. Attempt:...

python,regex,url,beautifulsoup,matching
https://example.net/users/x Here, x is a number that ranges from 1 to 200000. I want to run a loop to get all the URLs and extract contents from every URL using beautiful soup. from bs4 import BeautifulSoup from urllib.request import urlopen import re content = urlopen(re.compile(r"https://example.net/users/[0-9]//")) soup = BeautifulSoup(content) Is this...

python,html,regex,beautifulsoup,lxml
I would like to find all URLs in a string. I found various solutions on StackOverflow that vary depending on the content of the string. For example, supposing my string contained HTML, this answer recommends using either BeautifulSoup or lxml. On the other hand, if my string contained only a...

python,python-2.7,beautifulsoup
Example: Sometimes the HTML is: <div id="1"> <div id="2"> this is the text i do NOT want </div> this is the text i want here </div> Other times it's just: <div id="1"> this is the text i want here </div> I want to get only the text in the one...

python,html,beautifulsoup,html-parsing
<div class="meaning"><span class="hinshi">［副］</span>物事の重点・大勢を述べるときに用いる。</div> All I need from this is おもに。もっぱら。物事の重点・大勢を述べるときに用いる. Usually the hinshi class is separate from the sentences I'm trying to parse, but for some of them they seem to be combined together. Is there anyway to just print the sentence while ignore the ［副］?...

python,html,escaping,beautifulsoup
I would like to wrap some words that are not already links with anchor links in BeautifulSoup. I use this to achieve it: from bs4 import BeautifulSoup import re text = ''' replace this string ''' soup = BeautifulSoup(text) pattern = 'replace' for txt in soup.findAll(text=True): if re.search(pattern,txt,re.I) and txt.parent.name...

python,web-scraping,beautifulsoup
I'm trying to get a hold of the data under the columns having the code "SEVNYXX", where "XX" are the numbers that follow (eg. 01, 02, etc) on the site using Python. With the code below I can get the first row of all the Columns data that I want....

python,python-2.7,web-scraping,beautifulsoup,urlopen
Let's say I want to scrape the data here. I can do it nicely using urlopen and BeautifulSoup in Python 2.7. Now if I want to scrape data from the second page with this address. What I get is the data from the first page! I looked at the page...

python,regex,parsing,beautifulsoup,python-requests
Im using Python 2.7, BeautifulSoup4, regex, and requests on windows 7. I've scraped some code from a website and I am having problems parsing and extracting the bits I want and storing them in a dictionary. What I'm after is text that is presented as follows in the code: @CAD_DTA\">I...

python,html,forms,beautifulsoup,html-parsing
I'm trying to use BeautifulSoup to extract input fields for a specific form only. Extracting the form using the following: soup.find('form') Now I want to extract all input fields which are a child to that form only. How can I do that with BS?...

python,html,beautifulsoup
so I'm a webscraping noob, and ran into some HTML format i've never seen before. All the info I need is in a completely flat hierarchy. I need to grab the Date/MovieName/Location/Amenities. It's laid out so (just like this): <div class="caption"> <strong>July 1</strong> <br> <em>Top Gun</em> <br> "Location: Millennium Park"...

python,python-2.7,web-scraping,beautifulsoup
Sorry if that was a vague title. I'm trying to scrape the number of XKCD web-comics on a consistent basis. I saw that http://xkcd.com/ always has their newest comic on the front page along with a line further down the site saying: Permanent link to this comic: http://xkcd.com/1520/ Where 1520...

python,csv,beautifulsoup
I can't seem to figure out the proper indents/clause placements to get this to loop thru more than 1 page. This code current prints out a CSV file fine, but only does it for the first page. Any help?? #THIS WORKS BUT ONLY PRINTS THE FIRST PAGE from bs4 import...

python-2.7,beautifulsoup
I have an HTML snippet which looks like following: <div class="myTestCode"> <strong>Abc: </strong> test1</br> <strong>Def: </strong> test2</br> </div> How do I parse it in Beautiful Soup to get: Abc: test1, Def: test2 This is what I have tried so far : data = """<div class="myTestCode"> <strong>Abc: </strong> test1</br> <strong>Def: </strong>...

python,css,beautifulsoup
I am beginning web scraping with BeautifulSoup in Python. Website I am trying to parse "http://www.moneycontrol.com/india/stockpricequote/computers-software/techmahindra/TM4" My code as below previous_close = content.select(".gD_12 PB3"); I have the following error when the line is interpreted previous_close = content.select(".gD_12 PB3"); File "/usr/local/lib/python2.7/dist-packages/bs4/element.py", line 1313, in select 'Unsupported or invalid CSS selector: "%s"'...

python,flask,beautifulsoup,jinja2
I want to create content snippets for my home page. An example post looks something like <p>Your favorite Harry Potter characters enter the Game of Thrones universe, and you'll never guess what happens!</p> <readmore/> <p>...they all die</p> On the home page I only want the things before <readmore/> to show...

python,table,website,beautifulsoup
I am new to this website and programming in general, so bare with me please as my formatting for the question may be incorrect. I am trying to extract data from a website for personal use. I only want the precipitation at the top of the hour. I am nearly...

python,web-scraping,beautifulsoup,linkedin,mechanize
I am trying to scrape some web pages from LinkedIn using BeautifulSoup and I keep getting error "HTTP Error 999: Request denied". Is there a way around to avoid this error. If you look at my code, I have tried Mechanize and URLLIB2 and both are giving me the same...

python,beautifulsoup
I am very new to all this and am having a hard time getting specific text outside of any tags using BeautifulSoup. Here is my code: from bs4 import BeautifulSoup soup = BeautifulSoup(''' <li id="SalesRank" style="list-style : none"> <b>Sellers Rank:</b> #81 in Fun (<a href="http://www.google.com">See Top 100</a>) </li> ''') theRank...

python,beautifulsoup,scrape
I want to get data located(name, city and address) in div tag from a HTML file like this: <div class="mainInfoWrapper"> <h4 itemprop="name">name</h4> <div> <a href="/Wiki/Province/Tehran"></a> city <a href="/Wiki/City/Tehran"></a> Address </div> </div> I don't know how can I get data that i want in that specific tag. obviously I'm using python...

python,web-scraping,beautifulsoup
I want to save images from url to special folder, for example 'my_images', but not to default(where my *.py file is). Is it possible to make it? Because my code saves all images to folder with *.py file. Here is my code: import urllib.request from bs4 import BeautifulSoup import re...

python,html,css-selectors,beautifulsoup,html-parsing
I want to select a table tag which has the value of class attribute as: drug-table data-table table table-condensed table-bordered So I tried the below code: for i in soup.select('table[class="drug-table data-table table table-condensed table-bordered"]'): print(i) But it fails to work: ValueError: Unsupported or invalid CSS selector: "table[class="drug-table" spaces in the...

python,html,web-scraping,beautifulsoup,html-parsing
On this page the final score (number) of each team has the same class name class="finalScore". When I call the final score of the away team (on top) the code calls that number without a problem. If ... favLastGM = 'A' When I try to call the final score of...

python,html,beautifulsoup
I've been able to isolate a row in a html table using Beautiful Soup in Python 2.7. Been a learning experience, but happy to get that far. Unfortunately I'm a bit stuck on this next bit. I need to get the link that follows the "Select document Remittance Report I...

python,html,xml,beautifulsoup,cdata
I used beautiful soup to get CDATA from a html page but i have to extract contents from it and put it in a csv file. this is my code: from bs4 import BeautifulSoup from urllib.request import urlopen import re import csv f = open('try.html') ff = csv.writer(open("profiletry.csv", "w")) ff.writerow(["cdata"])...

javascript,python,html,web-scraping,beautifulsoup
In my attempt to make a scraper, I found a website that uses javascript alot in its code, is it possible to retrieve the output of the script e.g. <html> <head> <title>Python</title> </head> <body> <script type="text/javascript" src='test.js'></script> <p> some stuff <br> more stuff <br> code <br> video <br> picture <br>...

python,html,beautifulsoup,html-parsing
For example, I have this: result = soup.select('div#test > div.filters > span.text') I want to limit the result of the above list to 10 items. In case of find_all() one can use the limit argument but what about select()?...

python,html,xml-parsing,beautifulsoup,html-parsing
I am a newbie and looking at HTML code for first time. For my research I need to know the number of tags and attributes in a webpage. I looked at various parser and found Beautiful Soup to be one of the most preferred one. The following code (taken from...

python,regex,string,beautifulsoup
I'm trying to extract some data from a web page. I'm using Beautiful Soup 4 and regexes. The problem is that it returns an error but I can't figure out why the error is raised. Here is a piece of my code: urls = soup.findall('a',href = re.compile(r'/katalog/stavebnictvi/'+'.')) Here is the...

python,xml,beautifulsoup,fogbugz,fogbugz-api
I'm trying to parse a list of cases that is returned from the Fogbugz API. Current code: from fogbugz import FogBugz from datetime import datetime, timedelta import fbSettings fb = FogBugz(fbSettings.URL, fbSettings.TOKEN) resp = fb.search(q='project:"Python Testing"',cols='ixBug') print resp.cases.case.ixbug.string The problem is that the XML has multiple cases returned simply as...

python,beautifulsoup
I'm downloading Excel files from a website using beautifulsoup4. I only need to download the files. I don't need to rename them, just download them to a folder, relative to where the code is. the function takes in a beautifulsoup call, searches for <a> then makes a call to the...

python,regex,string,beautifulsoup,html-parsing
I've got a little project where I’m trying to download a series of wallpapers from a web page. I'm new to python. I'm using the urllib library, which is returning a long string of web page data which includes <a href="http://website.com/wallpaper/filename.jpg"> I know that every filename I need to download...

python,beautifulsoup,web-crawler,bs4
I'm new to python and learning it. Basically I am trying to pull all the links from my e-commerce store products that is stored in the html below. I'm getting no results returned though and I can't seem to figure out why not. <h3 class="two-lines-name"> <a title="APPLE IPOD IPOD A1199...

python,unicode,beautifulsoup
i'm playing around with BeautifulSoup scraping a table and its contents and i've noticed I get different outputs based on how I end it - if i print it outright I get an output that has no unicode notation. html = urlopen('http://www.bcsfootball.org').read() soup = BeautifulSoup(html) for row in soup('table', {'class':'mod-data'})[0].tbody('tr'):...