• Latest
  • Trending
Web Scraping all ResearchGate Publications in Python

Web Scraping all ResearchGate Publications in Python

July 1, 2022
ATC Ghana supports Girls-In-ICT Program

ATC Ghana supports Girls-In-ICT Program

April 25, 2023
Vice President Dr. Bawumia inaugurates  ICT Hub

Vice President Dr. Bawumia inaugurates ICT Hub

April 2, 2023
Co-Creation Hub’s edtech accelerator puts $15M towards African startups

Co-Creation Hub’s edtech accelerator puts $15M towards African startups

February 20, 2023
Data Leak Hits Thousands of NHS Workers

Data Leak Hits Thousands of NHS Workers

February 20, 2023
EU Cybersecurity Agency Warns Against Chinese APTs

EU Cybersecurity Agency Warns Against Chinese APTs

February 20, 2023
How Your Storage System Will Still Be Viable in 5 Years’ Time?

How Your Storage System Will Still Be Viable in 5 Years’ Time?

February 20, 2023
The Broken Promises From Cybersecurity Vendors

Cloud Infrastructure Used By WIP26 For Espionage Attacks on Telcos

February 20, 2023
Instagram and Facebook to get paid-for verification

Instagram and Facebook to get paid-for verification

February 20, 2023
YouTube CEO Susan Wojcicki steps down after nine years

YouTube CEO Susan Wojcicki steps down after nine years

February 20, 2023
Inaugural AfCFTA Conference on Women and Youth in Trade

Inaugural AfCFTA Conference on Women and Youth in Trade

September 6, 2022
Instagram fined €405m over children’s data privacy

Instagram fined €405m over children’s data privacy

September 6, 2022
8 Most Common Causes of a Data Breach

5.7bn data entries found exposed on Chinese VPN

August 18, 2022
  • Consumer Watch
  • Kids Page
  • Directory
  • Events
  • Reviews
Sunday, 4 June, 2023
  • Login
itechnewsonline.com
  • Home
  • Tech
  • Africa Tech
  • InfoSEC
  • Data Science
  • Data Storage
  • Business
  • Opinion
Subscription
Advertise
No Result
View All Result
itechnewsonline.com
No Result
View All Result

Web Scraping all ResearchGate Publications in Python

by ITECHNEWS
July 1, 2022
in Data Science, Leading Stories
0 0
0
Web Scraping all ResearchGate Publications in Python

Prerequisites

Basic knowledge scraping with CSS selectors

If you haven’t scraped with CSS selectors, there’s a dedicated blog post of mine about how to use CSS selectors when web-scraping that covers what it is, pros and cons, and why they’re matter from a web-scraping perspective and show the most common approaches of using CSS selectors when web scraping.

YOU MAY ALSO LIKE

ATC Ghana supports Girls-In-ICT Program

Vice President Dr. Bawumia inaugurates ICT Hub

Separate virtual environment

If you didn’t work with a virtual environment before, have a look at the dedicated Python virtual environments tutorial using Virtualenv and Poetry blog post of mine to get familiar.

Reduce the chance of being blocked

There’s a chance that a request might be blocked. Have a look at how to reduce the chance of being blocked while web-scraping, there are eleven methods to bypass blocks from most websites.

Install libraries:

pip install parsel playwright

Full Code

from parsel import Selector
from playwright.sync_api import sync_playwright
import json


def scrape_researchgate_publications(query: str):
    with sync_playwright() as p:

        browser = p.chromium.launch(headless=True, slow_mo=50)
        page = browser.new_page(user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Safari/537.36")

        publications = []
        page_num = 1

        while True:
            page.goto(f"https://www.researchgate.net/search/publication?q={query}&page={page_num}")
            selector = Selector(text=page.content())

            for publication in selector.css(".nova-legacy-c-card__body--spacing-inherit"):
                title = publication.css(".nova-legacy-v-publication-item__title .nova-legacy-e-link--theme-bare::text").get().title()
                title_link = f'https://www.researchgate.net{publication.css(".nova-legacy-v-publication-item__title .nova-legacy-e-link--theme-bare::attr(href)").get()}'
                publication_type = publication.css(".nova-legacy-v-publication-item__badge::text").get()
                publication_date = publication.css(".nova-legacy-v-publication-item__meta-data-item:nth-child(1) span::text").get()
                publication_doi = publication.css(".nova-legacy-v-publication-item__meta-data-item:nth-child(2) span").xpath("normalize-space()").get()
                publication_isbn = publication.css(".nova-legacy-v-publication-item__meta-data-item:nth-child(3) span").xpath("normalize-space()").get()
                authors = publication.css(".nova-legacy-v-person-inline-item__fullname::text").getall()
                source_link = f'https://www.researchgate.net{publication.css(".nova-legacy-v-publication-item__preview-source .nova-legacy-e-link--theme-bare::attr(href)").get()}'

                publications.append({
                    "title": title,
                    "link": title_link,
                    "source_link": source_link,
                    "publication_type": publication_type,
                    "publication_date": publication_date,
                    "publication_doi": publication_doi,
                    "publication_isbn": publication_isbn,
                    "authors": authors
                })

            print(f"page number: {page_num}")

            # checks if next page arrow key is greyed out `attr(rel)` (inactive) and breaks out of the loop
            if selector.css(".nova-legacy-c-button-group__item:nth-child(9) a::attr(rel)").get():
                break
            else:
                page_num += 1


        print(json.dumps(publications, indent=2, ensure_ascii=False))

        browser.close()


scrape_researchgate_publications(query="coffee")

Code explanation

Import libraries:

from parsel import Selector
from playwright.sync_api import sync_playwright
import json
Code Explanation
parsel to parse HTML/XML documents. Supports XPath.
playwright to render the page with a browser instance.
json to convert Python dictionary to JSON string.

Define a function and open a playwright with a context manager::

def scrape_researchgate_publications(query: str):
    with sync_playwright() as p:
        # ...
Code Explanation
query: str to tell Python that query should be an str.

Lunch a browser instance, open new_page with passed user-agent:

browser = p.chromium.launch(headless=True, slow_mo=50)
page = browser.new_page(user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Safari/537.36")
Code Explanation
p.chromium.launch() to launch Chromium browser instance.
headless to explicitly tell playwright to run in headless mode even though it’s a defaut value.
slow_mo to tell playwright to slow down execution.
browser.new_page() to open new page. user_agent is used to act a real user makes a request from the browser. If not used, it will default to playwright value which is None. Check what’s your user-agent.

Add a temporary list, set up a while loop, and open a new URL:

authors = []

while True:
    page.goto(f"https://www.researchgate.net/search/publication?q={query}&page={page_num}")
    selector = Selector(text=page.content())
    # ...
Code Explanation
goto() to make a request to specific URL with passed query and page parameters.
Selector() to pass returned HTML data with page.content() and process it.

Iterate over author results on each page, extract the data and append to a temporary list:

for publication in selector.css(".nova-legacy-c-card__body--spacing-inherit"):
    title = publication.css(".nova-legacy-v-publication-item__title .nova-legacy-e-link--theme-bare::text").get().title()
    title_link = f'https://www.researchgate.net{publication.css(".nova-legacy-v-publication-item__title .nova-legacy-e-link--theme-bare::attr(href)").get()}'
    publication_type = publication.css(".nova-legacy-v-publication-item__badge::text").get()
    publication_date = publication.css(".nova-legacy-v-publication-item__meta-data-item:nth-child(1) span::text").get()
    publication_doi = publication.css(".nova-legacy-v-publication-item__meta-data-item:nth-child(2) span").xpath("normalize-space()").get()
    publication_isbn = publication.css(".nova-legacy-v-publication-item__meta-data-item:nth-child(3) span").xpath("normalize-space()").get()
    authors = publication.css(".nova-legacy-v-person-inline-item__fullname::text").getall()
    source_link = f'https://www.researchgate.net{publication.css(".nova-legacy-v-publication-item__preview-source .nova-legacy-e-link--theme-bare::attr(href)").get()}'

    publications.append({
        "title": title,
        "link": title_link,
        "source_link": source_link,
        "publication_type": publication_type,
        "publication_date": publication_date,
        "publication_doi": publication_doi,
        "publication_isbn": publication_isbn,
        "authors": authors
    })
Code Explanation
css() to parse data from the passed CSS selector(s). Every CSS query traslates to XPath using csselect package under the hood.
::text/::attr(attribute) to extract textual or attribute data from the node.
get()/getall() to get actual data from a matched node, or to get a list of matched data from nodes.
xpath("normalize-space()") to parse blank text node as well. By default, blank text node is be skipped by XPath.

Check if the next page is present and paginate:

# checks if the next page arrow key is greyed out `attr(rel)` (inactive) -> breaks out of the loop
if selector.css(".nova-legacy-c-button-group__item:nth-child(9) a::attr(rel)").get():
    break
else:
    page_num += 1

Print extracted data, and close browser instance:

print(json.dumps(publications, indent=2, ensure_ascii=False))

browser.close()

# call the function
scrape_researchgate_publications(query="coffee")

Part of the JSON output:

[
   {
      "title":"The Social Life Of Coffee Turkey’S Local Coffees",
      "link":"https://www.researchgate.netpublication/360540595_The_Social_Life_of_Coffee_Turkey%27s_Local_Coffees?_sg=kzuAi6HlFbSbnLEwtGr3BA_eiFtDIe1VEA4uvJlkBHOcbSjh5XlSQe6GpYvrbi12M0Z2MQ6grwnq9fI",
      "source_link":"https://www.researchgate.netpublication/360540595_The_Social_Life_of_Coffee_Turkey%27s_Local_Coffees?_sg=kzuAi6HlFbSbnLEwtGr3BA_eiFtDIe1VEA4uvJlkBHOcbSjh5XlSQe6GpYvrbi12M0Z2MQ6grwnq9fI",
      "publication_type":"Conference Paper",
      "publication_date":"Apr 2022",
      "publication_doi":null,
      "publication_isbn":null,
      "authors":[
         "Gülşen Berat Torusdağ",
         "Merve Uçkan Çakır",
         "Cinucen Okat"
      ]
   },
   {
      "title":"Coffee With The Algorithm",
      "link":"https://www.researchgate.netpublication/359599064_Coffee_with_the_Algorithm?_sg=3KHP4SXHm_BSCowhgsa4a2B0xmiOUMyuHX2nfqVwRilnvd1grx55EWuJqO0VzbtuG-16TpsDTUywp0o",
      "source_link":"https://www.researchgate.netNone",
      "publication_type":"Chapter",
      "publication_date":"Mar 2022",
      "publication_doi":"DOI: 10.4324/9781003170884-10",
      "publication_isbn":"ISBN: 9781003170884",
      "authors":[
         "Jakob Svensson"
      ]
   }, ... other publications
   {
      "title":"Coffee In Chhattisgarh", # last publication
      "link":"https://www.researchgate.netpublication/353118247_COFFEE_IN_CHHATTISGARH?_sg=CsJ66DoWjFfkMNdujuE-R9aVTZA4kVb_9lGiy1IrYXls1Nur4XFMdh2s5E9zkF5Skb5ZZzh663USfBA",
      "source_link":"https://www.researchgate.netNone",
      "publication_type":"Technical Report",
      "publication_date":"Jul 2021",
      "publication_doi":null,
      "publication_isbn":null,
      "authors":[
         "Krishan Pal Singh",
         "Beena Nair Singh",
         "Dushyant Singh Thakur",
         "Anurag Kerketta",
         "Shailendra Kumar Sahu"
      ]
   }
]

Links

  • GitHub Gist
Source: Dmitriy Zub
Tags: Web Scraping all ResearchGate Publications in Python
ShareTweetShare
Plugin Install : Subscribe Push Notification need OneSignal plugin to be installed.

Search

No Result
View All Result

Recent News

ATC Ghana supports Girls-In-ICT Program

ATC Ghana supports Girls-In-ICT Program

April 25, 2023
Vice President Dr. Bawumia inaugurates  ICT Hub

Vice President Dr. Bawumia inaugurates ICT Hub

April 2, 2023
Co-Creation Hub’s edtech accelerator puts $15M towards African startups

Co-Creation Hub’s edtech accelerator puts $15M towards African startups

February 20, 2023

About What We Do

itechnewsonline.com

We bring you the best Premium Tech News.

Recent News With Image

ATC Ghana supports Girls-In-ICT Program

ATC Ghana supports Girls-In-ICT Program

April 25, 2023
Vice President Dr. Bawumia inaugurates  ICT Hub

Vice President Dr. Bawumia inaugurates ICT Hub

April 2, 2023

Recent News

  • ATC Ghana supports Girls-In-ICT Program April 25, 2023
  • Vice President Dr. Bawumia inaugurates ICT Hub April 2, 2023
  • Co-Creation Hub’s edtech accelerator puts $15M towards African startups February 20, 2023
  • Data Leak Hits Thousands of NHS Workers February 20, 2023
  • Home
  • InfoSec
  • Opinion
  • Africa Tech
  • Data Storage

© 2021-2022 iTechNewsOnline.Com - Powered by BackUPDataSystems

No Result
View All Result
  • Home
  • Tech
  • Africa Tech
  • InfoSEC
  • Data Science
  • Data Storage
  • Business
  • Opinion

© 2021-2022 iTechNewsOnline.Com - Powered by BackUPDataSystems

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In
Go to mobile version