Ujeebu Extract
Introduction​
Ujeebu Extract converts a news or blog article into structured JSON data. It extracts the main text and html bodies, the author, publish date, any embeddable media such as YouTube and twitter cards, the RSS feeds or social feeds (Facebook/Twitter timelines or YouTube channels) among other relevant pieces of data.
To use API, subscribe to a plan here and connect to :
GET https://api.ujeebu.com/extract
Parameters​
Name | Type | Description | Default Value |
---|---|---|---|
url REQUIRED | string | URL of article to be extracted. | - |
raw_html | string | HTML of article to be extracted. When this is passed, article extraction is carried out on the value of this parameter (i.e. without fetching article from url ), however the extractor still relies on url to resolve relative links and relatively referenced assets in the provided html. | |
js | boolean | indicates whether to execute JavaScript or not. Set to 'auto' to let the extractor decide. | false |
text | boolean | indicates whether API should return extracted text. | true |
html | boolean | indicates whether API should extract html. | true |
media | boolean | indicates whether API should extract media. | false |
feeds | boolean | indicates whether API should extract RSS feeds. | false |
images | boolean | indicates whether API should extract all images present in HTML. | true |
author | boolean | indicates whether API should extract article's author. | true |
pub_date | boolean | indicates whether API should extract article's publish date. | true |
partial | number | number of characters or percentage of text (if percent sign is present) of text/html to be returned. 0 means all. | 0 |
is_article | boolean | when true returns the probability [0-1] of URL being an article. Anything scoring 0.5 and above should be an article, but this may slightly vary from one site to another. | true |
quick_mode | boolean | when true, does a quick analysis of the content instead of the normal advanced parsing. Usually cuts down response time by about 30% to 60%. | false |
strip_tags | csv-string | indicates which tags to strip from the extracted article HTML. Expects a comma separated list of tag names/css selectors. | form |
timeout | number | maximum number of seconds before request timeout | 60 |
js_timeout | number | when js is enabled, indicates how many seconds the API should wait for the JS engine to render the supplied URL. | timeout /2 |
scroll_down | boolean | indicates whether to scroll down the page or not, this applies only when js is enabled. | true |
image_analysis | boolean | indicates whether API should analyse images for minimum width and height (see parameters min_image_width and min_image_height for more details). | true |
min_image_width | number | minimum width of the images kept in the HTML (if image_analysis is false this parameter has no effect). | 200 |
min_image_height | number | minimum height of the images kept in the HTML (if image_analysis is false this parameter has no effect). | 100 |
image_timeout | number | image fetching timeout in seconds. | 2 |
return_only_enclosed_text_images | boolean | indicates whether to return only images that are enclosed within extracted article HTML. | true |
proxy_type | string | indicates type of proxy to use. Possible values: 'rotating', 'advanced', 'premium', 'residential', 'mobile', 'custom'. | rotating |
proxy_country | string | country ISO 3166-1 alpha-2 code to proxy from. Valid only when premium proxy type is chosen. | US |
custom_proxy | string | URI for your custom proxy in the following format: scheme://user:pass@host:port . applicable and required only if proxy_type=custom | null |
auto_proxy | string | enable a more advanced proxy by default when rotating proxy is not working. It will move to the next proxy option until it gets the content and will only stop when content is available or none of the options worked. Please note that you are billed only on the top option attempted. | false |
session_id | alphanumeric | alphanumeric identifier with a length between 1 and 16 characters, used to route multiple requests from the same proxy instance. Sessions remain active for 30 minutes | null |
pagination | boolean | extract and concatenate multiple-page articles. | true |
pagination_max_pages | string | indicates the number of pages to extract when pagination is enabled. | 30 |
UJB-headerName | string | indicates which headers to send to target URL. This can be useful when article is behind a paywall for example, and that you need to pass your authentication cookies. | null |
Responses​
Status | Meaning | Description | Schema |
---|---|---|---|
200 | OK | successful operation | SuccessResponse |
400 | Bad Request | Invalid parameter value | APIResponseError |
Schemas​
Article Schema​
{
"url": "string",
"canonical_url": "string",
"title": "string",
"text": "string",
"html": "string",
"summary": "string",
"image": "string",
"images": ["string"],
"media": ["string"],
"language": "string",
"author": "string",
"pub_date": "string",
"modified_date": "string",
"site_name": "string",
"favicon": "string",
"encoding": "string"
}
Properties​
Name | Type | Description |
---|---|---|
url | string | the URL parameter. |
canonical_url | string | the final (resolved) URL. |
title | string | the title of the article. |
text | string | the extracted text. |
html | string | the extracted html. |
summary | string | summary (if available) of the article text. |
image | string | main image of the article. |
images | [string] | all images present in article. |
media | [string] | all media present in article. |
language | string | language code of article text. |
author | string | author of article. |
pub_date | string | publication date of article. |
modified_date | string | last modified date of article. |
site_name | string | name of site hosting article. |
favicon | string | favicon of site hosting article. |
encoding | string | character encoding of article text. |
Success Response example​
{
"article": {
"text": "The killings of unarmed Palestinians by Israeli snipers this past fortnight marks a new chapter in the degradation of human life in Gaza and the Occupied Territories. It also makes one think about the continuing scandal of the Rohingya genocide in Myanmar, followed by their shameful treatment in India, Bangladesh, Indonesia and Thailand, among other Asian countries. Then there is Kashmir, which the Indian military continues to occupy with complete indifference to the lives of the men, women and children who live there. In each case, the targeted communities are Muslim though they are products of very different contextual histories. The global dynamics of genocide are not, in spite of appearances, primarily about Muslimness. The obsession with projecting Muslims as a coordinated global category is a collaborative project of highly specific Western and Muslim political theologies. It should not be viewed as a self-evident, universal fact.\nToday’s genocidal projects move me to reflect on the fate of ethnic, racial and other biominorities in our world. By biominorities I mean those whose difference (ethnic, religious, racial) from their national majorities is seen as a form of bodily threat to the national ethnos. There is something odd about the relationship of such biominorities to the typology of today’s genocidal projects. One type, which the Israeli killings in Gaza exemplify, is what we may call carceral genocide, genocide by confinement, concentration and starvation. The historical prototype of this is to be found in the Nazi concentration camps. The other might be called diasporic genocide, genocide by dispersion, extrusion and uprooting, where biominorities are forced out of any form of stability and forced to move until they die. Palestinians under Israeli occupation represent the first type, Rohingyas represent the second.\nWhat accounts for this bipolar condition of biominorities in this early decade of the 21st century? Put another way: why does the Israeli state not simply push Palestinians out of their land using its overwhelming military might, forcing them to join their brethren in other parts of the Middle East or North Africa, or die on the way? Conversely, why did Myanmar not simply create a carceral Rohingya state where this biominority could be confined, policed, starved and “concentrated” to death? These counterfactual questions force one to look more closely at the menu of genocidal strategies in play today.\nIn Myanmar’s case, the key factor, as many commentators have pointed out, is that Rohingyas occupy rich agricultural lands on the Western coast, which are now ripe for building ports and infrastructure across the Bay of Bengal. Rohingyas are deeply embedded in their land, which they have developed over centuries. Incarcerating them is no solution for the Myanmar military. They need to go, and the murder, rape and armed aggression directed at them is intended to push them out. The ethnocidal Buddhist monkhood which provides the ideological fuel for this extrusion is the willing partner of the militarised state. The Buddhist majority of Myanmar is in fact awash in an ocean of minorities, many of which are well-armed, belligerent and based in inaccessible ecological zones. But Rohingyas are not experienced in armed resistance and they are geographically concentrated in land which the state needs for its global projects. Thus, they are ripe for murderous expulsion. While their Muslim identity is a source of ideological fuel for the Buddhist majority, their relative weakness and location in vital global stretches along the Mynamar coast are more relevant.\n\nPARANOID SOVEREIGNTY\nWhy does Israel not follow a similar policy of expulsion, extrusion and displacement in the case of the Palestinian population in the Occupied Territories, including Gaza? Why adopt the option of incarceration and killing with impunity? The fundamental reason is near at hand. Palestinians under Israeli rule will not leave willingly because they are the legitimate occupants of their lands and because they have a long tradition of militant resistance, supported at different times by other Middle Eastern states, most recently Iran. They are stubborn and, thus, they have to be concentrated, starved and killed until they elect exit.\nBut there is more to the Israeli case than this. Israel needs its captive Palestinian population for without it neither the current power of the religious right nor the populist authoritarianism of Benjamin Netanhayu has any justification for existence. Like Kurds in Turkey, Jews in Hungary, Muslims in India and other visible biominorities, Palestinians in Israel are the guarantee of a permanent state of paranoid sovereignty. This paranoid sovereignty is Israel’s major claim to the sympathies and armed assistance of the United States since Israel would be far more susceptible to moderate voices if Palestinians were to disappear or exit. An outbreak of democracy is the last thing the Israeli religious and political right want, and the Donald Trump White House also hates any hint of moderation in any of its client states. The Israeli policy of aggressive and ongoing settler colonialism is intended to produce a continuous border theater in which Palestinians are indispensable in the creation of a permanent state of paranoid sovereignty.\nSo, what do the Palestinian and Rohingya cases (extreme ideal types, as it were) teach us? That solutions to the “problem” of biominorities depend on whether you want to keep the despised minority in order to avoid actually producing some semblance of democracy, or whether you want to delink the group from their lands or resources, with no pressing need to use their presence as a pretext for an ever-militant militarised state. You either need the minority to keep paranoid sovereignty alive, or you need their resources more than you need their biominor threat.\nWhat then of other genocidal trends we see in different regimes and regions across the world? Do these thoughts about Palestinians and Rohingyas offer us a more general insight? That both Rohingyas and Palestinians are Muslim does not account for the very different ways in which Myanmar and Israel treat them. The loose post 9/11 discourse of the Muslim threat allows the two states (and others) to legitimise their violence, but the global dynamics of genocide are not primarily about Muslimness. The fact is that all nation states rely on some idea, however covert, of ethnic purity and singularity. Biominor plurality is thus always a threat to modern nation states. The question is what combination of extrusion and incarceration a particular nation state finds useful. As they consider the possibilities, Israel and Myanmar offer them two radical options, which just happen to have Muslim communities as their targets. But today’s varieties of genocide are not as much about religion as they are about paranoid and/or predatory nation-states.\nArjun Appadurai is the Goddard Professor of Media, Culture, and Communication at New York University.",
"html": "<p> The killings of unarmed Palestinians by Israeli snipers this past fortnight marks a new chapter in the degradation of human life in Gaza and the Occupied Territories. It also makes one think about the continuing scandal of the Rohingya genocide in Myanmar, followed by their shameful treatment in India, Bangladesh, Indonesia and Thailand, among other Asian countries. Then there is Kashmir, which the Indian military continues to occupy with complete indifference to the lives of the men, women and children who live there. In each case, the targeted communities are Muslim though they are products of very different contextual histories. The global dynamics of genocide are not, in spite of appearances, primarily about Muslimness. The obsession with projecting Muslims as a coordinated global category is a collaborative project of highly specific Western and Muslim political theologies. It should not be viewed as a self-evident, universal fact. </p><p> Today’s genocidal projects move me to reflect on the fate of ethnic, racial and other biominorities in our world. By biominorities I mean those whose difference (ethnic, religious, racial) from their national majorities is seen as a form of bodily threat to the national ethnos. There is something odd about the relationship of such biominorities to the typology of today’s genocidal projects. One type, which the Israeli killings in Gaza exemplify, is what we may call carceral genocide, genocide by confinement, concentration and starvation. The historical prototype of this is to be found in the Nazi concentration camps. The other might be called diasporic genocide, genocide by dispersion, extrusion and uprooting, where biominorities are forced out of any form of stability and forced to move until they die. Palestinians under Israeli occupation represent the first type, Rohingyas represent the second. </p><p> What accounts for this bipolar condition of biominorities in this early decade of the 21st century? Put another way: why does the Israeli state not simply push Palestinians out of their land using its overwhelming military might, forcing them to join their brethren in other parts of the Middle East or North Africa, or die on the way? Conversely, why did Myanmar not simply create a carceral Rohingya state where this biominority could be confined, policed, starved and “concentrated” to death? These counterfactual questions force one to look more closely at the menu of genocidal strategies in play today. </p><p> In Myanmar’s case, the key factor, as many commentators have pointed out, is that Rohingyas occupy rich agricultural lands on the Western coast, which are now ripe for building ports and infrastructure across the Bay of Bengal. Rohingyas are deeply embedded in their land, which they have developed over centuries. Incarcerating them is no solution for the Myanmar military. They need to go, and the murder, rape and armed aggression directed at them is intended to push them out. The ethnocidal Buddhist monkhood which provides the ideological fuel for this extrusion is the willing partner of the militarised state. The Buddhist majority of Myanmar is in fact awash in an ocean of minorities, many of which are well-armed, belligerent and based in inaccessible ecological zones. But Rohingyas are not experienced in armed resistance and they are geographically concentrated in land which the state needs for its global projects. Thus, they are ripe for murderous expulsion. While their Muslim identity is a source of ideological fuel for the Buddhist majority, their relative weakness and location in vital global stretches along the Mynamar coast are more relevant. </p><figure><a href=\"https://scroll.in/subscribe?utm_source=internal&utm_medium=articleinline\"><img src=\"https://s01.sgp1.digitaloceanspaces.com/inline/879591-llweiqsnvq-1526920853.jpg\" alt=\"\"></a></figure><h3><strong> Paranoid sovereignty </strong></h3><p> Why does Israel not follow a similar policy of expulsion, extrusion and displacement in the case of the Palestinian population in the Occupied Territories, including Gaza? Why adopt the option of incarceration and killing with impunity? The fundamental reason is near at hand. Palestinians under Israeli rule will not leave willingly because they are the legitimate occupants of their lands and because they have a long tradition of militant resistance, supported at different times by other Middle Eastern states, most recently Iran. They are stubborn and, thus, they have to be concentrated, starved and killed until they elect exit. </p><p> But there is more to the Israeli case than this. Israel needs its captive Palestinian population for without it neither the current power of the religious right nor the populist authoritarianism of Benjamin Netanhayu has any justification for existence. Like Kurds in Turkey, Jews in Hungary, Muslims in India and other visible biominorities, Palestinians in Israel are the guarantee of a permanent state of paranoid sovereignty. This paranoid sovereignty is Israel’s major claim to the sympathies and armed assistance of the United States since Israel would be far more susceptible to moderate voices if Palestinians were to disappear or exit. An outbreak of democracy is the last thing the Israeli religious and political right want, and the Donald Trump White House also hates any hint of moderation in any of its client states. The Israeli policy of aggressive and ongoing settler colonialism is intended to produce a continuous border theater in which Palestinians are indispensable in the creation of a permanent state of paranoid sovereignty. </p><p> So, what do the Palestinian and Rohingya cases (extreme ideal types, as it were) teach us? That solutions to the “problem” of biominorities depend on whether you want to keep the despised minority in order to avoid actually producing some semblance of democracy, or whether you want to delink the group from their lands or resources, with no pressing need to use their presence as a pretext for an ever-militant militarised state. You either need the minority to keep paranoid sovereignty alive, or you need their resources more than you need their biominor threat. </p><p> What then of other genocidal trends we see in different regimes and regions across the world? Do these thoughts about Palestinians and Rohingyas offer us a more general insight? That both Rohingyas and Palestinians are Muslim does not account for the very different ways in which Myanmar and Israel treat them. The loose post 9/11 discourse of the Muslim threat allows the two states (and others) to legitimise their violence, but the global dynamics of genocide are not primarily about Muslimness. The fact is that all nation states rely on some idea, however covert, of ethnic purity and singularity. Biominor plurality is thus always a threat to modern nation states. The question is what combination of extrusion and incarceration a particular nation state finds useful. As they consider the possibilities, Israel and Myanmar offer them two radical options, which just happen to have Muslim communities as their targets. But today’s varieties of genocide are not as much about religion as they are about paranoid and/or predatory nation-states. </p><p><em> Arjun Appadurai is the Goddard Professor of Media, Culture, and Communication at New York University. </em></p>",
"images": [
"https://s01.sgp1.digitaloceanspaces.com/inline/879591-llweiqsnvq-1526920853.jpg"
],
"author": "Arjun Appadurai",
"pub_date": "2018-05-22 08:00:00",
"is_article": 1,
"url": "https://scroll.in/article/879591/from-israel-to-myanmar-genocidal-projects-are-less-about-religion-and-more-about-predatory-states",
"canonical_url": "https://scroll.in/article/879591/from-israel-to-myanmar-genocidal-projects-are-less-about-religion-and-more-about-predatory-states",
"title": "Across the world, genocidal states are attacking Muslims. Is Islam really their target?",
"language": "en",
"image": "https://s01.sgp1.digitaloceanspaces.com/facebook/879591-ylsreufeki-1526956060.jpg",
"summary": "As Israel incarcerates Palestinians and Mynmar drives out its Rohingyas, a reflection on the predicament of ethnic and racial biominorities.",
"modified_date": "2018-05-22 08:00:00",
"site_name": "Scroll.in",
"favicon": "https://scroll.in/static/assets/apple-touch-icon-144x144.b71c766a62abe812b4b37e7f21e91e56.003.png",
"encoding": "utf-8",
"pages": [
"https://scroll.in/article/879591/from-israel-to-myanmar-genocidal-projects-are-less-about-religion-and-more-about-predatory-states"
]
},
"time": 10.981818914413452,
"js": false,
"pagination": false
}
Error Response Schema​
{
"url": "string",
"message": "string",
"error_code": "string",
"errors": ["string"]
}
Properties​
Name | Type | Description |
---|---|---|
url | string | Given URL |
message | string | Error message |
error_code | string | Error code |
errors | [string] | List of all errors |
Response Codes​
Code | Billed | Meaning | Suggestion |
---|---|---|---|
200 | Yes | Successful request | - |
400 | NO | Some required parameter is missing (URL) | Set |
401 | NO | Missing API-KEY | Provide API-KEY |
404 | YES | Provided URL not found | Provide a valid URL |
408 | YES | Request timeout | Increase timeout parameter, use premium proxy or force JS |
429 | NO | Too many requests | upgrade your plan |
500 | NO | Internal error | Try request or contact us |
Examples​
Stripping tags​
If you want to delete some html element(s) before the extraction is carried out, use parameter strip_tags
to pass a comma-separated list of css selectors of elements to delete.
The example below will remove any meta, form and input tags as well as any element with class hidden
curl -i \
-H 'ApiKey: <API Key>' \
-X GET \
"https://api.ujeebu.com/extract?url=https://ujeebu.com/blog/how-to-extract-clean-text-from-html&strip_tags=meta,form,.hidden,input"
- cURL
- NodeJs
- Python
- Java
- PHP
- Go
curl --location --request GET 'https://api.ujeebu.com/extract?url=https://ujeebu.com/blog/how-to-extract-clean-text-from-html&js=0&strip_tags=aside,form' \--header 'ApiKey: <API Key>'
var request = require('request');var options = {'method': 'GET','url': 'https://api.ujeebu.com/extract?url=https://ujeebu.com/blog/how-to-extract-clean-text-from-html&js=0&strip_tags=aside,form','headers': {'ApiKey': '<API Key>'}};request(options, function (error, response) {if (error) throw new Error(error);console.log(response.body);});
import requestsurl = "https://api.ujeebu.com/extract?url=https://ujeebu.com/blog/how-to-extract-clean-text-from-html&js=0&strip_tags=aside,form"payload={}headers = {'ApiKey': '<API Key>'}response = requests.request("GET", url, headers=headers, data=payload)print(response.text)
OkHttpClient client = new OkHttpClient().newBuilder().build();Request request = new Request.Builder().url("https://api.ujeebu.com/extract?url=https://ujeebu.com/blog/how-to-extract-clean-text-from-html&js=0&strip_tags=aside,form").method("GET", null).addHeader("ApiKey", "<API Key>").build();Response response = client.newCall(request).execute();
<?php$curl = curl_init();curl_setopt_array($curl, array(CURLOPT_URL => 'https://api.ujeebu.com/extract?url=https://ujeebu.com/blog/how-to-extract-clean-text-from-html&js=0&strip_tags=aside,form',CURLOPT_RETURNTRANSFER => true,CURLOPT_ENCODING => '',CURLOPT_MAXREDIRS => 10,CURLOPT_TIMEOUT => 0,CURLOPT_FOLLOWLOCATION => true,CURLOPT_HTTP_VERSION => CURL_HTTP_VERSION_1_1,CURLOPT_CUSTOMREQUEST => 'GET',CURLOPT_HTTPHEADER => array('ApiKey: <API Key>'),));$response = curl_exec($curl);curl_close($curl);echo $response;
package mainimport ("fmt""net/http""io/ioutil")func main() {url := "https://api.ujeebu.com/extract?url=https://ujeebu.com/blog/how-to-extract-clean-text-from-html&js=0&strip_tags=aside,form"method := "GET"client := &http.Client {}req, err := http.NewRequest(method, url, nil)if err != nil {fmt.Println(err)return}req.Header.Add("ApiKey", "<API Key>")res, err := client.Do(req)if err != nil {fmt.Println(err)return}defer res.Body.Close()body, err := ioutil.ReadAll(res.Body)if err != nil {fmt.Println(err)return}fmt.Println(string(body))}
Passing custom headers​
curl -i \
-H 'UJB-Username: username' \
-H 'UJB-Authorisation: Basic dXNlcm5hbWU6cGFzc3dvcmQ=' \
-H 'ApiKey: <API Key>' \
-X GET \
https://api.ujeebu.com/extract?url=https://ujeebu.com/blog/how-to-extract-clean-text-from-html
- cURL
- NodeJs
- Python
- Java
- PHP
- Go
curl --location --request GET 'https://api.ujeebu.com/extract?url=https://ujeebu.com/blog/how-to-extract-clean-text-from-html&js=0' \--header 'UJB-User-Agent: Custom user agent' \--header 'ApiKey: <API Key>'
var request = require('request');var options = {'method': 'GET','url': 'https://api.ujeebu.com/extract?url=https://ujeebu.com/blog/how-to-extract-clean-text-from-html&js=0','headers': {'UJB-User-Agent': 'Custom user agent','ApiKey': '<API Key>'}};request(options, function (error, response) {if (error) throw new Error(error);console.log(response.body);});
import requestsurl = "https://api.ujeebu.com/extract?url=https://ujeebu.com/blog/how-to-extract-clean-text-from-html&js=0"payload={}headers = {'UJB-User-Agent': 'Custom user agent','ApiKey': '<API Key>'}response = requests.request("GET", url, headers=headers, data=payload)print(response.text)
OkHttpClient client = new OkHttpClient().newBuilder().build();Request request = new Request.Builder().url("https://api.ujeebu.com/extract?url=https://ujeebu.com/blog/how-to-extract-clean-text-from-html&js=0").method("GET", null).addHeader("UJB-User-Agent", "Custom user agent").addHeader("ApiKey", "<API Key>").build();Response response = client.newCall(request).execute();
<?php$curl = curl_init();curl_setopt_array($curl, array(CURLOPT_URL => 'https://api.ujeebu.com/extract?url=https://ujeebu.com/blog/how-to-extract-clean-text-from-html&js=0',CURLOPT_RETURNTRANSFER => true,CURLOPT_ENCODING => '',CURLOPT_MAXREDIRS => 10,CURLOPT_TIMEOUT => 0,CURLOPT_FOLLOWLOCATION => true,CURLOPT_HTTP_VERSION => CURL_HTTP_VERSION_1_1,CURLOPT_CUSTOMREQUEST => 'GET',CURLOPT_HTTPHEADER => array('UJB-User-Agent: Custom user agent','ApiKey: <API Key>'),));$response = curl_exec($curl);curl_close($curl);echo $response;
package mainimport ("fmt""net/http""io/ioutil")func main() {url := "https://api.ujeebu.com/extract?url=https://ujeebu.com/blog/how-to-extract-clean-text-from-html&js=0"method := "GET"client := &http.Client {}req, err := http.NewRequest(method, url, nil)if err != nil {fmt.Println(err)return}req.Header.Add("UJB-User-Agent", "Custom user agent")req.Header.Add("ApiKey", "<API Key>")res, err := client.Do(req)if err != nil {fmt.Println(err)return}defer res.Body.Close()body, err := ioutil.ReadAll(res.Body)if err != nil {fmt.Println(err)return}fmt.Println(string(body))}
The code above will return the following response:
{
"article": {
"text": "Several open source tools allow the extraction of clean text from article HTML. We list the most popular ones below, and run a benchmark to see how they stack up against the Ujeebu API.\nExtracting clean article text from blogs and news sites (a.k.a. boilerplate removal) comes in handy for a variety of applications such as offline reading, article narration and generating article previews. It is also often a prerequisite for further content processing such as sentiment analysis, content summarization, text classification and other tasks that fall under the natural language processing (NLP) umbrella.\n\nWHY IS BOILERPLATE REMOVAL A DIFFICULT PROBLEM?\nThe main difficulty in extracting clean text from html lies in determining which blocks of text contribute to the article; different articles use different mark-up, and the text can be located anywhere in the DOM tree. It is also not uncommon for the parts of the DOM that contain the meat of the article to not be contiguous, and include loads of boilerplate (ads, forms, related article snippets...). Some publications also use JavaScript to generate the article content to ward off web scrapers, or simply as a coding preference.\nArticles contain other important info like author and publication date, which are also not straightforward to extract. Take the example of dates. Though you can achieve rather decent date extraction with regular expressions, you might need to identify the actual publication date vs. some other date mentioned in the article. Furthermore, one would need to run tens of regular expressions per supported language, and in doing so dramatically affect performance.\n\nSO HOW DO YOU EXTRACT TEXT AND OTHER DATA FROM A WEB PAGE?\nTwo sets of techniques are commonly used: statistical and Machine Learning based. Most statistical methods work by computing heuristics like link density, the frequency of certain characters, distance from the title, etc..., then combining them to form a probability score that represents the likelihood that an html block contains the main article text. A good explanation of these techniques can be found here . Machine learning techniques on the other hand rely on training mathematical models on a large set of documents that are represented by their key features and feeding them into a ML model.\nBoth techniques have their merits, with the statistical/heuristics method being the less computationally intensive of the two, on top of providing acceptable results in most cases. ML based techniques on the other hand tend to work better in complex cases and perform well on outliers, however as with any Machine Learning based algorithms, the quality of the training data is key. The two techniques are also sometimes used in tandem for better accuracy.\nIn some cases, extractors can fail due to a never-seen-before html structure, or simply bad mark-up. In such cases, it's customary to use per-domain rules that rely on CSS and/or DOM selectors. This is obviously a site dependent technique, and cannot be standardized by any means, but might help if we're scraping a small set of known publications, and provided regular checks are performed to make sure their html structure didn't change.\n\nTHE OPEN SOURCE OFFERING\n\nREADABILITY\nReadability is one of the oldest and probably the most used algorithms for text extraction, though it has considerably changed since it was first released. It also has several adaptations in different languages.\n\nMERCURY\nMercury is written in JavaScript and is based on Readability. It is also known for using custom site rules.\n\nBOILERPIPE\nBoilerPipe is Java based, uses a heuristics based algorithm just like readability and can be demo'ed here .\n\nDRAGNET\nDragNet uses a combination of heuristics and a Machine Learning Model . It comes pre-trained out of the box but can also be trained on a custom data set .\n\nNEWSPAPER\nNewsPaper is written in Python and is based on a package called Goose , which is also another decent extractor written in Scala. NewsPaper offers the advantage of extracting other data pieces like main keywords and article summary.\n\nUJEEBU VS. OPEN SOURCE\nUjeebu uses heuristics much like the other packages, from which it draws heavily, but resorts to a model to determine which heuristics to use. It also uses a model to determine if JavaScript is needed. This is paramount since JavaScript execution can dramatically slow down the extraction process, so it's important to know if it's needed or not upfront. Ujeebu supports extraction on multi-page articles, can identify rich media on demand and has built-in proxy and IP rotation.\nIn what follows, we compare the capabilities of Ujeebu with those of open source tools.\n\nPERFORMANCE\nWe ran Ujeebu and the aforementioned open source packages against a list of 338 URLs , then compared their output against the manually extracted version of those articles. Our sample represents 9 of the most languages on the Web. Namely, English, Spanish, Chinese, Russian, German, French, Portuguese, Arabic and Italian.\nOn the open source front, Readability stands out on top. We used the default version of DragNet, so the results were not the greatest, but pretty sure we could have had (much) better results had we trained it on our own multilingual model. Mercury on the other hand performed pretty well on western languages, but didn't do as well on Arabic, Russian and Chinese.\nUjeebu scores better across the board and on all languages, slightly outperforming Readability on text and html extraction, but besting all extractors on the rest of data with a large margin.\nThe extraction scores (out of 100) are based on computing text similarity between each extractor's output and the manual data set:\nExtractor Text Title Author Publication Date\nUjeebu 95.21 91.4 61.52 48.63\nBoilerpipe 88.92 - - -\nDragNet 75.95 - - -\nMercury 62.76 60.92 12.5 25.65\nNewsPaper 90.07 92.5 - 26.76\nReadability 94.85 87.84 32.64 -\n\nCONCLUSION\nWhile the current open source offering exhibits decent performance for text extraction, Ujeebu extracts more info from articles, and incorporates several capabilities from the get go which would require substantial effort to get right if done in-house (pagination, rich media, rendering js heavy pages, using a proxy, etc...).\nDon't take our word for it though, feel free to experiment with our test set here , try Ujeebu for yourself on our demo page , or get your own trial key and get started by using one of our several examples in the language of your choice .",
"html": "<p> Several open source tools allow the extraction of clean text from article HTML. We list the most popular ones below, and run a benchmark to see how they stack up against the Ujeebu API. </p><p> Extracting clean article text from blogs and news sites (a.k.a. boilerplate removal) comes in handy for a variety of applications such as offline reading, article narration and generating article previews. It is also often a prerequisite for further content processing such as sentiment analysis, content summarization, text classification and other tasks that fall under the natural language processing (NLP) umbrella. </p><h2> Why is boilerplate removal a difficult problem? </h2><p> The main difficulty in extracting clean text from html lies in determining which blocks of text contribute to the article; different articles use different mark-up, and the text can be located anywhere in the DOM tree. It is also not uncommon for the parts of the DOM that contain the meat of the article to not be contiguous, and include loads of boilerplate (ads, forms, related article snippets...). Some publications also use JavaScript to generate the article content to ward off web scrapers, or simply as a coding preference. </p><p> Articles contain other important info like author and publication date, which are also not straightforward to extract. Take the example of dates. Though you can achieve rather decent date extraction with regular expressions, you might need to identify the actual publication date vs. some other date mentioned in the article. Furthermore, one would need to run tens of regular expressions per supported language, and in doing so dramatically affect performance. </p><h2> So how do you extract text and other data from a web page? </h2><p> Two sets of techniques are commonly used: statistical and Machine Learning based. Most statistical methods work by computing heuristics like link density, the frequency of certain characters, distance from the title, etc..., then combining them to form a probability score that represents the likelihood that an html block contains the main article text. A good explanation of these techniques can be found <a href=\"https://stackoverflow.com/questions/3652657/what-algorithm-does-readability-use-for-extracting-text-from-urls\"> here </a> . Machine learning techniques on the other hand rely on training mathematical models on a large set of documents that are represented by their key features and feeding them into a ML model. </p><p> Both techniques have their merits, with the statistical/heuristics method being the less computationally intensive of the two, on top of providing acceptable results in most cases. ML based techniques on the other hand tend to work better in complex cases and perform well on outliers, however as with any Machine Learning based algorithms, the quality of the training data is key. The two techniques are also sometimes used in tandem for better accuracy. </p><p> In some cases, extractors can fail due to a never-seen-before html structure, or simply bad mark-up. In such cases, it's customary to use per-domain rules that rely on CSS and/or DOM selectors. This is obviously a site dependent technique, and cannot be standardized by any means, but might help if we're scraping a small set of known publications, and provided regular checks are performed to make sure their html structure didn't change. </p><h2> The Open source offering </h2><h3> Readability </h3><p><a href=\"https://www.readability.com\"> Readability </a> is one of the oldest and probably the most used algorithms for text extraction, though it has considerably changed since it was first released. It also has several <a href=\"https://github.com/masukomi/arc90-readability\"> adaptations </a> in different languages. </p><h3> Mercury </h3><p><a href=\"https://mercury.postlight.com/web-parser/\"> Mercury </a> is written in JavaScript and is based on Readability. It is also known for using custom site rules. </p><h3> BoilerPipe </h3><p><a href=\"https://github.com/kohlschutter/boilerpipe\"> BoilerPipe </a> is Java based, uses a heuristics based algorithm just like readability and can be <a href=\"https://boilerpipe-web.appspot.com/\"> demo'ed here </a> . </p><h3> DragNet </h3><p><a href=\"https://github.com/dragnet-org/dragnet\"> DragNet </a> uses a <a href=\"https://moz.com/devblog/dragnet-content-extraction-from-diverse-feature-sets\"> combination of heuristics and a Machine Learning Model </a> . It comes pre-trained out of the box but can also be <a href=\"https://techblog.gumgum.com/articles/text-extraction-using-dragnet-and-diffbot\"> trained on a custom data set </a> . </p><h3> NewsPaper </h3><p><a href=\"https://newspaper.readthedocs.io/en/latest/\"> NewsPaper </a> is written in Python and is based on a package called <a href=\"https://github.com/GravityLabs/goose\"> Goose </a> , which is also another decent extractor written in Scala. NewsPaper offers the advantage of extracting other data pieces like main keywords and article summary. </p><h2> Ujeebu vs. Open Source </h2><p> Ujeebu uses heuristics much like the other packages, from which it draws heavily, but resorts to a model to determine which heuristics to use. It also uses a model to determine if JavaScript is needed. This is paramount since JavaScript execution can dramatically slow down the extraction process, so it's important to know if it's needed or not upfront. Ujeebu supports extraction on multi-page articles, can identify rich media on demand and has built-in proxy and IP rotation. </p><p> In what follows, we compare the capabilities of Ujeebu with those of open source tools. </p><figure><img src=\"https://ujeebu.com/blog/content/images/2019/08/ujeebu-opensource-comparison-2.png\" alt=\"\"><figcaption> Feature Comparison: Ujeebu extraction API vs. Open Source </figcaption></figure><h2> Performance </h2><p> We ran Ujeebu and the aforementioned open source packages against a list of <a href=\"https://github.com/ujeebu/api-examples/blob/master/ujeebu-benchmark-urls.json\"> 338 URLs </a> , then compared their output against the manually extracted version of those articles. Our sample represents 9 of the most languages on the Web. Namely, English, Spanish, Chinese, Russian, German, French, Portuguese, Arabic and Italian. </p><p> On the open source front, Readability stands out on top. We used the default version of DragNet, so the results were not the greatest, but pretty sure we could have had (much) better results had we trained it on our own multilingual model. Mercury on the other hand performed pretty well on western languages, but didn't do as well on Arabic, Russian and Chinese. </p><p> Ujeebu scores better across the board and on all languages, slightly outperforming Readability on text and html extraction, but besting all extractors on the rest of data with a large margin. </p><p> The extraction scores (out of 100) are based on computing text similarity between each extractor's output and the manual data set: </p><table id=\"results\"><tr><th> Extractor </th><th> Text </th><th> Title </th><th> Author </th><th> Publication Date </th></tr><tr><td> Ujeebu </td><td><b> 95.21 </b></td><td> 91.4 </td><td><b> 61.52 </b></td><td><b> 48.63 </b></td></tr><tr><td> Boilerpipe </td><td> 88.92 </td><td> - </td><td> - </td><td> - </td></tr><tr><td> DragNet </td><td> 75.95 </td><td> - </td><td> - </td><td> - </td></tr><tr><td> Mercury </td><td> 62.76 </td><td> 60.92 </td><td> 12.5 </td><td> 25.65 </td></tr><tr><td> NewsPaper </td><td> 90.07 </td><td><b> 92.5 </b></td><td> - </td><td> 26.76 </td></tr><tr><td> Readability </td><td> 94.85 </td><td> 87.84 </td><td> 32.64 </td><td> - </td></tr></table><h2> Conclusion </h2><p> While the current open source offering exhibits decent performance for text extraction, Ujeebu extracts more info from articles, and incorporates several capabilities from the get go which would require substantial effort to get right if done in-house (pagination, rich media, rendering js heavy pages, using a proxy, etc...). </p><p> Don't take our word for it though, feel free to <a href=\"https://github.com/ujeebu/api-examples/blob/master/ujeebu-benchmark-urls.json\"> experiment with our test set here </a> , <a href=\"https://ujeebu.com/blog/demo\"> try Ujeebu for yourself on our demo page </a> , or <a href=\"https://ujeebu.com/blog/pricing\"> get your own trial key </a> and get started by using one of our <a href=\"https://ujeebu.com/blog/docs\"> several examples in the language of your choice </a> . </p>",
"images": [
"https://ujeebu.com/blog/content/images/2019/08/ujeebu-opensource-comparison-2.png"
],
"author": "Sam",
"pub_date": "2019-08-09 12:42:25",
"is_article": 1,
"url": "https://ujeebu.com/blog/how-to-extract-clean-text-from-html",
"canonical_url": "https://ujeebu.com/blog/how-to-extract-clean-text-from-html/",
"title": "Extracting clean data from blog and news articles",
"language": "en",
"image": "https://ujeebu.com/blog/content/images/2021/05/ujb-blog-benchmark.png",
"summary": "Several open source tools allow the extraction of clean text from article HTML. We list the most popular ones below, and run a benchmark to see how they stack up against the Ujeebu API",
"modified_date": "2021-05-02 20:22:34",
"site_name": "Ujeebu blog",
"favicon": "https://ujeebu.com/blog/favicon.png",
"encoding": "utf-8",
"pages": ["https://ujeebu.com/blog/how-to-extract-clean-text-from-html/"]
},
"time": 6.366053104400635,
"js": false,
"pagination": false
}
Using Proxies​
We realize your scraping activities might be blocked once in a while. In order to help you achieve the most success we developed a multi-tiered proxy offering which lets you select the proxy type that best fits your needs
Your API calls go through our rotating proxy by default.
The default proxy uses top IPs that will get the job done most the time.
If the default rotating proxy is working great for your needs, no need to do anything.
For tougher URLs, you need to set proxy_type
to one of the following options:
advanced
: US IPs only.premium
: US IPs only. Premium proxies which work well with social media and shopping sites.residential
: Geo targeted residential IPs which work on "tough" sites that aren't accessible with the options above. Please note that data scraped via the residential IPs is currently metered on requests exceeding 5MB.mobile
: US IPs only. If GEO targeting is not a primary concern of yours, these IPs are even morestealthy
than ourresidential
proxies. Like residential this option is also currently metered and incurs additional credits past the 5MB request length mark.custom
: Set your own proxy see custom proxy section
tip
We won't bill for failing requests that aren't 404s.
info
A request length also includes assets downloaded with the page when JS rendering is on.
info
To use premium proxy from a specific country, set the parameter proxy_country
to the ISO 3166-1 alpha-2
country code of one of the following:
Supported countries
- Algeria:
DZ
- Angola:
AO
- Benin:
BJ
- Botswana:
BW
- Burkina Faso:
BF
- Burundi:
BI
- Cameroon:
CM
- Central African Republic:
CF
- Chad:
TD
- Democratic Republic of the Congo:
CD
- Djibouti:
DJ
- Egypt:
EG
- Equatorial Guinea:
GN
- Eritrea:
ER
- Ethiopia:
ET
- Gabon:
GA
- Gambia:
GM
- Ghana:
GH
- Guinea:
PG
- Guinea Bissau:
GN
- Ivory Coast:
CI
- Kenya:
KE
- Lesotho:
LS
- Liberia:
LR
- Libya:
LY
- Madagascar:
MG
- Malawi:
MW
- Mali:
SO
- Mauritania:
MR
- Morocco:
MA
- Mozambique:
MZ
- Namibia:
NA
- Niger:
NE
- Nigeria:
NE
- Republic of the Congo:
CG
- Rwanda:
RW
- Senegal:
SN
- Sierra Leone:
SL
- Somalia:
SO
- Somaliland:
ML
- South Africa:
ZA
- South Sudan:
SS
- Sudan:
SD
- Swaziland:
SZ
- Tanzania:
TZ
- Togo:
TG
- Tunisia:
TN
- Uganda:
UG
- Western Sahara:
EH
- Zambia:
ZM
- Zimbabwe:
ZW
- Afghanistan:
AF
- Armenia:
AM
- Azerbaijan:
AZ
- Bangladesh:
BD
- Bhutan:
BT
- Brunei:
BN
- Cambodia:
KH
- China:
CN
- East Timor:
TL
- Hong Kong:
HK
- India:
IN
- Indonesia:
ID
- Iran:
IR
- Iraq:
IQ
- Israel:
IL
- Japan:
JP
- Jordan:
JO
- Kazakhstan:
KZ
- Kuwait:
KW
- Kyrgyzstan:
KG
- Laos:
LA
- Lebanon:
LB
- Malaysia:
MY
- Maldives:
MV
- Mongolia:
MN
- Myanmar:
MM
- Nepal:
NP
- North Korea:
KP
- Oman:
OM
- Pakistan:
PK
- Palestine:
PS
- Philippines:
PH
- Qatar:
QA
- Saudi Arabia:
SA
- Singapore:
SG
- South Korea:
KR
- Sri Lanka:
LK
- Syria:
SY
- Taiwan:
TW
- Tajikistan:
TJ
- Thailand:
TH
- Turkey:
TR
- Turkmenistan:
TM
- United Arab Emirates:
AE
- Uzbekistan:
UZ
- Vietnam:
VN
- Yemen:
YE
- Albania:
AL
- Andorra:
AD
- Austria:
AT
- Belarus:
BY
- Belgium:
BE
- Bosnia and Herzegovina:
BA
- Bulgaria:
BG
- Croatia:
HR
- Cyprus:
CY
- Czech Republic:
CZ
- Denmark:
DK
- Estonia:
EE
- Finland:
FI
- France:
FR
- Germany:
DE
- Gibraltar:
GI
- Greece:
GR
- Hungary:
HU
- Iceland:
IS
- Ireland:
IE
- Italy:
IT
- Kosovo:
XK
- Latvia:
LV
- Liechtenstein:
LI
- Lithuania:
LT
- Luxembourg:
LU
- Macedonia:
MK
- Malta:
MT
- Moldova:
MD
- Monaco:
MC
- Montenegro:
ME
- Netherlands:
NL
- Northern Cyprus:
CY
- Norway:
NO
- Poland:
PL
- Portugal:
PT
- Romania:
OM
- Russia:
RU
- San Marino:
SM
- Serbia:
RS
- Slovakia:
SK
- Slovenia:
SI
- Spain:
ES
- Sweden:
SE
- Switzerland:
CH
- Ukraine:
UA
- United Kingdom:
GB
- Bahamas:
BS
- Belize:
BZ
- Bermuda:
BM
- Canada:
CA
- Costa Rica:
CR
- Cuba:
CU
- Dominican Republic:
DM
- El Salvador:
SV
- Greenland:
GL
- Guatemala:
GT
- Haiti:
HT
- Honduras:
HN
- Jamaica:
JM
- Nicaragua:
NI
- Panama:
PA
- Puerto Rico:
PR
- Trinidad And Tobago:
TT
- United States:
US
- Australia:
AU
- Fiji:
FJ
- New Caledonia:
NC
- New Zealand:
NZ
- Papua New Guinea:
PG
- Solomon Islands:
SB
- Vanuatu:
VU
- Argentina:
AR
- Bolivia:
BO
- Brazil:
BR
- Chile:
CL
- Colombia:
CO
- Ecuador:
EC
- Falkland Islands:
FK
- French Guiana:
GF
- Guyana:
GY
- Mexico:
MX
- Paraguay:
PY
- Peru:
PE
- Suriname:
SR
- Uruguay:
UY
- Venezuela:
VE
Using Ujeebu Extract with your own proxy​
To use your custom proxy set the proxy_type
parameter to custom
then set custom_proxy
parameter to your own proxy in the following format: scheme://host:port
and set proxy credentials using
custom_proxy_username
and custom_proxy_password
parameters
info
If you're using GET http verb and custom_proxy_password
contains special chars it's better to url encode it before passing it.
curl -i \
-H 'ApiKey: <API Key>' \
-X GET \
https://api.ujeebu.com/scrape?url=url=https://ipinfo.io&response_type=raw&proxy_type=custom&custom_proxy=http://proxyhost:8889&custom_proxy_username=user&custom_proxy_password=pass
Credits​
Credit cost per request:
Proxy Type | No JS | w/ JS | Geo Targeting | Metered |
---|---|---|---|---|
rotating | 5 | 10 | US | No |
advanced | 10 | 15 | US | No |
premium | 12 | 17 | US | No |
residential | 10 | 20 | Multiple countries | +2 credits per MB after 5MB |
mobile | 30 | 35 | US | +2 credits per MB after 5MB |
custom | 5 | 10 | Custom | No |
Article Preview API​
Introduction​
Extracts a preview of an article (article card). This is faster than the extract endpoint as it doesn't do any in-depth analysis of the content of the article, and instead mostly relies on its meta tags.
To use API, subscribe to a plan here and connect to :
GET https://api.ujeebu.com/card
Parameters​
Name | Type | Description | Default Value |
---|---|---|---|
url REQUIRED | string | URL of article to be extracted. | - |
js | boolean | indicates whether to execute JavaScript or not, set to 'auto' to let the extractor decide. | false |
timeout | number | maximum number of seconds before request timeout. | 60 |
js | boolean | indicates whether to execute JavaScript or not. | true |
js_timeout | number | when js is enabled, indicates how many seconds the API should wait for the JS engine to render the supplied URL. | 60 |
proxy_type | string | indicates type of proxy to use. Possible values: 'rotating', 'advanced', 'premium', 'residential', 'mobile', 'custom'. | rotating |
proxy_country | string | country ISO 3166-1 alpha-2 code to proxy from. Valid only when premium proxy type is chosen. | US |
custom_proxy | string | URI for your custom proxy in the following format: scheme://user:pass@host:port . applicable and required only if proxy_type=custom | null |
auto_proxy | string | enable a more advanced proxy by default when rotating proxy is not working. It will move to the next proxy option until it gets the content and will only stop when content is available or none of the options worked. Please note that you are billed only on the top option attempted. | false |
session_id | alphanumeric | alphanumeric identifier with a length between 1 and 16 characters, used to route multiple requests from the same proxy instance. Sessions remain active for 30 minutes | null |
UJB-headerName | string | indicates which headers to send to target URL. This can be useful when article is behind a paywall for example, and that you need to pass your authentication cookies. | null |
Responses​
Status | Meaning | Description | Schema |
---|---|---|---|
200 | OK | successful operation | SuccessResponse |
400 | Bad Request | Invalid parameter value | APIResponseError |
Schemas​
Article Card Schema​
{
"url": "string",
"lang": "string",
"favicon": "string",
"title": "string",
"summary": "string",
"author": "string",
"date_published": "string",
"date_modified": "string",
"image": "string",
"site_name": "string"
}
Properties​
Name | Type | Description |
---|---|---|
url | string | the URL parameter. |
lang | string | the language of the article. |
favicon | string | Domain favicon. |
title | string | the title of the article. |
summary | string | the description of the article. |
author | string | the author of the article. |
date_published | string | the publish date of the article. |
date_modified | string | the modified date of the article. |
image | string | the main image of the article. |
site_name | string | the site name of the article. |
Error Response Schema​
{
"url": "string",
"message": "string",
"errors": ["string"]
}
Properties​
Name | Type | Description |
---|---|---|
url | string | Given URL |
message | string | Error message |
errors | [string] | List of all errors |
Code Example​
This will get the meta info for article https://ujeebu.com/blog/how-to-extract-clean-text-from-html/
- cURL
- NodeJs
- Python
- Java
- PHP
- Go
curl --location --request GET 'https://api.ujeebu.com/card?url=https://ujeebu.com/blog/how-to-extract-clean-text-from-html/&js=0' \--header 'ApiKey: <API Key>'
var request = require('request');var options = {'method': 'GET','url': 'https://api.ujeebu.com/card?url=https://ujeebu.com/blog/how-to-extract-clean-text-from-html/&js=0','headers': {'ApiKey': '<API Key>'}};request(options, function (error, response) {if (error) throw new Error(error);console.log(response.body);});
import requestsurl = "https://api.ujeebu.com/card?url=https://ujeebu.com/blog/how-to-extract-clean-text-from-html/&js=0"payload={}headers = {'ApiKey': '<API Key>'}response = requests.request("GET", url, headers=headers, data=payload)print(response.text)
OkHttpClient client = new OkHttpClient().newBuilder().build();Request request = new Request.Builder().url("https://api.ujeebu.com/card?url=https://ujeebu.com/blog/how-to-extract-clean-text-from-html/&js=0").method("GET", null).addHeader("ApiKey", "<API Key>").build();Response response = client.newCall(request).execute();
<?php$curl = curl_init();curl_setopt_array($curl, array(CURLOPT_URL => 'https://api.ujeebu.com/card?url=https://ujeebu.com/blog/how-to-extract-clean-text-from-html/&js=0',CURLOPT_RETURNTRANSFER => true,CURLOPT_ENCODING => '',CURLOPT_MAXREDIRS => 10,CURLOPT_TIMEOUT => 0,CURLOPT_FOLLOWLOCATION => true,CURLOPT_HTTP_VERSION => CURL_HTTP_VERSION_1_1,CURLOPT_CUSTOMREQUEST => 'GET',CURLOPT_HTTPHEADER => array('ApiKey: <API Key>'),));$response = curl_exec($curl);curl_close($curl);echo $response;
package mainimport ("fmt""net/http""io/ioutil")func main() {url := "https://api.ujeebu.com/card?url=https://ujeebu.com/blog/how-to-extract-clean-text-from-html/&js=0"method := "GET"client := &http.Client {}req, err := http.NewRequest(method, url, nil)if err != nil {fmt.Println(err)return}req.Header.Add("ApiKey", "<API Key>")res, err := client.Do(req)if err != nil {fmt.Println(err)return}defer res.Body.Close()body, err := ioutil.ReadAll(res.Body)if err != nil {fmt.Println(err)return}fmt.Println(string(body))}
The code above will generate the following response:
{
"author": "Sam",
"title": "Extracting clean data from blog and news articles",
"summary": "Several open source tools allow the extraction of clean text from article HTML. We list the most popular ones below, and run a benchmark to see how they stack up against the Ujeebu API",
"date_published": "2019-08-09 12:42:25",
"date_modified": "2021-05-02 20:22:34",
"favicon": ":///blog/favicon.png",
"charset": "utf-8",
"image": "https://ujeebu.com/blog/content/images/2021/05/ujb-blog-benchmark.png",
"lang": "en",
"keywords": [],
"site_name": "Ujeebu blog",
"time": 1.501387119293213
}
Usage endpoint​
Introduction​
To keep track of how much credit you're using programmatically, you can use the /account
endpoint in your program.
Calls to this endpoint will not affect your calls per second, but you can only make 10 /account
calls per minute.
To use the API:
GET https://api.ujeebu.com/account
Usage Endpoint Code Example​
This will get the current usage of the account with the given ApiKey
- cURL
- NodeJs
- Python
- Java
- PHP
- Go
curl --location --request GET 'https://api.ujeebu.com/account' \--header 'ApiKey: <API Key>'
var request = require('request');var options = {'method': 'GET','url': 'https://api.ujeebu.com/account','headers': {'ApiKey': '<API Key>'}};request(options, function (error, response) {if (error) throw new Error(error);console.log(response.body);});
import requestsurl = "https://api.ujeebu.com/account"payload={}headers = {'ApiKey': '<API Key>'}response = requests.request("GET", url, headers=headers, data=payload)print(response.text)
OkHttpClient client = new OkHttpClient().newBuilder().build();Request request = new Request.Builder().url("https://api.ujeebu.com/account").method("GET", null).addHeader("ApiKey", "<API Key>").build();Response response = client.newCall(request).execute();
<?php$curl = curl_init();curl_setopt_array($curl, array(CURLOPT_URL => 'https://api.ujeebu.com/account',CURLOPT_RETURNTRANSFER => true,CURLOPT_ENCODING => '',CURLOPT_MAXREDIRS => 10,CURLOPT_TIMEOUT => 0,CURLOPT_FOLLOWLOCATION => true,CURLOPT_HTTP_VERSION => CURL_HTTP_VERSION_1_1,CURLOPT_CUSTOMREQUEST => 'GET',CURLOPT_HTTPHEADER => array('ApiKey: <API Key>'),));$response = curl_exec($curl);curl_close($curl);echo $response;
package mainimport ("fmt""net/http""io/ioutil")func main() {url := "https://api.ujeebu.com/account"method := "GET"client := &http.Client {}req, err := http.NewRequest(method, url, nil)if err != nil {fmt.Println(err)return}req.Header.Add("ApiKey", "<API Key>")res, err := client.Do(req)if err != nil {fmt.Println(err)return}defer res.Body.Close()body, err := ioutil.ReadAll(res.Body)if err != nil {fmt.Println(err)return}fmt.Println(string(body))}
The code above will generate the following response:
{
"balance": 0,
"days_till_next_billing": 0,
"next_billing_date": null,
"plan": "TRIAL",
"quota": "5000",
"requests_per_second": "1",
"total_requests": 14,
"used": 95,
"used_percent": 1.9,
"userid": "8155"
}