Skip to main content

Ujeebu Extract

Introduction​

Ujeebu Extract converts a news or blog article into structured JSON data. It extracts the main text and html bodies, the author, publish date, any embeddable media such as YouTube and twitter cards, the RSS feeds or social feeds (Facebook/Twitter timelines or YouTube channels) among other relevant pieces of data.

To use API, subscribe to a plan here and connect to :


GET https://api.ujeebu.com/extract

Parameters​

NameTypeDescriptionDefault Value
url REQUIREDstringURL of article to be extracted.-
raw_htmlstringHTML of article to be extracted. When this is passed, article extraction is carried out on the value of this parameter (i.e. without fetching article from url), however the extractor still relies on url to resolve relative links and relatively referenced assets in the provided html.
jsbooleanindicates whether to execute JavaScript or not. Set to 'auto' to let the extractor decide.false
textbooleanindicates whether API should return extracted text.true
htmlbooleanindicates whether API should extract html.true
mediabooleanindicates whether API should extract media.false
feedsbooleanindicates whether API should extract RSS feeds.false
imagesbooleanindicates whether API should extract all images present in HTML.true
authorbooleanindicates whether API should extract article's author.true
pub_datebooleanindicates whether API should extract article's publish date.true
partialnumbernumber of characters or percentage of text (if percent sign is present) of text/html to be returned. 0 means all.0
is_articlebooleanwhen true returns the probability [0-1] of URL being an article. Anything scoring 0.5 and above should be an article, but this may slightly vary from one site to another.true
quick_modebooleanwhen true, does a quick analysis of the content instead of the normal advanced parsing. Usually cuts down response time by about 30% to 60%.false
strip_tagscsv-stringindicates which tags to strip from the extracted article HTML. Expects a comma separated list of tag names/css selectors.form
timeoutnumbermaximum number of seconds before request timeout60
js_timeoutnumberwhen js is enabled, indicates how many seconds the API should wait for the JS engine to render the supplied URL.timeout/2
scroll_downbooleanindicates whether to scroll down the page or not, this applies only when js is enabled.true
image_analysisbooleanindicates whether API should analyse images for minimum width and height (see parameters min_image_width and min_image_height for more details).true
min_image_widthnumberminimum width of the images kept in the HTML (if image_analysis is false this parameter has no effect).200
min_image_heightnumberminimum height of the images kept in the HTML (if image_analysis is false this parameter has no effect).100
image_timeoutnumberimage fetching timeout in seconds.2
return_only_enclosed_text_imagesbooleanindicates whether to return only images that are enclosed within extracted article HTML.true
proxy_typestringindicates type of proxy to use. Possible values: 'rotating', 'advanced', 'premium', 'residential', 'custom'.rotating
proxy_countrystringcountry ISO 3166-1 alpha-2 code to proxy from. Valid only when premium proxy type is chosen.US
custom_proxystringURI for your custom proxy in the following format: scheme://user:pass@host:port. applicable and required only if proxy_type=customnull
auto_proxystringenable a more advanced proxy by default when rotating proxy is not working. It will move to the next proxy option until it gets the content and will only stop when content is available or none of the options worked. Please note that you are billed only on the top option attempted.false
session_idalphanumericalphanumeric identifier with a length between 1 and 16 characters, used to route multiple requests from the same proxy instance. Sessions remain active for 30 minutesnull
paginationbooleanextract and concatenate multiple-page articles.true
pagination_max_pagesstringindicates the number of pages to extract when pagination is enabled.30
UJB-headerNamestringindicates which headers to send to target URL. This can be useful when article is behind a paywall for example, and that you need to pass your authentication cookies.null

Responses​

StatusMeaningDescriptionSchema
200OKsuccessful operationSuccessResponse
400Bad RequestInvalid parameter valueAPIResponseError

Schemas​

Article Schema​

{
"url": "string",
"canonical_url": "string",
"title": "string",
"text": "string",
"html": "string",
"summary": "string",
"image": "string",
"images": ["string"],
"media": ["string"],
"language": "string",
"author": "string",
"pub_date": "string",
"modified_date": "string",
"site_name": "string",
"favicon": "string",
"encoding": "string"
}

Properties​

NameTypeDescription
urlstringthe URL parameter.
canonical_urlstringthe final (resolved) URL.
titlestringthe title of the article.
textstringthe extracted text.
htmlstringthe extracted html.
summarystringsummary (if available) of the article text.
imagestringmain image of the article.
images[string]all images present in article.
media[string]all media present in article.
languagestringlanguage code of article text.
authorstringauthor of article.
pub_datestringpublication date of article.
modified_datestringlast modified date of article.
site_namestringname of site hosting article.
faviconstringfavicon of site hosting article.
encodingstringcharacter encoding of article text.

Success Response example​

{
"article": {
"text": "The killings of unarmed Palestinians by Israeli snipers this past fortnight marks a new chapter in the degradation of human life in Gaza and the Occupied Territories. It also makes one think about the continuing scandal of the Rohingya genocide in Myanmar, followed by their shameful treatment in India, Bangladesh, Indonesia and Thailand, among other Asian countries. Then there is Kashmir, which the Indian military continues to occupy with complete indifference to the lives of the men, women and children who live there. In each case, the targeted communities are Muslim though they are products of very different contextual histories. The global dynamics of genocide are not, in spite of appearances, primarily about Muslimness. The obsession with projecting Muslims as a coordinated global category is a collaborative project of highly specific Western and Muslim political theologies. It should not be viewed as a self-evident, universal fact.\nToday’s genocidal projects move me to reflect on the fate of ethnic, racial and other biominorities in our world. By biominorities I mean those whose difference (ethnic, religious, racial) from their national majorities is seen as a form of bodily threat to the national ethnos. There is something odd about the relationship of such biominorities to the typology of today’s genocidal projects. One type, which the Israeli killings in Gaza exemplify, is what we may call carceral genocide, genocide by confinement, concentration and starvation. The historical prototype of this is to be found in the Nazi concentration camps. The other might be called diasporic genocide, genocide by dispersion, extrusion and uprooting, where biominorities are forced out of any form of stability and forced to move until they die. Palestinians under Israeli occupation represent the first type, Rohingyas represent the second.\nWhat accounts for this bipolar condition of biominorities in this early decade of the 21st century? Put another way: why does the Israeli state not simply push Palestinians out of their land using its overwhelming military might, forcing them to join their brethren in other parts of the Middle East or North Africa, or die on the way? Conversely, why did Myanmar not simply create a carceral Rohingya state where this biominority could be confined, policed, starved and “concentrated” to death? These counterfactual questions force one to look more closely at the menu of genocidal strategies in play today.\nIn Myanmar’s case, the key factor, as many commentators have pointed out, is that Rohingyas occupy rich agricultural lands on the Western coast, which are now ripe for building ports and infrastructure across the Bay of Bengal. Rohingyas are deeply embedded in their land, which they have developed over centuries. Incarcerating them is no solution for the Myanmar military. They need to go, and the murder, rape and armed aggression directed at them is intended to push them out. The ethnocidal Buddhist monkhood which provides the ideological fuel for this extrusion is the willing partner of the militarised state. The Buddhist majority of Myanmar is in fact awash in an ocean of minorities, many of which are well-armed, belligerent and based in inaccessible ecological zones. But Rohingyas are not experienced in armed resistance and they are geographically concentrated in land which the state needs for its global projects. Thus, they are ripe for murderous expulsion. While their Muslim identity is a source of ideological fuel for the Buddhist majority, their relative weakness and location in vital global stretches along the Mynamar coast are more relevant.\n\nPARANOID SOVEREIGNTY\nWhy does Israel not follow a similar policy of expulsion, extrusion and displacement in the case of the Palestinian population in the Occupied Territories, including Gaza? Why adopt the option of incarceration and killing with impunity? The fundamental reason is near at hand. Palestinians under Israeli rule will not leave willingly because they are the legitimate occupants of their lands and because they have a long tradition of militant resistance, supported at different times by other Middle Eastern states, most recently Iran. They are stubborn and, thus, they have to be concentrated, starved and killed until they elect exit.\nBut there is more to the Israeli case than this. Israel needs its captive Palestinian population for without it neither the current power of the religious right nor the populist authoritarianism of Benjamin Netanhayu has any justification for existence. Like Kurds in Turkey, Jews in Hungary, Muslims in India and other visible biominorities, Palestinians in Israel are the guarantee of a permanent state of paranoid sovereignty. This paranoid sovereignty is Israel’s major claim to the sympathies and armed assistance of the United States since Israel would be far more susceptible to moderate voices if Palestinians were to disappear or exit. An outbreak of democracy is the last thing the Israeli religious and political right want, and the Donald Trump White House also hates any hint of moderation in any of its client states. The Israeli policy of aggressive and ongoing settler colonialism is intended to produce a continuous border theater in which Palestinians are indispensable in the creation of a permanent state of paranoid sovereignty.\nSo, what do the Palestinian and Rohingya cases (extreme ideal types, as it were) teach us? That solutions to the “problem” of biominorities depend on whether you want to keep the despised minority in order to avoid actually producing some semblance of democracy, or whether you want to delink the group from their lands or resources, with no pressing need to use their presence as a pretext for an ever-militant militarised state. You either need the minority to keep paranoid sovereignty alive, or you need their resources more than you need their biominor threat.\nWhat then of other genocidal trends we see in different regimes and regions across the world? Do these thoughts about Palestinians and Rohingyas offer us a more general insight? That both Rohingyas and Palestinians are Muslim does not account for the very different ways in which Myanmar and Israel treat them. The loose post 9/11 discourse of the Muslim threat allows the two states (and others) to legitimise their violence, but the global dynamics of genocide are not primarily about Muslimness. The fact is that all nation states rely on some idea, however covert, of ethnic purity and singularity. Biominor plurality is thus always a threat to modern nation states. The question is what combination of extrusion and incarceration a particular nation state finds useful. As they consider the possibilities, Israel and Myanmar offer them two radical options, which just happen to have Muslim communities as their targets. But today’s varieties of genocide are not as much about religion as they are about paranoid and/or predatory nation-states.\nArjun Appadurai is the Goddard Professor of Media, Culture, and Communication at New York University.",
"html": "<p> The killings of unarmed Palestinians by Israeli snipers this past fortnight marks a new chapter in the degradation of human life in Gaza and the Occupied Territories. It also makes one think about the continuing scandal of the Rohingya genocide in Myanmar, followed by their shameful treatment in India, Bangladesh, Indonesia and Thailand, among other Asian countries. Then there is Kashmir, which the Indian military continues to occupy with complete indifference to the lives of the men, women and children who live there. In each case, the targeted communities are Muslim though they are products of very different contextual histories. The global dynamics of genocide are not, in spite of appearances, primarily about Muslimness. The obsession with projecting Muslims as a coordinated global category is a collaborative project of highly specific Western and Muslim political theologies. It should not be viewed as a self-evident, universal fact. </p><p> Today’s genocidal projects move me to reflect on the fate of ethnic, racial and other biominorities in our world. By biominorities I mean those whose difference (ethnic, religious, racial) from their national majorities is seen as a form of bodily threat to the national ethnos. There is something odd about the relationship of such biominorities to the typology of today’s genocidal projects. One type, which the Israeli killings in Gaza exemplify, is what we may call carceral genocide, genocide by confinement, concentration and starvation. The historical prototype of this is to be found in the Nazi concentration camps. The other might be called diasporic genocide, genocide by dispersion, extrusion and uprooting, where biominorities are forced out of any form of stability and forced to move until they die. Palestinians under Israeli occupation represent the first type, Rohingyas represent the second. </p><p> What accounts for this bipolar condition of biominorities in this early decade of the 21st century? Put another way: why does the Israeli state not simply push Palestinians out of their land using its overwhelming military might, forcing them to join their brethren in other parts of the Middle East or North Africa, or die on the way? Conversely, why did Myanmar not simply create a carceral Rohingya state where this biominority could be confined, policed, starved and “concentrated” to death? These counterfactual questions force one to look more closely at the menu of genocidal strategies in play today. </p><p> In Myanmar’s case, the key factor, as many commentators have pointed out, is that Rohingyas occupy rich agricultural lands on the Western coast, which are now ripe for building ports and infrastructure across the Bay of Bengal. Rohingyas are deeply embedded in their land, which they have developed over centuries. Incarcerating them is no solution for the Myanmar military. They need to go, and the murder, rape and armed aggression directed at them is intended to push them out. The ethnocidal Buddhist monkhood which provides the ideological fuel for this extrusion is the willing partner of the militarised state. The Buddhist majority of Myanmar is in fact awash in an ocean of minorities, many of which are well-armed, belligerent and based in inaccessible ecological zones. But Rohingyas are not experienced in armed resistance and they are geographically concentrated in land which the state needs for its global projects. Thus, they are ripe for murderous expulsion. While their Muslim identity is a source of ideological fuel for the Buddhist majority, their relative weakness and location in vital global stretches along the Mynamar coast are more relevant. </p><figure><a href=\"https://scroll.in/subscribe?utm_source=internal&amp;utm_medium=articleinline\"><img src=\"https://s01.sgp1.digitaloceanspaces.com/inline/879591-llweiqsnvq-1526920853.jpg\" alt=\"\"></a></figure><h3><strong> Paranoid sovereignty </strong></h3><p> Why does Israel not follow a similar policy of expulsion, extrusion and displacement in the case of the Palestinian population in the Occupied Territories, including Gaza? Why adopt the option of incarceration and killing with impunity? The fundamental reason is near at hand. Palestinians under Israeli rule will not leave willingly because they are the legitimate occupants of their lands and because they have a long tradition of militant resistance, supported at different times by other Middle Eastern states, most recently Iran. They are stubborn and, thus, they have to be concentrated, starved and killed until they elect exit. </p><p> But there is more to the Israeli case than this. Israel needs its captive Palestinian population for without it neither the current power of the religious right nor the populist authoritarianism of Benjamin Netanhayu has any justification for existence. Like Kurds in Turkey, Jews in Hungary, Muslims in India and other visible biominorities, Palestinians in Israel are the guarantee of a permanent state of paranoid sovereignty. This paranoid sovereignty is Israel’s major claim to the sympathies and armed assistance of the United States since Israel would be far more susceptible to moderate voices if Palestinians were to disappear or exit. An outbreak of democracy is the last thing the Israeli religious and political right want, and the Donald Trump White House also hates any hint of moderation in any of its client states. The Israeli policy of aggressive and ongoing settler colonialism is intended to produce a continuous border theater in which Palestinians are indispensable in the creation of a permanent state of paranoid sovereignty. </p><p> So, what do the Palestinian and Rohingya cases (extreme ideal types, as it were) teach us? That solutions to the “problem” of biominorities depend on whether you want to keep the despised minority in order to avoid actually producing some semblance of democracy, or whether you want to delink the group from their lands or resources, with no pressing need to use their presence as a pretext for an ever-militant militarised state. You either need the minority to keep paranoid sovereignty alive, or you need their resources more than you need their biominor threat. </p><p> What then of other genocidal trends we see in different regimes and regions across the world? Do these thoughts about Palestinians and Rohingyas offer us a more general insight? That both Rohingyas and Palestinians are Muslim does not account for the very different ways in which Myanmar and Israel treat them. The loose post 9/11 discourse of the Muslim threat allows the two states (and others) to legitimise their violence, but the global dynamics of genocide are not primarily about Muslimness. The fact is that all nation states rely on some idea, however covert, of ethnic purity and singularity. Biominor plurality is thus always a threat to modern nation states. The question is what combination of extrusion and incarceration a particular nation state finds useful. As they consider the possibilities, Israel and Myanmar offer them two radical options, which just happen to have Muslim communities as their targets. But today’s varieties of genocide are not as much about religion as they are about paranoid and/or predatory nation-states. </p><p><em> Arjun Appadurai is the Goddard Professor of Media, Culture, and Communication at New York University. </em></p>",
"images": [
"https://s01.sgp1.digitaloceanspaces.com/inline/879591-llweiqsnvq-1526920853.jpg"
],
"author": "Arjun Appadurai",
"pub_date": "2018-05-22 08:00:00",
"is_article": 1,
"url": "https://scroll.in/article/879591/from-israel-to-myanmar-genocidal-projects-are-less-about-religion-and-more-about-predatory-states",
"canonical_url": "https://scroll.in/article/879591/from-israel-to-myanmar-genocidal-projects-are-less-about-religion-and-more-about-predatory-states",
"title": "Across the world, genocidal states are attacking Muslims. Is Islam really their target?",
"language": "en",
"image": "https://s01.sgp1.digitaloceanspaces.com/facebook/879591-ylsreufeki-1526956060.jpg",
"summary": "As Israel incarcerates Palestinians and Mynmar drives out its Rohingyas, a reflection on the predicament of ethnic and racial biominorities.",
"modified_date": "2018-05-22 08:00:00",
"site_name": "Scroll.in",
"favicon": "https://scroll.in/static/assets/apple-touch-icon-144x144.b71c766a62abe812b4b37e7f21e91e56.003.png",
"encoding": "utf-8",
"pages": [
"https://scroll.in/article/879591/from-israel-to-myanmar-genocidal-projects-are-less-about-religion-and-more-about-predatory-states"
]
},
"time": 10.981818914413452,
"js": false,
"pagination": false
}

Error Response Schema​

{
"url": "string",
"message": "string",
"error_code": "string",
"errors": ["string"]
}

Properties​

NameTypeDescription
urlstringGiven URL
messagestringError message
error_codestringError code
errors[string]List of all errors

Response Codes​

CodeBilledMeaningSuggestion
200YesSuccessful request-
400NOSome required parameter is missing (URL)Set
401NOMissing API-KEYProvide API-KEY
404YESProvided URL not foundProvide a valid URL
408YESRequest timeoutIncrease timeout parameter, use premium proxy or force JS
429NOToo many requestsupgrade your plan
500NOInternal errorTry request or contact us

Examples​

Stripping tags​

If you want to delete some html element(s) before the extraction is carried out, use parameter strip_tags to pass a comma-separated list of css selectors of elements to delete. The example below will remove any meta, form and input tags as well as any element with class hidden

curl -i \
-H 'ApiKey: <API Key>' \
-X GET \
"https://api.ujeebu.com/extract?url=https://ujeebu.com/blog/how-to-extract-clean-text-from-html&strip_tags=meta,form,.hidden,input"
curl --location --request GET 'https://api.ujeebu.com/extract?url=https://ujeebu.com/blog/how-to-extract-clean-text-from-html&js=0&strip_tags=aside,form' \
--header 'ApiKey: <API Key>'

Passing custom headers​

The extract endpoint will forward any headers with the `UJB-` prefix to the target URL https://api.ujeebu.com/extract
curl -i \
-H 'UJB-Username: username' \
-H 'UJB-Authorisation: Basic dXNlcm5hbWU6cGFzc3dvcmQ=' \
-H 'ApiKey: <API Key>' \
-X GET \
https://api.ujeebu.com/extract?url=https://ujeebu.com/blog/how-to-extract-clean-text-from-html
curl --location --request GET 'https://api.ujeebu.com/extract?url=https://ujeebu.com/blog/how-to-extract-clean-text-from-html&js=0' \
--header 'UJB-User-Agent: Custom user agent' \
--header 'ApiKey: <API Key>'

The code above will return the following response:

{
"article": {
"text": "Several open source tools allow the extraction of clean text from article HTML. We list the most popular ones below, and run a benchmark to see how they stack up against the Ujeebu API.\nExtracting clean article text from blogs and news sites (a.k.a. boilerplate removal) comes in handy for a variety of applications such as offline reading, article narration and generating article previews. It is also often a prerequisite for further content processing such as sentiment analysis, content summarization, text classification and other tasks that fall under the natural language processing (NLP) umbrella.\n\nWHY IS BOILERPLATE REMOVAL A DIFFICULT PROBLEM?\nThe main difficulty in extracting clean text from html lies in determining which blocks of text contribute to the article; different articles use different mark-up, and the text can be located anywhere in the DOM tree. It is also not uncommon for the parts of the DOM that contain the meat of the article to not be contiguous, and include loads of boilerplate (ads, forms, related article snippets...). Some publications also use JavaScript to generate the article content to ward off web scrapers, or simply as a coding preference.\nArticles contain other important info like author and publication date, which are also not straightforward to extract. Take the example of dates. Though you can achieve rather decent date extraction with regular expressions, you might need to identify the actual publication date vs. some other date mentioned in the article. Furthermore, one would need to run tens of regular expressions per supported language, and in doing so dramatically affect performance.\n\nSO HOW DO YOU EXTRACT TEXT AND OTHER DATA FROM A WEB PAGE?\nTwo sets of techniques are commonly used: statistical and Machine Learning based. Most statistical methods work by computing heuristics like link density, the frequency of certain characters, distance from the title, etc..., then combining them to form a probability score that represents the likelihood that an html block contains the main article text. A good explanation of these techniques can be found here . Machine learning techniques on the other hand rely on training mathematical models on a large set of documents that are represented by their key features and feeding them into a ML model.\nBoth techniques have their merits, with the statistical/heuristics method being the less computationally intensive of the two, on top of providing acceptable results in most cases. ML based techniques on the other hand tend to work better in complex cases and perform well on outliers, however as with any Machine Learning based algorithms, the quality of the training data is key. The two techniques are also sometimes used in tandem for better accuracy.\nIn some cases, extractors can fail due to a never-seen-before html structure, or simply bad mark-up. In such cases, it's customary to use per-domain rules that rely on CSS and/or DOM selectors. This is obviously a site dependent technique, and cannot be standardized by any means, but might help if we're scraping a small set of known publications, and provided regular checks are performed to make sure their html structure didn't change.\n\nTHE OPEN SOURCE OFFERING\n\nREADABILITY\nReadability is one of the oldest and probably the most used algorithms for text extraction, though it has considerably changed since it was first released. It also has several adaptations in different languages.\n\nMERCURY\nMercury is written in JavaScript and is based on Readability. It is also known for using custom site rules.\n\nBOILERPIPE\nBoilerPipe is Java based, uses a heuristics based algorithm just like readability and can be demo'ed here .\n\nDRAGNET\nDragNet uses a combination of heuristics and a Machine Learning Model . It comes pre-trained out of the box but can also be trained on a custom data set .\n\nNEWSPAPER\nNewsPaper is written in Python and is based on a package called Goose , which is also another decent extractor written in Scala. NewsPaper offers the advantage of extracting other data pieces like main keywords and article summary.\n\nUJEEBU VS. OPEN SOURCE\nUjeebu uses heuristics much like the other packages, from which it draws heavily, but resorts to a model to determine which heuristics to use. It also uses a model to determine if JavaScript is needed. This is paramount since JavaScript execution can dramatically slow down the extraction process, so it's important to know if it's needed or not upfront. Ujeebu supports extraction on multi-page articles, can identify rich media on demand and has built-in proxy and IP rotation.\nIn what follows, we compare the capabilities of Ujeebu with those of open source tools.\n\nPERFORMANCE\nWe ran Ujeebu and the aforementioned open source packages against a list of 338 URLs , then compared their output against the manually extracted version of those articles. Our sample represents 9 of the most languages on the Web. Namely, English, Spanish, Chinese, Russian, German, French, Portuguese, Arabic and Italian.\nOn the open source front, Readability stands out on top. We used the default version of DragNet, so the results were not the greatest, but pretty sure we could have had (much) better results had we trained it on our own multilingual model. Mercury on the other hand performed pretty well on western languages, but didn't do as well on Arabic, Russian and Chinese.\nUjeebu scores better across the board and on all languages, slightly outperforming Readability on text and html extraction, but besting all extractors on the rest of data with a large margin.\nThe extraction scores (out of 100) are based on computing text similarity between each extractor's output and the manual data set:\nExtractor Text Title Author Publication Date\nUjeebu 95.21 91.4 61.52 48.63\nBoilerpipe 88.92 - - -\nDragNet 75.95 - - -\nMercury 62.76 60.92 12.5 25.65\nNewsPaper 90.07 92.5 - 26.76\nReadability 94.85 87.84 32.64 -\n\nCONCLUSION\nWhile the current open source offering exhibits decent performance for text extraction, Ujeebu extracts more info from articles, and incorporates several capabilities from the get go which would require substantial effort to get right if done in-house (pagination, rich media, rendering js heavy pages, using a proxy, etc...).\nDon't take our word for it though, feel free to experiment with our test set here , try Ujeebu for yourself on our demo page , or get your own trial key and get started by using one of our several examples in the language of your choice .",
"html": "<p> Several open source tools allow the extraction of clean text from article HTML. We list the most popular ones below, and run a benchmark to see how they stack up against the Ujeebu API. </p><p> Extracting clean article text from blogs and news sites (a.k.a. boilerplate removal) comes in handy for a variety of applications such as offline reading, article narration and generating article previews. It is also often a prerequisite for further content processing such as sentiment analysis, content summarization, text classification and other tasks that fall under the natural language processing (NLP) umbrella. </p><h2> Why is boilerplate removal a difficult problem? </h2><p> The main difficulty in extracting clean text from html lies in determining which blocks of text contribute to the article; different articles use different mark-up, and the text can be located anywhere in the DOM tree. It is also not uncommon for the parts of the DOM that contain the meat of the article to not be contiguous, and include loads of boilerplate (ads, forms, related article snippets...). Some publications also use JavaScript to generate the article content to ward off web scrapers, or simply as a coding preference. </p><p> Articles contain other important info like author and publication date, which are also not straightforward to extract. Take the example of dates. Though you can achieve rather decent date extraction with regular expressions, you might need to identify the actual publication date vs. some other date mentioned in the article. Furthermore, one would need to run tens of regular expressions per supported language, and in doing so dramatically affect performance. </p><h2> So how do you extract text and other data from a web page? </h2><p> Two sets of techniques are commonly used: statistical and Machine Learning based. Most statistical methods work by computing heuristics like link density, the frequency of certain characters, distance from the title, etc..., then combining them to form a probability score that represents the likelihood that an html block contains the main article text. A good explanation of these techniques can be found <a href=\"https://stackoverflow.com/questions/3652657/what-algorithm-does-readability-use-for-extracting-text-from-urls\"> here </a> . Machine learning techniques on the other hand rely on training mathematical models on a large set of documents that are represented by their key features and feeding them into a ML model. </p><p> Both techniques have their merits, with the statistical/heuristics method being the less computationally intensive of the two, on top of providing acceptable results in most cases. ML based techniques on the other hand tend to work better in complex cases and perform well on outliers, however as with any Machine Learning based algorithms, the quality of the training data is key. The two techniques are also sometimes used in tandem for better accuracy. </p><p> In some cases, extractors can fail due to a never-seen-before html structure, or simply bad mark-up. In such cases, it's customary to use per-domain rules that rely on CSS and/or DOM selectors. This is obviously a site dependent technique, and cannot be standardized by any means, but might help if we're scraping a small set of known publications, and provided regular checks are performed to make sure their html structure didn't change. </p><h2> The Open source offering </h2><h3> Readability </h3><p><a href=\"https://www.readability.com\"> Readability </a> is one of the oldest and probably the most used algorithms for text extraction, though it has considerably changed since it was first released. It also has several <a href=\"https://github.com/masukomi/arc90-readability\"> adaptations </a> in different languages. </p><h3> Mercury </h3><p><a href=\"https://mercury.postlight.com/web-parser/\"> Mercury </a> is written in JavaScript and is based on Readability. It is also known for using custom site rules. </p><h3> BoilerPipe </h3><p><a href=\"https://github.com/kohlschutter/boilerpipe\"> BoilerPipe </a> is Java based, uses a heuristics based algorithm just like readability and can be <a href=\"https://boilerpipe-web.appspot.com/\"> demo'ed here </a> . </p><h3> DragNet </h3><p><a href=\"https://github.com/dragnet-org/dragnet\"> DragNet </a> uses a <a href=\"https://moz.com/devblog/dragnet-content-extraction-from-diverse-feature-sets\"> combination of heuristics and a Machine Learning Model </a> . It comes pre-trained out of the box but can also be <a href=\"https://techblog.gumgum.com/articles/text-extraction-using-dragnet-and-diffbot\"> trained on a custom data set </a> . </p><h3> NewsPaper </h3><p><a href=\"https://newspaper.readthedocs.io/en/latest/\"> NewsPaper </a> is written in Python and is based on a package called <a href=\"https://github.com/GravityLabs/goose\"> Goose </a> , which is also another decent extractor written in Scala. NewsPaper offers the advantage of extracting other data pieces like main keywords and article summary. </p><h2> Ujeebu vs. Open Source </h2><p> Ujeebu uses heuristics much like the other packages, from which it draws heavily, but resorts to a model to determine which heuristics to use. It also uses a model to determine if JavaScript is needed. This is paramount since JavaScript execution can dramatically slow down the extraction process, so it's important to know if it's needed or not upfront. Ujeebu supports extraction on multi-page articles, can identify rich media on demand and has built-in proxy and IP rotation. </p><p> In what follows, we compare the capabilities of Ujeebu with those of open source tools. </p><figure><img src=\"https://ujeebu.com/blog/content/images/2019/08/ujeebu-opensource-comparison-2.png\" alt=\"\"><figcaption> Feature Comparison: Ujeebu extraction API vs. Open Source </figcaption></figure><h2> Performance </h2><p> We ran Ujeebu and the aforementioned open source packages against a list of <a href=\"https://github.com/ujeebu/api-examples/blob/master/ujeebu-benchmark-urls.json\"> 338 URLs </a> , then compared their output against the manually extracted version of those articles. Our sample represents 9 of the most languages on the Web. Namely, English, Spanish, Chinese, Russian, German, French, Portuguese, Arabic and Italian. </p><p> On the open source front, Readability stands out on top. We used the default version of DragNet, so the results were not the greatest, but pretty sure we could have had (much) better results had we trained it on our own multilingual model. Mercury on the other hand performed pretty well on western languages, but didn't do as well on Arabic, Russian and Chinese. </p><p> Ujeebu scores better across the board and on all languages, slightly outperforming Readability on text and html extraction, but besting all extractors on the rest of data with a large margin. </p><p> The extraction scores (out of 100) are based on computing text similarity between each extractor's output and the manual data set: </p><table id=\"results\"><tr><th> Extractor </th><th> Text </th><th> Title </th><th> Author </th><th> Publication Date </th></tr><tr><td> Ujeebu </td><td><b> 95.21 </b></td><td> 91.4 </td><td><b> 61.52 </b></td><td><b> 48.63 </b></td></tr><tr><td> Boilerpipe </td><td> 88.92 </td><td> - </td><td> - </td><td> - </td></tr><tr><td> DragNet </td><td> 75.95 </td><td> - </td><td> - </td><td> - </td></tr><tr><td> Mercury </td><td> 62.76 </td><td> 60.92 </td><td> 12.5 </td><td> 25.65 </td></tr><tr><td> NewsPaper </td><td> 90.07 </td><td><b> 92.5 </b></td><td> - </td><td> 26.76 </td></tr><tr><td> Readability </td><td> 94.85 </td><td> 87.84 </td><td> 32.64 </td><td> - </td></tr></table><h2> Conclusion </h2><p> While the current open source offering exhibits decent performance for text extraction, Ujeebu extracts more info from articles, and incorporates several capabilities from the get go which would require substantial effort to get right if done in-house (pagination, rich media, rendering js heavy pages, using a proxy, etc...). </p><p> Don't take our word for it though, feel free to <a href=\"https://github.com/ujeebu/api-examples/blob/master/ujeebu-benchmark-urls.json\"> experiment with our test set here </a> , <a href=\"https://ujeebu.com/blog/demo\"> try Ujeebu for yourself on our demo page </a> , or <a href=\"https://ujeebu.com/blog/pricing\"> get your own trial key </a> and get started by using one of our <a href=\"https://ujeebu.com/blog/docs\"> several examples in the language of your choice </a> . </p>",
"images": [
"https://ujeebu.com/blog/content/images/2019/08/ujeebu-opensource-comparison-2.png"
],
"author": "Sam",
"pub_date": "2019-08-09 12:42:25",
"is_article": 1,
"url": "https://ujeebu.com/blog/how-to-extract-clean-text-from-html",
"canonical_url": "https://ujeebu.com/blog/how-to-extract-clean-text-from-html/",
"title": "Extracting clean data from blog and news articles",
"language": "en",
"image": "https://ujeebu.com/blog/content/images/2021/05/ujb-blog-benchmark.png",
"summary": "Several open source tools allow the extraction of clean text from article HTML. We list the most popular ones below, and run a benchmark to see how they stack up against the Ujeebu API",
"modified_date": "2021-05-02 20:22:34",
"site_name": "Ujeebu blog",
"favicon": "https://ujeebu.com/blog/favicon.png",
"encoding": "utf-8",
"pages": ["https://ujeebu.com/blog/how-to-extract-clean-text-from-html/"]
},
"time": 6.366053104400635,
"js": false,
"pagination": false
}

Using Proxies​

We realize your scraping activities might be blocked once in a while. In order to help you achieve the most success we developed a multi-tiered proxy offering which lets you select the proxy type that best fits your needs

Your API calls go through our rotating proxy by default. The default proxy uses top IPs that will get the job done most the time. If the default rotating proxy is working great for your needs, no need to do anything. For tougher URLs, you need to set proxy_type to one of the following options:

  • advanced: US IPs only.
  • premium: US IPs only. Premium proxies which work well with social media and shopping sites.
  • residential: Geo targeted residential IPs which work on "tough" sites that aren't accessible with the options above. Please note that data scraped via the residential IPs is currently metered on requests exceeding 1MB.
  • custom: Set your own proxy see custom proxy section
tip

We won't bill for failing requests that aren't 404s.

info

A request length also includes assets downloaded with the page when JS rendering is on.

info

To use premium proxy from a specific country, set the parameter proxy_country to the ISO 3166-1 alpha-2 country code of one of the following:

Supported countries
  • Algeria: DZ
  • Angola: AO
  • Benin: BJ
  • Botswana: BW
  • Burkina Faso: BF
  • Burundi: BI
  • Cameroon: CM
  • Central African Republic: CF
  • Chad: TD
  • Democratic Republic of the Congo: CD
  • Djibouti: DJ
  • Egypt: EG
  • Equatorial Guinea: GN
  • Eritrea: ER
  • Ethiopia: ET
  • Gabon: GA
  • Gambia: GM
  • Ghana: GH
  • Guinea: PG
  • Guinea Bissau: GN
  • Ivory Coast: CI
  • Kenya: KE
  • Lesotho: LS
  • Liberia: LR
  • Libya: LY
  • Madagascar: MG
  • Malawi: MW
  • Mali: SO
  • Mauritania: MR
  • Morocco: MA
  • Mozambique: MZ
  • Namibia: NA
  • Niger: NE
  • Nigeria: NE
  • Republic of the Congo: CG
  • Rwanda: RW
  • Senegal: SN
  • Sierra Leone: SL
  • Somalia: SO
  • Somaliland: ML
  • South Africa: ZA
  • South Sudan: SS
  • Sudan: SD
  • Swaziland: SZ
  • Tanzania: TZ
  • Togo: TG
  • Tunisia: TN
  • Uganda: UG
  • Western Sahara: EH
  • Zambia: ZM
  • Zimbabwe: ZW
  • Afghanistan: AF
  • Armenia: AM
  • Azerbaijan: AZ
  • Bangladesh: BD
  • Bhutan: BT
  • Brunei: BN
  • Cambodia: KH
  • China: CN
  • East Timor: TL
  • Hong Kong: HK
  • India: IN
  • Indonesia: ID
  • Iran: IR
  • Iraq: IQ
  • Israel: IL
  • Japan: JP
  • Jordan: JO
  • Kazakhstan: KZ
  • Kuwait: KW
  • Kyrgyzstan: KG
  • Laos: LA
  • Lebanon: LB
  • Malaysia: MY
  • Maldives: MV
  • Mongolia: MN
  • Myanmar: MM
  • Nepal: NP
  • North Korea: KP
  • Oman: OM
  • Pakistan: PK
  • Palestine: PS
  • Philippines: PH
  • Qatar: QA
  • Saudi Arabia: SA
  • Singapore: SG
  • South Korea: KR
  • Sri Lanka: LK
  • Syria: SY
  • Taiwan: TW
  • Tajikistan: TJ
  • Thailand: TH
  • Turkey: TR
  • Turkmenistan: TM
  • United Arab Emirates: AE
  • Uzbekistan: UZ
  • Vietnam: VN
  • Yemen: YE
  • Albania: AL
  • Andorra: AD
  • Austria: AT
  • Belarus: BY
  • Belgium: BE
  • Bosnia and Herzegovina: BA
  • Bulgaria: BG
  • Croatia: HR
  • Cyprus: CY
  • Czech Republic: CZ
  • Denmark: DK
  • Estonia: EE
  • Finland: FI
  • France: FR
  • Germany: DE
  • Gibraltar: GI
  • Greece: GR
  • Hungary: HU
  • Iceland: IS
  • Ireland: IE
  • Italy: IT
  • Kosovo: XK
  • Latvia: LV
  • Liechtenstein: LI
  • Lithuania: LT
  • Luxembourg: LU
  • Macedonia: MK
  • Malta: MT
  • Moldova: MD
  • Monaco: MC
  • Montenegro: ME
  • Netherlands: NL
  • Northern Cyprus: CY
  • Norway: NO
  • Poland: PL
  • Portugal: PT
  • Romania: OM
  • Russia: RU
  • San Marino: SM
  • Serbia: RS
  • Slovakia: SK
  • Slovenia: SI
  • Spain: ES
  • Sweden: SE
  • Switzerland: CH
  • Ukraine: UA
  • United Kingdom: GB
  • Bahamas: BS
  • Belize: BZ
  • Bermuda: BM
  • Canada: CA
  • Costa Rica: CR
  • Cuba: CU
  • Dominican Republic: DM
  • El Salvador: SV
  • Greenland: GL
  • Guatemala: GT
  • Haiti: HT
  • Honduras: HN
  • Jamaica: JM
  • Nicaragua: NI
  • Panama: PA
  • Puerto Rico: PR
  • Trinidad And Tobago: TT
  • United States: US
  • Australia: AU
  • Fiji: FJ
  • New Caledonia: NC
  • New Zealand: NZ
  • Papua New Guinea: PG
  • Solomon Islands: SB
  • Vanuatu: VU
  • Argentina: AR
  • Bolivia: BO
  • Brazil: BR
  • Chile: CL
  • Colombia: CO
  • Ecuador: EC
  • Falkland Islands: FK
  • French Guiana: GF
  • Guyana: GY
  • Mexico: MX
  • Paraguay: PY
  • Peru: PE
  • Suriname: SR
  • Uruguay: UY
  • Venezuela: VE

Using Ujeebu Extract with your own proxy​

To use your custom proxy set the proxy_type parameter to custom then set custom_proxy parameter to your own proxy in the following format: scheme://host:port and set proxy credentials using custom_proxy_username and custom_proxy_password parameters

info

If you're using GET http verb and custom_proxy_password contains special chars it's better to url encode it before passing it.

curl -i \
-H 'ApiKey: <API Key>' \
-X GET \
https://api.ujeebu.com/scrape?url=url=https://ipinfo.io&response_type=raw&proxy_type=custom&custom_proxy=http://proxyhost:8889&custom_proxy_username=user&custom_proxy_password=pass

Credits​

Credit cost per request:

Proxy TypeNo JSw/ JSGeo TargetingMetered
rotating510USNo
advanced1015USNo
premium1217USNo
residential1020Multiple countries+2 credits per MB after 5MB
custom510CustomNo

Article Preview API​

Introduction​

Extracts a preview of an article (article card). This is faster than the extract endpoint as it doesn't do any in-depth analysis of the content of the article, and instead mostly relies on its meta tags.

To use API, subscribe to a plan here and connect to :


GET https://api.ujeebu.com/card

Parameters​

NameTypeDescriptionDefault Value
url REQUIREDstringURL of article to be extracted.-
jsbooleanindicates whether to execute JavaScript or not, set to 'auto' to let the extractor decide.false
timeoutnumbermaximum number of seconds before request timeout.60
jsbooleanindicates whether to execute JavaScript or not.true
js_timeoutnumberwhen js is enabled, indicates how many seconds the API should wait for the JS engine to render the supplied URL.60
proxy_typestringindicates type of proxy to use. Possible values: 'rotating', 'advanced', 'premium', 'residential', 'custom'.rotating
proxy_countrystringcountry ISO 3166-1 alpha-2 code to proxy from. Valid only when premium proxy type is chosen.US
custom_proxystringURI for your custom proxy in the following format: scheme://user:pass@host:port. applicable and required only if proxy_type=customnull
auto_proxystringenable a more advanced proxy by default when rotating proxy is not working. It will move to the next proxy option until it gets the content and will only stop when content is available or none of the options worked. Please note that you are billed only on the top option attempted.false
session_idalphanumericalphanumeric identifier with a length between 1 and 16 characters, used to route multiple requests from the same proxy instance. Sessions remain active for 30 minutesnull
UJB-headerNamestringindicates which headers to send to target URL. This can be useful when article is behind a paywall for example, and that you need to pass your authentication cookies.null

Responses​

StatusMeaningDescriptionSchema
200OKsuccessful operationSuccessResponse
400Bad RequestInvalid parameter valueAPIResponseError

Schemas​

Article Card Schema​

{
"url": "string",
"lang": "string",
"favicon": "string",
"title": "string",
"summary": "string",
"author": "string",
"date_published": "string",
"date_modified": "string",
"image": "string",
"site_name": "string"
}
Properties​
NameTypeDescription
urlstringthe URL parameter.
langstringthe language of the article.
faviconstringDomain favicon.
titlestringthe title of the article.
summarystringthe description of the article.
authorstringthe author of the article.
date_publishedstringthe publish date of the article.
date_modifiedstringthe modified date of the article.
imagestringthe main image of the article.
site_namestringthe site name of the article.

Error Response Schema​

{
"url": "string",
"message": "string",
"errors": ["string"]
}
Properties​
NameTypeDescription
urlstringGiven URL
messagestringError message
errors[string]List of all errors

Code Example​

This will get the meta info for article https://ujeebu.com/blog/how-to-extract-clean-text-from-html/

curl --location --request GET 'https://api.ujeebu.com/card?url=https://ujeebu.com/blog/how-to-extract-clean-text-from-html/&js=0' \
--header 'ApiKey: <API Key>'

The code above will generate the following response:

{
"author": "Sam",
"title": "Extracting clean data from blog and news articles",
"summary": "Several open source tools allow the extraction of clean text from article HTML. We list the most popular ones below, and run a benchmark to see how they stack up against the Ujeebu API",
"date_published": "2019-08-09 12:42:25",
"date_modified": "2021-05-02 20:22:34",
"favicon": ":///blog/favicon.png",
"charset": "utf-8",
"image": "https://ujeebu.com/blog/content/images/2021/05/ujb-blog-benchmark.png",
"lang": "en",
"keywords": [],
"site_name": "Ujeebu blog",
"time": 1.501387119293213
}

Usage endpoint​

Introduction​

To keep track of how much credit you're using programmatically, you can use the /account endpoint in your program.

Calls to this endpoint will not affect your calls per second, but you can only make 10 /account calls per minute.

To use the API:


GET https://api.ujeebu.com/account

Usage Endpoint Code Example​

This will get the current usage of the account with the given ApiKey

curl --location --request GET 'https://api.ujeebu.com/account' \
--header 'ApiKey: <API Key>'

The code above will generate the following response:

{
"balance": 0,
"days_till_next_billing": 0,
"next_billing_date": null,
"plan": "TRIAL",
"quota": "5000",
"requests_per_second": "1",
"total_requests": 14,
"used": 95,
"used_percent": 1.9,
"userid": "8155"
}