Skip to main content

Article Extractor API

Introductionā€‹

Ujeebu Extract converts a news or blog article into structured JSON data. It extracts the main text and html bodies, the author, publish date, any embeddable media such as YouTube and twitter cards, the RSS feeds or social feeds (Facebook/Twitter timelines or YouTube channels) among other relevant pieces of data.

To use API, subscribe to a plan here and connect to :


GET https://api.ujeebu.com/extract

Parametersā€‹

NameTypeDescriptionDefault Value
url REQUIREDstringURL of article to be extracted.-
raw_htmlstringHTML of article to be extracted. When this is passed, article extraction is carried out on the value of this parameter (i.e. without fetching article from url), however the extractor still relies on url to resolve relative links and relatively referenced assets in the provided html.
jsbooleanindicates whether to execute JavaScript or not. Set to 'auto' to let the extractor decide.false
textbooleanindicates whether API should return extracted text.true
htmlbooleanindicates whether API should extract html.true
mediabooleanindicates whether API should extract media.false
feedsbooleanindicates whether API should extract RSS feeds.false
imagesbooleanindicates whether API should extract all images present in HTML.true
authorbooleanindicates whether API should extract article's author.true
pub_datebooleanindicates whether API should extract article's publish date.true
partialnumbernumber of characters or percentage of text (if percent sign is present) of text/html to be returned. 0 means all.0
is_articlebooleanwhen true returns the probability [0-1] of URL being an article. Anything scoring 0.5 and above should be an article, but this may slightly vary from one site to another.true
quick_modebooleanwhen true, does a quick analysis of the content instead of the normal advanced parsing. Usually cuts down response time by about 30% to 60%.false
strip_tagscsv-stringindicates which tags to strip from the extracted article HTML. Expects a comma separated list of tag names/css selectors.form
timeoutnumbermaximum number of seconds before request timeout60
js_timeoutnumberwhen js is enabled, indicates how many seconds the API should wait for the JS engine to render the supplied URL.timeout/2
scroll_downbooleanindicates whether to scroll down the page or not, this applies only when js is enabled.true
image_analysisbooleanindicates whether API should analyse images for minimum width and height (see parameters min_image_width and min_image_height for more details).true
min_image_widthnumberminimum width of the images kept in the HTML (if image_analysis is false this parameter has no effect).200
min_image_heightnumberminimum height of the images kept in the HTML (if image_analysis is false this parameter has no effect).100
image_timeoutnumberimage fetching timeout in seconds.2
return_only_enclosed_text_imagesbooleanindicates whether to return only images that are enclosed within extracted article HTML.true
proxy_typestringindicates type of proxy to use. Possible values: 'rotating', 'advanced', 'premium', 'residential', 'custom'.rotating
proxy_countrystringcountry ISO 3166-1 alpha-2 code to proxy from. Valid only when premium proxy type is chosen.US
custom_proxystringURI for your custom proxy in the following format: scheme://user:pass@host:port. applicable and required only if proxy_type=customnull
auto_proxystringenable a more advanced proxy by default when rotating proxy is not working. It will move to the next proxy option until it gets the content and will only stop when content is available or none of the options worked. Please note that you are billed only on the top option attempted.false
session_idalphanumericalphanumeric identifier with a length between 1 and 16 characters, used to route multiple requests from the same proxy instance. Sessions remain active for 30 minutesnull
paginationbooleanextract and concatenate multiple-page articles.true
pagination_max_pagesstringindicates the number of pages to extract when pagination is enabled.30
UJB-headerNamestringindicates which headers to send to target URL. This can be useful when article is behind a paywall for example, and that you need to pass your authentication cookies.null

Responsesā€‹

StatusMeaningDescriptionSchema
200OKsuccessful operationSuccessResponse
400Bad RequestInvalid parameter valueAPIResponseError

Schemasā€‹

Article Schemaā€‹

{
"url": "string",
"canonical_url": "string",
"title": "string",
"text": "string",
"html": "string",
"summary": "string",
"image": "string",
"images": ["string"],
"media": ["string"],
"language": "string",
"author": "string",
"pub_date": "string",
"modified_date": "string",
"site_name": "string",
"favicon": "string",
"encoding": "string"
}

Propertiesā€‹

NameTypeDescription
urlstringthe URL parameter.
canonical_urlstringthe final (resolved) URL.
titlestringthe title of the article.
textstringthe extracted text.
htmlstringthe extracted html.
summarystringsummary (if available) of the article text.
imagestringmain image of the article.
images[string]all images present in article.
media[string]all media present in article.
languagestringlanguage code of article text.
authorstringauthor of article.
pub_datestringpublication date of article.
modified_datestringlast modified date of article.
site_namestringname of site hosting article.
faviconstringfavicon of site hosting article.
encodingstringcharacter encoding of article text.

Success Response exampleā€‹

{
"article": {
"text": "I began learning German at the age of 13, and I\u2019m still trying to explain to myself why it was love at first sound. The answer must surely be: the excellence of my teacher. At an English public school not famed for its cultural generosity, Mr King was that rare thing: a kindly and intelligent man who, in the thick of the second world war, determinedly loved the Germany that he knew was still there somewhere.\nRather than join the chorus of anti-German propaganda, he preferred, doggedly, to inspire his little class with the beauty of the language, and of its literature and culture. One day, he used to say, the real Germany will come back. And he was right. Because now it has.\nWhy was it love at first sound for me? Well...",
"html": "<p><span>I<\/span> began learning German at the age of 13, and Iā€™m still trying to explain to myself why it was love at first sound. The answer must surely be: the excellence of my teacher. At an English public school not famed for its cultural generosity, Mr King was that rare thing: a kindly and intelligent man who, in the thick of the second world war, determinedly loved the Germany that he knew was still there somewhere.<\/p><p>Rather than join the chorus of anti-German propaganda, he preferred, doggedly, to inspire his little class with the beauty of the language, and of its literature and culture. One day, he used to say, the real Germany will come back. And he was right. Because now it has....",
"media": [],
"images": [],
"author": "John le Carr\u00e9",
"pub_date": "2017-07-01 23:05:12",
"is_article": 1,
"url": "https:\/\/www.theguardian.com\/education\/2017\/jul\/02\/why-we-should-learn-german-john-le-carre",
"canonical_url": "https:\/\/www.theguardian.com\/education\/2017\/jul\/02\/why-we-should-learn-german-john-le-carre",
"title": "Why we should learn German | John le Carr\u00e9",
"language": "en",
"image": "https:\/\/i.guim.co.uk\/img\/media\/f19eff6f7e1751d88b38e725cfbe6687084d5f64\/0_235_9010_5405\/master\/9010.jpg?width=1200&height=630&quality=85&auto=format&fit=crop&overlay-align=bottom%2Cleft&overlay-width=100p&overlay-base64=L2ltZy9zdGF0aWMvb3ZlcmxheXMvdG8tb3BpbmlvbnMtYWdlLTIwMTcucG5n&enable=upscale&s=efeec857dffdb94cd84c4b652b4e287f",
"summary": "To help make the European debate decent and civilised, it is now more important than ever to value the skills of the linguist",
"modified_date": "2017-12-02 03:00:56",
"site_name": "the Guardian",
"favicon": "https:\/\/static.guim.co.uk\/images\/favicon-32x32.ico",
"encoding": "utf-8",
"time": 0.85
}
}

Error Response Schemaā€‹

{
"url": "string",
"message": "string",
"error_code": "string",
"errors": ["string"]
}

Propertiesā€‹

NameTypeDescription
urlstringGiven URL
messagestringError message
error_codestringError code
errors[string]List of all errors

Response Codesā€‹

CodeBilledMeaningSuggestion
200YesSuccessful request-
400NOSome required parameter is missing (URL)Set
401NOMissing API-KEYProvide API-KEY
404YESProvided URL not foundProvide a valid URL
408YESRequest timeoutIncrease timeout parameter, use premium proxy or force JS
429NOToo many requestsupgrade your plan
500NOInternal errorTry request or contact us

Examplesā€‹

Stripping tagsā€‹

If you want to delete some html element(s) before the extraction is carried out, use parameter strip_tags to pass a comma-separated list of css selectors of elements to delete. The example below will remove any meta, form and input tags as well as any element with class hidden

curl -i \
-H 'ApiKey: <API Key>' \
-X GET \
"https://api.ujeebu.com/extract?url=https://ujeebu.com/blog/how-to-extract-clean-text-from-html&strip_tags=meta,form,.hidden,input"
curl --location --request GET 'https://api.ujeebu.com/extract?url=https://ujeebu.com/blog/how-to-extract-clean-text-from-html&js=0&strip_tags=aside,form' \
--header 'ApiKey: <API Key>'

Passing custom headersā€‹

The extract endpoint will forward any headers with the `UJB-` prefix to the target URL https://api.ujeebu.com/extract
curl -i \
-H 'UJB-Username: username' \
-H 'UJB-Authorisation: Basic dXNlcm5hbWU6cGFzc3dvcmQ=' \
-H 'ApiKey: <API Key>' \
-X GET \
https://api.ujeebu.com/extract?url=https://ujeebu.com/blog/how-to-extract-clean-text-from-html
curl --location --request GET 'https://api.ujeebu.com/extract?url=https://ujeebu.com/blog/how-to-extract-clean-text-from-html&js=0' \
--header 'UJB-User-Agent: Custom user agent' \
--header 'ApiKey: <API Key>'

The code above will return the following response:

{
"article": {

"author": "Sam",
"pub_date": "2019-08-09 12:42:25",
"is_article": 1,
"url": "https://ujeebu.com/blog/how-to-extract-clean-text-from-html",
"canonical_url": "https://ujeebu.com/blog/how-to-extract-clean-text-from-html/",
"title": "Extracting clean data from blog and news articles",
"site_name": "Ujeebu blog",
"favicon": "https://ujeebu.com/blog/favicon.png",
"encoding": "utf-8",
"pages": ["https://ujeebu.com/blog/how-to-extract-clean-text-from-html/"]
},
"time": 6.366053104400635,
"js": false,
"pagination": false
}

Using Proxiesā€‹

We realize your scraping activities might be blocked once in a while. In order to help you achieve the most success we developed a multi-tiered proxy offering which lets you select the proxy type that best fits your needs

Your API calls go through our rotating proxy by default. The default proxy uses top IPs that will get the job done most the time. If the default rotating proxy is working great for your needs, no need to do anything. For tougher URLs, you need to set proxy_type to one of the following options:

  • rotating: Default.
  • advanced: US IPs only.
  • premium: US IPs only. Premium proxies which work well with social media and shopping sites.
  • residential: Geo-targeted residential IPs which work on "tough" sites that aren't accessible with the other options. Please note that data scraped via non-US residential IPs is currently metered once requests exceeding 1MB. Keep in mind that all assets associated with an HTML page count toward this total, not just the HTML itself. Meanwhile, US Residential IPs are not metered. Please refer to the Credits section for more details.
  • custom: Set your own proxy. See custom proxy section
tip

We won't bill for failing requests that aren't 404s.

info

A request length also includes assets downloaded with the page when JS rendering is on.

info

To use premium proxy from a specific country, set the parameter proxy_country to the ISO 3166-1 alpha-2 country code of one of the following:

Supported countries
  • Algeria: DZ
  • Angola: AO
  • Benin: BJ
  • Botswana: BW
  • Burkina Faso: BF
  • Burundi: BI
  • Cameroon: CM
  • Central African Republic: CF
  • Chad: TD
  • Democratic Republic of the Congo: CD
  • Djibouti: DJ
  • Egypt: EG
  • Equatorial Guinea: GN
  • Eritrea: ER
  • Ethiopia: ET
  • Gabon: GA
  • Gambia: GM
  • Ghana: GH
  • Guinea: PG
  • Guinea Bissau: GN
  • Ivory Coast: CI
  • Kenya: KE
  • Lesotho: LS
  • Liberia: LR
  • Libya: LY
  • Madagascar: MG
  • Malawi: MW
  • Mali: SO
  • Mauritania: MR
  • Morocco: MA
  • Mozambique: MZ
  • Namibia: NA
  • Niger: NE
  • Nigeria: NE
  • Republic of the Congo: CG
  • Rwanda: RW
  • Senegal: SN
  • Sierra Leone: SL
  • Somalia: SO
  • Somaliland: ML
  • South Africa: ZA
  • South Sudan: SS
  • Sudan: SD
  • Swaziland: SZ
  • Tanzania: TZ
  • Togo: TG
  • Tunisia: TN
  • Uganda: UG
  • Western Sahara: EH
  • Zambia: ZM
  • Zimbabwe: ZW
  • Afghanistan: AF
  • Armenia: AM
  • Azerbaijan: AZ
  • Bangladesh: BD
  • Bhutan: BT
  • Brunei: BN
  • Cambodia: KH
  • China: CN
  • East Timor: TL
  • Hong Kong: HK
  • India: IN
  • Indonesia: ID
  • Iran: IR
  • Iraq: IQ
  • Israel: IL
  • Japan: JP
  • Jordan: JO
  • Kazakhstan: KZ
  • Kuwait: KW
  • Kyrgyzstan: KG
  • Laos: LA
  • Lebanon: LB
  • Malaysia: MY
  • Maldives: MV
  • Mongolia: MN
  • Myanmar: MM
  • Nepal: NP
  • North Korea: KP
  • Oman: OM
  • Pakistan: PK
  • Palestine: PS
  • Philippines: PH
  • Qatar: QA
  • Saudi Arabia: SA
  • Singapore: SG
  • South Korea: KR
  • Sri Lanka: LK
  • Syria: SY
  • Taiwan: TW
  • Tajikistan: TJ
  • Thailand: TH
  • Turkey: TR
  • Turkmenistan: TM
  • United Arab Emirates: AE
  • Uzbekistan: UZ
  • Vietnam: VN
  • Yemen: YE
  • Albania: AL
  • Andorra: AD
  • Austria: AT
  • Belarus: BY
  • Belgium: BE
  • Bosnia and Herzegovina: BA
  • Bulgaria: BG
  • Croatia: HR
  • Cyprus: CY
  • Czech Republic: CZ
  • Denmark: DK
  • Estonia: EE
  • Finland: FI
  • France: FR
  • Germany: DE
  • Gibraltar: GI
  • Greece: GR
  • Hungary: HU
  • Iceland: IS
  • Ireland: IE
  • Italy: IT
  • Kosovo: XK
  • Latvia: LV
  • Liechtenstein: LI
  • Lithuania: LT
  • Luxembourg: LU
  • Macedonia: MK
  • Malta: MT
  • Moldova: MD
  • Monaco: MC
  • Montenegro: ME
  • Netherlands: NL
  • Northern Cyprus: CY
  • Norway: NO
  • Poland: PL
  • Portugal: PT
  • Romania: OM
  • Russia: RU
  • San Marino: SM
  • Serbia: RS
  • Slovakia: SK
  • Slovenia: SI
  • Spain: ES
  • Sweden: SE
  • Switzerland: CH
  • Ukraine: UA
  • United Kingdom: GB
  • Bahamas: BS
  • Belize: BZ
  • Bermuda: BM
  • Canada: CA
  • Costa Rica: CR
  • Cuba: CU
  • Dominican Republic: DM
  • El Salvador: SV
  • Greenland: GL
  • Guatemala: GT
  • Haiti: HT
  • Honduras: HN
  • Jamaica: JM
  • Nicaragua: NI
  • Panama: PA
  • Puerto Rico: PR
  • Trinidad And Tobago: TT
  • United States: US
  • Australia: AU
  • Fiji: FJ
  • New Caledonia: NC
  • New Zealand: NZ
  • Papua New Guinea: PG
  • Solomon Islands: SB
  • Vanuatu: VU
  • Argentina: AR
  • Bolivia: BO
  • Brazil: BR
  • Chile: CL
  • Colombia: CO
  • Ecuador: EC
  • Falkland Islands: FK
  • French Guiana: GF
  • Guyana: GY
  • Mexico: MX
  • Paraguay: PY
  • Peru: PE
  • Suriname: SR
  • Uruguay: UY
  • Venezuela: VE

Using Ujeebu Extract with your own proxyā€‹

To use your custom proxy set the proxy_type parameter to custom then set custom_proxy parameter to your own proxy in the following format: scheme://host:port and set proxy credentials using custom_proxy_username and custom_proxy_password parameters

info

If you're using GET http verb and custom_proxy_password contains special chars it's better to url encode it before passing it.

curl -i \
-H 'ApiKey: <API Key>' \
-X GET \
https://api.ujeebu.com/scrape?url=url=https://ipinfo.io&response_type=raw&proxy_type=custom&custom_proxy=http://proxyhost:8889&custom_proxy_username=user&custom_proxy_password=pass

Creditsā€‹

Credit cost per request:

Proxy TypeNo JSw/ JSGeo TargetingMetered
rotating510USNo
advanced1015USNo
premium1217USNo
residential(us)3035USNo
residential1020Multiple countries+10 credits per MB after 1MB
custom510CustomNo
info

Consumed credits are returned in the Ujb-credits header

Article Preview APIā€‹

Introductionā€‹

Extracts a preview of an article (article card). This is faster than the extract endpoint as it doesn't do any in-depth analysis of the content of the article, and instead mostly relies on its meta tags.

To use API, subscribe to a plan here and connect to :


GET https://api.ujeebu.com/card

Parametersā€‹

NameTypeDescriptionDefault Value
url REQUIREDstringURL of article to be extracted.-
jsbooleanindicates whether to execute JavaScript or not, set to 'auto' to let the extractor decide.false
timeoutnumbermaximum number of seconds before request timeout.60
jsbooleanindicates whether to execute JavaScript or not.true
js_timeoutnumberwhen js is enabled, indicates how many seconds the API should wait for the JS engine to render the supplied URL.60
proxy_typestringindicates type of proxy to use. Possible values: 'rotating', 'advanced', 'premium', 'residential', 'custom'.rotating
proxy_countrystringcountry ISO 3166-1 alpha-2 code to proxy from. Valid only when premium proxy type is chosen.US
custom_proxystringURI for your custom proxy in the following format: scheme://user:pass@host:port. applicable and required only if proxy_type=customnull
auto_proxystringenable a more advanced proxy by default when rotating proxy is not working. It will move to the next proxy option until it gets the content and will only stop when content is available or none of the options worked. Please note that you are billed only on the top option attempted.false
session_idalphanumericalphanumeric identifier with a length between 1 and 16 characters, used to route multiple requests from the same proxy instance. Sessions remain active for 30 minutesnull
UJB-headerNamestringindicates which headers to send to target URL. This can be useful when article is behind a paywall for example, and that you need to pass your authentication cookies.null

Responsesā€‹

StatusMeaningDescriptionSchema
200OKsuccessful operationSuccessResponse
400Bad RequestInvalid parameter valueAPIResponseError

Schemasā€‹

Article Card Schemaā€‹

{
"url": "string",
"lang": "string",
"favicon": "string",
"title": "string",
"summary": "string",
"author": "string",
"date_published": "string",
"date_modified": "string",
"image": "string",
"site_name": "string"
}
Propertiesā€‹
NameTypeDescription
urlstringthe URL parameter.
langstringthe language of the article.
faviconstringDomain favicon.
titlestringthe title of the article.
summarystringthe description of the article.
authorstringthe author of the article.
date_publishedstringthe publish date of the article.
date_modifiedstringthe modified date of the article.
imagestringthe main image of the article.
site_namestringthe site name of the article.

Error Response Schemaā€‹

{
"url": "string",
"message": "string",
"errors": ["string"]
}
Propertiesā€‹
NameTypeDescription
urlstringGiven URL
messagestringError message
errors[string]List of all errors

Code Exampleā€‹

This will get the meta info for article https://ujeebu.com/blog/how-to-extract-clean-text-from-html/

curl --location --request GET 'https://api.ujeebu.com/card?url=https://ujeebu.com/blog/how-to-extract-clean-text-from-html/&js=0' \
--header 'ApiKey: <API Key>'

The code above will generate the following response:

{
"author": "Sam",
"title": "Extracting clean data from blog and news articles",
"summary": "Several open source tools allow the extraction of clean text from article HTML. We list the most popular ones below, and run a benchmark to see how they stack up against the Ujeebu API",
"date_published": "2019-08-09 12:42:25",
"date_modified": "2021-05-02 20:22:34",
"favicon": ":///blog/favicon.png",
"charset": "utf-8",
"image": "https://ujeebu.com/blog/content/images/2021/05/ujb-blog-benchmark.png",
"lang": "en",
"keywords": [],
"site_name": "Ujeebu blog",
"time": 1.501387119293213
}

Usage endpointā€‹

Introductionā€‹

To keep track of how much credit you're using programmatically, you can use the /account endpoint in your program.

Calls to this endpoint will not affect your calls per second, but you can only make 10 /account calls per minute.

To use the API:


GET https://api.ujeebu.com/account

Usage Endpoint Code Exampleā€‹

This will get the current usage of the account with the given ApiKey

curl --location --request GET 'https://api.ujeebu.com/account' \
--header 'ApiKey: <API Key>'

The code above will generate the following response:

{
"balance": 0,
"days_till_next_billing": 0,
"next_billing_date": null,
"plan": "TRIAL",
"quota": "5000",
"requests_per_second": "0",
"concurrent_requests": 1,
"total_requests": 14,
"used": 95,
"used_percent": 1.9,
"userid": "8155"
}