API Reference
API Reference¶
astel.agent
¶
User agent for processing domain rules, thus allowing the crawler to fetch the pages without getting blocked.
UserAgent
¶
A user agent for processing domain rules so that the crawler can respect them.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name |
str
|
The name of the user agent |
required |
Source code in astel/agent.py
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 |
|
acknowledged_domains: List[str]
property
¶
The domains that have been acknowledged by the user agent.
can_access(domain, url)
¶
Determines whether the given URL can be accessed by the user agent for the specified domain.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
domain |
str
|
A string representing the domain of the URL. |
required |
url |
str
|
A string representing the URL to access. |
required |
Returns:
Name | Type | Description |
---|---|---|
bool |
bool
|
A boolean indicating whether the URL can be accessed for the specified domain. |
Source code in astel/agent.py
41 42 43 44 45 46 47 48 49 50 51 |
|
get_crawl_delay(domain)
¶
Return the crawl delay for the given domain if it has been acknowledged, and None
otherwise.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
domain |
str
|
A string representing the domain to check the crawl delay for. |
required |
Returns:
Type | Description |
---|---|
str | None
|
Union[str, None]: A string representing the crawl delay for the given domain if it has been acknowledged, |
Source code in astel/agent.py
66 67 68 69 70 71 72 73 74 75 76 77 78 79 |
|
get_request_rate(domain)
¶
Return the request rate of that domain if it is acknowledged.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
domain |
str
|
A string representing the domain whose request rate is sought. |
required |
Returns:
Type | Description |
---|---|
RequestRate | None
|
Union[RequestRate, None]: An instance of |
Source code in astel/agent.py
53 54 55 56 57 58 59 60 61 62 63 64 |
|
get_site_maps(domain)
¶
Return the site maps associated with the given domain if the domain is acknowledged, otherwise returns None
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
domain |
str
|
A string representing the domain to retrieve site maps for. |
required |
Returns:
Type | Description |
---|---|
list[str] | None
|
Union[list[str], None]: A list of strings representing the site maps associated with the domain, or |
Source code in astel/agent.py
81 82 83 84 85 86 87 88 89 90 91 92 |
|
respect(domain, robots_txt)
¶
Process the rules in the robots.txt file in the URL and associates them to the given domain, if the domain has not already been acknowledged.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
domain |
str
|
A string representing the domain to be acknowledged. |
required |
robots_txt |
str
|
A string representing the content of the robots.txt file. |
required |
Source code in astel/agent.py
27 28 29 30 31 32 33 34 35 36 37 38 39 |
|
astel.crawler
¶
Crawler module.
This module defines the Crawler
class that can be used to crawl websites asynchronously.
Crawler
¶
An asynchronous web crawler that can be used to extract, process and follow links in webpages.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
urls |
Iterable[str]
|
The URLs to start the crawler with. |
required |
options |
CrawlerOptions
|
The options to use for the crawler. |
None
|
Source code in astel/crawler.py
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 |
|
agent: str
property
¶
The user agent used by the crawler.
done: set[str]
property
¶
The URLs that have been crawled by the crawler.
limit: int
property
¶
The maximum number of pages to crawl.
It is used as a fail-safe to prevent the crawler from running indefinitely.
num_workers: int
property
¶
The number of worker tasks used by the crawler.
options: CrawlerOptions
property
writable
¶
The options used by the crawler.
parser_factory: ParserFactory
property
¶
The parser factory object used by the crawler to parse HTML responses.
rate_limiter: limiters.RateLimiter
property
¶
The rate limiter used by the crawler.
start_urls: Set[str]
property
¶
The URLs that the crawler was started with.
total_pages: int
property
¶
The total number of pages queued by the crawler.
urls_seen: set[parsers.Url]
property
¶
The URLs that have been seen by the crawler.
filter(*args, **kwargs)
¶
Add URL filters to the crawler.
Filters can be used to determine which URLs should be ignored.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
*args |
Filter
|
A list of |
()
|
**kwargs |
Any
|
A list of keyword arguments to create |
{}
|
Returns:
Name | Type | Description |
---|---|---|
Crawler |
Self
|
The |
Raises:
Type | Description |
---|---|
ValueError
|
If a filter could not be created from the given keyword arguments. |
Examples:
>>> crawler.filter(filters.StartsWith("scheme", "http"))
>>> crawler.filter(filters.Matches("https://example.com"))
>>> crawler.filter(domain__in=["example.com"])
Source code in astel/crawler.py
214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 |
|
on(event, handler)
¶
Add an event handler to the crawler.
An event is emitted when
- a request is ready to be sent (Event.REQUEST
): the httpx.Request
object is
passed to the handler.
- a response is received (Event.RESPONSE
): the httpx.Response
object is
passed to the handler.
- an error occurs (Event.ERROR
): the Error
object is passed to the handler.
- a URL is done being processed (Event.DONE
): the astel.parsers.Url
object
is passed to the handler.
- a URL is found in a page (Event.URL_FOUND
): the astel.parsers.Url
object is passed to the handler.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
event |
str
|
The event to add the handler to. |
required |
handler |
Callable
|
The handler to add to the event. |
required |
Source code in astel/crawler.py
261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 |
|
parse_site_map(site_map_path)
async
¶
Parse a sitemap.xml file and return the URLs found in it.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
site_map_path |
str
|
The URL of the sitemap.xml file. |
required |
Returns:
Type | Description |
---|---|
Set[Url]
|
Set[parsers.Url]: The URLs found in the sitemap.xml file. |
Source code in astel/crawler.py
200 201 202 203 204 205 206 207 208 209 210 211 212 |
|
reset()
¶
Reset the crawler.
Source code in astel/crawler.py
295 296 297 298 299 |
|
retry(handler)
¶
Set a handler to determine whether a request should be retried.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
handler |
Callable
|
A function that takes a |
required |
Returns:
Name | Type | Description |
---|---|---|
Crawler |
Self
|
The |
Source code in astel/crawler.py
301 302 303 304 305 306 307 308 309 310 311 |
|
run()
async
¶
Run the crawler.
Source code in astel/crawler.py
84 85 86 87 88 89 90 91 92 93 94 |
|
stop(*, reset=False)
¶
Stop the crawler current execution.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
reset |
bool
|
Optionally, reset the crawler on the same call. Defaults to |
False
|
Source code in astel/crawler.py
284 285 286 287 288 289 290 291 292 293 |
|
astel.errors
¶
Error
¶
Bases: Exception
Base class for exceptions in this package
Source code in astel/errors.py
4 5 6 7 8 9 10 11 12 13 14 15 16 |
|
InvalidConfigurationError
¶
Bases: Error
Raised when a rate limiter configure call is invalid
Source code in astel/errors.py
29 30 31 32 33 34 35 36 |
|
astel.events
¶
Event handlers for the crawler.
This module defines the event handlers that can be used to
do some action when a specific event occurs, like storing information
about the pages crawled, logging errors, or stopping the execution.
The handlers are called with the current Crawler
instance
(passed through the crawler
kwarg) and the event data.
DoneHandler
¶
Bases: Protocol
Handler for when a crawler finishes processing a URL.
Source code in astel/events.py
50 51 52 53 |
|
ErrorHandler
¶
Bases: Protocol
Handler for errors occurred during a crawler execution.
Source code in astel/events.py
42 43 44 45 46 47 |
|
EventEmitter
¶
Bases: Protocol
Protocol for an event emitter.
Source code in astel/events.py
67 68 69 70 71 72 73 74 75 76 77 |
|
RequestHandler
¶
Bases: Protocol
Handler for requests made by a crawler.
Source code in astel/events.py
30 31 32 33 |
|
ResponseHandler
¶
Bases: Protocol
Handler for responses received by a crawler.
Source code in astel/events.py
36 37 38 39 |
|
UrlFoundHandler
¶
Bases: Protocol
Handler for when a URL is found in a page.
Source code in astel/events.py
56 57 58 59 |
|
astel.filters
¶
Filters for URLs.
Some URLs in a webpage may not be relevant to your use cases.
This module defines the filters that can be used to filter out URLs from the crawlers execution based on their properties.
CallableFilter
¶
Bases: Protocol
Callable filter interface.
Source code in astel/filters.py
54 55 56 57 |
|
Contains
¶
Bases: TextFilter
Filter URLs based on a text substring.
Examples:
>>> from astel.filterers.filters import Contains
>>> domain_contains = Contains("domain", "example")
>>> domain_contains.filter(ParsedUrl(domain="https://example.com", ...)) # True
Source code in astel/filters.py
237 238 239 240 241 242 243 244 245 246 247 |
|
EndsWith
¶
Bases: TextFilter
Filter URLs based on a text suffix.
Examples:
>>> from astel.filterers.filters import EndsWith
>>> domain_ends_with = EndsWith("domain", ".com")
>>> domain_ends_with.filter(ParsedUrl(domain="https://example.com", ...)) # True
Source code in astel/filters.py
224 225 226 227 228 229 230 231 232 233 234 |
|
Filter
¶
Bases: ABC
, Generic[T]
Base class for filters.
Filters are used to determine if a URL should be processed or not. They can be combined using the bitwise operator &
: filter1
& filter2
will return a new filter that will pass only if both filter1
and filter2
pass.
New filters can be created by subclassing this class and implementing the _apply
method.
Generic
Examples:
>>> from astel.filterers.filters import In
>>> domain_in_list = In("domain", ["example.com"])
>>> html_or_php = In(lambda url: url.path.split(".")[-1], ["html", "php"])
>>> my_filter = domain_in_list & html_or_php
Source code in astel/filters.py
60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 |
|
__init__(url_prop, param=None, *, _inverted=False, _chained=None)
¶
Initializes the filter with the given URL property.
Source code in astel/filters.py
83 84 85 86 87 88 89 90 91 92 93 94 95 |
|
filter(url)
¶
Applies the filter to the given URL.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
url |
Url
|
The URL to filter. |
required |
Returns:
Name | Type | Description |
---|---|---|
bool |
bool
|
True if the URL passes the filter, False otherwise. |
Source code in astel/filters.py
120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 |
|
In
¶
Bases: Filter[Sequence[str]]
Filter URLs based on a group of values.
Examples:
>>> from astel.filterers.filters import In
>>> domain_in_list = In("domain", ["example.com"])
>>> domain_in_list.filter(ParsedUrl(domain="https://example.com", ...)) # True
Source code in astel/filters.py
152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 |
|
Matches
¶
Bases: Filter[Union[Pattern, str]]
Filter URLs based on a regular expression.
Examples:
>>> from astel.filterers.filters import Matches
>>> domain_matches = Matches("domain", r"example\..+")
>>> domain_matches.filter(ParsedUrl(domain="https://example.com", ...)) # True
Source code in astel/filters.py
169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 |
|
StartsWith
¶
Bases: TextFilter
Filter URLs based on a text prefix.
Examples:
>>> from astel.filterers.filters import StartsWith
>>> domain_starts_with = StartsWith("domain", "example")
>>> domain_starts_with.filter(ParsedUrl(domain="https://example.com", ...)) # True
Source code in astel/filters.py
211 212 213 214 215 216 217 218 219 220 221 |
|
TextFilter
¶
Bases: Filter[str]
, ABC
Base class for text filters.
Filters URLs based on a text value.
Source code in astel/filters.py
188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 |
|
create_from_kwarg(key, value)
¶
Create a filter from a key-value pair.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
key |
str
|
The key to create the filter from. |
required |
value |
FilterParameter
|
The filter parameter. |
required |
Returns:
Type | Description |
---|---|
Filter | None
|
Filter | None: The created filter or None if the key is invalid. |
Source code in astel/filters.py
328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 |
|
astel.limiters
¶
Rate limiting module.
Most websites have rate limits to prevent abuse and to ensure that their servers.
This module defines the rate limiters that can be used to limit the amount of requests sent to a website.
NoLimitRateLimiter
¶
Bases: RateLimiter
A limiter that does not limit the requests. Keep in mind that sending a lot of requests per second can result in throttling or even bans.
Source code in astel/limiters.py
103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 |
|
configure(*args, **kwargs)
¶
Does nothing
Source code in astel/limiters.py
115 116 117 118 |
|
limit()
async
¶
Asynchronously sleeps for 0 seconds.
Source code in astel/limiters.py
109 110 111 112 113 |
|
PerDomainRateLimiter
¶
Bases: RateLimiter
Limit the number of requests per domain using its especified limiter instance if given, otherwise uses the default limiter
Source code in astel/limiters.py
210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 |
|
add_domain(domain, limiter=None)
¶
Adds a new domain to the limited domains with an optional rate limiter.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
domain |
str
|
A string representing the domain name to add. |
required |
limiter |
RateLimiter
|
An optional |
None
|
Raises:
Type | Description |
---|---|
InvalidUrlError
|
If the given URL does not contain a valid domain. |
Source code in astel/limiters.py
243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 |
|
configure(config)
¶
Configures the rate at which requests are made to a domain by defining its corresponding limiter.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
config |
RateLimiterConfig
|
The configuration to apply. |
required |
Raises:
Type | Description |
---|---|
InvalidConfigurationError
|
If the new computed token rate is less than or equal to 0. |
Source code in astel/limiters.py
270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 |
|
extract_domain(url)
staticmethod
¶
Extracts the domain from a given URL.
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
A string representing the domain name extracted from the URL. |
Source code in astel/limiters.py
261 262 263 264 265 266 267 268 |
|
limit(url)
async
¶
Limit the requests to the given URL by its domain.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
url |
str
|
The URL to limit |
required |
Raises:
Type | Description |
---|---|
InvalidConfigurationError
|
If no limiter is found for the domain. |
Source code in astel/limiters.py
225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 |
|
RateLimiter
¶
Bases: ABC
Base class for rate limiters.
Source code in astel/limiters.py
45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 |
|
configure(config)
abstractmethod
¶
Configures the rate limiter to respect the rules defined by the domain with the given parameters.
In the case of a craw delay, the craw delay is ignored.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
config |
RateLimiterConfig
|
The configuration to apply. |
required |
Source code in astel/limiters.py
48 49 50 51 52 53 54 55 56 57 58 59 60 |
|
limit(*args, **kwargs)
abstractmethod
async
¶
Asynchronously limits the specified URL.
Source code in astel/limiters.py
62 63 64 65 |
|
RateLimiterConfig
¶
Bases: TypedDict
Rate limiting configuration.
Attributes:
Name | Type | Description |
---|---|---|
domain |
str
|
The domain to crawl. |
crawl_delay |
str
|
A string representing the delay between each crawl in the format " |
Source code in astel/limiters.py
32 33 34 35 36 37 38 39 40 41 42 |
|
StaticRateLimiter
¶
Bases: RateLimiter
Limit the number of requests per second by waiting for a specified amount of time between requests
Parameters:
Name | Type | Description | Default |
---|---|---|---|
time_in_seconds |
float
|
The amount of time to wait between requests |
required |
Source code in astel/limiters.py
68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 |
|
limit()
async
¶
Limit by wainting for the specified amount of time
Source code in astel/limiters.py
79 80 81 |
|
TokenBucketRateLimiter
¶
Bases: RateLimiter
Limit the requests by using the token bucket algorithm
Parameters:
Name | Type | Description | Default |
---|---|---|---|
tokens_per_second |
float
|
The amount of tokens to add to the bucket per second. |
required |
Source code in astel/limiters.py
121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 |
|
configure(config)
¶
Configures the rate at which requests are made to a domain by setting the tokens per second.
Source code in astel/limiters.py
186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 |
|
consume(tokens=1)
¶
Check if the given number of tokens can be consumed and decrease the number of available tokens if possible.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
tokens |
int
|
The number of tokens to consume. Default is 1. |
1
|
Returns:
Name | Type | Description |
---|---|---|
bool |
bool
|
|
Source code in astel/limiters.py
153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 |
|
astel.options
¶
Options module.
This module defines the options that can be used to configure the crawlers behavior.
CrawlerOptions
¶
Bases: TypedDict
Crawler options.
Attributes:
Name | Type | Description |
---|---|---|
client |
AsyncClient
|
An instance of |
workers |
int
|
The number of worker tasks to run in parallel. |
limit |
int
|
The maximum number of pages to crawl. |
user_agent |
str
|
The user agent to use for the requests. |
parser_factory |
ParserFactory
|
A factory function to create a parser instance. |
rate_limiter |
RateLimiter
|
The rate limiter to limit the number of requests sent per second. |
event_limiter_factory |
Callable[[], EventEmitter]
|
A factory function to create an event limiter for the crawler. |
retry_for_status_codes |
list[int]
|
A list of status codes for which the crawler should retry the request. |
Source code in astel/options.py
34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 |
|
ParserFactory
¶
Bases: Protocol
Callable that creates a parser instance.
Source code in astel/options.py
20 21 22 23 |
|
RetryHandler
¶
Bases: Protocol
Callable that determines whether the crawler should retry the request.
Source code in astel/options.py
26 27 28 29 30 31 |
|
merge_with_default_options(options=None)
¶
Merge the given options with the default options.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
options |
CrawlerOptions
|
The options to merge. |
None
|
Returns:
Name | Type | Description |
---|---|---|
CrawlerOptions |
CrawlerOptions
|
The merged options. |
Source code in astel/options.py
72 73 74 75 76 77 78 79 80 81 |
|
astel.parsers
¶
Parsers for extracting links from webpages and sitemaps.
This module defines the parsers that can be used to extract the links from the content of a webpage or a sitemap.
BaseParser
¶
Bases: InitParserMixin
, ABC
Base class to be used for implementing new parser classes.
Source code in astel/parsers.py
144 145 |
|
HTMLAnchorsParser
¶
Bases: InitParserMixin
, HTMLParser
A parser that extracts the urls from a webpage and filter them out with the given filterer.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
base |
str
|
The base URL to use to resolve relative URLs |
None
|
Source code in astel/parsers.py
148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 |
|
InitParserMixin
¶
Helper mixin to initialize the parser with a base URL.
Source code in astel/parsers.py
129 130 131 132 133 134 135 136 137 138 139 140 141 |
|
Parser
¶
Bases: Protocol
Parses the content of a file (webpages, or sitemaps, for example) to extract the links of interest.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
base |
Union[str, None]
|
The base URL to use to resolve relative URLs. Defaults to |
None
|
Source code in astel/parsers.py
50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 |
|
feed(text)
¶
Process the content of a website and update the found_links
attribute
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text |
str
|
The content of the website |
required |
Source code in astel/parsers.py
59 60 61 62 63 64 65 |
|
reset(base=None)
¶
Reset the parser to its initial state.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
base |
Union[str, None]
|
The base URL to use to resolve relative URLs. Defaults to |
None
|
Source code in astel/parsers.py
67 68 69 70 71 72 |
|
SiteMapParser
¶
Bases: InitParserMixin
Parses a sitemap file to extract the links of interest.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
base |
str
|
The base URL to use to resolve relative URLs |
None
|
Source code in astel/parsers.py
166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 |
|
Url
¶
Bases: Protocol
Model of a URL for the library to work with.
Source code in astel/parsers.py
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 |
|
parse_url(url, base=None)
¶
Parse a URL into its components.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
url |
str
|
The URL to parse |
required |
base |
str
|
The base URL to use to resolve relative URLs. Defaults to |
None
|
Returns:
Name | Type | Description |
---|---|---|
Url |
Url
|
The parsed URL |
Source code in astel/parsers.py
108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 |
|