Logo
Tutorials

Parser Recipes

Practical extractor recipes for plain text, HTML selectors, JSON paths, and noisy proxy sources.

Proxy sources are rarely polite. One page gives you a clean host:port list, another hides the values in a table, another returns JSON objects with ip and port split apart, and another gives you a huge page full of unrelated numbers. The extractor is built for that mess, but the trick is choosing the parser that matches the shape of the source.

The safest workflow is simple: run one source first, look at the parser and source metrics, then scale the same setup to a larger source list. If the first run is noisy, do not immediately throw the source away. Tighten the selector, JSON path, or dedupe mode until the source either becomes useful or proves it is not worth keeping.

Choose A Parser

Use Plain text for raw lists, pasted blobs, and files that already contain proxy-shaped strings. This is the fastest path when the source is mostly host:port, scheme://host:port, or credentialed proxy lines.

Use HTML when the proxy data is visible on a page but mixed into markup. Tables, code blocks, text areas, list items, and useful data-* attributes are all good HTML parser candidates.

Use JSON when the source is an API-style response. JSON can work with arrays of full proxy strings or with object lists where host, port, scheme, username, and password are separate fields.

Use Auto for the first look at an unknown source. Auto is great for discovery, but if you plan to reuse the source, convert a good Auto result into an explicit parser setup so future runs are easier to reason about.

Plain Text Recipe

Start with Plain text when the source is already close to a proxy list. Paste a small sample, choose Host + port dedupe for mixed schemes, and enable Prefer strongest if the same endpoint appears in several forms.

http://192.0.2.10:8080
socks5://198.51.100.25:1080
203.0.113.44:3128:user:pass

After the run, compare Candidate count, Duplicate count, and Proxy count. A high candidate count with a low proxy count usually means the input includes too much unrelated text. A high duplicate count is less scary: it often means the source repeats the same endpoint with several schemes or formatting styles.

For public-format paste lists, Host + port dedupe is usually the best first cleanup pass. Full-string dedupe is better when scheme or credentials must remain separate.

HTML Table Recipe

Use HTML when the source page has visible proxy rows. Begin with a broad selector such as:

table tbody tr

If that extracts too much, narrow it to the specific table or row class. If it extracts nothing, try the containers where proxy sites often place raw text:

pre
code
textarea
[data-proxy]
[data-host]

The HTML parser can read text, table cells, child elements, useful attributes like href, content, value, title, and data-*, and script blocks that look like JSON or contain proxy-shaped values. That is powerful, but it also means a broad selector can pull in noise. Use Extraction scope count to tell whether you selected the right area of the page.

JSON Path Recipe

Use JSON when the source looks like an API response. If every item is a full proxy string, point at the string array:

data.items[*].proxy

If every item is an object, point at the object array and let the extractor infer the proxy from fields:

data.items[*]

The object inference path is useful for responses shaped like this:

{
  "host": "192.0.2.10",
  "port": 8080,
  "protocol": "http",
  "username": "demo",
  "password": "secret"
}

If a JSON path returns nothing, remove the path once and run Auto or JSON broadly to confirm the payload shape. Then add the path back in a tighter form.

Clean Up A Noisy Source

When a source is noisy, change one variable at a time. Tighten the parser first, then adjust dedupe, then adjust fetch settings.

ProblemBetter move
Auto found nothing usefulForce the parser that matches the source shape.
HTML found too muchUse a narrower selector and check extraction scope count.
JSON path found nothingSelect the parent array, then tighten the path again.
Duplicate count is highSwitch from Full to Host + port dedupe.
Source fails intermittentlyIncrease timeout and lower concurrency for the test run.

The goal is not the biggest possible output. The goal is a repeatable setup that produces normalized proxies with a low error rate and understandable source metrics.

On this page