How to extract href attributes from HTML page using grep & regex

You can use a regular expression to grep for href="..." attributes in a HTML like this:

grep-href.sh

grep -oP "(HREF|href)=\"\K.+?(?=\")"

grep -oP "(HREF|href)=\"\K.+?(?=\")"

grep is operated with -o (only print match, this is required to get extra features like lookahead assertions) and -P(use Perl regular expression engine). The regular expression is basically

regex.txt

href=".*"

href=".*"

where the .+ is used in non-greedy mode (.+?):

regex-nongreedy.txt

href=".+?"

href=".+?"

This will give us hits like

example-link.html

href="/files/image.png"

href="/files/image.png"

Since we only want the content of the quotes (") and not the href="..." part, we can use positive lookbehind assertions (\K) to remove the href part:

regex-lookbehind.txt

href=\"\K.+?\"

href=\"\K.+?\"

but we also want to get rid of the closing double quote. In order to do this, we can use positive lookahead assertions((?=\")):

regex-lookaround.txt

href=\"\K.+?(?=\")

href=\"\K.+?(?=\")

Now we want to match both href and HREF to get some case insensitivity:

regex-case.txt

(href|HREF)=\"\K.+?(?=\")

(href|HREF)=\"\K.+?(?=\")

Often we want to specifically match one file type. For example, we could match only .png:

match-png.txt

(href|HREF)=\"\K.+?\.png(?=\")

(href|HREF)=\"\K.+?\.png(?=\")

In order to reduce falsely too long matches on some pages, we want to use [^\"]+? instead of .+?:

match-png-safe.txt

(href|HREF)=\"\K[^\"]+?\.png(?=\")

(href|HREF)=\"\K[^\"]+?\.png(?=\")

This disallows matches containing " characters, hence preventing more than the tag being matched.

Usage example:

wget-grep-png.sh

wget -qO- https://nasagrace.unl.edu/data/NASApublication/maps/ | grep -oP "(href|HREF)=\"\K[^\"]+?\.png(?=\")"

wget -qO- https://nasagrace.unl.edu/data/NASApublication/maps/ | grep -oP "(href|HREF)=\"\K[^\"]+?\.png(?=\")"

Output:

output.txt

/data/NASApublication/maps/GRACE_SFSM_20201026.png
[...]

/data/NASApublication/maps/GRACE_SFSM_20201026.png
[...]

Check out similar posts by category: Linux

If this post helped you, please consider buying me a coffee or donating via PayPal to support research & publishing of new posts on TechOverflow

Buy me a coffee