The Process

Choose a website to steal, and get started! For this tutorial, I will use this article from sive.rs.

Download the HTML with `curl`

curl https://sive.rs/confab

Extract the text with `html2text`

Install html2text with your distro’s package manager.

curl https://sive.rs/confab | html2text

Extra: `glow`!

Lets bring in a tool from my favorite FOSS group, charmbracelet.

We can simply pipe our output to glow, and watch the magic!

curl https://sive.rs/confab | html2text | glow -

Projects

Pull Pictures from a Website

Curl is used to fetch the whole website’s HTML. Then find all .jpg links and download them with wget. xargs is used to “execute commands from standard input”

curl https://wallhaven.cc | grep -o -e 'http[^"]*\.jpg' | xargs wget

Turn in into a script

This essentially wraps that command in a script format.

pics_dir="$(basename "$1")"
 
# remove $pics_dir if it already exists
[ "$pics_dir" ] && rm -rf "$pics_dir" 
 
mkdir "$pics_dir"; cd "$pics_dir" || exit
curl "$1" | grep -o -e 'http[^"]*\.jpg' | xargs wget

Pull All Articles from Website

This builds on the tutorial

Get a List of Links

Get the HTML of the blogs index site.
Sort the lines that have <li> it them.
Replace everything before href=' with https://
Replace everything after > with nothing
Pipe that all to > links.txt

curl https://sive.rs/blog | grep '<li>' | sed "s/^.*href=/https\:\/\/sive\.rs/" | sed "s/>.*$//" > links.txt

Get all Articles

Loop through all links and pull them.

for i in $(cat links.txt); do
  curl $i | html2text > $(basename $i) &
done

& makes them all execute in parallel. Much faster!

Viewer Script

ls . | fzf | xargs glow -

Tech Notes

Explorer

Scraping Web Content

The Process

Download the HTML with `curl`

Extract the text with `html2text`

Extra: `glow`!

Projects

Pull Pictures from a Website

Turn in into a script

Pull All Articles from Website

Get a List of Links

Get all Articles

Viewer Script

Graph View

Table of Contents

Backlinks

Tech Notes

Explorer

Scraping Web Content

The Process §

Download the HTML with curl §

Extract the text with html2text §

Extra: glow! §

Projects §

Pull Pictures from a Website §

Turn in into a script §

Pull All Articles from Website §

Get a List of Links §

Get all Articles §

Viewer Script §

Graph View

Table of Contents

Backlinks

The Process

Download the HTML with `curl`

Extract the text with `html2text`

Extra: `glow`!

Projects

Pull Pictures from a Website

Turn in into a script

Pull All Articles from Website

Get a List of Links

Get all Articles

Viewer Script