The Process
Choose a website to steal, and get started! For this tutorial, I will use this article from sive.rs.
Download the HTML with curl
curl https://sive.rs/confab
Extract the text with html2text
Install html2text
with your distro’s package manager.
curl https://sive.rs/confab | html2text
Extra: glow
!
Lets bring in a tool from my favorite FOSS group, charmbracelet.
We can simply pipe our output to glow
, and watch the magic!
curl https://sive.rs/confab | html2text | glow -
Projects
Pull Pictures from a Website
Curl is used to fetch the whole website’s HTML. Then find all .jpg
links and download them with wget
. xargs
is used to “execute commands from standard input”
curl https://wallhaven.cc | grep -o -e 'http[^"]*\.jpg' | xargs wget
Turn in into a script
This essentially wraps that command in a script format.
pics_dir="$(basename "$1")"
# remove $pics_dir if it already exists
[ "$pics_dir" ] && rm -rf "$pics_dir"
mkdir "$pics_dir"; cd "$pics_dir" || exit
curl "$1" | grep -o -e 'http[^"]*\.jpg' | xargs wget
Pull All Articles from Website
This builds on the tutorial
Get a List of Links
- Get the HTML of the blogs index site.
- Sort the lines that have
<li>
it them. - Replace everything before
href='
withhttps://
- Replace everything after
>
with nothing - Pipe that all to >
links.txt
curl https://sive.rs/blog | grep '<li>' | sed "s/^.*href=/https\:\/\/sive\.rs/" | sed "s/>.*$//" > links.txt
Get all Articles
Loop through all links and pull them.
for i in $(cat links.txt); do
curl $i | html2text > $(basename $i) &
done
&
makes them all execute in parallel. Much faster!
Viewer Script
ls . | fzf | xargs glow -