RFP General Description
This assignment consists of three tasks involving web scraping for various types of listings:
- Task #1: Scraping of Real Estate Listings
- Task #2: Scraping of Business Directory Listings
- Task #3: Scraping of Event Listings
Task #1: Scraping of Real Estate Listings
Description: Extract real estate listings from a website similar to realtor.com. An example of such a listing can be found here: https://www.realtor.com/realestateandhomes-detail/201-Ridgecrest-Dr_Greenville_SC_29609_M68057-00430?property_id=6805700430&from=ab_mixed_view_card
Approximate size: 60,000 listings.
Data Requirement: The data should be provided in a CSV file format, which can be sent via email. The CSV should include the following fields:
- URL
- Agent Phone
- Agent Email
- Agent Name
- Brokerage Name
- Acres
- Price
- Has a House?
- Bedrooms
- Bathrooms
- Square Feet
- Listing Headline/Title
- External Property Link
- Land ID URL
- Address
- ZIP Code
- City/Town
- State
- County
- Latitude
- Longitude
- Short Description
- Detailed Property Description
- Available Financing
- Annual Taxes
- Tax Rate
- Taxes Without Exemption
- Amenities
- Images
- Main Image
- YouTube Video Link
Attachments and Images: All attachments and images should be uploaded to an S3 bucket.
Process: First round of scraping should be conducted as soon as possible.
Starting from October 15th, new listings and updates to existing listings are to be provided on a weekly basis for the next 6-10 weeks.
Task #2: Scraping of Business Directory Listings
Extract business listings from a website similar to allpages.com, but we only need Texas listings.
Approximate size: 200,000 to 300,000 listings.
Data Requirement: The data should be provided in a CSV file format, which can be sent via email. The CSV should include the following fields:
- URL
- Business Category
- Business Name
- Description
- County
- City/Town
- ZIP Code
- Address
- Phone Number
- Email
- Website Link
- Business Hours
Timeline: Expected completion by September 15th.
Data validation:
If you can provide the service, provide a quote for the validation of company website link so it does not return 4xx. If it returns 4xx, mark it in the csv file.
If you can provide the service, provide a quote for the validation of phone numbers and method options.
Task #3: Scraping of Events
Description: Extract events from 70-100 websites similar to this one:
https://members.lufkintexas.org/events
Data Requirement: The data should be provided in a CSV file format and include the following fields:
- URL
- Listing Type
- Event Listing ID
- Event Type (Category)
- Image
- Primary Photo
- Listing Headline/Title
- State
- County
- City/Town
- ZIP Code
- Address
- Venue Name
- Map Location
- Event Start Date
- Event End Date
- Event Day of the Week
- Event Start Time
- Event End Time
- Event Description
- External Website Link
- Facebook Link
- Ticket Price
- Ticket Details
Implementation:
An automatic mechanism for data extraction must be developed. The extracted CSV files and their associated images must be automatically uploaded to an S3 bucket on a weekly basis.
Data Quality Check and Project Acceptance Criteria
To ensure the scraped data meets quality standards, we will perform random sampling checks.
RFP Response
Please provide the estimated time and price for each Task separately (Task 1, Task 2, Task 3).
Specifically for Task #3, please include the price for scraping one website.
We look forward to your detailed proposal and timeline for each task.