Let’s get started
By clicking ‘Next’, I agree to the Terms of Service
and Privacy Policy
Jobs / Job page
Data Engineer Intern - Web Crawling image - Rise Careers
Job details

Data Engineer Intern - Web Crawling

About Sayari: 

Sayari is the counterparty and supply chain risk intelligence provider trusted by government agencies, multinational corporations, and financial institutions. Its intuitive network analysis platform surfaces hidden risk through integrated corporate ownership, supply chain, trade transaction and risk intelligence data from over 250 jurisdictions. Sayari is headquartered in Washington, D.C., and its solutions are used by thousands of frontline analysts in over 35 countries.


Our company culture is defined by a dedication to our mission of using open data to enhance visibility into global commercial and financial networks, a passion for finding novel approaches to complex problems, and an understanding that diverse perspectives create optimal outcomes. We embrace cross-team collaboration, encourage training and learning opportunities, and reward initiative and innovation. If you like working with supportive, high-performing, and curious teams, Sayari is the place for you.


Internship Description:

Sayari is looking for a Data Engineer Intern specializing in web crawling to join its Data Engineering team! Sayari has developed a robust web crawling project that collects hundreds of millions of documents every year from a diverse set of sources around the world. These documents serve as source records for Sayari’s flagship graph product, which is a global network of corporate and trade entities and relationships. As a member of Sayari's data team your primary objective will be to work on maintaining and improving Sayari’s web crawling framework, with an emphasis on scalability and reliability. You will work with our Product and Software Engineering teams to ensure our crawling deployment meets product requirements and integrates efficiently with our ETL pipelines.


This is a remote paid internship with work expectations being between 20-30 hours a week.


Job Responsibilities:
  • Investigate and implement web crawlers for new sources
  • Maintain and improve existing crawling infrastructure
  • Improve metrics and reporting for web crawling
  • Help improve and maintain ETL processes
  • Contribute to development and design of Sayari’s data product


Required Skills & Experience:
  • Experience with Python
  • Experience managing web crawling at scale, any framework, Scrapy is a plus
  • Experience working with Kubernetes
  • Experience working collaboratively with git
  • Experience working with selectors such as: XPath, CSS, JMESPath
  • Experience with WebDev tools (Chrome/Firefox)


Desired Skills & Experience:
  • Experience with Apache projects such as Spark, Avro, Nifi, and Airflow
  • Experience with datastores Postgres and/or RocksDB
  • Experience working on a cloud platform like GCP, AWS, or Azure
  • Working knowledge of API frameworks, primarily REST
  • Understanding of or interest in knowledge graphs
  • Experience with *nix environments
  • Experience with reverse engineering
  • Proficient in bypassing anti-crawling techniques
  • Experience with Javascript


$20 - $25 an hour
Sayari Glassdoor Company Review
No rating Glassdoor star iconGlassdoor star iconGlassdoor star iconGlassdoor star iconGlassdoor star icon
Sayari DE&I Review
No rating Glassdoor star iconGlassdoor star iconGlassdoor star iconGlassdoor star iconGlassdoor star icon
CEO of Sayari
Sayari CEO photo
Unknown name
Approve of CEO

Average salary estimate

$46800 / YEARLY (est.)
min
max
$41600K
$52000K

If an employer mentions a salary or salary range on their job, we display it as an "Employer Estimate". If a job has no salary data, Rise displays an estimate if available.

What You Should Know About Data Engineer Intern - Web Crawling, Sayari

Sayari is looking for an enthusiastic Data Engineer Intern specializing in web crawling to join our innovative Data Engineering team. As a trusted provider of counterparty and supply chain risk intelligence, we empower government agencies, multinational corporations, and financial institutions with actionable insights. This unique internship offers the opportunity to contribute to a robust web crawling project that collects hundreds of millions of documents annually from diverse global sources. Your mission will be to enhance and maintain our web crawling framework, focusing on scalability and reliability. You will collaborate closely with our Product and Software Engineering teams to ensure our crawling deployment exceeds product requirements and integrates seamlessly with our ETL pipelines. We value a supportive work environment that fosters curiosity, collaboration, and continuous learning. This remote, paid internship expects a commitment of 20-30 hours a week, with a pay range of $20 - $25 per hour. To be successful, you should have experience with Python, managing web crawling at scale, and familiarity with tools like Kubernetes and Git. If you’re passionate about diving into the world of data engineering and enjoy tackling complex challenges, Sayari is the perfect place for you to grow your skills and make a real impact.

Frequently Asked Questions (FAQs) for Data Engineer Intern - Web Crawling Role at Sayari
What are the responsibilities of a Data Engineer Intern at Sayari?

The Data Engineer Intern at Sayari will focus on investigating and implementing web crawlers for new data sources, maintaining the existing crawling infrastructure, and improving metrics and reporting for web crawling. Additionally, the intern will help enhance and maintain ETL processes and contribute to the development and design of Sayari’s data products.

Join Rise to see the full answer
What skills are required for the Data Engineer Intern position at Sayari?

To excel as a Data Engineer Intern at Sayari, candidates should have experience with Python and managing web crawlers at scale, particularly frameworks like Scrapy. Familiarity with Kubernetes, Git collaboration, and working with selectors such as XPath and CSS is essential. Experience with web development tools and a proactive approach to bypassing anti-crawling techniques is also valuable.

Join Rise to see the full answer
Is the Data Engineer Intern position at Sayari remote?

Yes, the Data Engineer Intern role at Sayari is fully remote. Interns are expected to work between 20-30 hours per week, providing flexibility while contributing significantly to our exciting web crawling projects from anywhere.

Join Rise to see the full answer
What qualifications are beneficial for a Data Engineer Intern seeking a role at Sayari?

While not mandatory, desirable skills for the Data Engineer Intern position at Sayari include experience with Apache projects like Spark and Airflow, familiarity with PostgreSQL or RocksDB, and exposure to cloud platforms like GCP, AWS, or Azure. Understanding API frameworks and knowledge graphs can also give candidates an edge.

Join Rise to see the full answer
What is the pay rate for a Data Engineer Intern at Sayari?

The pay rate for the Data Engineer Intern position at Sayari ranges from $20 to $25 an hour, depending on the candidate's skills and experience. This competitive compensation reflects Sayari's commitment to valuing intern contributions while offering a meaningful learning experience.

Join Rise to see the full answer
Common Interview Questions for Data Engineer Intern - Web Crawling
Can you explain your experience with web crawling and its relevant frameworks?

When answering this question, describe specific projects where you’ve managed web crawling tasks, detailing the frameworks used, like Scrapy. Discuss challenges encountered, and how you optimized processes, which showcases your hands-on experience and problem-solving skills.

Join Rise to see the full answer
How do you ensure the reliability and scalability of web crawlers?

Discuss your approach to building robust crawling systems, including strategies for error handling, retries, and monitoring performance metrics. Highlight your understanding of load balancing and distributed systems, which reflects your technical depth in this field.

Join Rise to see the full answer
What experience do you have working with Python for data engineering tasks?

Share specific examples of Python projects you've worked on, particularly those relating to data processing or web scraping. Include libraries or frameworks used, such as Pandas or Requests, to demonstrate your programming proficiency and familiarity with the language's best practices.

Join Rise to see the full answer
How familiar are you with Git for collaborative work?

Explain your experience using Git in team projects, focusing on version control practices, branching strategies, and handling merge conflicts. Demonstrating effective communication and collaboration skills through Git conveys your ability to work well in a team environment.

Join Rise to see the full answer
What methods have you used to improve ETL processes?

Talk about specific optimizations you've implemented in ETL processes, such as streamlining data flow, reducing processing times, or automating tasks. This shows your practical understanding of data engineering workflows and your ability to enhance data pipeline efficiency.

Join Rise to see the full answer
Describe a challenge you faced when working on a crawling project, and how you overcame it.

Present an engaging narrative about a specific challenge, detailing your thought process for troubleshooting. Focus on the outcome and what you learned, which will give insight into your problem-solving ability and resilience in challenging situations.

Join Rise to see the full answer
What experiences do you have with Kubernetes and container orchestration?

Outline your hands-on experience deploying applications using Kubernetes, discussing concepts like pods, services, and deployments. Your comfort level with container orchestration demonstrates your readiness to work in modern cloud environments.

Join Rise to see the full answer
How do you approach learning new technologies or skills?

Share your personal strategies for continuous learning, such as online courses or hands-on practice. Mention any recent technologies you’ve explored related to data engineering, showcasing your commitment to professional growth and adaptability.

Join Rise to see the full answer
What interests you about working at Sayari as a Data Engineer Intern?

Express your alignment with Sayari’s mission and commitment to innovative data solutions. Emphasize your eagerness to work in a collaborative environment and contribute to meaningful projects, which reflects your genuine interest in the company and its culture.

Join Rise to see the full answer
Can you discuss your experience with data stores like PostgreSQL?

Describe your familiarity with PostgreSQL, including any projects where you utilized it for data storage or management. Discuss your understanding of database optimization and querying techniques, which can demonstrate your data handling expertise.

Join Rise to see the full answer
Similar Jobs
Posted 8 days ago
Posted 8 days ago
Photo of the Rise User
Posted 13 days ago
Diversity of Opinions
Inclusive & Diverse
Collaboration over Competition
Growth & Learning
Mission Driven
Rapid Growth
Passion for Exploration
Empathetic
Feedback Forward
Medical Insurance
Dental Insurance
Vision Insurance
401K Matching
Life insurance
Maternity Leave
Paternity Leave
Paid Holidays
Paid Time-Off
Performance Bonus
Social Gatherings
Some Meals Provided
Photo of the Rise User
BlueOptima Remote Vakil Square, 1st Floor, Bannerghatta Main Road, Jayanagar, Bangalore, Karnataka, India
Posted 7 days ago
Dental Insurance
Photo of the Rise User
Posted 5 days ago
Photo of the Rise User
Amazon Hybrid Seattle, Washington, USA
Posted 11 days ago
Inclusive & Diverse
Rise from Within
Mission Driven
Diversity of Opinions
Work/Life Harmony
Transparent & Candid
Growth & Learning
Fast-Paced
Collaboration over Competition
Take Risks
Friends Outside of Work
Passion for Exploration
Customer-Centric
Reward & Recognition
Feedback Forward
Rapid Growth
Medical Insurance
Paid Time-Off
Maternity Leave
Mental Health Resources
Equity
Paternity Leave
Fully Distributed
Flex-Friendly
Some Meals Provided
Snacks
Social Gatherings
Pet Friendly
Company Retreats
Dental Insurance
Life insurance
Health Savings Account (HSA)
Panoptyc Remote No location specified
Posted 2 hours ago
Photo of the Rise User
Posted yesterday
Photo of the Rise User
Posted 12 days ago
Photo of the Rise User
Posted 8 days ago
MATCH
Calculating your matching score...
FUNDING
DEPARTMENTS
SENIORITY LEVEL REQUIREMENT
TEAM SIZE
EMPLOYMENT TYPE
Internship, remote
DATE POSTED
March 17, 2025

Subscribe to Rise newsletter

Risa star 🔮 Hi, I'm Risa! Your AI
Career Copilot
Want to see a list of jobs tailored to
you, just ask me below!